Home PolyCoder
Post
Cancel

PolyCoder

  • 📙Paper: A systematic evaluation of large language models of code
  • 📚Publisher: MAPS
  • 🏠Author Affiliation: Carnegie Mellon University
  • 🔑Public: ✅
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 160M; 400M; 2.7B
  • 🗂️Data pre-processing
    • Data Resource
      • We cloned the most popular repositories for 12 popular programming languages with at least 50 stars (stopping at about 25K per language to avoid a too heavy skew towards popular programming languages) from GitHub in October 2021. For each project, each file belonging to the majority-language of that project was extracted, yielding the initial training set. This initial, unfiltered dataset spanned 631GB and 38.9M files.
    • De-duplication: ✅
    • Filter Strategies
      • They filter the files: > 1MB;
      • < 100 tokens.
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • Trained GPT-2 tokenizer on a random 5% subset (all languages)
  • 🧪Hyperparameters (PolyCoder 2.7B)
    • optimizer: AdamW
      • betas: 0.9, 0.999
      • eps: 1e-8
    • batch size: 262K tokens
    • context window: 2,048
    • gradient accumulation steps: /
    • warmup steps: 1,600
    • learning rate: 1.6e-4
    • weight decay: /
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: /
  • 🏃‍♀️Training
    • model initialization: from scratch
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: 39B tokens or 150K steps
    • hardware: 8 Nvidia RTX 8000 GPUs
    • training time: about 6 weeks
This post is licensed under CC BY 4.0 by the author.

CodeContests

MCoNaLa