PolyCoder

📙Paper: A systematic evaluation of large language models of code
📚Publisher: MAPS
🏠Author Affiliation: Carnegie Mellon University
🔑Public: ✅
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 160M; 400M; 2.7B
🗂️Data pre-processing
- Data Resource
  - We cloned the most popular repositories for 12 popular programming languages with at least 50 stars (stopping at about 25K per language to avoid a too heavy skew towards popular programming languages) from GitHub in October 2021. For each project, each file belonging to the majority-language of that project was extracted, yielding the initial training set. This initial, unfiltered dataset spanned 631GB and 38.9M files.
- De-duplication: ✅
- Filter Strategies
  - They filter the files: > 1MB;
  - < 100 tokens.
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - Trained GPT-2 tokenizer on a random 5% subset (all languages)
🧪Hyperparameters (PolyCoder 2.7B)
- optimizer: AdamW
  - betas: 0.9, 0.999
  - eps: 1e-8
- batch size: 262K tokens
- context window: 2,048
- gradient accumulation steps: /
- warmup steps: 1,600
- learning rate: 1.6e-4
- weight decay: /
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: /
🏃‍♀️Training
- model initialization: from scratch
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: 39B tokens or 150K steps
- hardware: 8 Nvidia RTX 8000 GPUs
- training time: about 6 weeks

PolyCoder

Further Reading

CodeRL

PanGu-Coder

FIM