- 📙Paper: CodeParrot
- 📚Publisher:
other
- 🏠Author Affiliation:
huggingface
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
110M
;1.5B
- 🗂️Data pre-processing
- Data Resource
- CodeParrot dataset
- De-duplication: ✅
- Filter Strategies
- > 1MB
- max line length > 1000
- mean line length > 100
- fraction of alphanumberic characters < 0.25
- containing the word “auto-generated”
- similar in the first 5 lines
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- Trained GPT-2 tokenizer on the training split
- Technology
- 🧪Hyperparameters (CodeParrot 1.5B)
- optimizer: AdamW
- betas: 0.9, 0.999
- eps: 1e-8
- batch size:
512
or524K
tokens - context window:
1,024
- gradient accumulation steps:
16
- warmup steps:
750
- learning rate:
5e-5
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: AdamW
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 30K steps or 41B tokens
- hardware: 16 x A100 (40GB)
- training time: /
CodeParrot
This post is licensed under CC BY 4.0 by the author.
Recently Updated