- 📙Paper: Efficient Training of Language Models to Fill in the Middle
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
OpenAI
- 🔑Public: ❌
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
50M
;77M
;164M
;411M
;844M
;1.4B
;2.8B
;6.9B
- 🗂️Data pre-processing
- Data Resource
- Same with Codex: which is a 159 GB Python dataset scraped in May 2020.
- De-duplication: ✅
- Filter Strategies
- We filtered out files which were likely auto-generated: average line length greater than 100;
- maximum line length greater than 1000;
- contain a small percentage of alphaunmeric characters.
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- Same with Codex: GPT3 tokenizer+additional set of tokens for representing whitespace runs of different lengths
- Technology
- 🧪Hyperparameters (FIM 6.9B)
- optimizer: Adam
- betas: /
- eps: /
- batch size:
2M
- context window:
2,048
- gradient accumulation steps: /
- warmup steps: /
- learning rate:
2.4e-4
- weight decay: /
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: Adam
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 100B tokens
- hardware: /
- training time: /
FIM
This post is licensed under CC BY 4.0 by the author.
Recently Updated