- 📙Paper: Meet in the Middle A New Pre-training Paradigm
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
Microsoft Azure AI
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
350M
,1.3B
,2.7B
- 🗂️Data pre-processing
- Data Resource
- We first pre-train our models on a large and diverse corpus of public code with permissive licenses, which covers multiple programming languages. Python, Java, C++ are the dominant languages in our corpus, accounting for most of the re-training data.
- Apart from pre-training our model on public code datasets, we also pretrain our model on natural language, specifically the union of the following datasets: CC-News, OpenWebText, CC-Stories and CC-100.
- De-duplication: ✅
- Filter Strategies
- Not report
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- Our tokenizer is based on the Byte-Pair Encoding algorithm widely used in previous work [Codex] to directly encode raw bytes with a vocabulary of size 100257 tokens. We pre-tokenize the text using a special regex pattern that accounts for splitting on digit and newlines together with the default GPT-2 pre-tokenization.
- Technology
- 🧪Hyperparameters (MIM 2.7B)
- optimizer: Adam
- betas: 0.9, 0.95
- eps: 1e-8
- batch size: 1M tokens
- context window:
2,048
- gradient accumulation steps:
/
- warmup steps:
5,000
- learning rate:
2e-4
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
fp32
,fp16
- optimizer: Adam
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 286K steps
- hardware: 128 A100 GPU with 80GB memory
- training time: about 4 days
MIM
This post is licensed under CC BY 4.0 by the author.
Recently Updated