- 📙Paper: SantaCoder don’t reach for the stars!
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
huggingface
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
1.1B
- 🗂️Data pre-processing
- Data Resource
- The Stack v1.1
- De-duplication: ✅
- Filter Strategies
- removing data from opt-out requests, near-deduplication, PII-redaction
- filtering based on line-length and percentage of alphanumeric characters
- removing files that contained test-samples from the following benchmarks: HumanEval, APPS, MBPP, MultiPL-E
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- We train a HuggingFace Tokenizer on 600k rows (around 2.6GB) of data.
- Technology
- 🧪Hyperparameters (SantaCoder 1.1B)
- optimizer: Adam
- betas: 0.9, 0.95
- eps: 1e-8
- global batch-size:
192
- context window: /
- gradient accumulation steps: /
- warmup steps: /
- learning rate:
2e-4
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
fp16
- optimizer: Adam
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 118B tokens
- hardware: 96 Tesla V100 GPUs
- training time: Each training run takes 3.1 days
SantaCoder
This post is licensed under CC BY 4.0 by the author.
Recently Updated