- 📙Paper: LLaMA Open and Efficient Foundation Language Models
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
Meta AI
- 🔑Public: √
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
6.7B
;13.0B
;32.5B
;65.2B
- 🗂️Data pre-processing
- Data Resource
- Google BigQuery
- De-duplication: √
- Filter Strategies
- We deduplicate the resulting dataset at the file level, with exact matches.
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- /
- Technology
- 🧪Hyperparameters (LLaMA 65.2B)
- optimizer: AdamW
- betas: 0.9
- eps: 0.95
- batch size:
4M
tokens - context window: /
- gradient accumulation steps: /
- warmup steps: 2,000
- learning rate: 1.5e-4
- weight decay: 0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: AdamW
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 1.4T tokens
- hardware: 2048 A00 GPU with 80GB of RAM
- training time: training over our dataset containing 1.4T tokens takes approximately 21 days.
LLaMA
This post is licensed under CC BY 4.0 by the author.
Recently Updated