- 📙Paper: LaMDA Language Models for Dialog Applications
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
Google
- 🔑Public: ❌
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
2B
;8B
;137B
- 🗂️Data pre-processing
- Data Resource
- We pre-trained LaMDA on a dataset created from public dialog data and other public web documents. The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
- De-duplication: /
- Filter Strategies
- /
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- /
- Technology
- 🧪Hyperparameters (LAMDA 137B)
- optimizer: /
- betas: /
- eps: /
- batch size:
256K
tokens - context window: /
- gradient accumulation steps: /
- warmup steps: /
- learning rate: /
- weight decay: /
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: /
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 3M tokens
- hardware: LaMDA 137B: 1024 TPU-v3 chips
- training time: 57.7 days
LAMDA
This post is licensed under CC BY 4.0 by the author.
Recently Updated