LAMDA

Posted Jan 19, 2022 Updated Jan 29, 2023

By Anonymous

1 min read

📙Paper: LaMDA Language Models for Dialog Applications
📚Publisher: arxiv
🏠Author Affiliation: Google
🔑Public: ❌
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 2B; 8B; 137B
🗂️Data pre-processing
- Data Resource
  - We pre-trained LaMDA on a dataset created from public dialog data and other public web documents. The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
- De-duplication: /
- Filter Strategies
  - /
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - /
🧪Hyperparameters (LAMDA 137B)
- optimizer: /
  - betas: /
  - eps: /
- batch size: 256K tokens
- context window: /
- gradient accumulation steps: /
- warmup steps: /
- learning rate: /
- weight decay: /
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: /
🏃‍♀️Training
- model initialization: /
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: 3M tokens
- hardware: LaMDA 137B: 1024 TPU-v3 chips
- training time: 57.7 days

This post is licensed under CC BY 4.0 by the author.

Recently Updated

Trending Tags

models benchmarks products papers resources evaluation Getting Started LLMs surveys tools

Trending Tags

models benchmarks products papers resources evaluation Getting Started LLMs surveys tools

A new version of content is available.