Home LAMDA
Post
Cancel

LAMDA

  • 📙Paper: LaMDA Language Models for Dialog Applications
  • 📚Publisher: arxiv
  • 🏠Author Affiliation: Google
  • 🔑Public: ❌
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 2B; 8B; 137B
  • 🗂️Data pre-processing
    • Data Resource
      • We pre-trained LaMDA on a dataset created from public dialog data and other public web documents. The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
    • De-duplication: /
    • Filter Strategies
      • /
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • /
  • 🧪Hyperparameters (LAMDA 137B)
    • optimizer: /
      • betas: /
      • eps: /
    • batch size: 256K tokens
    • context window: /
    • gradient accumulation steps: /
    • warmup steps: /
    • learning rate: /
    • weight decay: /
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: /
  • 🏃‍♀️Training
    • model initialization: /
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: 3M tokens
    • hardware: LaMDA 137B: 1024 TPU-v3 chips
    • training time: 57.7 days
This post is licensed under CC BY 4.0 by the author.

JigsawDataset

DSP