Home GPT-NeoX
Post
Cancel

GPT-NeoX

  • 📙Paper: GPT-NeoX-20B An Open-Source Autoregressive Language Model
  • 📚Publisher: ACL
  • 🏠Author Affiliation: EleutherAI
  • 🔑Public: ✅
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 20B
  • 🗂️Data pre-processing
    • Data Resource
      • The Pile
    • De-duplication: ✅ (the whole Pile data, not only code)
    • Filter Strategies
      • /
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • We use a BPE-based tokenizer similar to that used in GPT-2, with the same total vocabulary size of 50257, with three major changes to the tokenizer:
      • 1) we train a new BPE tokenizer based on the Pile;
      • 2) the tokenizer applies consistent space delimitation regardless;
      • 3) our tokenizer contains tokens for repeated space tokens.
  • 🧪Hyperparameters (GPT-NeoX 20B)
    • optimizer: ZeRo
      • betas: 0.9, 0.95
      • eps: 1e-8
    • batch size: approximately 3.15M tokens
    • context window: 2,048
    • gradient accumulation steps: 32
    • warmup steps: /
    • learning rate: 9.7e-5
    • weight decay: 0.01
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: fp16
  • 🏃‍♀️Training
    • model initialization: from scratch
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: 150K steps
    • hardware: We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.
    • training time: 1,830 hours
This post is licensed under CC BY 4.0 by the author.

MTPB

BIG-Bench