GPT-NeoX

📙Paper: GPT-NeoX-20B An Open-Source Autoregressive Language Model
📚Publisher: ACL
🏠Author Affiliation: EleutherAI
🔑Public: ✅
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 20B
🗂️Data pre-processing
- Data Resource
  - The Pile
- De-duplication: ✅ (the whole Pile data, not only code)
- Filter Strategies
  - /
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - We use a BPE-based tokenizer similar to that used in GPT-2, with the same total vocabulary size of 50257, with three major changes to the tokenizer:
  - 1) we train a new BPE tokenizer based on the Pile;
  - 2) the tokenizer applies consistent space delimitation regardless;
  - 3) our tokenizer contains tokens for repeated space tokens.
🧪Hyperparameters (GPT-NeoX 20B)
- optimizer: ZeRo
  - betas: 0.9, 0.95
  - eps: 1e-8
- batch size: approximately 3.15M tokens
- context window: 2,048
- gradient accumulation steps: 32
- warmup steps: /
- learning rate: 9.7e-5
- weight decay: 0.01
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: fp16
🏃‍♀️Training
- model initialization: from scratch
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: 150K steps
- hardware: We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.
- training time: 1,830 hours

GPT-NeoX

Further Reading

CodeRL

PanGu-Coder

FIM