- 📙Paper: GPT-NeoX-20B An Open-Source Autoregressive Language Model
- 📚Publisher:
ACL
- 🏠Author Affiliation:
EleutherAI
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
20B
- 🗂️Data pre-processing
- Data Resource
- The Pile
- De-duplication: ✅ (the whole Pile data, not only code)
- Filter Strategies
- /
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- We use a BPE-based tokenizer similar to that used in GPT-2, with the same total vocabulary size of 50257, with three major changes to the tokenizer:
- 1) we train a new BPE tokenizer based on the Pile;
- 2) the tokenizer applies consistent space delimitation regardless;
- 3) our tokenizer contains tokens for repeated space tokens.
- Technology
- 🧪Hyperparameters (GPT-NeoX 20B)
- optimizer: ZeRo
- betas: 0.9, 0.95
- eps: 1e-8
- batch size: approximately
3.15M
tokens - context window:
2,048
- gradient accumulation steps:
32
- warmup steps: /
- learning rate:
9.7e-5
- weight decay:
0.01
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
fp16
- optimizer: ZeRo
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps:
150K
steps - hardware: We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.
- training time:
1,830
hours
GPT-NeoX
This post is licensed under CC BY 4.0 by the author.
Recently Updated