- 📙Paper: GPT-J
- 📚Publisher:
other
- 🏠Author Affiliation:
EleutherAI
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
6B
- 🗂️Data pre-processing
- Data Resource
- The Pile
- De-duplication: /
- Filter Strategies
- /
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- Same tokenizer as GPT-2/3 (origin words)
- Technology
- 🧪Hyperparameters (GPT-J 6B)
- optimizer: Adam
- betas: /
- eps: /
- batch size: /
- context window:
2,048
- gradient accumulation steps: 16
- warmup steps:
3,000
- learning rate: /
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
bf16
- optimizer: Adam
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: /
- hardware: TPU Research Cloud
- training time: /
GPT-J
This post is licensed under CC BY 4.0 by the author.
Recently Updated