- 📙Paper: GPT-J
 - 📚Publisher: 
other - 🏠Author Affiliation: 
EleutherAI - 🔑Public: ✅
 - 🌐Architecture
- Encoder-Decoder
 - Decoder-Only
 
 - 📏Model Size
6B
 - 🗂️Data pre-processing
- Data Resource
- The Pile
 
 - De-duplication: /
 - Filter Strategies
- /
 
 
 - Data Resource
 - 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
 - SentencePiece
 
 - Details
- Same tokenizer as GPT-2/3 (origin words)
 
 
 - Technology
 - 🧪Hyperparameters (GPT-J 6B)
- optimizer: Adam
- betas: /
 - eps: /
 
 - batch size: /
 - context window: 
2,048 - gradient accumulation steps: 16
 - warmup steps: 
3,000 - learning rate: /
 - weight decay: 
0.1 - decay schedule
- Cosine
 - Linear
 - Polynomial
 - Inverse Square
 
 - precision floating point: 
bf16 
 - optimizer: Adam
 - 🏃♀️Training
- model initialization: /
 - training strategies
- left-to-right
 - fill-in-the-middle
 
 - trained tokens/steps: /
 - hardware: TPU Research Cloud
 - training time: /
 
 
GPT-J
 This post is licensed under  CC BY 4.0  by the author.
Recently Updated