- 📙Paper: CodeGeeX
- 📚Publisher:
other
- 🏠Author Affiliation:
Tsinghua University
- 🔑Public: ✅ (Required Application)
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
13B
- 🗂️Data pre-processing
- Data Resource
- The Pile
- CodeParrot
- Scapped from the public GitHub repositories that do not appear in previous datasets, including Python, Java and C++
- De-duplication: /
- Filter Strategies
- A file is filtered out if it has more than 100 characters per line on average
- is automatically generated
- has a ratio of alphabet less than 40%
- is bigger than 100KB or smaller than 1KB
- To help the model distinguish different languages, we add a language-specific prefix at the beginning of each segment in the form of [Comment sign] language: [LANG], e.g., # language: Python.
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- We use the same tokenizer as GPT-2 and process whitespaces as extra tokens
- Technology
- 🧪Hyperparameters (CodeGeeX 13B)
- optimizer: ZeRo
- betas: 0.9, 0.95
- eps: /
- batch size: The micro-batch size is 16 and the global batch size reaches 3,072.
- context window:
2,048
- gradient accumulation steps: /
- warmup steps: /
- learning rate: /
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
fp16
- optimizer: ZeRo
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: /
- hardware: 1,536 Ascend 910 AI Processor (32GB)
- training time: /
CodeGeeX
This post is licensed under CC BY 4.0 by the author.
Recently Updated