CodeGeeX

📙Paper: CodeGeeX
📚Publisher: other
🏠Author Affiliation: Tsinghua University
🔑Public: ✅ (Required Application)
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 13B
🗂️Data pre-processing
- Data Resource
  - The Pile
  - CodeParrot
  - Scapped from the public GitHub repositories that do not appear in previous datasets, including Python, Java and C++
- De-duplication: /
- Filter Strategies
  - A file is filtered out if it has more than 100 characters per line on average
  - is automatically generated
  - has a ratio of alphabet less than 40%
  - is bigger than 100KB or smaller than 1KB
  - To help the model distinguish different languages, we add a language-specific prefix at the beginning of each segment in the form of [Comment sign] language: [LANG], e.g., # language: Python.
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - We use the same tokenizer as GPT-2 and process whitespaces as extra tokens
🧪Hyperparameters (CodeGeeX 13B)
- optimizer: ZeRo
  - betas: 0.9, 0.95
  - eps: /
- batch size: The micro-batch size is 16 and the global batch size reaches 3,072.
- context window: 2,048
- gradient accumulation steps: /
- warmup steps: /
- learning rate: /
- weight decay: 0.1
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: fp16
🏃‍♀️Training
- model initialization: /
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: /
- hardware: 1,536 Ascend 910 AI Processor (32GB)
- training time: /

CodeGeeX

Further Reading

CodeRL

PanGu-Coder

FIM