CodeGen

📙Paper: CodeGen An Open Large Language Model for Code with Multi-Turn Program Synthesis
📚Publisher: arxiv
🏠Author Affiliation: Salesforce Research
🔑Public: ✅
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 350M; 2.7B; 6.1B; 16.1B
🗂️Data pre-processing
- Data Resource
  - The natural language dataset ThePile is an 825.18GiB English text corpus
  - The multi-lingual dataset BigQuery is a subset of Google’s publicly available BigQuery dataset, which consists of code (under open-source license) in multiple programming languages
  - The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language, Python. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code in October 2021
- De-duplication: ✅
- Filter Strategies
  - file extension
  - average lines length of <100 characters
  - a maximum line length of 1,000
  - >90% of the characters being decimal or hexadecimal digits
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - The BPE vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces. In the multi-lingual setting of BigQuery, a prefix is prepended to indicate the name of the programming language.
🧪Hyperparameters (CodeGen 16.1B)
- optimizer: Adam
  - betas: 0.9, 0.999
  - eps: 1e-8
- batch size: 2M tokens
- context window: 2,048
- gradient accumulation steps: /
- warmup steps: 3,000
- learning rate: 0.5e-4
- weight decay: 0.1
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: /
🏃‍♀️Training
- model initialization: /
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: 150K steps
- hardware: TPUv4
- training time: /

CodeGen

Further Reading

CodeRL

PyCodeGPT

CodeGeeX