- 📙Paper: CodeGen An Open Large Language Model for Code with Multi-Turn Program Synthesis
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
Salesforce Research
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
350M
;2.7B
;6.1B
;16.1B
- 🗂️Data pre-processing
- Data Resource
- The natural language dataset ThePile is an 825.18GiB English text corpus
- The multi-lingual dataset BigQuery is a subset of Google’s publicly available BigQuery dataset, which consists of code (under open-source license) in multiple programming languages
- The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language, Python. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code in October 2021
- De-duplication: ✅
- Filter Strategies
- file extension
- average lines length of <100 characters
- a maximum line length of 1,000
- >90% of the characters being decimal or hexadecimal digits
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- The BPE vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces. In the multi-lingual setting of BigQuery, a prefix is prepended to indicate the name of the programming language.
- Technology
- 🧪Hyperparameters (CodeGen 16.1B)
- optimizer: Adam
- betas: 0.9, 0.999
- eps: 1e-8
- batch size:
2M
tokens - context window:
2,048
- gradient accumulation steps: /
- warmup steps:
3,000
- learning rate:
0.5e-4
- weight decay:
0.1
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: Adam
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 150K steps
- hardware: TPUv4
- training time: /
CodeGen
This post is licensed under CC BY 4.0 by the author.
Recently Updated