- 📙Paper: InCoder A Generative Model for Code Infilling and Synthesis
- 📚Publisher:
arxiv
- 🏠Author Affiliation:
Facebook AI Research
- 🔑Public: ✅
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
1.3B
;6.7B
- 🗂️Data pre-processing
- Data Resource
- We train our models on a corpus of (1) public code with permissive, non-copyleft, open-source licenses and (2) StackOverflow questions, answers, and comments.
- We obtained code files and repository metadata from GitHub and GitLab through the sites’public APIs over a period ending on December 9th, 2021. Since Python files can also be contained in non-majority-Python repositories, we also included all other Python and Jupyter files obtainable through the GitHub archive on BigQuery that we did not already obtain from GitHub directly.
- De-duplication: ✅
- Filter Strategies
- We remove files that contain any line longer than 3000 tokens
- an average line length greater than 100 tokens
- have less than 40% of their characters being alphanumeric or underscores
- appear to be automatically generated, which we determine using substring match on a small number of phrases produced by automatic code and documentation generation systems
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- we train a byte-level BPE tokenizer. We allow tokens to extend across whitespace (excluding newline characters) so that common code idioms (e.g., import numpy as np) are represented as single tokens in the vocabulary.
- Technology
- 🧪Hyperparameters (InCoder 6.7B)
- optimizer: Adam
- betas: 0.9, 0.98
- eps: /
- batch size: /
- context window:
2,048
- gradient accumulation steps: /
- warmup steps:
1,500
- learning rate: /
- weight decay: /
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point: /
- optimizer: Adam
- 🏃♀️Training
- model initialization: from scratch
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: 52B tokens
- hardware: 248 V100 GPUs
- training time: 24 days
InCoder
This post is licensed under CC BY 4.0 by the author.
Recently Updated