PyMT5

📙Paper: PyMT5 multi-mode translation of natural language and Python code with transformers
📚Publisher: EMNLP
🏠Author Affiliation: Microsoft
🔑Public: ❌
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 374M
🗂️Data pre-processing
- Data Resource
  - Our data consists of 118k GITHUB repositories, which includes all public repositories labelled as containing primarily PYTHON source code, featuring at least 10 stars, and which have had a commit in the past 5 years. We successfully cloned 112k of these repositories, extracting 5.3 million PYTHON files from the default HEAD state of each repository.
- De-duplication: ✅
- Filter Strategies
  - We use the python 3.7 standard library ast to produce the file-level AST for each Python file
  - use 2to3 and autopep8 to overcome the issue of different styles and white space or tab convertions
  - use the Python module astunparse to take the AST for each method and unparse them back into source code
  - ignore comments as they generally represent trivia and are not part of the normal language syntax
  - clean the docstrings by removing non-ASCII characters, normalizing Unicode, and replacing commit hashes, file paths, and URLs with placeholder tokens
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - The same extended GPT tokenizer, which is trained on raw python files.
🧪Hyperparameters (PyMT5 374M)
- optimizer: Adam
  - betas: 0.9, 0.98
  - eps: 1e-6
- batch size: /
- context window: 2,200
- gradient accumulation steps: /
- warmup steps: 5,000
- learning rate: 9.1875e-5
- weight decay: 0.01
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: fp16
🏃‍♀️Training
- model initialization: /
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: convergence occurs after 397k steps or 183 epochs
- hardware: 16 32GB Tesla V100 GPUs
- training time: 3 weeks

PyMT5

Further Reading

CodeRL

PanGu-Coder

FIM