- 📙Paper: PyMT5 multi-mode translation of natural language and Python code with transformers
- 📚Publisher:
EMNLP
- 🏠Author Affiliation:
Microsoft
- 🔑Public: ❌
- 🌐Architecture
- Encoder-Decoder
- Decoder-Only
- 📏Model Size
374M
- 🗂️Data pre-processing
- Data Resource
- Our data consists of 118k GITHUB repositories, which includes all public repositories labelled as containing primarily PYTHON source code, featuring at least 10 stars, and which have had a commit in the past 5 years. We successfully cloned 112k of these repositories, extracting 5.3 million PYTHON files from the default HEAD state of each repository.
- De-duplication: ✅
- Filter Strategies
- We use the python 3.7 standard library ast to produce the file-level AST for each Python file
- use 2to3 and autopep8 to overcome the issue of different styles and white space or tab convertions
- use the Python module astunparse to take the AST for each method and unparse them back into source code
- ignore comments as they generally represent trivia and are not part of the normal language syntax
- clean the docstrings by removing non-ASCII characters, normalizing Unicode, and replacing commit hashes, file paths, and URLs with placeholder tokens
- Data Resource
- 🍉Tokenizer
- Technology
- Byte-level Byte-Pair-Encoding (BBPE)
- SentencePiece
- Details
- The same extended GPT tokenizer, which is trained on raw python files.
- Technology
- 🧪Hyperparameters (PyMT5 374M)
- optimizer: Adam
- betas: 0.9, 0.98
- eps: 1e-6
- batch size: /
- context window:
2,200
- gradient accumulation steps: /
- warmup steps:
5,000
- learning rate:
9.1875e-5
- weight decay:
0.01
- decay schedule
- Cosine
- Linear
- Polynomial
- Inverse Square
- precision floating point:
fp16
- optimizer: Adam
- 🏃♀️Training
- model initialization: /
- training strategies
- left-to-right
- fill-in-the-middle
- trained tokens/steps: convergence occurs after 397k steps or 183 epochs
- hardware: 16 32GB Tesla V100 GPUs
- training time: 3 weeks
PyMT5
This post is licensed under CC BY 4.0 by the author.
Recently Updated