JuPyT5

📙Paper: Training and Evaluating a Jupyter Notebook Data Science Assistant
📚Publisher: arxiv
🏠Author Affiliation: Microsoft
🔑Public: ❌
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 350M
🗂️Data pre-processing
- Data Resource
  - Our pre-training and training data consists of all Jupyter Notebooks from all GitHub repositories which were labeled by GitHub as consisting primarily of Jupyter Notebooks, and the repositories were cloned and processed April 2021. In total we obtained 1.97 million repositories and from these derived 9.06 million total Jupyter Notebooks, 7.24 million of which were unique. In order to prevent data leakage into DSP, we removed from our training set all repositories which were in the JuICe testing and development sets, and further ensured no duplicates from these holdout sets were present in our training set.
- De-duplication: ✅ (unique)
- Filter Strategies
  - Training Subset-Markdown Focused: we define the Markdown Focused training subset as notebooks with at least one code cell and at least 1/3 of the cells are Markdown cells. This subset contains 4.1M notebooks, or 3/5 of the total training set, and 15.7B training tokens.
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - PyMT5 tokenizer
🧪Hyperparameters (JuPyT5 350M)
- optimizer: Adam
  - betas: 0.9, 0.98
  - eps: 1e-6
- batch size: /
- context window: 2,200
- gradient accumulation steps: /
- warmup steps: 5,000
- learning rate: 9.1875e-5
- weight decay: 0.01
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: fp16
🏃‍♀️Training
- model initialization: PyMT5
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: 27.2B tokens
- hardware: 80 32GB Tesla V100 GPUs
- training time: /

JuPyT5

Further Reading

CodeRL

PyCodeGPT

CodeGeeX