Home PyMT5
Post
Cancel

PyMT5

  • 📙Paper: PyMT5 multi-mode translation of natural language and Python code with transformers
  • 📚Publisher: EMNLP
  • 🏠Author Affiliation: Microsoft
  • 🔑Public: ❌
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 374M
  • 🗂️Data pre-processing
    • Data Resource
      • Our data consists of 118k GITHUB repositories, which includes all public repositories labelled as containing primarily PYTHON source code, featuring at least 10 stars, and which have had a commit in the past 5 years. We successfully cloned 112k of these repositories, extracting 5.3 million PYTHON files from the default HEAD state of each repository.
    • De-duplication: ✅
    • Filter Strategies
      • We use the python 3.7 standard library ast to produce the file-level AST for each Python file
      • use 2to3 and autopep8 to overcome the issue of different styles and white space or tab convertions
      • use the Python module astunparse to take the AST for each method and unparse them back into source code
      • ignore comments as they generally represent trivia and are not part of the normal language syntax
      • clean the docstrings by removing non-ASCII characters, normalizing Unicode, and replacing commit hashes, file paths, and URLs with placeholder tokens
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • The same extended GPT tokenizer, which is trained on raw python files.
  • 🧪Hyperparameters (PyMT5 374M)
    • optimizer: Adam
      • betas: 0.9, 0.98
      • eps: 1e-6
    • batch size: /
    • context window: 2,200
    • gradient accumulation steps: /
    • warmup steps: 5,000
    • learning rate: 9.1875e-5
    • weight decay: 0.01
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: fp16
  • 🏃‍♀️Training
    • model initialization: /
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: convergence occurs after 397k steps or 183 epochs
    • hardware: 16 32GB Tesla V100 GPUs
    • training time: 3 weeks
This post is licensed under CC BY 4.0 by the author.

GPT-C

CodeGPT