Home CodeGen
Post
Cancel

CodeGen

  • 📙Paper: CodeGen An Open Large Language Model for Code with Multi-Turn Program Synthesis
  • 📚Publisher: arxiv
  • 🏠Author Affiliation: Salesforce Research
  • 🔑Public: ✅
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 350M; 2.7B; 6.1B; 16.1B
  • 🗂️Data pre-processing
    • Data Resource
      • The natural language dataset ThePile is an 825.18GiB English text corpus
      • The multi-lingual dataset BigQuery is a subset of Google’s publicly available BigQuery dataset, which consists of code (under open-source license) in multiple programming languages
      • The mono-lingual dataset BIGPYTHON contains a large amount of data in the programming language, Python. We have compiled public, non-personal information from GitHub consisting of permissively licensed Python code in October 2021
    • De-duplication: ✅
    • Filter Strategies
      • file extension
      • average lines length of <100 characters
      • a maximum line length of 1,000
      • >90% of the characters being decimal or hexadecimal digits
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • The BPE vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces. In the multi-lingual setting of BigQuery, a prefix is prepended to indicate the name of the programming language.
  • 🧪Hyperparameters (CodeGen 16.1B)
    • optimizer: Adam
      • betas: 0.9, 0.999
      • eps: 1e-8
    • batch size: 2M tokens
    • context window: 2,048
    • gradient accumulation steps: /
    • warmup steps: 3,000
    • learning rate: 0.5e-4
    • weight decay: 0.1
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: /
  • 🏃‍♀️Training
    • model initialization: /
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: 150K steps
    • hardware: TPUv4
    • training time: /
This post is licensed under CC BY 4.0 by the author.

MCoNaLa

MTPB