Home CodeGeeX
Post
Cancel

CodeGeeX

  • 📙Paper: CodeGeeX
  • 📚Publisher: other
  • 🏠Author Affiliation: Tsinghua University
  • 🔑Public: ✅ (Required Application)
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 13B
  • 🗂️Data pre-processing
    • Data Resource
      • The Pile
      • CodeParrot
      • Scapped from the public GitHub repositories that do not appear in previous datasets, including Python, Java and C++
    • De-duplication: /
    • Filter Strategies
      • A file is filtered out if it has more than 100 characters per line on average
      • is automatically generated
      • has a ratio of alphabet less than 40%
      • is bigger than 100KB or smaller than 1KB
      • To help the model distinguish different languages, we add a language-specific prefix at the beginning of each segment in the form of [Comment sign] language: [LANG], e.g., # language: Python.
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • We use the same tokenizer as GPT-2 and process whitespaces as extra tokens
  • 🧪Hyperparameters (CodeGeeX 13B)
    • optimizer: ZeRo
      • betas: 0.9, 0.95
      • eps: /
    • batch size: The micro-batch size is 16 and the global batch size reaches 3,072.
    • context window: 2,048
    • gradient accumulation steps: /
    • warmup steps: /
    • learning rate: /
    • weight decay: 0.1
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: fp16
  • 🏃‍♀️Training
    • model initialization: /
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: /
    • hardware: 1,536 Ascend 910 AI Processor (32GB)
    • training time: /
This post is licensed under CC BY 4.0 by the author.

MultiPL-MBPP

HumanEval-X