Home MIM
Post
Cancel

MIM

  • 📙Paper: Meet in the Middle A New Pre-training Paradigm
  • 📚Publisher: arxiv
  • 🏠Author Affiliation: Microsoft Azure AI
  • 🔑Public: ✅
  • 🌐Architecture
    • Encoder-Decoder
    • Decoder-Only
  • 📏Model Size
    • 350M, 1.3B, 2.7B
  • 🗂️Data pre-processing
    • Data Resource
      • We first pre-train our models on a large and diverse corpus of public code with permissive licenses, which covers multiple programming languages. Python, Java, C++ are the dominant languages in our corpus, accounting for most of the re-training data.
      • Apart from pre-training our model on public code datasets, we also pretrain our model on natural language, specifically the union of the following datasets: CC-News, OpenWebText, CC-Stories and CC-100.
    • De-duplication: ✅
    • Filter Strategies
      • Not report
  • 🍉Tokenizer
    • Technology
      • Byte-level Byte-Pair-Encoding (BBPE)
      • SentencePiece
    • Details
      • Our tokenizer is based on the Byte-Pair Encoding algorithm widely used in previous work [Codex] to directly encode raw bytes with a vocabulary of size 100257 tokens. We pre-tokenize the text using a special regex pattern that accounts for splitting on digit and newlines together with the default GPT-2 pre-tokenization.
  • 🧪Hyperparameters (MIM 2.7B)
    • optimizer: Adam
      • betas: 0.9, 0.95
      • eps: 1e-8
    • batch size: 1M tokens
    • context window: 2,048
    • gradient accumulation steps: /
    • warmup steps: 5,000
    • learning rate: 2e-4
    • weight decay: 0.1
    • decay schedule
      • Cosine
      • Linear
      • Polynomial
      • Inverse Square
    • precision floating point: fp32, fp16
  • 🏃‍♀️Training
    • model initialization: from scratch
    • training strategies
      • left-to-right
      • fill-in-the-middle
    • trained tokens/steps: 286K steps
    • hardware: 128 A100 GPU with 80GB memory
    • training time: about 4 days
This post is licensed under CC BY 4.0 by the author.

LLaMA

HumanEval+