CodeT5

Posted Sep 2, 2021 Updated Jan 29, 2023

By Anonymous

1 min read

📙Paper: CodeT5 Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
📚Publisher: EMNLP
🏠Author Affiliation: Salesforce Research Asia
🔑Public: ✅
🌐Architecture
- Encoder-Decoder
- Decoder-Only
📏Model Size
- 60M; 220M; 770M
🗂️Data pre-processing
- Data Resource
  - CodeSearchNet
  - BigQuery
- De-duplication: ❌
- Filter Strategies
  - /
🍉Tokenizer
- Technology
  - Byte-level Byte-Pair-Encoding (BBPE)
  - SentencePiece
- Details
  - we train a byte-level BPE tokenizer. We allow tokens to extend across whitespace (excluding newline characters) so that common code idioms (e.g., import numpy as np) are represented as single tokens in the vocabulary.
🧪Hyperparameters (CodeT5 770M)
- optimizer: AdamW
  - betas: /
  - eps: /
- batch size: /
- context window: 2,048
- gradient accumulation steps: /
- warmup steps: 1,000
- learning rate: 2e-4
- weight decay: 0.05
- decay schedule
  - Cosine
  - Linear
  - Polynomial
  - Inverse Square
- precision floating point: fp16
🏃‍♀️Training
- model initialization: from scratch
- training strategies
  - left-to-right
  - fill-in-the-middle
- trained tokens/steps: /
- hardware: 16 A100 GPUs with 40G memory
- training time: 21 days

This post is licensed under CC BY 4.0 by the author.