Data Resource

CodeNet

Project CodeNet is a large scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems.

Google Big Query & BigQuery Guide

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery.

Archive.org

The website contains a lot of StackOverFlow posts.

GHTorent & GHTorent Paper

A library and a collection of scripts used to retrieve data from the Github API and extract metadata in an SQL database, in a modular and scalable manner. The scripts are distributed as a Gem (ghtorrent), but they can also be run by checking out this repository.

CodeSearchNet

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code.

JuICe

To create JuICe we first collect all publicly available Jupyter notebooks from github.com created before May 2019 and filter for notebooks having NL markdown in English and Python 2/3 as their kernel type. We observe that the presence of NL markdown is correlated with notebook quality and remove any notebooks that have more than three times the number of code cells as the number of NL cells, leaving us with ∼ 659K notebooks.

The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

APPS

A benchmark for code generation with 10000 problems.

Github-Code

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery. 1) a 1TB dataset of 32 programming languages from GitHub files. 2) a cleaner version of GitHub-Code dataset.

GitHub-Jupyter

The dataset was extracted from Jupyter Notebooks on BigQuery. 1) a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub. 2) a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.

CodeClippy & GitHub Search

CodeParrot

A dataset of Python files from Github. This dataset has ~50GB of code and 5361373 files.

The Stack

The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages.

CoST

132,046 natural-language-code pairwise (C++, Java, Python, C#, JS, PHP, C)

XLCoST

The dataset contains around 1 million parallel snippets and 123K parallel programs in total, which is significantly larger than many available parallel code datasets. ( C++, Java, Py, C#, JS, PHP, C, English)

CrossCodeBench

The large-scale benchmark includes 216 existing code-related tasks. Then, we annotate each task with the corresponding meta information such as task description and instruction, which contains detailed information about the task and a solution guide.

The Vault

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

Data Resource

Google Big Query & BigQuery Guide

GHTorent & GHTorent Paper

CodeClippy & GitHub Search

Further Reading

Files Generated by the LLMs

CodeGeeX

HumanEval-X