LLM for Coding Benchmarks and Datasets

Benchmark/Dataset	Year Created	Institution	Time Span	Dataset Size	Data Format	LLM Testing Capability
LiveCodeBench	2024	UC San Diego, Microsoft Research	2024-ongoing (continuously updated)	Several hundred problems	Programming contest problems with test cases (LeetCode, CodeForces, AtCoder)	Real-time coding ability on fresh problems, prevents data contamination
SWE-bench	2023	Princeton University	2018-2023 GitHub data	2,294 task instances from 12 Python repos	Real-world GitHub issues with failing tests and repository context	Practical software engineering skills, issue resolution
Aider Polyglot	2024	Paul Gauthier (Aider project)	Contemporary	Multi-language tasks	Code editing/generation tasks across multiple languages	Multi-language coding and code editing workflows
BIG-Bench Hard (BBH)	2022	Google Research + collaborators	Static benchmark	23 challenging tasks	Multiple-choice and generation tasks	Complex reasoning, logic, mathematics, world knowledge
HumanEval	2021	OpenAI	Static benchmark	164 problems	Python function signatures with docstrings and test cases	Functional correctness of Python code generation
MBPP	2021	Google Research	Static benchmark	1,000 problems	Natural language descriptions with Python solutions and tests	Python programming across various difficulty levels
Common Crawl	2007	Common Crawl Foundation	2007-ongoing (monthly snapshots)	Petabytes (billions of pages per crawl)	Raw web pages, extracted text, metadata in WARC format	Training data source (not a benchmark itself)

Comments