Skip to main content

Command Palette

Search for a command to run...

LLM for Coding Benchmarks and Datasets

LiveCodeBench, SWEBench, Aider Polyglot, BBH, HumanEval, MBPP, Common Crawl (Time Span, Dataset Size, Data Format, LLM Testing Capability)

Published
1 min read
LLM for Coding Benchmarks and Datasets
A

I’m Anni Huang, an AI researcher-in-training currently at ByteDance, specializing in LLM training operations with a coding focus. I bridge the gap between engineering execution and model performance, ensuring the quality, reliability, and timely delivery of large-scale training projects.

Benchmark/DatasetYear CreatedInstitutionTime SpanDataset SizeData FormatLLM Testing Capability
LiveCodeBench2024UC San Diego, Microsoft Research2024-ongoing (continuously updated)Several hundred problemsProgramming contest problems with test cases (LeetCode, CodeForces, AtCoder)Real-time coding ability on fresh problems, prevents data contamination
SWE-bench2023Princeton University2018-2023 GitHub data2,294 task instances from 12 Python reposReal-world GitHub issues with failing tests and repository contextPractical software engineering skills, issue resolution
Aider Polyglot2024Paul Gauthier (Aider project)ContemporaryMulti-language tasksCode editing/generation tasks across multiple languagesMulti-language coding and code editing workflows
BIG-Bench Hard (BBH)2022Google Research + collaboratorsStatic benchmark23 challenging tasksMultiple-choice and generation tasksComplex reasoning, logic, mathematics, world knowledge
HumanEval2021OpenAIStatic benchmark164 problemsPython function signatures with docstrings and test casesFunctional correctness of Python code generation
MBPP2021Google ResearchStatic benchmark1,000 problemsNatural language descriptions with Python solutions and testsPython programming across various difficulty levels
Common Crawl2007Common Crawl Foundation2007-ongoing (monthly snapshots)Petabytes (billions of pages per crawl)Raw web pages, extracted text, metadata in WARC formatTraining data source (not a benchmark itself)