LLM for Coding Benchmarks and Datasets
LiveCodeBench, SWEBench, Aider Polyglot, BBH, HumanEval, MBPP, Common Crawl (Time Span, Dataset Size, Data Format, LLM Testing Capability)

I’m Anni Huang, an AI researcher-in-training currently at ByteDance, specializing in LLM training operations with a coding focus. I bridge the gap between engineering execution and model performance, ensuring the quality, reliability, and timely delivery of large-scale training projects.
| Benchmark/Dataset | Year Created | Institution | Time Span | Dataset Size | Data Format | LLM Testing Capability |
| LiveCodeBench | 2024 | UC San Diego, Microsoft Research | 2024-ongoing (continuously updated) | Several hundred problems | Programming contest problems with test cases (LeetCode, CodeForces, AtCoder) | Real-time coding ability on fresh problems, prevents data contamination |
| SWE-bench | 2023 | Princeton University | 2018-2023 GitHub data | 2,294 task instances from 12 Python repos | Real-world GitHub issues with failing tests and repository context | Practical software engineering skills, issue resolution |
| Aider Polyglot | 2024 | Paul Gauthier (Aider project) | Contemporary | Multi-language tasks | Code editing/generation tasks across multiple languages | Multi-language coding and code editing workflows |
| BIG-Bench Hard (BBH) | 2022 | Google Research + collaborators | Static benchmark | 23 challenging tasks | Multiple-choice and generation tasks | Complex reasoning, logic, mathematics, world knowledge |
| HumanEval | 2021 | OpenAI | Static benchmark | 164 problems | Python function signatures with docstrings and test cases | Functional correctness of Python code generation |
| MBPP | 2021 | Google Research | Static benchmark | 1,000 problems | Natural language descriptions with Python solutions and tests | Python programming across various difficulty levels |
| Common Crawl | 2007 | Common Crawl Foundation | 2007-ongoing (monthly snapshots) | Petabytes (billions of pages per crawl) | Raw web pages, extracted text, metadata in WARC format | Training data source (not a benchmark itself) |



