./ProgramBench
Can language models rebuild programs from scratch?
Given only a compiled binary and its documentation, agents must architect and implement a complete codebase that reproduces the original program's behavior.
Leaderboard
Evaluated with mini-SWE-agent · 200 tasks · See extended results →| # | Model | Agent | Resolved The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. | Almost resolvedAlmost Instances where the agent's solution solves ≥ 95% of all behavioral tests. See extended results. | |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 Anthropic | mini-SWE-agent | 0% | 3.0% | |
| 2 | Claude Opus 4.6 Anthropic | mini-SWE-agent | 0% | 2.5% | |
| 3 | Claude Sonnet 4.6 Anthropic | mini-SWE-agent | 0% | 1.0% | |
| 4 | GPT 5.4 OpenAI | mini-SWE-agent | 0% | 0.0% | |
| 5 | Gemini 3.1 Pro Google | mini-SWE-agent | 0% | 0.0% | |
| 6 | Gemini 3 Flash Google | mini-SWE-agent | 0% | 0.0% | |
| 7 | Claude Haiku 4.5 Anthropic | mini-SWE-agent | 0% | 0.0% | |
| 8 | GPT 5.4 mini OpenAI | mini-SWE-agent | 0% | 0.0% | |
| 9 | GPT 5 mini OpenAI | mini-SWE-agent | 0% | 0.0% |
About ProgramBench
In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to any of the executable's source code, it cannot de-compile the executable, and cannot use the internet. There are 200 tasks in total covering different program complexities, ranging from small terminal utilities like jq and ripgrep to massive software projects like the PHP compiler, FFmpeg, and SQLite.
The agent must choose a language, design the architecture, write all source code, and produce a build script. Every design decision is the model's to make.
Once the agent submits a program, our test suite compares the candidate program's behavior against the original program. A candidate program passes only if all tests for that task pass.
Our test suite is generated via agent-driven fuzzing, and it comprises more than 248,000 total behavioral tests for our 200 tasks.
Can tasks in ProgramBench be fully solved at all?
Yes. The agent can run the given program with any input and observe exactly what it does, so there's nothing hidden that can't be discovered through experimentation. The benchmark is hard, but it's solvable by design: all the reference executables pass our test suites. Read more in our blog post.
Why are ProgramBench scores so low?
Building a program from scratch is a fundamentally challenging task. Agents do currently make partial progress on many tasks (see the extended results for details), but fully passing every test is still out of reach.
Agents truly have to architect. This is in part because unlike other whole-repo generation projects, we give no hints or structure to the agent, meaning that the agent truly has to architect its own solutions (see "How is ProgramBench different?").
No harness tuning. Other recent and concurrent work also performed substantial harness tuning for a single or a handful number of tasks. We deliberately avoid this, since headline scores from a tuned harness on a curated handful of tasks can substantially overstate how capable agents really are at building software from scratch. Instead, ProgramBench is evaluated with a single generic harness across the entire task set.
Cleanroom implementation. We take substantial precautions to prevent cheating. Agents run in sandboxed containers without internet access, so they cannot retrieve the original source code or obtain any other form of help.
No decompilation. See "Can tasks be solved with decompilation?"
We review related work in section 6 of the paper. We also discuss cheating in the FAQ below and in section 4.1.
Is your agent scaffold sufficient to solve all tasks?
Widely adopted baseline. We use mini-SWE-agent because it is both widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in its scaffolding, reducing confounds between model capability and harness design. Most other agents (like Claude Code with apparently several 100k lines of code) are also constantly changing in non-transparent ways, while mini-SWE-agent will allow for apples-to-apples performance comparison of models for the foreseeable future.
Almost no runtime limitations. With very few exceptions, models submit their solutions deliberately rather than exceeding our generous time or step limits, and they never exhaust their context window. Because we do not limit total cost, our runs have cost up to $5k (for Sonnet 4.5).
Varying degree of difficulty. ProgramBench deliberately includes tasks from various degrees of difficulty, from very short repositories of only a few thousand lines of code to extremely large ones. We believe that the extremely low scores are therefore more of a signal of inadequate model capabilities rather than an indicator that only multi-agent systems can solve our tasks. Nonetheless, we would be excited to be one of the first systematic benchmarks that includes tasks that can only be solved by multi-agent systems.
Kicking off a new scaffold race. We believe that mini-SWE-agent is the right choice of baseline and that it can absolutely solve (some of) the tasks. However, we'd be more than excited if ProgramBench kicks off a new scaffold race! We will be opening submissions soon.
Can agents cheat?
Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. In early trials without these restrictions, models found shortcuts like cloning source repositories from GitHub or downloading code through package managers. Read more in our blog post and in section 4.1 of the paper.
Why and how do you block decompilation?
The executable that is given to the agent only has execution, not read permissions. That means that any operation that is not execution (such as running a decompiler, disassembler, objdump, strings, or hexdump) will fail.
We do this because we want ProgramBench to answer the question "How well can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".
How is the leaderboard sorted? What's the primary metric?
The primary metric that should be reported for ProgramBench are fully resolved instances. We currently report "almost resolved" (more than 95% of test cases pass) as an additional point of reference while the scores of our primary metric are low. The leaderboard is sorted by fully resolved first, almost resolved second, and finally the average test pass rate.
For a detailed understanding of model performance, we recommend the plot at the detailed leaderboard. See also: "Have you considered other metrics?"
How do I submit to the leaderboard?
Why do you not allow internet?
We have extensively studied different inference settings, including allowing internet. We find that allowing internet leads to an abundance of cheating that requires LM as a judge to flag and disqualify solutions. This makes the benchmark less reliable, especially because defining exactly what cheating means in the context of obtaining source online is not as clear cut as it might seem.
However, except for instances that contained cheating, we did not observe a dramatic improvement of scores when allowing internet.
Find our ablations in section 4.1 of the paper and John's explanation.
Have you considered other metrics? E.g., average number of tests passed?
Yes, we've decided on our current resolved metric after a lot of thought. Our initial question was "Can LMs build programs from scratch?", and the most relevant metric is the fraction of programs that can be fully built. Reporting an average test pass rate would be extremely misleading, because every instance includes very simple tests (such as checking for the existence of flags, checking what happens if you call the executable with --help etc.).
We've also thought about using a more relaxed metric like "almost resolved". However, relaxing to ≥95% of tests solved or even 99% of tests solved is also problematic. First, for some of our tasks, we have almost 15k tests. Even 1% of that is still 100 tests. And even a single failed test can indicate severe issues with a program. Therefore "almost resolved" only serves as additional orientation, until the main "resolved" metric has enough signal to differentiate all models.
However, all auxiliary metrics are still useful for diagnosing and improving models and scaffolds! They're just not the right metric as a benchmark. Check the extended results for more information. See also: "How is the leaderboard sorted?"
:cherry_blossom: A command-line fuzzy finder
simple terminal UI for git commands
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Mirror of https://git.ffmpeg.org/ffmpeg.git
A cat(1) clone with wings.
A markup-based typesetting system that is powerful and easy to learn.
Universal markup converter
A simple, fast and user-friendly alternative to 'find'
The PHP Interpreter
DuckDB is an analytical in-process SQL database management system
A smarter cd command. Supports all major shells.
Command-line JSON processor
A syntax-highlighting pager for git, diff, grep, rg --json, and blame output
A command-line benchmarking tool
A code-searching tool similar to ack, but faster.
Zstandard - Fast real-time compression algorithm
Library for fast text representation and classification.
TCP port scanner, spews SYN packets asynchronously, scanning entire Internet in under 5 minutes.
An incremental parsing system for programming tools
A simple, modern and secure encryption tool (and Go library) with small explicit keys, no config options, and UNIX-style composability.
Create book from markdown files. Like Gitbook but implemented in Rust
n³ The unorthodox terminal file manager
Terminal JSON viewer & processor
yq is a portable command-line YAML, JSON, XML, CSV, TOML, HCL and properties processor
⬛️ CLI tool and library for saving complete web pages as a single HTML file
unclutter your .profile
Brotli compression format
Make JSON greppable!
Count your code, quickly.
⚡A CLI tool for code structural search, lint and rewriting. Written in Rust
cheat allows you to create and view interactive cheatsheets on the command-line. It was designed to help remind *nix system administrators of options for commands that they use frequently, but not frequently enough to remember.
Text-mode interface for git
a small build system with a focus on speed
A new way to see and navigate directory trees : https://dystroy.org/broot
Ping, but with a graph
🌀 A nonsense activity generator
Extremely Fast Compression algorithm
Command-line Git information tool
A more intuitive version of du in rust
🕳 bore is a simple CLI tool for making tunnels to localhost
A fast CSV command line toolkit written in Rust.
Public repository of the QuickJS Javascript Engine.
Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation.
Log file navigator
A command-line hex viewer
A copy of the Lua development repository, as seen by the Lua team. Mirrored irregularly. All communication should be through the Lua mailing list https://www.lua.org/lua-l.html
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Official Git mirror of the SQLite source tree
Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
Declarative schema migrations with schema-as-code workflows
A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
htop - an interactive process viewer
Simplistic interactive filtering tool
🌀 A log file highlighter
Friendly and fast tool for sending HTTP requests
🌟 For when you really just want to serve some files over HTTP right now!
Like jq, but for HTML.
An extremely fast CSS parser, transformer, bundler, and minifier written in Rust.
A maintained ctags implementation
Intuitive find & replace CLI (sed alternative)
A command-line DNS client.
static analysis of C/C++ code
Official doxygen git repository
A command-line tool to generate, analyze, convert and manipulate colors
the official Rust and C implementations of the BLAKE3 cryptographic hash function
:stars: Manage your shell commands.
GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
View disk space usage and delete unwanted data, fast.
Fast disk usage analyzer with console interface written in Go
Run arbitrary commands when files change
Mirror of the LuaJIT git repository
🔥 ~6x faster, stricter, configurable, extensible, and beautiful drop-in replacement for golint
Automatically generate Go test boilerplate from your source code.
Visualize Ownership and Lifetimes in Rust
Terminal based "The Matrix" like implementation
Async-friendly QUIC implementation in Rust
A general purpose syntax highlighter in pure Go
The corrective bash syntax highlighter
Melody is a language that compiles to regular expressions and aims to be more readable and maintainable
A hackable, minimal, fast TUI file explorer
📺🗿 Terminal graphics for the 21st century.
Find files with SQL-like queries
Convert your ascii diagram scribbles into happy little SVG
Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.
Slice and dice logs on the command line
The power of curl, the ease of use of httpie.
Terminal file manager
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
A cross-platform command-line tool to convert images into ascii art and print them on the console. Now supports braille art!
A flexible commandline tool for template rendering. Supports lots of local and remote datasources.
7-Zip
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Unofficial mirror of mob development branch
Fast website link checker in Go
CLI for managing secrets
Go implement CLI, cURL-like tool for humans
Plain text note-taking assistant
errcheck checks that you checked errors.
Dropbear SSH
CLI tool that can execute SQL queries on CSV, LTSV, JSON, YAML and TBLN. Can output to various formats.
🐧ping command but with pingu
The most opinionated Go source code linter for code audit.
PROJ - Cartographic Projections and Coordinate Transformations Library
🎑Feature-rich terminal-based text viewer. It is a so-called terminal pager.
Tools (written in C using htslib) for manipulating next-generation sequencing data
Tool for helping developers keep their code bases clean and decoupled. It allows visualising a code base complexity using a 3d force-directed graph of files and the dependencies between them.
Claudio's FIGlet tree
Toolkit for processing sequences in FASTA/Q formats
XZ Utils
Declarative pure-SQL schema management for MySQL and MariaDB
CLI tool for summarizing go test output. Pipe friendly. CI/CD friendly.
A text-based calendar and scheduling application
WSDL2Go code generation as well as its SOAP proxy
Your dev tool to manage /etc/hosts like a pro!
iTerm2 expvar/JSON monitoring tool
Git powered terminal-based todo/note manager -- markdown note page per task. Single binary!
A Bash CLI framework, also a Bash command runner.
Command-line XML and HTML beautifier and content extractor
Clock using lib ncurses
A CLI code-typing game that turns your source code into typing challenges
A plain text-based spaced repetition system.
Fast Markdown linter and formatter written in Rust
CLI - Convert between YAML, TOML, JSON, and HCL. Preserves map order.
bedtools - the swiss army knife for genome arithmetic
Converts jpg images to ASCII
A modern alternative to the watch command, records the differences in execution results and can check this differences at after.
Draw images in your ANSI terminal with true color
A nginx log explorer
ditaa is a small command-line utility that can convert diagrams drawn using ascii art ('drawings' that contain characters that resemble lines like | / - ), into proper bitmap graphics.
ELF visualizer. Generates HTML files from ELF binaries.
A command-line shell like fish, but POSIX compatible.
A code search / replace tool
pls is a prettier and powerful ls(1) for the pros.
Universal multi-language runner and smart REPL written in Rust.
SoX, Swiss Army knife of sound processing
Generate beautiful changelogs from your Git commit history
An extended `cp`
a calculator REPL, similar to bc(1)
Command line tool to show clear git graphs arranged for your branching model
Public/backup repository of the GROMACS molecular simulation toolkit. Please do not mine the metadata blindly; we use https://gitlab.com/gromacs/gromacs for code review and issue tracking.
A command-line tool to prevent committing secret keys into your source code
Quickly Extracts IP's, Email Addresses, Hashes, Files, Credit Cards, Social Security Numbers and a lot More From Text
A grep-like tool which understands source code syntax and allows for manipulation in addition to search
tui file manager with vim-like key mapping
lints and suggestions for the nix programming language
a tool to analyze file system usage written in Rust
💡 CLI tool to input and store your ideas without leaving the terminal
Enrich `go test` outputs with text decorations.
Markdown makes sites - A Static Site Generator for Blogs
Generate Rust register maps (`struct`s) from SVD files
Interactive Grep
A simple timetracker for the command line. It saves a log of all tracked activities as a plaintext file and allows you to create flexible reports.
A simple terminal UI for search and replace, ala VS Code.
When cut doesn't cut it
A 3D software rasterizer... for the terminal!
Converts books written in Markdown to HTML, LaTeX/PDF and EPUB
An extremely fast LaTeX formatter written in Rust
A high-performance JSON Schema validator for Rust
A small terminal UTF-8 text editor written in Rust 📝🦀
Scan Nix files for dead code
A sharp cut(1) clone.
Git genealogy, untangled. A TUI for navigating commit graphs with color and clarity.
Your journal app if you live in a terminal
Right imports sorting & code formatting tool (goimports alternative)
Batch rename utility for developers
📠 Pretty and fast csv viewer for cli with cjk/emoji support.
A better xdg-utils
UNIX's missing `loop` command
Highly parallelized, blazing fast directory tree analyzer
Hush is a unix shell based on the Lua programming language
Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
🎨 Make your i3 config a little more stylish.
A TUI tool to help you type faster and learn new layouts. Includes a free cat.
Find outdated dependencies of your Go projects. go-mod-outdated provides a table view of the go list -u -m -json all command which lists all dependencies of a Go project and their available minor and patch updates. It also provides a way to filter indirect dependencies and dependencies without updates.
🛰 A high performance code minimap render.
Peek inside Parquet files right from your terminal
Stacked Git
Fast directory scanning and scraping tool
Flamegraph viewer in the terminal
Yet another diff highlighting tool
⚡Rapid note management for the terminal.
A (TUI/CLI) markdown navigator with tree-based structural navigation.
A CLI to organize and run short Unix shell scripts
✨ sleek typing tui with visualized results and historical logging
A command-line tool to batch rename files and directories
🔮 Futuristic take on hexdump, made in Rust.
Small command-line JSON Log viewer
🦀️📸 Pure Rust tool to generate beautiful code snapshots, provide CLI and Library
Automatically trims your branches whose tracking remote refs are merged or stray
🎁 generate beautiful landing pages for your developer tools
A tool to interactively write shell pipelines.
Blazingly fast, modular and contributor friendly Solidity compiler, written in Rust
Caesium Command Line Tools - Lossy/lossless image compression tool
Find the password of protected ZIP files.
Encode and decode smart contract invocations
A JSON terminal UI made in C++
A Go linter to check that errors from external packages are wrapped
A small TUI journaling tool. 📖
a tool for code clone detection
@twosigma's first artificial intelligence programming challenge
Citation
@misc{yang2026programbenchlanguagemodelsrebuild,
title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
year={2026},
eprint={2605.03546},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2605.03546},
}