./ProgramBench

Can language models rebuild programs from scratch?

Given only a compiled binary and its documentation, agents must architect and implement a complete codebase that reproduces the original program's behavior.

John Yang*, Kilian Lieret*, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press Meta Superintelligence Labs • Stanford University • Harvard University
Evaluated with mini-SWE-agent · 200 tasks · See extended results →
# Model Agent Resolved help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. Almost resolvedAlmost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests. See extended results.
1 Claude Opus 4.7 Anthropic mini-SWE-agent 0% 3.0%
2 Claude Opus 4.6 Anthropic mini-SWE-agent 0% 2.5%
3 Claude Sonnet 4.6 Anthropic mini-SWE-agent 0% 1.0%
4 GPT 5.4 OpenAI mini-SWE-agent 0% 0.0%
5 Gemini 3.1 Pro Google mini-SWE-agent 0% 0.0%
6 Gemini 3 Flash Google mini-SWE-agent 0% 0.0%
7 Claude Haiku 4.5 Anthropic mini-SWE-agent 0% 0.0%
8 GPT 5.4 mini OpenAI mini-SWE-agent 0% 0.0%
9 GPT 5 mini OpenAI mini-SWE-agent 0% 0.0%

About ProgramBench

In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to any of the executable's source code, it cannot de-compile the executable, and cannot use the internet. There are 200 tasks in total covering different program complexities, ranging from small terminal utilities like jq and ripgrep to massive software projects like the PHP compiler, FFmpeg, and SQLite.

The agent must choose a language, design the architecture, write all source code, and produce a build script. Every design decision is the model's to make.

Once the agent submits a program, our test suite compares the candidate program's behavior against the original program. A candidate program passes only if all tests for that task pass.

Our test suite is generated via agent-driven fuzzing, and it comprises more than 248,000 total behavioral tests for our 200 tasks.

Can tasks in ProgramBench be fully solved at all?

Yes. The agent can run the given program with any input and observe exactly what it does, so there's nothing hidden that can't be discovered through experimentation. The benchmark is hard, but it's solvable by design: all the reference executables pass our test suites. Read more in our blog post.

Why are ProgramBench scores so low?

Building a program from scratch is a fundamentally challenging task. Agents do currently make partial progress on many tasks (see the extended results for details), but fully passing every test is still out of reach.

Agents truly have to architect. This is in part because unlike other whole-repo generation projects, we give no hints or structure to the agent, meaning that the agent truly has to architect its own solutions (see "How is ProgramBench different?").

No harness tuning. Other recent and concurrent work also performed substantial harness tuning for a single or a handful number of tasks. We deliberately avoid this, since headline scores from a tuned harness on a curated handful of tasks can substantially overstate how capable agents really are at building software from scratch. Instead, ProgramBench is evaluated with a single generic harness across the entire task set.

Cleanroom implementation. We take substantial precautions to prevent cheating. Agents run in sandboxed containers without internet access, so they cannot retrieve the original source code or obtain any other form of help.

No decompilation. See "Can tasks be solved with decompilation?"

We review related work in section 6 of the paper. We also discuss cheating in the FAQ below and in section 4.1.

Is your agent scaffold sufficient to solve all tasks?

Widely adopted baseline. We use mini-SWE-agent because it is both widely adopted as a baseline by other benchmarks (SWE-bench Verified, SWE-bench Multilingual, Terminal-bench) and deliberately minimal in its scaffolding, reducing confounds between model capability and harness design. Most other agents (like Claude Code with apparently several 100k lines of code) are also constantly changing in non-transparent ways, while mini-SWE-agent will allow for apples-to-apples performance comparison of models for the foreseeable future.

Almost no runtime limitations. With very few exceptions, models submit their solutions deliberately rather than exceeding our generous time or step limits, and they never exhaust their context window. Because we do not limit total cost, our runs have cost up to $5k (for Sonnet 4.5).

Varying degree of difficulty. ProgramBench deliberately includes tasks from various degrees of difficulty, from very short repositories of only a few thousand lines of code to extremely large ones. We believe that the extremely low scores are therefore more of a signal of inadequate model capabilities rather than an indicator that only multi-agent systems can solve our tasks. Nonetheless, we would be excited to be one of the first systematic benchmarks that includes tasks that can only be solved by multi-agent systems.

Kicking off a new scaffold race. We believe that mini-SWE-agent is the right choice of baseline and that it can absolutely solve (some of) the tasks. However, we'd be more than excited if ProgramBench kicks off a new scaffold race! We will be opening submissions soon.

Can agents cheat?

Agents run in sandboxed containers with no internet access, execute-only permissions on the binary, and no access to decompilation tools. In early trials without these restrictions, models found shortcuts like cloning source repositories from GitHub or downloading code through package managers. Read more in our blog post and in section 4.1 of the paper.

Why and how do you block decompilation?

The executable that is given to the agent only has execution, not read permissions. That means that any operation that is not execution (such as running a decompiler, disassembler, objdump, strings, or hexdump) will fail.

We do this because we want ProgramBench to answer the question "How well can LMs build programs from scratch", rather than "How well can LMs patch together bits of decompiled code".

How is the leaderboard sorted? What's the primary metric?

The primary metric that should be reported for ProgramBench are fully resolved instances. We currently report "almost resolved" (more than 95% of test cases pass) as an additional point of reference while the scores of our primary metric are low. The leaderboard is sorted by fully resolved first, almost resolved second, and finally the average test pass rate.

For a detailed understanding of model performance, we recommend the plot at the detailed leaderboard. See also: "Have you considered other metrics?"

How do I submit to the leaderboard?

A public submission portal is coming soon. Follow the authors (John, Kilian) for updates.

Why do you not allow internet?

We have extensively studied different inference settings, including allowing internet. We find that allowing internet leads to an abundance of cheating that requires LM as a judge to flag and disqualify solutions. This makes the benchmark less reliable, especially because defining exactly what cheating means in the context of obtaining source online is not as clear cut as it might seem.

However, except for instances that contained cheating, we did not observe a dramatic improvement of scores when allowing internet.

Find our ablations in section 4.1 of the paper and John's explanation.

Have you considered other metrics? E.g., average number of tests passed?

Yes, we've decided on our current resolved metric after a lot of thought. Our initial question was "Can LMs build programs from scratch?", and the most relevant metric is the fraction of programs that can be fully built. Reporting an average test pass rate would be extremely misleading, because every instance includes very simple tests (such as checking for the existence of flags, checking what happens if you call the executable with --help etc.).

We've also thought about using a more relaxed metric like "almost resolved". However, relaxing to ≥95% of tests solved or even 99% of tests solved is also problematic. First, for some of our tasks, we have almost 15k tests. Even 1% of that is still 100 tests. And even a single failed test can indicate severe issues with a program. Therefore "almost resolved" only serves as additional orientation, until the main "resolved" metric has enough signal to differentiate all models.

However, all auxiliary metrics are still useful for diagnosing and improving models and scaffolds! They're just not the right metric as a benchmark. Check the extended results for more information. See also: "How is the leaderboard sorted?"

Citation

@misc{yang2026programbenchlanguagemodelsrebuild,
      title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
      author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
      year={2026},
      eprint={2605.03546},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2605.03546},
}