We release ProgramBench, a benchmark that evaluates whether language models can rebuild black-box software systems from scratch.
The initial release includes 200 task instances sourced from open-source repositories, with evaluation of 9 frontier models. No model fully resolves a single task, highlighting the difficulty of end-to-end software development for today's AI systems.