Evaluated with mini-SWE-agent · 200 tasks
# Model Agent ResolvedRes. help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. Almost resolvedAlmost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests. Cost help_outline Average API cost in USD per task instance. Calls help_outline Average number of LLM calls per task instance.
1 Claude Opus 4.7 Anthropic mini-SWE-agent 0% 3.0% $3.81 93
2 Claude Opus 4.6 Anthropic mini-SWE-agent 0% 2.5% $11.38 260
3 Claude Sonnet 4.6 Anthropic mini-SWE-agent 0% 1.0% $26.73 472
4 GPT 5.4 OpenAI mini-SWE-agent 0% 0.0% $0.33 16
5 Gemini 3.1 Pro Google mini-SWE-agent 0% 0.0% $1.51 94
6 Gemini 3 Flash Google mini-SWE-agent 0% 0.0% $0.30 85
7 Claude Haiku 4.5 Anthropic mini-SWE-agent 0% 0.0% $0.80 124
8 GPT 5.4 mini OpenAI mini-SWE-agent 0% 0.0% $0.04 18
9 GPT 5 mini OpenAI mini-SWE-agent 0% 0.0% $0.03 15

Click row to see model details

All models × 200 tasks
0%
100%

Hover for details · Click to open task

Double click legend item (show only this model) · Click (hide model)