Evaluated with mini-SWE-agent · 200 tasks · Updated May. 11, 2026
# Model Agent ResolvedRes. help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. Almost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests. Cost help_outline Average API cost in USD per task instance. Calls help_outline Average number of LLM calls per task instance.
1 GPT 5.5 (xhigh) mini-SWE-agent 0.5% 13.5% $8.85 82
2 GPT 5.5 (high) mini-SWE-agent 0.5% 5.0% $3.65 41
3 Claude Opus 4.7 (xhigh) mini-SWE-agent 0% 4.5% $10.96 159
4 Claude Opus 4.7 mini-SWE-agent 0% 3.0% $3.81 93
5 Claude Opus 4.6 mini-SWE-agent 0% 2.5% $11.38 260
6 GPT 5.5 mini-SWE-agent 0% 1.5% $1.21 19
7 Claude Sonnet 4.6 mini-SWE-agent 0% 1.0% $26.73 472
8 GPT 5.4 mini-SWE-agent 0% 0.0% $0.33 16
9 Gemini 3.1 Pro mini-SWE-agent 0% 0.0% $1.51 94
10 Gemini 3 Flash mini-SWE-agent 0% 0.0% $0.30 85
11 Claude Haiku 4.5 mini-SWE-agent 0% 0.0% $0.80 124
12 GPT 5.4 mini mini-SWE-agent 0% 0.0% $0.04 18
13 GPT 5 mini mini-SWE-agent 0% 0.0% $0.03 15

Click row to see model details · Sorting: Resolved → Almost resolved → Avg. pass rate (more)

All models × 200 tasks
0%
100%

Hover for details · Click to open task

Double click legend item (show only this model) · Click (hide model)

Each dot is one task instance · Hover for details · Click to view task