Extended Results
Evaluated with mini-SWE-agent · 200 tasks · Updated May. 11, 2026| # | Model | Agent | ResolvedRes. The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. | Almost Instances where the agent's solution solves ≥ 95% of all behavioral tests. | Cost Average API cost in USD per task instance. | Calls Average number of LLM calls per task instance. | |
|---|---|---|---|---|---|---|---|
| 1 | GPT 5.5 (xhigh) | mini-SWE-agent | 0.5% | 13.5% | $8.85 | 82 | |
| 2 | GPT 5.5 (high) | mini-SWE-agent | 0.5% | 5.0% | $3.65 | 41 | |
| 3 | Claude Opus 4.7 (xhigh) | mini-SWE-agent | 0% | 4.5% | $10.96 | 159 | |
| 4 | Claude Opus 4.7 | mini-SWE-agent | 0% | 3.0% | $3.81 | 93 | |
| 5 | Claude Opus 4.6 | mini-SWE-agent | 0% | 2.5% | $11.38 | 260 | |
| 6 | GPT 5.5 | mini-SWE-agent | 0% | 1.5% | $1.21 | 19 | |
| 7 | Claude Sonnet 4.6 | mini-SWE-agent | 0% | 1.0% | $26.73 | 472 | |
| 8 | GPT 5.4 | mini-SWE-agent | 0% | 0.0% | $0.33 | 16 | |
| 9 | Gemini 3.1 Pro | mini-SWE-agent | 0% | 0.0% | $1.51 | 94 | |
| 10 | Gemini 3 Flash | mini-SWE-agent | 0% | 0.0% | $0.30 | 85 | |
| 11 | Claude Haiku 4.5 | mini-SWE-agent | 0% | 0.0% | $0.80 | 124 | |
| 12 | GPT 5.4 mini | mini-SWE-agent | 0% | 0.0% | $0.04 | 18 | |
| 13 | GPT 5 mini | mini-SWE-agent | 0% | 0.0% | $0.03 | 15 |
Click row to see model details · Sorting: Resolved → Almost resolved → Avg. pass rate (more)
Score by Model × Task
All models × 200 tasks
0%
100%
Hover for details · Click to open task
Behavioral Test Pass Rate Distribution
Double click legend item (show only this model) · Click (hide model)
Model Comparison
Each dot is one task instance · Hover for details · Click to view task