Extended Results
Evaluated with mini-SWE-agent · 200 tasks| # | Model | Agent | ResolvedRes. The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. | Almost resolvedAlmost Instances where the agent's solution solves ≥ 95% of all behavioral tests. | Cost Average API cost in USD per task instance. | Calls Average number of LLM calls per task instance. | |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 Anthropic | mini-SWE-agent | 0% | 3.0% | $3.81 | 93 | |
| 2 | Claude Opus 4.6 Anthropic | mini-SWE-agent | 0% | 2.5% | $11.38 | 260 | |
| 3 | Claude Sonnet 4.6 Anthropic | mini-SWE-agent | 0% | 1.0% | $26.73 | 472 | |
| 4 | GPT 5.4 OpenAI | mini-SWE-agent | 0% | 0.0% | $0.33 | 16 | |
| 5 | Gemini 3.1 Pro Google | mini-SWE-agent | 0% | 0.0% | $1.51 | 94 | |
| 6 | Gemini 3 Flash Google | mini-SWE-agent | 0% | 0.0% | $0.30 | 85 | |
| 7 | Claude Haiku 4.5 Anthropic | mini-SWE-agent | 0% | 0.0% | $0.80 | 124 | |
| 8 | GPT 5.4 mini OpenAI | mini-SWE-agent | 0% | 0.0% | $0.04 | 18 | |
| 9 | GPT 5 mini OpenAI | mini-SWE-agent | 0% | 0.0% | $0.03 | 15 |
Click row to see model details
Score by Model × Task
All models × 200 tasks
0%
100%
Hover for details · Click to open task
Behavioral Test Pass Rate Distribution
Double click legend item (show only this model) · Click (hide model)