Extended Results

Evaluated with mini-SWE-agent · 200 tasks · Updated May. 11, 2026

#	Model	Agent	ResolvedRes. help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise.	Almost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests.	Cost help_outline Average API cost in USD per task instance.	Calls help_outline Average number of LLM calls per task instance.
1	GPT 5.5 (xhigh)	mini-SWE-agent	0.5%	13.5%	$8.85	82
2	GPT 5.5 (high)	mini-SWE-agent	0.5%	5.0%	$3.65	41
3	Claude Opus 4.7 (xhigh)	mini-SWE-agent	0%	4.5%	$10.96	159
4	Claude Opus 4.7	mini-SWE-agent	0%	3.0%	$3.81	93
5	Claude Opus 4.6	mini-SWE-agent	0%	2.5%	$11.38	260
6	GPT 5.5	mini-SWE-agent	0%	1.5%	$1.21	19
7	Claude Sonnet 4.6	mini-SWE-agent	0%	1.0%	$26.73	472
8	GPT 5.4	mini-SWE-agent	0%	0.0%	$0.33	16
9	Gemini 3.1 Pro	mini-SWE-agent	0%	0.0%	$1.51	94
10	Gemini 3 Flash	mini-SWE-agent	0%	0.0%	$0.30	85
11	Claude Haiku 4.5	mini-SWE-agent	0%	0.0%	$0.80	124
12	GPT 5.4 mini	mini-SWE-agent	0%	0.0%	$0.04	18
13	GPT 5 mini	mini-SWE-agent	0%	0.0%	$0.03	15

Click row to see model details · Sorting: Resolved → Almost resolved → Avg. pass rate (more)

Score by Model × Task

All models × 200 tasks

100%

Hover for details · Click to open task

Double click legend item (show only this model) · Click (hide model)

Y axis

X axis

Each dot is one task instance · Hover for details · Click to view task