Extended Results — ProgramBench

← Back to leaderboard

Extended Results

Evaluated with mini-SWE-agent · 200 tasks

#		Model	Agent	ResolvedRes. help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise.	Almost resolvedAlmost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests.	Cost help_outline Average API cost in USD per task instance.	Calls help_outline Average number of LLM calls per task instance.
1		Claude Opus 4.7 Anthropic	mini-SWE-agent	0%	3.0%	$3.81	93
2		Claude Opus 4.6 Anthropic	mini-SWE-agent	0%	2.5%	$11.38	260
3		Claude Sonnet 4.6 Anthropic	mini-SWE-agent	0%	1.0%	$26.73	472
4		GPT 5.4 OpenAI	mini-SWE-agent	0%	0.0%	$0.33	16
5		Gemini 3.1 Pro Google	mini-SWE-agent	0%	0.0%	$1.51	94
6		Gemini 3 Flash Google	mini-SWE-agent	0%	0.0%	$0.30	85
7		Claude Haiku 4.5 Anthropic	mini-SWE-agent	0%	0.0%	$0.80	124
8		GPT 5.4 mini OpenAI	mini-SWE-agent	0%	0.0%	$0.04	18
9		GPT 5 mini OpenAI	mini-SWE-agent	0%	0.0%	$0.03	15

Click row to see model details

Score by Model × Task

All models × 200 tasks

0%

100%

Hover for details · Click to open task

Behavioral Test Pass Rate Distribution

Double click legend item (show only this model) · Click (hide model)

← Back to leaderboard