stacked-git/stgit — ProgramBench

← Back to leaderboard · Show all task instances

stacked-git/stgit

Stacked Git

1,488

Generated Behavioral Tests

48.7%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	48.7%	$4.62	44
2		GPT 5.5 (xhigh) OpenAI	46.4%	$7.28	80
3		GPT 5.5 OpenAI	37.7%	$1.13	18
4		Claude Opus 4.7 (xhigh) Anthropic	25.1%	$10.31	162
5		GPT 5 mini OpenAI	20.0%	$0.02	14
6		GPT 5.4 OpenAI	16.3%	$0.23	8
7		Gemini 3 Flash Google	12.6%	$0.24	114
8		Claude Opus 4.7 Anthropic	10.1%	$0.56	25
9		Gemini 3.1 Pro Google	8.6%	$1.18	94
10		Claude Haiku 4.5 Anthropic	8.3%	$0.60	104
11		GPT 5.4 mini OpenAI	7.1%	$0.02	9
12		Claude Sonnet 4.6 Anthropic	4.0%	$31.22	581
13		Claude Opus 4.6 Anthropic	0.0%	$1.93	213

Click row to see model details