jonas/tig — ProgramBench

← Back to leaderboard · Show all task instances

Text-mode interface for git

13,200 c medium

1,586

Generated Behavioral Tests

85.2%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	85.2%	$4.82	39
2		GPT 5.5 (high) OpenAI	85.0%	$2.06	32
3		GPT 5.5 OpenAI	84.2%	$0.82	16
4		Gemini 3.1 Pro Google	83.9%	$1.02	72
5		Claude Opus 4.7 (xhigh) Anthropic	83.8%	$1.57	65
6		Claude Opus 4.7 Anthropic	83.6%	$0.84	44
7		Claude Haiku 4.5 Anthropic	82.9%	$0.22	80
8		Claude Opus 4.6 Anthropic	79.6%	$10.78	281
9		GPT 5 mini OpenAI	76.9%	$0.01	12
10		Gemini 3 Flash Google	71.2%	$0.19	54
11		GPT 5.4 mini OpenAI	39.0%	$0.03	8
12		GPT 5.4 OpenAI	38.1%	$0.12	7
13		Claude Sonnet 4.6 Anthropic	0.0%	$14.93	382

Click row to see model details