o2sh/onefetch — ProgramBench

← Back to leaderboard · Show all task instances

Command-line Git information tool

11,745 rs medium

1,166

Generated Behavioral Tests

85.1%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	85.1%	$12.76	109
2		GPT 5.5 (high) OpenAI	81.9%	$4.06	45
3		Claude Opus 4.6 Anthropic	81.7%	$16.25	289
4		GPT 5.5 OpenAI	69.0%	$1.42	16
5		Claude Opus 4.7 (xhigh) Anthropic	66.0%	$3.68	108
6		Claude Opus 4.7 Anthropic	53.6%	$1.84	72
7		GPT 5.4 OpenAI	52.2%	$0.22	8
8		Claude Haiku 4.5 Anthropic	35.8%	$0.72	98
9		GPT 5.4 mini OpenAI	15.8%	$0.03	7
10		Gemini 3 Flash Google	13.6%	$0.61	91
11		Claude Sonnet 4.6 Anthropic	2.3%	$33.45	604
12		Gemini 3.1 Pro Google	0.0%	$1.68	89
13		GPT 5 mini OpenAI	0.0%	$0.02	13

Click row to see model details