trasta298/keifu — ProgramBench

← Back to leaderboard · Show all task instances

trasta298/keifu

Git genealogy, untangled. A TUI for navigating commit graphs with color and clarity.

262

Generated Behavioral Tests

87.0%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	87.0%	$7.34	70
2		GPT 5.5 (high) OpenAI	75.2%	$3.94	50
3		Claude Opus 4.7 Anthropic	67.2%	$1.11	48
4		GPT 5.5 OpenAI	66.4%	$1.02	18
5		Claude Opus 4.7 (xhigh) Anthropic	65.3%	$3.33	79
6		GPT 5.4 OpenAI	36.6%	$0.14	9
7		Gemini 3 Flash Google	26.0%	$0.28	64
8		GPT 5.4 mini OpenAI	21.8%	$0.10	142
9		GPT 5 mini OpenAI	16.0%	$0.01	9
10		Claude Haiku 4.5 Anthropic	10.3%	$0.64	109
11		Gemini 3.1 Pro Google	7.3%	$2.32	92
12		Claude Opus 4.6 Anthropic	0.0%	$2.19	178
13		Claude Sonnet 4.6 Anthropic	0.0%	$5.25	270

Click row to see model details