unhappychoice/gittype — ProgramBench

← Back to leaderboard · Show all task instances

unhappychoice/gittype

A CLI code-typing game that turns your source code into typing challenges

741

Generated Behavioral Tests

91.3%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Opus 4.7 Anthropic	91.3%	$1.74	71
2		GPT 5.5 (xhigh) OpenAI	76.7%	$11.11	61
3		GPT 5.5 (high) OpenAI	63.2%	$3.70	51
4		GPT 5.4 OpenAI	63.1%	$0.25	10
5		GPT 5.5 OpenAI	61.1%	$0.74	16
6		Claude Haiku 4.5 Anthropic	56.3%	$0.34	93
7		Claude Opus 4.7 (xhigh) Anthropic	50.3%	$4.73	133
8		Gemini 3.1 Pro Google	45.6%	$1.16	65
9		Gemini 3 Flash Google	41.7%	$0.39	129
10		GPT 5 mini OpenAI	33.0%	$0.02	17
11		Claude Sonnet 4.6 Anthropic	1.0%	$18.17	396
12		GPT 5.4 mini OpenAI	1.0%	$0.05	24
13		Claude Opus 4.6 Anthropic	0.0%	$4.14	242

Click row to see model details