tarka/xcp — ProgramBench

← Back to leaderboard · Show all task instances

An extended `cp`

1,184

Generated Behavioral Tests

95.8%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	95.8%	$6.37	62
2		Claude Opus 4.6 Anthropic	92.6%	$5.01	138
3		Claude Sonnet 4.6 Anthropic	91.7%	$8.99	330
4		Claude Opus 4.7 (xhigh) Anthropic	91.5%	$13.71	263
5		Claude Opus 4.7 Anthropic	84.7%	$2.57	85
6		GPT 5.4 OpenAI	84.5%	$0.28	10
7		Gemini 3.1 Pro Google	73.7%	$0.72	42
8		GPT 5.5 OpenAI	38.7%	$1.33	18
9		GPT 5.4 mini OpenAI	22.5%	$0.04	18
10		Gemini 3 Flash Google	18.6%	$0.24	57
11		GPT 5 mini OpenAI	4.3%	$0.02	16
12		Claude Haiku 4.5 Anthropic	0.0%	$0.73	107
13		GPT 5.5 (high) OpenAI	n/a	$1.95	25

Click row to see model details