ducaale/xh — ProgramBench

← Back to leaderboard · Show all task instances

Friendly and fast tool for sending HTTP requests

7,754 rs medium

1,171

Generated Behavioral Tests

50.0%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Opus 4.6 Anthropic	50.0%	$13.67	284
2		GPT 5.5 (high) OpenAI	48.5%	$2.92	27
3		GPT 5.5 (xhigh) OpenAI	43.5%	$10.20	89
4		GPT 5.5 OpenAI	42.2%	$1.18	17
5		Claude Opus 4.7 (xhigh) Anthropic	38.6%	$3.96	82
6		GPT 5.4 OpenAI	36.6%	$0.25	7
7		Claude Sonnet 4.6 Anthropic	33.6%	$45.79	702
8		Gemini 3.1 Pro Google	24.7%	$0.84	54
9		Claude Haiku 4.5 Anthropic	19.8%	$1.42	156
10		Gemini 3 Flash Google	16.7%	$0.23	73
11		Claude Opus 4.7 Anthropic	15.5%	$1.24	43
12		GPT 5 mini OpenAI	9.4%	$0.02	13
13		GPT 5.4 mini OpenAI	1.3%	$0.07	14

Click row to see model details