sitkevij/hex — ProgramBench

← Back to leaderboard · Show all task instances

🔮 Futuristic take on hexdump, made in Rust.

823

Generated Behavioral Tests

99.6%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	99.6%	$1.84	28
2		GPT 5.5 (xhigh) OpenAI	98.4%	$7.97	87
3		Claude Opus 4.7 (xhigh) Anthropic	97.3%	$6.72	121
4		Gemini 3.1 Pro Google	91.7%	$1.25	84
5		GPT 5.5 OpenAI	82.7%	$1.00	15
6		Claude Sonnet 4.6 Anthropic	72.3%	$11.60	392
7		Claude Opus 4.6 Anthropic	71.3%	$5.97	227
8		GPT 5.4 OpenAI	67.2%	$0.19	9
9		Claude Opus 4.7 Anthropic	66.3%	$2.75	89
10		Gemini 3 Flash Google	66.1%	$0.34	124
11		GPT 5.4 mini OpenAI	36.1%	$0.03	13
12		Claude Haiku 4.5 Anthropic	28.7%	$0.63	114
13		GPT 5 mini OpenAI	4.4%	$0.03	21

Click row to see model details