bensadeh/tailspin — ProgramBench

← Back to leaderboard · Show all task instances

bensadeh/tailspin

🌀 A log file highlighter

7,793 rs medium

615

Generated Behavioral Tests

88.0%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	88.0%	$15.26	97
2		GPT 5.5 (high) OpenAI	77.9%	$3.64	33
3		Claude Sonnet 4.6 Anthropic	75.8%	$58.91	840
4		Claude Opus 4.7 Anthropic	75.6%	$2.18	65
5		Claude Opus 4.7 (xhigh) Anthropic	74.0%	$26.95	265
6		Claude Opus 4.6 Anthropic	72.0%	$20.30	324
7		GPT 5.5 OpenAI	68.5%	$1.45	24
8		GPT 5.4 mini OpenAI	62.9%	$0.10	10
9		GPT 5.4 OpenAI	59.7%	$0.26	8
10		Gemini 3 Flash Google	56.6%	$0.67	130
11		GPT 5 mini OpenAI	26.2%	$0.02	13
12		Gemini 3.1 Pro Google	2.1%	$1.56	51
13		Claude Haiku 4.5 Anthropic	0.0%	$0.87	101

Click row to see model details