mkj/dropbear — ProgramBench

← Back to leaderboard · Show all task instances

Dropbear SSH

678

Generated Behavioral Tests

65.5%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	65.5%	$5.49	64
2		Claude Opus 4.7 (xhigh) Anthropic	65.0%	$3.00	72
3		GPT 5.4 OpenAI	58.1%	$0.15	10
4		Claude Opus 4.6 Anthropic	51.8%	$8.15	152
5		GPT 5.5 (high) OpenAI	50.1%	$2.08	30
6		GPT 5.5 OpenAI	46.5%	$0.96	18
7		Gemini 3 Flash Google	44.5%	$0.19	56
8		Claude Sonnet 4.6 Anthropic	41.6%	$67.81	600
9		Claude Opus 4.7 Anthropic	40.6%	$1.62	56
10		Gemini 3.1 Pro Google	38.3%	$1.05	69
11		Claude Haiku 4.5 Anthropic	26.5%	$0.26	78
12		GPT 5.4 mini OpenAI	22.0%	$0.02	7
13		GPT 5 mini OpenAI	17.0%	$0.01	12

Click row to see model details