direnv/direnv — ProgramBench

← Back to leaderboard · Show all task instances

unclutter your .profile

14,998 go medium

849

Generated Behavioral Tests

80.9%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	80.9%	$6.28	45
2		GPT 5.5 (high) OpenAI	73.3%	$3.77	35
3		Claude Opus 4.6 Anthropic	62.0%	$14.72	312
4		Claude Sonnet 4.6 Anthropic	57.6%	$20.85	417
5		GPT 5.5 OpenAI	48.8%	$1.46	17
6		Claude Opus 4.7 (xhigh) Anthropic	33.5%	$7.67	169
7		GPT 5.4 OpenAI	32.9%	$0.22	8
8		Claude Opus 4.7 Anthropic	32.2%	$2.28	84
9		Gemini 3.1 Pro Google	27.1%	$3.32	161
10		Claude Haiku 4.5 Anthropic	19.3%	$1.04	133
11		GPT 5.4 mini OpenAI	12.0%	$0.05	31
12		GPT 5 mini OpenAI	10.1%	$0.03	19
13		Gemini 3 Flash Google	6.8%	$0.73	127

Click row to see model details