alexpovel/srgn — ProgramBench

← Back to leaderboard · Show all task instances

A grep-like tool which understands source code syntax and allows for manipulation in addition to search

1,852

Generated Behavioral Tests

80.1%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	80.1%	$10.26	91
2		GPT 5.5 (high) OpenAI	79.8%	$3.89	40
3		Claude Sonnet 4.6 Anthropic	69.5%	$19.30	565
4		GPT 5.5 OpenAI	68.7%	$1.63	22
5		Claude Opus 4.7 (xhigh) Anthropic	65.9%	$6.38	118
6		Claude Opus 4.6 Anthropic	62.8%	$13.43	344
7		GPT 5.4 OpenAI	58.1%	$0.41	12
8		Gemini 3.1 Pro Google	57.9%	$1.13	94
9		Gemini 3 Flash Google	53.1%	$0.25	111
10		GPT 5 mini OpenAI	46.0%	$0.04	13
11		Claude Haiku 4.5 Anthropic	43.6%	$1.15	167
12		Claude Opus 4.7 Anthropic	2.4%	$1.57	64
13		GPT 5.4 mini OpenAI	0.6%	$0.03	6

Click row to see model details