yassinebridi/serpl — ProgramBench

← Back to leaderboard · Show all task instances

yassinebridi/serpl

A simple terminal UI for search and replace, ala VS Code.

446

Generated Behavioral Tests

65.0%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Opus 4.7 (xhigh) Anthropic	65.0%	$2.51	79
2		GPT 5.5 (xhigh) OpenAI	62.1%	$6.46	83
3		GPT 5.5 (high) OpenAI	61.9%	$4.18	61
4		GPT 5.4 OpenAI	61.0%	$0.36	13
5		Claude Opus 4.7 Anthropic	60.8%	$2.92	96
6		GPT 5.5 OpenAI	60.3%	$0.91	22
7		Claude Haiku 4.5 Anthropic	54.3%	$0.64	95
8		Gemini 3.1 Pro Google	52.9%	$5.31	134
9		Gemini 3 Flash Google	48.4%	$0.44	116
10		GPT 5 mini OpenAI	41.3%	$0.01	10
11		GPT 5.4 mini OpenAI	41.0%	$0.03	15
12		Claude Opus 4.6 Anthropic	40.8%	$12.94	210
13		Claude Sonnet 4.6 Anthropic	6.5%	$17.28	310

Click row to see model details