Y2Z/monolith — ProgramBench

← Back to leaderboard · Show all task instances

⬛️ CLI tool and library for saving complete web pages as a single HTML file

15,024 rs medium

713

Generated Behavioral Tests

67.7%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	67.7%	$6.59	61
2		GPT 5.5 (high) OpenAI	60.6%	$4.38	57
3		Claude Opus 4.7 (xhigh) Anthropic	56.2%	$6.26	113
4		GPT 5.5 OpenAI	51.8%	$1.00	19
5		Gemini 3 Flash Google	51.2%	$0.33	72
6		Gemini 3.1 Pro Google	38.7%	$0.95	77
7		Claude Haiku 4.5 Anthropic	36.7%	$0.89	121
8		GPT 5.4 mini OpenAI	33.0%	$0.03	8
9		GPT 5.4 OpenAI	25.8%	$0.25	8
10		Claude Opus 4.7 Anthropic	2.4%	$3.80	118
11		Claude Sonnet 4.6 Anthropic	2.4%	$14.72	406
12		Claude Opus 4.6 Anthropic	1.3%	$12.81	305
13		GPT 5 mini OpenAI	1.3%	$0.03	16

Click row to see model details