jgm/pandoc — ProgramBench

← Back to leaderboard · Show all task instances

Universal markup converter

5,228

Generated Behavioral Tests

14.7%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	14.7%	$2.59	24
2		Claude Sonnet 4.6 Anthropic	14.1%	$35.83	568
3		Claude Opus 4.7 (xhigh) Anthropic	10.7%	$2.53	83
4		GPT 5.5 OpenAI	10.1%	$1.63	21
5		Gemini 3.1 Pro Google	6.3%	$1.33	95
6		Gemini 3 Flash Google	6.2%	$0.20	67
7		GPT 5.4 OpenAI	5.2%	$0.24	8
8		Claude Opus 4.7 Anthropic	4.7%	$1.04	46
9		GPT 5.4 mini OpenAI	0.6%	$0.05	19
10		GPT 5 mini OpenAI	0.4%	$0.06	20
11		Claude Opus 4.6 Anthropic	0.1%	$17.70	339
12		Claude Haiku 4.5 Anthropic	0.1%	$0.90	105
13		GPT 5.5 (xhigh) OpenAI	0.0%	$3.56	41

Click row to see model details