hpjansson/chafa — ProgramBench

← Back to leaderboard · Show all task instances

hpjansson/chafa

📺🗿 Terminal graphics for the 21st century.

1,931

Generated Behavioral Tests

58.4%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Sonnet 4.6 Anthropic	58.4%	$50.80	532
2		Claude Haiku 4.5 Anthropic	50.2%	$0.82	136
3		Gemini 3 Flash Google	47.1%	$0.17	65
4		Claude Opus 4.6 Anthropic	33.5%	$14.64	252
5		GPT 5.5 (xhigh) OpenAI	28.0%	$8.49	77
6		GPT 5.5 (high) OpenAI	21.4%	$3.78	44
7		Claude Opus 4.7 (xhigh) Anthropic	18.7%	$7.73	146
8		GPT 5.5 OpenAI	16.6%	$1.74	22
9		Claude Opus 4.7 Anthropic	16.1%	$1.45	42
10		GPT 5.4 OpenAI	8.7%	$0.16	7
11		GPT 5 mini OpenAI	4.6%	$0.01	6
12		GPT 5.4 mini OpenAI	3.5%	$0.04	9
13		Gemini 3.1 Pro Google	0.0%	$1.65	92

Click row to see model details