zk-org/zk — ProgramBench

← Back to leaderboard · Show all task instances

Plain text note-taking assistant

2,542 go medium

1,108

Generated Behavioral Tests

43.1%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Sonnet 4.6 Anthropic	43.1%	$24.25	494
2		GPT 5.4 OpenAI	17.0%	$0.30	8
3		GPT 5.5 (high) OpenAI	16.5%	$4.05	51
4		GPT 5.5 (xhigh) OpenAI	16.2%	$7.05	80
5		Claude Opus 4.7 (xhigh) Anthropic	16.1%	$9.09	113
6		GPT 5.5 OpenAI	15.3%	$1.30	22
7		Claude Opus 4.7 Anthropic	14.0%	$1.52	59
8		Gemini 3 Flash Google	13.9%	$0.21	62
9		Gemini 3.1 Pro Google	12.1%	$1.75	139
10		GPT 5 mini OpenAI	7.5%	$0.02	12
11		GPT 5.4 mini OpenAI	5.1%	$0.03	9
12		Claude Opus 4.6 Anthropic	1.2%	$14.07	286
13		Claude Haiku 4.5 Anthropic	1.2%	$0.59	91

Click row to see model details