eudoxia0/hashcards — ProgramBench

← Back to leaderboard · Show all task instances

eudoxia0/hashcards

A plain text-based spaced repetition system.

1,071 rs medium

1,019

Generated Behavioral Tests

86.8%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	86.8%	$5.14	63
2		GPT 5.5 (xhigh) OpenAI	82.5%	$9.67	100
3		GPT 5.5 OpenAI	77.6%	$2.18	28
4		Gemini 3 Flash Google	56.3%	$0.33	114
5		Claude Haiku 4.5 Anthropic	51.5%	$0.82	141
6		GPT 5.4 mini OpenAI	41.5%	$0.04	7
7		GPT 5 mini OpenAI	38.0%	$0.04	25
8		Claude Opus 4.7 Anthropic	6.1%	$1.55	65
9		Claude Sonnet 4.6 Anthropic	6.1%	$51.19	537
10		Claude Opus 4.7 (xhigh) Anthropic	3.7%	$11.74	179
11		GPT 5.4 OpenAI	3.7%	$0.48	15
12		Claude Opus 4.6 Anthropic	0.0%	$6.91	254
13		Gemini 3.1 Pro Google	0.0%	$2.07	136

Click row to see model details