kyoheiu/felix — ProgramBench

← Back to leaderboard · Show all task instances

tui file manager with vim-like key mapping

502

Generated Behavioral Tests

88.2%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (high) OpenAI	88.2%	$3.14	51
2		GPT 5.5 OpenAI	88.0%	$1.12	22
3		Claude Opus 4.7 (xhigh) Anthropic	76.5%	$1.68	52
4		GPT 5.5 (xhigh) OpenAI	56.8%	$6.12	46
5		Gemini 3.1 Pro Google	49.2%	$1.12	62
6		GPT 5.4 mini OpenAI	46.0%	$0.06	44
7		Claude Opus 4.7 Anthropic	45.2%	$1.13	52
8		GPT 5.4 OpenAI	39.0%	$0.42	12
9		Gemini 3 Flash Google	33.3%	$0.27	76
10		Claude Haiku 4.5 Anthropic	23.5%	$1.01	142
11		GPT 5 mini OpenAI	22.3%	$0.05	20
12		Claude Opus 4.6 Anthropic	0.0%	$2.48	244
13		Claude Sonnet 4.6 Anthropic	0.0%	$4.16	360

Click row to see model details