antonmedv/fx — ProgramBench

← Back to leaderboard · Show all task instances

Terminal JSON viewer & processor

20,433 go medium

2,047

Generated Behavioral Tests

75.7%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		Claude Opus 4.6 Anthropic	75.7%	$20.74	396
2		GPT 5.5 (high) OpenAI	74.4%	$3.29	50
3		GPT 5.5 OpenAI	56.5%	$1.21	24
4		Claude Opus 4.7 Anthropic	49.3%	$5.93	145
5		GPT 5.4 OpenAI	49.3%	$0.20	10
6		Claude Opus 4.7 (xhigh) Anthropic	47.0%	$7.48	135
7		GPT 5.5 (xhigh) OpenAI	43.3%	$7.63	71
8		Gemini 3.1 Pro Google	42.1%	$1.37	127
9		GPT 5.4 mini OpenAI	40.3%	$0.03	7
10		Claude Haiku 4.5 Anthropic	17.5%	$0.73	133
11		Gemini 3 Flash Google	16.1%	$0.19	85
12		GPT 5 mini OpenAI	11.6%	$0.01	11
13		Claude Sonnet 4.6 Anthropic	4.3%	$25.98	621

Click row to see model details