blacknon/hwatch — ProgramBench

← Back to leaderboard · Show all task instances

blacknon/hwatch

A modern alternative to the watch command, records the differences in execution results and can check this differences at after.

1,016 rs medium

1,016

Generated Behavioral Tests

85.3%

Best Score

Results by Model

#		Model	Score help_outline Percentage of hidden behavioral tests passed.	Cost help_outline Total API cost in USD for this task instance.	Calls help_outline Number of LLM API calls for this task instance.
1		GPT 5.5 (xhigh) OpenAI	85.3%	$5.31	65
2		GPT 5.5 (high) OpenAI	82.6%	$4.53	59
3		Claude Opus 4.6 Anthropic	81.1%	$10.71	221
4		GPT 5.5 OpenAI	80.0%	$1.67	23
5		Claude Opus 4.7 (xhigh) Anthropic	76.3%	$10.58	210
6		Gemini 3.1 Pro Google	75.1%	$1.04	73
7		Claude Opus 4.7 Anthropic	73.5%	$1.23	56
8		Claude Haiku 4.5 Anthropic	66.0%	$0.59	88
9		Gemini 3 Flash Google	61.1%	$0.34	71
10		GPT 5.4 OpenAI	56.6%	$0.27	9
11		GPT 5.4 mini OpenAI	24.6%	$0.02	9
12		GPT 5 mini OpenAI	2.6%	$0.03	10
13		Claude Sonnet 4.6 Anthropic	1.9%	$14.13	378

Click row to see model details