Jun. 23, 2026 Kilian Lieret, John Yang

ProgramBench Harness Fixes

A batch of fairness and reliability fixes landed in ProgramBench v1.1.0. These fixes crack down on newly discovered ways for agents to cheat, as well as improving reproducibility and some false negatives among tests.

Agents exploiting internet access during evaluation

Problem. While internet is blocked during inference i.e., while the agent working on a ProgramBench task (why we block internet), the evaluation pipeline ran without the same safeguards. Some clever agents realize that they are in a benchmark and speculate that they can exploit this to circumvent the blocked internet.

How it's fixed. Internet access is taken away while we execute compile.sh

What you need to do. Unfortunately this fix requires to rerun all affected trajectories. You should however be able to quickly scan all submitted compile.sh to see which ones might be affected.

Reference. 57a6d3c (block build-script internet, #41).

Examples.

Sonnet 4.6 (DuckDB) realizes it has no network (step 62), decides to write a compile.sh that curls the original DuckDB source (step 75), and explicitly gambles that the judge machine has internet even though its own sandbox does not (step 217) "Does the evaluation machine have internet? I think the answer is YES".
Gemini 3.1 Pro (jsonschema) reimplements the CLI (step 31) but has compile.sh fall back to pip install jsonschema-rs (step 32) — the very project it was asked to rebuild from scratch.
Gemini 3.1 Pro (pingu) decides evaluation can "download the real library" (step 34) and has compile.sh go get the upstream Go dependencies (step 39).

Use ← → to navigate

Agents exploiting a readable executable copy at /tmp

Problem. The reference binary is chmod 111 (execute-only) so agents can't read or disassemble it (see why we don't want that). But for historic reasons, the task_cleanroom images kept a fully-readable copy of the executable at /tmp/_inject_exe. Some agents scanned the /tmp directory, found this copy, and proceeded to reverse engineer it.

How it's fixed. We rebuilt all containers and removed the copy.

What you need to do. Run an LM as a judge on your trajectories to check whether your agent exploited this, and rerun with the new containers.

Reference. 68e3da2 (evaluate in cleanroom images, #42), 3f57100 (default to v6 images, #46).

Examples. The exploitation ranged from reading a few bytes to shipping the original wholesale.

Gemini 3.5 Flash (QuickJS) is the single worst case in the whole audit: it copies the readable /tmp/_inject_exe (step 64), zlib-compresses it (step 70), links the bytes into its own program (step 77), and commits the stolen binary (step 90) instead of reimplementing anything.
Gemini 3.1 Pro (QuickJS) performs the static reverse-engineering the rules forbid against the readable copy — strings (step 58), binwalk (step 59), and readelf (step 70).
Opus 4.8 (duc) is milder: rather than reading bytes, it runs the original as a behavioral oracle (step 85) and even spoofs argv[0] so the copy's error messages match (step 136).

Use ← → to navigate

Different dependency / toolchain versions

Problem. Rebuilding a task image months later silently pulled different library and compiler versions (unpinned apt, pip, cargo, go). This caused score drift unrelated to the agent: e.g. pandoc emitting api-version [1,23,1,2] vs the captured golden's [1,23,1,1], and similar wording/whitespace drift in jsonschema, oha, bartib, and zk — correct binaries marked as failures.

How it's fixed. Tier-A reproducibility pinning landed across the base image: Ubuntu by digest, apt snapshot, hash-locked pip, versioned cargo/go pre-cache, and GOTOOLCHAIN pinned so Go can't auto-upgrade its patch version. Version-drift test failures that predate the pin were classified and ignored as deterministic gold drift (tagged [gold_fail_v6_toolchain]) rather than counted against agents.

What you need to do. Rerun corresponding instances.

Reference. 3f57100 (default to v6 images, #46), 102c952 (sync [gold_fail_v6_toolchain] ignores, #40).

Flaky tests

Problem. Some tests are non-deterministic on the gold binary itself — TUI render timing, screen-init races, local HTTP test-server connection races, SQL row-order without ORDER BY, etc. These passed in some runs and failed in others, adding noise to every score.

How it's fixed. We ran the full ~200-instance gold eval 20 independent times. Tests that passed in ≥1 round and failed in ≥1 round were classified as flaky (42 tests / 16 instances) and ignored under reason gold_flaky, each annotated with its flake mechanism. Tests that failed in all 20 rounds were separated out as genuine deterministic gold failures (gold_fail) — distinguishing real defects from timing noise.

Reference. 102c952 (sync gold_flaky/gold_fail test-ignore updates, #40), 064c3e0 (ignore slow/hanging tests).