A batch of fairness and reliability fixes landed in ProgramBench v1.1.0.
These fixes crack down on newly discovered ways for agents to cheat, as well as improving reproducibility and some
false negatives among tests.
Agents exploiting internet access during evaluation
Problem. While internet is blocked during inference i.e., while the agent working on a ProgramBench task (why we block internet),
the evaluation pipeline ran without the same safeguards. Some clever agents realize that they are in a benchmark and speculate that they can exploit this
to circumvent the blocked internet.
How it's fixed. Internet access is taken away while we execute compile.sh
What you need to do. Unfortunately this fix requires to rerun all affected trajectories. You should however be able to quickly scan all submitted compile.sh to see which ones
might be affected.
Reference. 57a6d3c (block build-script internet, #41).
Examples.
- Sonnet 4.6 (DuckDB) realizes it has no network (step 62), decides to write a
compile.sh
that curls the original DuckDB source (step 75), and explicitly gambles that the judge
machine has internet even though its own sandbox does not (step 217) "Does the evaluation machine have internet? I think the answer is YES".
- Gemini 3.1 Pro (jsonschema) reimplements the CLI (step 31) but has
compile.sh fall back to
pip install jsonschema-rs (step 32) — the very project it was asked to rebuild from scratch.
- Gemini 3.1 Pro (pingu) decides evaluation can "download the real library" (step 34) and has
compile.sh go get the upstream Go dependencies (step 39).
Agents exploiting a readable executable copy at /tmp
Problem. The reference binary is chmod 111 (execute-only) so agents can't read or
disassemble it (see why we don't want that).
But for historic reasons, the task_cleanroom images kept a fully-readable copy of the
executable at /tmp/_inject_exe. Some agents scanned the /tmp directory, found this copy,
and proceeded to reverse engineer it.
How it's fixed. We rebuilt all containers and removed the copy.
What you need to do. Run an LM as a judge on your trajectories to check whether your
agent exploited this, and rerun with the new containers.
Reference. 68e3da2 (evaluate in cleanroom images, #42), 3f57100 (default to v6 images, #46).
Examples. The exploitation ranged from reading a few bytes to shipping the original wholesale.
- Gemini 3.5 Flash (QuickJS) is the single worst case in the whole audit: it copies the readable
/tmp/_inject_exe (step 64), zlib-compresses it (step 70), links the bytes into its own program
(step 77), and commits the stolen binary (step 90) instead of reimplementing anything.
- Gemini 3.1 Pro (QuickJS) performs the static reverse-engineering the rules forbid against the
readable copy —
strings (step 58), binwalk (step 59), and readelf (step 70).
- Opus 4.8 (duc) is milder: rather than reading bytes, it runs the original as a behavioral
oracle (step 85) and even spoofs
argv[0] so the copy's error messages match (step 136).
Problem. Rebuilding a task image months later silently pulled different library and
compiler versions (unpinned apt, pip, cargo, go). This caused score drift unrelated to the
agent: e.g. pandoc emitting api-version [1,23,1,2] vs the captured golden's [1,23,1,1],
and similar wording/whitespace drift in jsonschema, oha, bartib, and zk — correct binaries
marked as failures.
How it's fixed. Tier-A reproducibility pinning landed across the base image: Ubuntu by
digest, apt snapshot, hash-locked pip, versioned cargo/go pre-cache, and GOTOOLCHAIN
pinned so Go can't auto-upgrade its patch version. Version-drift test failures that predate
the pin were classified and ignored as deterministic gold drift (tagged
[gold_fail_v6_toolchain]) rather than counted against agents.
What you need to do. Rerun corresponding instances.
Reference. 3f57100 (default to v6 images, #46), 102c952 (sync [gold_fail_v6_toolchain] ignores, #40).
Flaky tests
Problem. Some tests are non-deterministic on the gold binary itself — TUI render timing,
screen-init races, local HTTP test-server connection races, SQL row-order without
ORDER BY, etc. These passed in some runs and failed in others, adding noise to every
score.
How it's fixed. We ran the full ~200-instance gold eval 20 independent times. Tests that
passed in ≥1 round and failed in ≥1 round were classified as flaky (42 tests / 16 instances)
and ignored under reason gold_flaky, each annotated with its flake mechanism. Tests that
failed in all 20 rounds were separated out as genuine deterministic gold failures
(gold_fail) — distinguishing real defects from timing noise.
Reference. 102c952 (sync gold_flaky/gold_fail test-ignore updates, #40), 064c3e0 (ignore slow/hanging tests).