GPT 5.5 high Solves First Instance!

TL;DR

The first ProgramBench instance is solved! GPT 5.5 (high) and GPT 5.5 (xhigh) both fully resolve the cmatrix instance -- interestingly in two different languages (C vs. Python). GPT 5.5 (xhigh) is significantly better than Claude Opus 4.7 (xhigh) across all metrics.

GPT 5.5 is our first update to the leaderboard. Going forward, we'll release a blog post like this one with each model release, documenting its performance and the interesting, distinct behaviors we noticed. Stay tuned for more additions coming soon by following Kilian and John.

The results of adding GPT 5.5 were very astonishing to us. Evaluated with medium reasoning (the vendor's default setting), it barely beat Claude Sonnet 4.6. However, it benefits enormously from higher reasoning mode. GPT 5.5 now solves the first instance (score of 0.05%), and it also sets a record high for number of "almost resolved" tasks (95+% of unit tests pass) to 26!

# Model Resolved help_outline The number of fully solved instances as measured by the hidden behavioral tests. Note that behavioral tests can never cover all possible inputs. The behavioral tests of ProgramBench can be easily extended should any false positives arise. Almost help_outline Instances where the agent's solution solves ≥ 95% of all behavioral tests. See extended results.
1 GPT 5.5 (xhigh) 0.5% 13.5%
2 GPT 5.5 (high) 0.5% 5.0%
3 Claude Opus 4.7 (xhigh) 0% 4.5%
4 Claude Opus 4.7 0% 3.0%
5 Claude Opus 4.6 0% 2.5%

The dominance of GPT 5.5 (xhigh) is even stronger when looking at the full cumulative histogram of scores: GPT 5.5 (xhigh) is the best model across the entire range (meaning that no matter what kind of score you were to pick, e.g., average or median scores, >= 90% pass rate, >= 50% pass rate, etc., it is the best model):

Double click legend item (show only this model) · Click (hide model)

cmatrix: a detailed comparison

Let's now compare the solutions for cmatrix in detail. All runs used the no_int_no_dep (no internet, no dependencies) configuration.

Run Failures Language Cost API calls
GPT 5.5 (high) 0 C (raw ANSI) $3.17 34
GPT 5.5 (xhigh) 0 Python 3 $4.84 40
GPT 5.5 3 C (raw ANSI) $1.04 17
Claude Opus 4.7 (xhigh) 19 C (ncurses) $10.74 178

Common aspects

All four agents followed the same high-level strategy:

  1. Read README.md and cmatrix.1 man page
  2. Probed the original binary's CLI behavior (flags, exit codes, error messages)
  3. Discovered ncurses headers were missing (only runtime .so present; see appendix)
  4. Wrote a single-file reimplementation
  5. Committed and submitted

GPT 5.5 (high) -- full solve

The winning run hit a sweet spot of thorough-but-efficient exploration: 10 exploration turns probing 40+ flag combinations, then wrote the full C implementation in one pass with 5 targeted patches.

GPT 5.5 (high) — abishekvashok/cmatrix
Use to navigate

GPT 5.5 (xhigh) -- full solve

A strong run that chose Python instead of C. Thorough exploration (27 steps probing every CLI path before writing code), then wrote the entire implementation in one shot as a self-contained Python file.

GPT 5.5 (xhigh) — abishekvashok/cmatrix
Use to navigate

GPT 5.5 -- 3 failures

The cheapest run ($1.04, 17 API calls) produced a working solution but didn't probe enough edge cases.

Failure 1: -- not treated as end-of-options

The agent wrote a custom argument parser instead of using getopt():

for(int i=1;i<argc;i++){
    char *a=argv[i];
    if(a[0]!='-' || a[1]=='\0') continue;
    if(strcmp(a,"--help")==0){ *do_help=1; return 0; }
    if(strcmp(a,"--version")==0){ *do_help=1; return 0; }
    for(int j=1; a[j]; j++){
        switch(a[j]){
            ...
            default: *do_help=1; return 0;
        }
    }
}

When argv[1] is "--": it doesn't match --help/--version, the inner loop processes a[1] = '-', which hits the default: case and prints help instead of entering the render loop. The agent never tested ./executable -- during exploration.

Failures 2-3: Screensaver/quit keystroke detection

The agent uses getchar() on a non-blocking file descriptor:

fcntl(STDIN_FILENO, F_SETFL, flags|O_NONBLOCK);
...
while((ch=getchar())!=EOF){
    if(o->screensaver){ stop_flag=1; break; }
}

On a non-blocking fd with no data, read() returns -1 with errno=EAGAIN. getchar() interprets this as EOF and sets the stdio EOF flag permanently. After the first animation tick where no input is available, getchar() returns EOF forever and the program never reads subsequent keystrokes. The fix would be clearerr(stdin) before each read, or using raw read() with select() (as GPT 5.5 (high) did).

The agent never tested screensaver mode (-s) or piped keystroke input during exploration.

GPT 5.5 — abishekvashok/cmatrix
Use to navigate

Claude Opus 4.7 (xhigh) -- 19 failures

The most expensive run ($10.74, 178 API calls) with the worst results.

This solution used ncurses, though a lot of time had to be spent getting it to work (see appendix).

Two simple bugs account for all 19 failures.

Bug 1: Case-sensitive color parsing (11 failures)

static int parse_color(const char *s) {
    if (!strcmp(s, "green"))   return COLOR_GREEN;
    if (!strcmp(s, "red"))     return COLOR_RED;
    // ... all lowercase comparisons with strcmp()
    return -1;
}

Uses strcmp() (case-sensitive) instead of strcasecmp(). "GREEN", "Red", "BLUE" all return -1 (treated as invalid). A one-token fix (strcasecmp) would eliminate 11 failures.

The agent never tested uppercase or mixed-case color inputs across all 178 steps. It only tested lowercase colors and the invalid color purple.

Bug 2: Wrong exit code for invalid colors (8 failures)

case 'C':
    opt_mcolor = parse_color(optarg);
    if (opt_mcolor < 0) {
        fprintf(stderr, " Invalid color selection\n");
        fprintf(stderr, " Valid colors are green, red, blue, ...\n");
        exit(1);    // should be exit(0)
    }

The original binary exits with code 0 for invalid colors. The agent actually observed this when testing the original binary:

$ ./executable -C purple; echo "exit=$?"
exit=0

But when testing its own implementation, it got exit=1 and never noticed the discrepancy. Adding to the confusion: it also tested valid colors without a proper TTY (TERM=unknown), which caused ncurses initscr() to fail with exit code 1 before the color validation path was reached. Both valid and invalid colors showed exit=1 in the non-TTY environment, masking the bug.

Appendix

An ignored test

While checking the GPT 5.5 (xhigh) result, we found one test that we considered unfair and have since removed: The test_u_out_of_range_or_overflow_values_still_enter_render_loop[args2] test passes -u 999999999999999999999999 and expects the program to enter the render loop. This only works in the original binary because C's atoi() silently overflows the 32-bit int, wrapping to a small/negative value, so napms() gets a tiny argument and the program runs at maximum speed. This is undefined behavior in C, not intentional design.

GPT 5.5 (xhigh)'s Python reimplementation wrote a faithful c_atoi():

def c_atoi(s: str) -> int:
    # ... standard atoi parsing logic ...
    val = val * 10 + (ord(s[i]) - 48)
    return sign * val if seen else 0

This replicates the parsing algorithm but not C's overflow semantics. Python integers have arbitrary precision, so c_atoi("999999999999999999999999") returns the exact value. The delay code then calls time.sleep(1e22), which raises OverflowError: timestamp out of range for platform time_t.

This test was marked as ignored_manual because it enshrines accidental C overflow semantics rather than meaningful program behavior.

ncurses header

The Docker container had the runtime library packages (libncurses6, libncursesw6) installed because the original cmatrix binary needs them to run. The development packages (libncurses-dev) with header files were not installed.

The four agents handled this differently:

GPT 5.5 (high) and GPT 5.5 tried compiling a test program with #include <ncurses.h>, saw the header was missing, and immediately pivoted to raw ANSI escape sequences in C. They never investigated further.

GPT 5.5 (xhigh) did the same single compile check and pivoted to Python with raw ANSI escape sequences:

printf '#include <ncurses.h>\nint main(){return 0;}\n' > /tmp/test_nc.c
gcc /tmp/test_nc.c -lncurses -o /tmp/test_nc
# -> fatal error: ncurses.h: No such file or directory

It never checked for .so files or tried apt-get.

Claude Opus 4.7 (xhigh) spent ~20 steps investigating. It discovered that the runtime .so files existed via ldconfig -p and dpkg -l, then examined their exported symbols with nm -D:

ldconfig -p | grep ncurses
# -> libncursesw.so.6 => /lib/x86_64-linux-gnu/libncursesw.so.6
nm -D /lib/x86_64-linux-gnu/libncursesw.so.6 | grep initscr
# -> all ncurses symbols present

It created a minimal src/curses_decls.h (~106 lines) with hand-written typedefs and extern declarations for ~40 ncurses functions, then linked with:

gcc src/cmatrix.c -l:libncursesw.so.6 -l:libtinfo.so.6 -o executable

This was genuinely creative systems engineering, but the added complexity didn't translate to better scores. The bugs that mattered (color case-sensitivity, exit codes) had nothing to do with the rendering approach.