GPT 5.5 is our first update to the leaderboard. Going forward, we'll release a blog post like this one with each model release, documenting its performance and the interesting, distinct behaviors we noticed. Stay tuned for more additions coming soon by following Kilian and John.
The results of adding GPT 5.5 were very astonishing to us. Evaluated with medium reasoning (the vendor's default setting), it barely beat Claude Sonnet 4.6. However, it benefits enormously from higher reasoning mode. GPT 5.5 now solves the first instance (score of 0.05%), and it also sets a record high for number of "almost resolved" tasks (95+% of unit tests pass) to 26!
The dominance of GPT 5.5 (xhigh) is even stronger when looking at the full cumulative histogram of scores: GPT 5.5 (xhigh) is the best model across the entire range (meaning that no matter what kind of score you were to pick, e.g., average or median scores, >= 90% pass rate, >= 50% pass rate, etc., it is the best model):
cmatrix: a detailed comparison
Let's now compare the solutions for cmatrix in detail. All runs used the no_int_no_dep (no internet, no dependencies) configuration.
Common aspects
All four agents followed the same high-level strategy:
- Read README.md and
cmatrix.1 man page
- Probed the original binary's CLI behavior (flags, exit codes, error messages)
- Discovered ncurses headers were missing (only runtime
.so present; see appendix)
- Wrote a single-file reimplementation
- Committed and submitted
GPT 5.5 (high) -- full solve
The winning run hit a sweet spot of thorough-but-efficient exploration: 10 exploration
turns probing 40+ flag combinations, then wrote the full C implementation in one pass
with 5 targeted patches.
GPT 5.5 (xhigh) -- full solve
A strong run that chose Python instead of C. Thorough exploration (27 steps probing every
CLI path before writing code), then wrote the entire implementation in one shot as a
self-contained Python file.
GPT 5.5 -- 3 failures
The cheapest run ($1.04, 17 API calls) produced a working solution but didn't probe enough
edge cases.
Failure 1: -- not treated as end-of-options
The agent wrote a custom argument parser instead of using getopt():
for(int i=1;i<argc;i++){
char *a=argv[i];
if(a[0]!='-' || a[1]=='\0') continue;
if(strcmp(a,"--help")==0){ *do_help=1; return 0; }
if(strcmp(a,"--version")==0){ *do_help=1; return 0; }
for(int j=1; a[j]; j++){
switch(a[j]){
...
default: *do_help=1; return 0;
}
}
}
When argv[1] is "--": it doesn't match --help/--version, the inner loop processes
a[1] = '-', which hits the default: case and prints help instead of entering the
render loop. The agent never tested ./executable -- during exploration.
Failures 2-3: Screensaver/quit keystroke detection
The agent uses getchar() on a non-blocking file descriptor:
fcntl(STDIN_FILENO, F_SETFL, flags|O_NONBLOCK);
...
while((ch=getchar())!=EOF){
if(o->screensaver){ stop_flag=1; break; }
}
On a non-blocking fd with no data, read() returns -1 with errno=EAGAIN. getchar()
interprets this as EOF and sets the stdio EOF flag permanently. After the first animation
tick where no input is available, getchar() returns EOF forever and the program never reads
subsequent keystrokes. The fix would be clearerr(stdin) before each read, or using raw
read() with select() (as GPT 5.5 (high) did).
The agent never tested screensaver mode (-s) or piped keystroke input during exploration.
Claude Opus 4.7 (xhigh) -- 19 failures
The most expensive run ($10.74, 178 API calls) with the worst results.
This solution used ncurses, though a lot of time had to be spent getting it to work (see appendix).
Two simple bugs account for all 19 failures.
Bug 1: Case-sensitive color parsing (11 failures)
static int parse_color(const char *s) {
if (!strcmp(s, "green")) return COLOR_GREEN;
if (!strcmp(s, "red")) return COLOR_RED;
// ... all lowercase comparisons with strcmp()
return -1;
}
Uses strcmp() (case-sensitive) instead of strcasecmp(). "GREEN", "Red", "BLUE" all
return -1 (treated as invalid). A one-token fix (strcasecmp) would eliminate 11 failures.
The agent never tested uppercase or mixed-case color inputs across all 178 steps. It
only tested lowercase colors and the invalid color purple.
Bug 2: Wrong exit code for invalid colors (8 failures)
case 'C':
opt_mcolor = parse_color(optarg);
if (opt_mcolor < 0) {
fprintf(stderr, " Invalid color selection\n");
fprintf(stderr, " Valid colors are green, red, blue, ...\n");
exit(1); // should be exit(0)
}
The original binary exits with code 0 for invalid colors. The agent actually observed
this when testing the original binary:
$ ./executable -C purple; echo "exit=$?"
exit=0
But when testing its own implementation, it got exit=1 and never noticed the
discrepancy. Adding to the confusion: it also tested valid colors without a proper TTY
(TERM=unknown), which caused ncurses initscr() to fail with exit code 1 before the
color validation path was reached. Both valid and invalid colors showed exit=1 in the
non-TTY environment, masking the bug.
Appendix
An ignored test
While checking the GPT 5.5 (xhigh) result, we found one test that we considered unfair and have since removed:
The test_u_out_of_range_or_overflow_values_still_enter_render_loop[args2] test passes -u 999999999999999999999999 and expects the program to enter the render
loop. This only works in the original binary because C's atoi() silently overflows the
32-bit int, wrapping to a small/negative value, so napms() gets a tiny argument and the
program runs at maximum speed. This is undefined behavior in C, not intentional design.
GPT 5.5 (xhigh)'s Python reimplementation wrote a faithful c_atoi():
def c_atoi(s: str) -> int:
# ... standard atoi parsing logic ...
val = val * 10 + (ord(s[i]) - 48)
return sign * val if seen else 0
This replicates the parsing algorithm but not C's overflow semantics. Python integers
have arbitrary precision, so c_atoi("999999999999999999999999") returns the exact value.
The delay code then calls time.sleep(1e22), which raises
OverflowError: timestamp out of range for platform time_t.
This test was marked as ignored_manual because it enshrines accidental C overflow
semantics rather than meaningful program behavior.
The Docker container had the runtime library packages (libncurses6, libncursesw6)
installed because the original cmatrix binary needs them to run. The development packages
(libncurses-dev) with header files were not installed.
The four agents handled this differently:
GPT 5.5 (high) and GPT 5.5 tried compiling a test program with #include <ncurses.h>, saw
the header was missing, and immediately pivoted to raw ANSI escape sequences in C. They
never investigated further.
GPT 5.5 (xhigh) did the same single compile check and pivoted to Python with raw ANSI
escape sequences:
printf '#include <ncurses.h>\nint main(){return 0;}\n' > /tmp/test_nc.c
gcc /tmp/test_nc.c -lncurses -o /tmp/test_nc
# -> fatal error: ncurses.h: No such file or directory
It never checked for .so files or tried apt-get.
Claude Opus 4.7 (xhigh) spent ~20 steps investigating. It discovered that the runtime .so
files existed via ldconfig -p and dpkg -l, then examined their exported symbols with
nm -D:
ldconfig -p | grep ncurses
# -> libncursesw.so.6 => /lib/x86_64-linux-gnu/libncursesw.so.6
nm -D /lib/x86_64-linux-gnu/libncursesw.so.6 | grep initscr
# -> all ncurses symbols present
It created a minimal src/curses_decls.h (~106 lines) with hand-written typedefs and
extern declarations for ~40 ncurses functions, then linked with:
gcc src/cmatrix.c -l:libncursesw.so.6 -l:libtinfo.so.6 -o executable
This was genuinely creative systems engineering, but the added complexity didn't translate
to better scores. The bugs that mattered (color case-sensitivity, exit codes) had nothing
to do with the rendering approach.