Submission Guide — ProgramBench

Today, we're creating a ProgramBench leaderboard to enable the community to open source and share their results.

This guide provides the complete instructions on turning your ProgramBench evaluation results into a leaderboard submission.

Before you start

Required:

programbench: uvx programbench (or pip install programbench).
A GitHub account: your submission lives in a public repo you own.

Note: these steps require programbench 1.2.0 or later. Update with uvx programbench@latest (or pip install -U programbench).

Optional (each just removes a manual step):

gh, logged in: lets publish create the repo and register open the PR for you; without it you do those two by hand.
A HuggingFace token (huggingface-cli login): lets package --upload-to host the heavy artifacts so your git repo stays small; without it everything stays local (fine for small runs).

What you'll end up with

A public GitHub repo: your packaged run dir, containing
- submission.yaml
- README.md
- _stats/
- _scripts/
- per-task folder:
  - *.traj.json
  - *.eval.json
  - .url + .sha256 pointers
A HuggingFace dataset (if you used --upload-to): the heavy submission.tar.gz solutions and eval.log.json logs.
A registry PR to ProgramBench/submissions adding submissions/<id>/; on merge, your leaderboard row appears.

We use our Gemini 3.1 Pro + mini-SWE-agent baseline as the running example (submission repo, HuggingFace dataset).

0. The evaluation run directory

After programbench eval, you should already have a run directory, with one folder per task:

20260429_mini-v2.2.6_gemini-3-1-pro/
  cmatsuoka__figlet.202a0a8/
    submission.tar.gz                        # the solution your agent built
    cmatsuoka__figlet.202a0a8.eval.json      # eval results
    cmatsuoka__figlet.202a0a8.traj.json      # the agent's trajectory
  ...                                        # one folder per task

Everything below happens in place, and you never hand-edit a score.

1. Package

programbench submit package 20260429_mini-v2.2.6_gemini-3-1-pro/ --upload-to programbench

The package subcommand turns the run dir into a submission.

It writes a submission.yaml manifest and automatically generates a _stats/score.json (per-instance, per-test pass/fail).
It splits each large eval.json into a light eval.json (kept in git) plus a heavy eval.log.json (raw log + failure text).
(If --upload-to specified) It uploads each submission.tar.gz and eval.log.json to a per-submission HuggingFace dataset, leaving a .url + .sha256 behind.
The name of the HF dataset will automatically be <org>/<run name> (so programbench/20260429_mini-v2.2.6_gemini-3-1-pro for the above.)

Omit --upload-to to keep everything local. It'll be fine for smaller runs, but be mindful that solutions and logs get large (ours went from 1.7 GB to ~280 MB once hosted).

2. Fill in submission information

Edit submission.yaml: your name, the model/provider, the agent/scaffold, and is_os_model / is_os_scaffold.
Edit README.md: provide a shrot system description, and make sure all items on the checklist are satisfied.

3. Add cost / calls stats (optional)

Extended results, such as average cost and number of calls, are included in the extended results on the website.

If you would like to have your submission surfaced, do the following:

In _scripts/: Include a script for each statistic you compute against a submission's trajectories (examples for costs, calls).
In _stats/: Include the output files (example for costs, calls). Your script must subscribe to the format ({<instance_id>: <stat>}).

4. Publish

programbench submit publish 20260429_mini-v2.2.6_gemini-3-1-pro/ --owner <your-org>

This commits the run dir and pushes it to a public GitHub repo: its permanent, citable home.

If you have gh installed this step is automatic
Otherwise pass --remote <url> to a repo you made, or follow the printed steps.

5. Register

programbench submit register 20260429_mini-v2.2.6_gemini-3-1-pro/

This opens a pull request against the submissions registry.

It adds submissions/<id>/ with your pointer.yaml (repo URL + pinned commit), submission.yaml, and _stats/.
That PR is the actual "submit", and on merge your row appears on the leaderboard.

Verifying

Submissions are public, so anyone can check one with just the installed package:

git clone <a-public-submission> && cd <it>
programbench submit verify .            # check score.json matches the eval.json (instant, no Docker)
programbench submit verify . --tier1    # fetch each solution from HF and re-run eval

Tier-0 checks that the submitted score.json faithfully reflects each eval.json (no scores are stored to compare against). Tier-1 downloads each submission.tar.gz and re-runs evaluation to confirm the artifacts reproduce it. Cost and calls are self-reported from your trajectories, so only score is independently re-verifiable.