Today, we're creating a ProgramBench leaderboard to enable the community to open source and share their results.
This guide provides the complete instructions on turning your ProgramBench evaluation results into a leaderboard submission.
Before you start
Required:
programbench:uvx programbench(orpip install programbench).- A GitHub account: your submission lives in a public repo you own.
Note: these steps require
programbench1.2.0 or later. Update withuvx programbench@latest(orpip install -U programbench).
Optional (each just removes a manual step):
gh, logged in: letspublishcreate the repo andregisteropen the PR for you; without it you do those two by hand.- A HuggingFace token (
huggingface-cli login): letspackage --upload-tohost the heavy artifacts so your git repo stays small; without it everything stays local (fine for small runs).
What you'll end up with
- A public GitHub repo: your packaged run dir, containing
submission.yamlREADME.md_stats/_scripts/- per-task folder:
*.traj.json*.eval.json.url+.sha256pointers
- A HuggingFace dataset (if you used
--upload-to): the heavysubmission.tar.gzsolutions andeval.log.jsonlogs. - A registry PR to ProgramBench/submissions adding
submissions/<id>/; on merge, your leaderboard row appears.
We use our Gemini 3.1 Pro + mini-SWE-agent baseline as the running example (submission repo, HuggingFace dataset).
0. The evaluation run directory
After programbench eval, you should already have a run directory,
with one folder per task:
20260429_mini-v2.2.6_gemini-3-1-pro/
cmatsuoka__figlet.202a0a8/
submission.tar.gz # the solution your agent built
cmatsuoka__figlet.202a0a8.eval.json # eval results
cmatsuoka__figlet.202a0a8.traj.json # the agent's trajectory
... # one folder per task
Everything below happens in place, and you never hand-edit a score.
1. Package
programbench submit package 20260429_mini-v2.2.6_gemini-3-1-pro/ --upload-to programbench
The package subcommand turns the run dir into a submission.
- It writes a
submission.yamlmanifest and automatically generates a_stats/score.json(per-instance, per-test pass/fail). - It splits each large
eval.jsoninto a lighteval.json(kept in git) plus a heavyeval.log.json(raw log + failure text). - (If
--upload-tospecified) It uploads eachsubmission.tar.gzandeval.log.jsonto a per-submission HuggingFace dataset, leaving a.url+.sha256behind. - The name of the HF dataset will automatically be
<org>/<run name>(soprogrambench/20260429_mini-v2.2.6_gemini-3-1-profor the above.)
Omit --upload-to to keep everything local.
It'll be fine for smaller runs, but be mindful that solutions and logs get large (ours went from 1.7 GB to ~280 MB once hosted).
2. Fill in submission information
- Edit
submission.yaml: your name, the model/provider, the agent/scaffold, andis_os_model/is_os_scaffold. - Edit
README.md: provide a shrot system description, and make sure all items on the checklist are satisfied.
3. Add cost / calls stats (optional)
Extended results, such as average cost and number of calls, are included in the extended results on the website.
If you would like to have your submission surfaced, do the following:
- In
_scripts/: Include a script for each statistic you compute against a submission's trajectories (examples for costs, calls). - In
_stats/: Include the output files (example for costs, calls). Your script must subscribe to the format ({<instance_id>: <stat>}).
4. Publish
programbench submit publish 20260429_mini-v2.2.6_gemini-3-1-pro/ --owner <your-org>
This commits the run dir and pushes it to a public GitHub repo: its permanent, citable home.
- If you have
ghinstalled this step is automatic - Otherwise pass
--remote <url>to a repo you made, or follow the printed steps.
5. Register
programbench submit register 20260429_mini-v2.2.6_gemini-3-1-pro/
This opens a pull request against the submissions registry.
- It adds
submissions/<id>/with yourpointer.yaml(repo URL + pinned commit),submission.yaml, and_stats/. - That PR is the actual "submit", and on merge your row appears on the leaderboard.
Verifying
Submissions are public, so anyone can check one with just the installed package:
git clone <a-public-submission> && cd <it>
programbench submit verify . # check score.json matches the eval.json (instant, no Docker)
programbench submit verify . --tier1 # fetch each solution from HF and re-run eval
Tier-0 checks that the submitted score.json faithfully reflects each eval.json (no scores are stored to compare against).
Tier-1 downloads each submission.tar.gz and re-runs evaluation to confirm the artifacts reproduce it.
Cost and calls are self-reported from your trajectories, so only score is independently re-verifiable.