Skip to content
comparotor
rotorbench-aero · v0.1 · 2026Q2

Methodology

How we score models, why the test set rotates, and the audit trail behind every published number.

§01How a model gets scored

Each submission travels through a deterministic six-step pipeline. The runner is hermetic — it pulls the artefact, runs against a fixed test split, and posts results back over a signed webhook.

POST /submissionsclient → API01R2 archiveartefact sealed02Cloudflare Queueeval-runner pulls03Inference200 airfoils × 240 ops04SU2 oraclehigh-fidelity labels05POST /webhooksHMAC-signed metrics06

§02Contamination resistance

The private test set rotates every quarter. 50 of the 200 airfoils are replaced with newly-procedurally-perturbed parents under a fixed seed schedule. Submissions cannot pre-train on the test set because it didn't exist when the model was trained.

2026Q2stable 150 airfoilsQ1Q2Q3Q42026Q3stable 150 airfoilsQ1Q2Q3Q42026Q4stable 150 airfoilsQ1Q2Q3Q42027Q1stable 150 airfoilsQ1Q2Q3Q4stablerotated this quarter

The 150 stable airfoils give cross-quarter comparability. A model submitted in 2026Q2 and re-run in 2027Q1 will show its drift against the rotated 50 alongside its persistence on the stable 150.

§03The composite score, derived from first principles

The composite is a weighted sum of five normalised metrics. Each weight is chosen so that a real, useful improvement on that metric matches its peers in scoring impact — see the annotations.

composite = MAE_Cl + 10·MAE_Cd + 0.5·MAE_Cm + 0.2·(1 − ρ_LD) + 0.1·OOD_score + 0.001·latency_p50_ms
MAE_Cl
primary lift error — directly used in design loops
10·MAE_Cd
weighted up because Cd ranges ~10× narrower than Cl
0.5·MAE_Cm
moment matters less in 2D scoring; halved
0.2·(1 − ρ_LD)
rank correlation already in [0, 1]; small weight is enough
0.1·OOD_score
OOD is a separate guardrail, not the dominant signal
0.001·latency_p50_ms
tiebreaker only — lets a fast model edge a marginally-more-accurate slow one

Lower composite is better. The leaderboard's default sort is by composite ascending.

§04What we archive

Every submission's artefact is sealed in R2 for 24 months. If contamination is later suspected we can replay the same artefact against a fresh test set. The bundle below is what gets archived per submission.

r2://comparotor-submissions/<submission_id>/retained 24 months
MODEL
model.onnx12–500 MB
META
wrapper.json<2 KB
META
container.txt<256 B
ATTEST
submitter.txt<512 B
MANIFEST
SHA-256SUMS<512 B
META
submitted_at.txt<32 B

§05Audit and replay

If a submission is suspected of having seen the test set during training, we can replay it against a freshly-rotated quarter without the submitter touching anything. The score difference is the audit signal.

01original score2026Q2 test set02replay artefact2026Q3 test set03suspect contaminationif Δ-score > 2σ04audit signalpublished in run history

Want the full spec?

The complete SPEC.md covers the SU2 RANS generation pipeline, the full D1 schema, and the OpenAPI contract.