Insights

Analysis from 31 models × 40 benchmarks × 3 arms = 3,641 runs, 1,945 solved. Every claim cites specific data from experiment-scores.json. No marketing.

Methodology

Every insight article uses the same dataset: public/experiment-scores.json, built from the full 3-arm experiment run. Numbers are computed, not claimed. The leaderboard on the home page shows the same data without commentary.

Three arms per model, identical prompt, identical benchmark set:

Scoring is deterministic: test.sh exits 0 on fix, 1 on fail. No LLM-as-judge, no subjective rubric.