Drop-in improvements to METR's methodology — three metrics (p50, p80, G[T]), bootstrap median crossings, validated on 18 frontier AI models
This is the interactive companion to the paper:
Read the Paper on arXiv →Logistic regression assumes a symmetric sigmoid. Real success-vs-difficulty curves are often asymmetric: quick saturation on easy tasks, slow decay on hard ones.
Isotonic regression (monotone decreasing) makes no parametric assumption. It adapts its shape freely to the data, preserving only the prior that harder tasks should not be easier.
A single threshold crossing (p50) is sensitive to local curve shape and jumps discontinuously for step-function fits. One number cannot capture the full picture of capability growth.
We report three complementary metrics: p50 (median-difficulty frontier), p80 (easy-task frontier where models achieve 80% success), and G[T] (geometric mean capability time from the full curve integral). For isotonic regression, threshold crossings use the bootstrap median curve instead of the single full-data fit, producing smoother per-model values.
Standard bootstrap is inconsistent for isotonic regression, which converges at rate n1/3, not n1/2.
Use m-out-of-n bootstrap with m = n2/3. This restores theoretical consistency for shape-constrained estimators.
| Method | CV-Brier | CV-LogLoss | p50 Dbl | p50 R² | p80 Dbl | p80 R² | G[T] Dbl | G[T] R² |
|---|---|---|---|---|---|---|---|---|
| Logistic (METR) | 0.1212 | 0.376 | 121 d | 0.916 | 121 d | 0.901 | 128 d | 0.916 |
| Isotonic | 0.1115 | 0.368 | 130 d | 0.852 | 111 d | 0.881 | 130 d | 0.921 |
Both methods plotted on the same axes. Blue diamonds = logistic, green circles = isotonic (bootstrap median crossings).
Explore per-model curve fits with bootstrap median, p50/p80/G[T] trend charts, comparison tables, and binned probability checks.
Open Detailed Comparison →18 frontier AI models from 4 families, released between March 2023 and December 2025:
| Model | Release Date | p80 | p50 | G[T] |
|---|---|---|---|---|
| GPT-4 | 2023-03-14 | 49 sec | 3.1 min | 3.4 min |
| GPT-4 1106 | 2023-11-06 | 52 sec | 2.0 min | 3.5 min |
| Claude 3 Opus | 2024-03-04 | 37 sec | 1.6 min | 3.2 min |
| GPT-4 Turbo | 2024-04-09 | 59 sec | 3.7 min | 3.7 min |
| GPT-4o | 2024-05-13 | 1.5 min | 8.5 min | 6.9 min |
| Claude 3.5 Sonnet | 2024-06-20 | 2.0 min | 14.5 min | 12.1 min |
| o1-preview | 2024-09-12 | 3.0 min | 46.5 min | 18.9 min |
| Claude 3.5 Sonnet (New) | 2024-10-22 | 2.4 min | 45.4 min | 16.9 min |
| o1 | 2024-12-05 | 6.1 min | 1.0 hr | 35.7 min |
| Claude 3.7 Sonnet | 2025-02-24 | 21.7 min | 57.8 min | 51.6 min |
| o3 | 2025-04-16 | 51.4 min | 1.6 hr | 1.8 hr |
| Claude 4 Opus | 2025-05-22 | 49.2 min | 1.7 hr | 1.4 hr |
| Claude 4.1 Opus | 2025-08-05 | 51.3 min | 1.6 hr | 1.4 hr |
| GPT-5 | 2025-08-07 | 1.1 hr | 1.8 hr | 2.6 hr |
| Gemini 3 Pro | 2025-11-18 | 1.4 hr | 2.0 hr | 3.5 hr |
| GPT-5.1 Codex-Max | 2025-11-19 | 52.5 min | 1.9 hr | 2.9 hr |
| Claude Opus 4.5 | 2025-11-24 | 1.5 hr | 3.7 hr | 3.8 hr |
| GPT-5.2 | 2025-12-11 | 1.7 hr | 6.2 hr | 4.4 hr |
All values from isotonic regression. p50/p80 use bootstrap median crossings for smoother estimates.
All code and data are open source. Reproduce the entire analysis in one command:
git clone https://github.com/guyko81/metr-ablation.git cd metr-ablation pip install -r requirements.txt python run.py
Generates all 36 per-model fit plots, 6 trend charts (p50, p80, G[T] × 2 methods), comparison tables, and the detailed comparison viewer.
@article{gulyas2026improving,
title={Improving Curve Fitting for AI Capability Time Horizons: Three Drop-In Improvements to METR's Methodology},
author={Gulyas, Gabor},
journal={arXiv preprint arXiv:[ARXIV-ID]},
year={2026}
}