Improving Curve Fitting for AI Capability Time Horizons

Drop-in improvements to METR's methodology — three metrics (p50, p80, G[T]), bootstrap median crossings, validated on 18 frontier AI models

Isotonic Regression Bootstrap Median Crossings p50 / p80 / G[T] Metrics m-out-of-n Bootstrap
TL;DR: METR's headline finding — AI agent capabilities double roughly every 4 months — is robust. We report three complementary metrics: p50 (median-difficulty frontier), p80 (easy-task frontier, doubling ~111–121 days), and G[T] (full-curve integral, doubling ~130 days, R² = 0.921). For isotonic regression, we use bootstrap median crossings instead of the single full-data fit, yielding smoother per-model values. The trend is real; our improvements make it more defensible and multi-dimensional.

The Three Improvements

1. Curve Fitting: Isotonic Regression

The issue

Logistic regression assumes a symmetric sigmoid. Real success-vs-difficulty curves are often asymmetric: quick saturation on easy tasks, slow decay on hard ones.

The fix

Isotonic regression (monotone decreasing) makes no parametric assumption. It adapts its shape freely to the data, preserving only the prior that harder tasks should not be easier.

Evidence: 5-fold CV Brier: 0.1115 (isotonic) vs 0.1212 (logistic) — 8% improvement.
5-fold CV log-loss: 0.368 (isotonic) vs 0.376 (logistic) — 2.1% improvement.
Both improvements hold out of sample.

2. Summary Metrics: p50, p80, and G[T]

The issue

A single threshold crossing (p50) is sensitive to local curve shape and jumps discontinuously for step-function fits. One number cannot capture the full picture of capability growth.

The fix

We report three complementary metrics: p50 (median-difficulty frontier), p80 (easy-task frontier where models achieve 80% success), and G[T] (geometric mean capability time from the full curve integral). For isotonic regression, threshold crossings use the bootstrap median curve instead of the single full-data fit, producing smoother per-model values.

Evidence: G[T] R²: 0.916–0.921 across methods (most stable).
p80 doubling: 111–121 days (R² = 0.881–0.901) — captures easy-task frontier growth.
p50 doubling: 121–130 days (R² = 0.852–0.916) — traditional metric, now stabilized via bootstrap median.

3. Bootstrap: m-out-of-n Resampling

The issue

Standard bootstrap is inconsistent for isotonic regression, which converges at rate n1/3, not n1/2.

The fix

Use m-out-of-n bootstrap with m = n2/3. This restores theoretical consistency for shape-constrained estimators.

References: Kosorok (2008), Sen & Xu (2015), Leger & MacGibbon (2006).

Results

Method CV-Brier CV-LogLoss p50 Dbl p50 p80 Dbl p80 G[T] Dbl G[T] R²
Logistic (METR) 0.1212 0.376 121 d 0.916 121 d 0.901 128 d 0.916
Isotonic 0.1115 0.368 130 d 0.852 111 d 0.881 130 d 0.921
G[T] Doubling
~130 d
Isotonic, R² = 0.921
p80 Doubling
~111 d
Isotonic, R² = 0.881
p50 Doubling
~130 d
Isotonic (bootstrap median)
Best CV Brier
0.1115
Isotonic
Models Evaluated
18
Mar 2023 – Dec 2025

Logistic vs Isotonic: Direct Comparison

Both methods plotted on the same axes. Blue diamonds = logistic, green circles = isotonic (bootstrap median crossings).

p50 comparison
G[T] comparison

Detailed Comparison

Explore per-model curve fits with bootstrap median, p50/p80/G[T] trend charts, comparison tables, and binned probability checks.

Open Detailed Comparison →

Models Evaluated

18 frontier AI models from 4 families, released between March 2023 and December 2025:

ModelRelease Datep80p50G[T]
GPT-42023-03-1449 sec3.1 min3.4 min
GPT-4 11062023-11-0652 sec2.0 min3.5 min
Claude 3 Opus2024-03-0437 sec1.6 min3.2 min
GPT-4 Turbo2024-04-0959 sec3.7 min3.7 min
GPT-4o2024-05-131.5 min8.5 min6.9 min
Claude 3.5 Sonnet2024-06-202.0 min14.5 min12.1 min
o1-preview2024-09-123.0 min46.5 min18.9 min
Claude 3.5 Sonnet (New)2024-10-222.4 min45.4 min16.9 min
o12024-12-056.1 min1.0 hr35.7 min
Claude 3.7 Sonnet2025-02-2421.7 min57.8 min51.6 min
o32025-04-1651.4 min1.6 hr1.8 hr
Claude 4 Opus2025-05-2249.2 min1.7 hr1.4 hr
Claude 4.1 Opus2025-08-0551.3 min1.6 hr1.4 hr
GPT-52025-08-071.1 hr1.8 hr2.6 hr
Gemini 3 Pro2025-11-181.4 hr2.0 hr3.5 hr
GPT-5.1 Codex-Max2025-11-1952.5 min1.9 hr2.9 hr
Claude Opus 4.52025-11-241.5 hr3.7 hr3.8 hr
GPT-5.22025-12-111.7 hr6.2 hr4.4 hr

All values from isotonic regression. p50/p80 use bootstrap median crossings for smoother estimates.

Reproduce It Yourself

All code and data are open source. Reproduce the entire analysis in one command:

git clone https://github.com/guyko81/metr-ablation.git
cd metr-ablation
pip install -r requirements.txt
python run.py

Generates all 36 per-model fit plots, 6 trend charts (p50, p80, G[T] × 2 methods), comparison tables, and the detailed comparison viewer.

Citation

@article{gulyas2026improving,
  title={Improving Curve Fitting for AI Capability Time Horizons: Three Drop-In Improvements to METR's Methodology},
  author={Gulyas, Gabor},
  journal={arXiv preprint arXiv:[ARXIV-ID]},
  year={2026}
}