Improving Curve Fitting for AI Capability Time Horizons

This is the interactive companion to the paper:

TL;DR: METR's headline finding — AI agent capabilities double roughly every 4 months — is robust. We report three complementary metrics: p₅₀ (median-difficulty frontier), p₈₀ (easy-task frontier, doubling ~111–121 days), and G[T] (full-curve integral, doubling ~130 days, R² = 0.921). For isotonic regression, we use bootstrap median crossings instead of the single full-data fit, yielding smoother per-model values. The trend is real; our improvements make it more defensible and multi-dimensional.

The Three Improvements

1. Curve Fitting: Isotonic Regression

The issue

Logistic regression assumes a symmetric sigmoid. Real success-vs-difficulty curves are often asymmetric: quick saturation on easy tasks, slow decay on hard ones.

The fix

Isotonic regression (monotone decreasing) makes no parametric assumption. It adapts its shape freely to the data, preserving only the prior that harder tasks should not be easier.

Evidence: 5-fold CV Brier: 0.1115 (isotonic) vs 0.1212 (logistic) — 8% improvement.
5-fold CV log-loss: 0.368 (isotonic) vs 0.376 (logistic) — 2.1% improvement.
Both improvements hold out of sample.

2. Summary Metrics: p₅₀, p₈₀, and G[T]

The issue

A single threshold crossing (p₅₀) is sensitive to local curve shape and jumps discontinuously for step-function fits. One number cannot capture the full picture of capability growth.

The fix

We report three complementary metrics: p₅₀ (median-difficulty frontier), p₈₀ (easy-task frontier where models achieve 80% success), and G[T] (geometric mean capability time from the full curve integral). For isotonic regression, threshold crossings use the bootstrap median curve instead of the single full-data fit, producing smoother per-model values.

Evidence: G[T] R²: 0.916–0.921 across methods (most stable).
p₈₀ doubling: 111–121 days (R² = 0.881–0.901) — captures easy-task frontier growth.
p₅₀ doubling: 121–130 days (R² = 0.852–0.916) — traditional metric, now stabilized via bootstrap median.

3. Bootstrap: m-out-of-n Resampling

The issue

Standard bootstrap is inconsistent for isotonic regression, which converges at rate n^1/3, not n^1/2.

The fix

Use m-out-of-n bootstrap with m = n^2/3. This restores theoretical consistency for shape-constrained estimators.

References: Kosorok (2008), Sen & Xu (2015), Leger & MacGibbon (2006).

Results

Method	CV-Brier	CV-LogLoss	p₅₀ Dbl	p₅₀ R²	p₈₀ Dbl	p₈₀ R²	G[T] Dbl	G[T] R²
Logistic (METR)	0.1212	0.376	121 d	0.916	121 d	0.901	128 d	0.916
Isotonic	0.1115	0.368	130 d	0.852	111 d	0.881	130 d	0.921

G[T] Doubling

~130 d

Isotonic, R² = 0.921

p₈₀ Doubling

~111 d

Isotonic, R² = 0.881

p₅₀ Doubling

~130 d

Isotonic (bootstrap median)

Best CV Brier

0.1115

Isotonic

Models Evaluated

Mar 2023 – Dec 2025

Logistic vs Isotonic: Direct Comparison

Both methods plotted on the same axes. Blue diamonds = logistic, green circles = isotonic (bootstrap median crossings).

Models Evaluated

18 frontier AI models from 4 families, released between March 2023 and December 2025:

Model	Release Date	p₈₀	p₅₀	G[T]
GPT-4	2023-03-14	49 sec	3.1 min	3.4 min
GPT-4 1106	2023-11-06	52 sec	2.0 min	3.5 min
Claude 3 Opus	2024-03-04	37 sec	1.6 min	3.2 min
GPT-4 Turbo	2024-04-09	59 sec	3.7 min	3.7 min
GPT-4o	2024-05-13	1.5 min	8.5 min	6.9 min
Claude 3.5 Sonnet	2024-06-20	2.0 min	14.5 min	12.1 min
o1-preview	2024-09-12	3.0 min	46.5 min	18.9 min
Claude 3.5 Sonnet (New)	2024-10-22	2.4 min	45.4 min	16.9 min
o1	2024-12-05	6.1 min	1.0 hr	35.7 min
Claude 3.7 Sonnet	2025-02-24	21.7 min	57.8 min	51.6 min
o3	2025-04-16	51.4 min	1.6 hr	1.8 hr
Claude 4 Opus	2025-05-22	49.2 min	1.7 hr	1.4 hr
Claude 4.1 Opus	2025-08-05	51.3 min	1.6 hr	1.4 hr
GPT-5	2025-08-07	1.1 hr	1.8 hr	2.6 hr
Gemini 3 Pro	2025-11-18	1.4 hr	2.0 hr	3.5 hr
GPT-5.1 Codex-Max	2025-11-19	52.5 min	1.9 hr	2.9 hr
Claude Opus 4.5	2025-11-24	1.5 hr	3.7 hr	3.8 hr
GPT-5.2	2025-12-11	1.7 hr	6.2 hr	4.4 hr

All values from isotonic regression. p₅₀/p₈₀ use bootstrap median crossings for smoother estimates.

Reproduce It Yourself

All code and data are open source. Reproduce the entire analysis in one command:

git clone https://github.com/guyko81/metr-ablation.git
cd metr-ablation
pip install -r requirements.txt
python run.py

Generates all 36 per-model fit plots, 6 trend charts (p50, p80, G[T] × 2 methods), comparison tables, and the detailed comparison viewer.

Citation

@article{gulyas2026improving,
  title={Improving Curve Fitting for AI Capability Time Horizons: Three Drop-In Improvements to METR's Methodology},
  author={Gulyas, Gabor},
  journal={arXiv preprint arXiv:[ARXIV-ID]},
  year={2026}
}

Improving Curve Fitting for AI Capability Time Horizons

The Three Improvements

1. Curve Fitting: Isotonic Regression

2. Summary Metrics: p50, p80, and G[T]

3. Bootstrap: m-out-of-n Resampling

Results

Logistic vs Isotonic: Direct Comparison

Detailed Comparison

Models Evaluated

Reproduce It Yourself

Citation

2. Summary Metrics: p₅₀, p₈₀, and G[T]