Backtest — Not a real-time track record

Hindcast Validation

We tested our methodology against 20 specific, falsifiable questions about events from October 2025 to February 2026 — events whose outcomes are now known. The model used (Gemini 2.5 Pro) has a training cutoff before the hindcast period, reducing data contamination.

0.155
Brier Score (lower = better)
16/20
Direction Correct
4
Misses (published)
20
Resolved Questions
What is a Brier Score? It measures prediction accuracy: 0.0 = perfect foresight, 0.25 = coin flip (guessing 50% every time), 0.50 = always wrong. Think of it like a golf score — lower is better. Our 0.155 beats random guessing by 38%. For context, top Metaculus forecasters score ~0.10-0.15, and weather forecasts typically score around 0.2.

Calibration Analysis

A well-calibrated forecaster who says 70% should be right 70% of the time. Each circle represents one probability bucket; size reflects sample count.

Calibration Plot

Predicted probability vs observed frequency — perfect calibration follows the dashed line

Overconfident
Underconfident
Perfect
3 67% 3 67% 5 100% 9 100% 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Predicted probability Observed freq.
0-25%
67%
n=3
25-50%
67%
n=3
50-75%
100%
n=5
75-100%
100%
n=9

Observed frequency per bucket (circle size = sample count)

Reading this chart: Circles on the dashed diagonal = perfect calibration. Red circles below the line = overconfident (we predicted higher than reality delivered). Green circles above the line = underconfident. The low-confidence bucket (0-25%) shows significant overconfidence — we assigned low odds to events that often happened.

All 20 Predictions (including misses)

85%
predicted
H01 YES Brier: 0.0225

OpenAI releases new flagship model before Jan 2026

OpenAI had been on quarterly release cadence; GPT-5.x series was expected

65%
predicted
H02 YES Brier: 0.1225

At least 2 FAANG companies announce AI layoffs in Q4 2025

Tech layoffs averaged 20K/month; AI-driven restructuring was accelerating

75%
predicted
H03 YES Brier: 0.0625

GitHub Copilot exceeds 4M paid subscribers by Jan 2026

Was at ~2.5M growing 75% YoY; trajectory supported 4M target

15%
predicted
H04 YES MISS Brier: 0.7225

Cursor ARR exceeds $500M by end 2025

Growth was rapid but $500M seemed aggressive — actually crossed $1B

99%
predicted
H05 YES Brier: 0.0001

EU AI Act GPAI rules take effect Aug 2, 2025

Already published in Official Journal; EU regulations rarely slip

70%
predicted
H06 YES Brier: 0.0900

Google releases model that tops GPT-4o before Dec 2025

Google was catching up with Gemini 2.x; a leap was plausible

85%
predicted
H07 YES Brier: 0.0225

Anthropic releases new Claude in Q4 2025

Anthropic was on quarterly cadence; Claude Opus 4.5 arrived Nov 24

60%
predicted
H08 YES Brier: 0.1600

Major AI coding tool acquired for >$100M in Q4 2025

M&A activity was accelerating; Cognition acquired Windsurf for ~$250M

75%
predicted
H09 YES Brier: 0.0625

GitHub Copilot changes pricing before Mar 2026

Metered billing introduced June 2025; structural pricing shift

55%
predicted
H10 YES Brier: 0.2025

US tech layoffs exceed 50K in Q4 2025

Q3 was ~40K; Q4 trend was upward but uncertain

20%
predicted
H11 NO Brier: 0.0400

AI-generated code exceeds 50% of Copilot user output

Was at ~40%, growing ~5pp/year; 50% in 3 months was unlikely. Actual: 46%

99%
predicted
H12 YES Brier: 0.0001

Devin pricing drops below $100/month by end 2025

Already happened — Devin 2.0 was $20/mo since April 2025

85%
predicted
H13 YES Brier: 0.0225

80%+ of Fortune 100 using AI coding assistants by Jan 2026

Was at ~77% mid-2025; trajectory strong. Actual: 90% (Copilot alone)

25%
predicted
H14 NO Brier: 0.0625

Any country enacts AI-generated code specific legislation

AI code legislation is a new category with no precedent. Correct: none enacted.

95%
predicted
H15 YES Brier: 0.0025

NeurIPS 2025 accepts more than 3,000 papers

NeurIPS 2024 accepted ~3,500; upward trend continued

40%
predicted
H16 YES MISS Brier: 0.3600

Meta announces >5,000 layoffs in Q1 2026

Meta had prior mass layoffs but 2025 seemed stable — wrong: 20% cut announced

85%
predicted
H17 YES Brier: 0.0225

AI coding tools market exceeds $7B by end 2025

Was at $4.91B in 2024 growing ~50% YoY. Actual: $7.37B

20%
predicted
H18 YES MISS Brier: 0.6400

Claude Code becomes #1 most-used AI coding tool by Mar 2026

Claude Code launched May 2025 — too new to predict dominance. Wrong: it did.

65%
predicted
H19 YES Brier: 0.1225

FAANG AI capex commitments for 2026 exceed $500B

2025 was ~$300-400B; $500B was aggressive but plausible. Actual: ~$700B

40%
predicted
H20 YES MISS Brier: 0.3600

90%+ of developers use AI tools weekly by early 2026

Was at ~75-80% mid-2025; 90%+ seemed aggressive. Actual: 95% per survey

Honest Caveats

  • This is a BACKTEST, not a real-time track record. Questions were defined with hindsight knowledge of outcomes.
  • Model: Gemini 2.5 Pro — training cutoff predates the Oct 2025 hindcast period, reducing but not eliminating data contamination risk.
  • Question set skews toward YES outcomes (16/20). A system that always predicts 70% YES would score well. Future validations will include more balanced outcomes.
  • Low-confidence bucket (0-30%) was poorly calibrated: predicted unlikely events that happened (Cursor growth, Claude Code adoption, Meta layoffs).

Methodology

20 questions were defined about specific, falsifiable events from October 2025 to February 2026. Gemini 2.5 Pro was prompted with the question, a base rate hint, and instructed to predict as if the current date were the ask date. Predictions were batched in groups of 5. All results are published, including all 4 misses. The Brier score is computed as the mean squared error between predicted probability and binary outcome across all 20 questions.