Hindcast Validation
We tested our methodology against 20 specific, falsifiable questions about events from October 2025 to February 2026 — events whose outcomes are now known. The model used (Gemini 2.5 Pro) has a training cutoff before the hindcast period, reducing data contamination.
Calibration Analysis
A well-calibrated forecaster who says 70% should be right 70% of the time. Each circle represents one probability bucket; size reflects sample count.
Calibration Plot
Predicted probability vs observed frequency — perfect calibration follows the dashed line
Observed frequency per bucket (circle size = sample count)
All 20 Predictions (including misses)
OpenAI releases new flagship model before Jan 2026
OpenAI had been on quarterly release cadence; GPT-5.x series was expected
At least 2 FAANG companies announce AI layoffs in Q4 2025
Tech layoffs averaged 20K/month; AI-driven restructuring was accelerating
GitHub Copilot exceeds 4M paid subscribers by Jan 2026
Was at ~2.5M growing 75% YoY; trajectory supported 4M target
Cursor ARR exceeds $500M by end 2025
Growth was rapid but $500M seemed aggressive — actually crossed $1B
EU AI Act GPAI rules take effect Aug 2, 2025
Already published in Official Journal; EU regulations rarely slip
Google releases model that tops GPT-4o before Dec 2025
Google was catching up with Gemini 2.x; a leap was plausible
Anthropic releases new Claude in Q4 2025
Anthropic was on quarterly cadence; Claude Opus 4.5 arrived Nov 24
Major AI coding tool acquired for >$100M in Q4 2025
M&A activity was accelerating; Cognition acquired Windsurf for ~$250M
GitHub Copilot changes pricing before Mar 2026
Metered billing introduced June 2025; structural pricing shift
US tech layoffs exceed 50K in Q4 2025
Q3 was ~40K; Q4 trend was upward but uncertain
AI-generated code exceeds 50% of Copilot user output
Was at ~40%, growing ~5pp/year; 50% in 3 months was unlikely. Actual: 46%
Devin pricing drops below $100/month by end 2025
Already happened — Devin 2.0 was $20/mo since April 2025
80%+ of Fortune 100 using AI coding assistants by Jan 2026
Was at ~77% mid-2025; trajectory strong. Actual: 90% (Copilot alone)
Any country enacts AI-generated code specific legislation
AI code legislation is a new category with no precedent. Correct: none enacted.
NeurIPS 2025 accepts more than 3,000 papers
NeurIPS 2024 accepted ~3,500; upward trend continued
Meta announces >5,000 layoffs in Q1 2026
Meta had prior mass layoffs but 2025 seemed stable — wrong: 20% cut announced
AI coding tools market exceeds $7B by end 2025
Was at $4.91B in 2024 growing ~50% YoY. Actual: $7.37B
Claude Code becomes #1 most-used AI coding tool by Mar 2026
Claude Code launched May 2025 — too new to predict dominance. Wrong: it did.
FAANG AI capex commitments for 2026 exceed $500B
2025 was ~$300-400B; $500B was aggressive but plausible. Actual: ~$700B
90%+ of developers use AI tools weekly by early 2026
Was at ~75-80% mid-2025; 90%+ seemed aggressive. Actual: 95% per survey
Honest Caveats
- This is a BACKTEST, not a real-time track record. Questions were defined with hindsight knowledge of outcomes.
- Model: Gemini 2.5 Pro — training cutoff predates the Oct 2025 hindcast period, reducing but not eliminating data contamination risk.
- Question set skews toward YES outcomes (16/20). A system that always predicts 70% YES would score well. Future validations will include more balanced outcomes.
- Low-confidence bucket (0-30%) was poorly calibrated: predicted unlikely events that happened (Cursor growth, Claude Code adoption, Meta layoffs).
Methodology
20 questions were defined about specific, falsifiable events from October 2025 to February 2026. Gemini 2.5 Pro was prompted with the question, a base rate hint, and instructed to predict as if the current date were the ask date. Predictions were batched in groups of 5. All results are published, including all 4 misses. The Brier score is computed as the mean squared error between predicted probability and binary outcome across all 20 questions.