Model Methodology

Executive Summary

Diamond AI is a quantitative MLB prediction engine that generates calibrated win probabilities for every Major League Baseball game. The system is built on a logistic regression model trained on 10,192 historical games spanning 2020-2024 (real-odds only).

The model was validated on 1,716 held-out 2025 games that were never seen during training, achieving 57.17% accuracy with a Brier score of 0.2398 (a coin flip scores 0.25). This result is statistically significant at p < 10^-11.

Walk-forward validation across six independent test years (2020-2025) shows consistent performance in the 56-58% accuracy range, confirming the model generalizes across different seasons, rule changes, and competitive environments.

57.17%

Holdout Accuracy

1,716 test games

0.2398

Brier Score

coin flip = 0.2500

p < 10^-11

Statistical Significance

z = 7.00

6

Walk-Forward Years

2020-2025

Section 2

Model Architecture

The prediction pipeline transforms raw data through six sequential stages. Each stage is independently testable, versioned, and produces deterministic output given the same inputs.

STAGE 1

Data Layer

Retrosheet, Lahman,
MLB API, Statcast

→

STAGE 2

Feature Engineering

Elo, Pythagorean,
Pitcher FIP

→

STAGE 3

Statistical Model

Logistic Regression
L2 + Ensemble

→

STAGE 4

Calibration

Platt Scaling
proprietary parameters

→

STAGE 5

Consensus

50% Model +
50% Vegas

→

OUTPUT

Win Probability

Calibrated P(win)
+ Confidence Score

Data Sources

⋄ Retrosheet — 233,000+ historical games with box scores and play-by-play
⋄ Lahman Database — Season-level batting, pitching, and fielding statistics (1871-present)
⋄ MLB Stats API — Real-time scores, lineups, rosters, and schedule data
⋄ Baseball Savant / Statcast — Pitch-level metrics including velocity, spin rate, and expected stats
⋄ Vegas Odds Feed — 34,378 real historical odds from DraftKings, FanDuel, BetMGM (2010-2025)

Model Specification

⋄ Algorithm: Logistic regression with L2 regularization (λ = [tuned])
⋄ Training: Gradient descent, [optimized] epochs, lr = [tuned]
⋄ Ensemble: 10-run averaging with random initialization seeds
⋄ Calibration: Platt scaling post-hoc (sigmoid mapping proprietary parameters)
⋄ Consensus: 50/50 blend of model probability and Vegas implied probability

Section 3

Elo Rating System

Diamond AI maintains a continuous Elo rating for all 30 MLB teams. The system adapts after every game, providing a real-time measure of team strength that captures momentum, roster changes, and form.

Algorithm Parameters

K-factor	4
Home advantage	+24 points
Margin of victory	sqrt(MOV)
Season regression	25% toward 1500
Initial seeding	Pythagorean W%
Optimization	96-config grid search

Update Formula

E = 1 / (1 + 10^((R_away - R_home - HFA) / 400))

MOV_mult = sqrt(|run_diff|)

R_new = R_old + K * MOV_mult * (S - E)

Where E = expected outcome, HFA = home field advantage (+24), S = actual result (1/0), K = update speed (4).

Optimization Process

The Elo parameters were optimized via grid search over 96 configurations (K = {2, 3, 4, 5, 6, 8}, HFA = {16, 20, 24, 28}, MOV = {none, log, sqrt, linear}), cross-validated on three independent holdout years. The selected configuration (K=4, HFA=24, sqrt MOV) minimized Brier score across all validation folds.

Section 4

Feature Engineering

The model uses three engineered features plus a bias term. Each feature captures a distinct predictive signal. Weights shown are from the trained model (v2.1).

Elo Differential

Team strength signal

Weight: ███ (proprietary)

What it measures

The difference in team quality based on continuous Elo ratings. Captures overall team strength, recent form, and historical performance in a single number.

Computation

elo_diff = (home_elo - away_elo + 24) / 100

Normalized to ~(-3, +3) range. Home advantage of +24 baked into the differential.

Pythagorean Run Differential

True talent estimator

Weight: ███ (proprietary)

What it measures

Expected win percentage based on runs scored vs. runs allowed. Filters out luck by estimating how many games a team should have won, independent of actual record. The strongest single predictor in the model.

Computation

pyth_pct = RS^1.83 / (RS^1.83 + RA^1.83)
pyth_diff = home_pyth - away_pyth

Range: roughly -0.3 to +0.3. Exponent 1.83 from Davenport's MLB-optimized formula.

Pitcher FIP Differential

Starting pitcher quality

Weight: ███ (proprietary)

What it measures

Fielding Independent Pitching isolates the starting pitcher's true skill from defense and luck. Captures the individual matchup advantage from the day's pitching assignment.

Computation

FIP = ((13*HR + 3*BB - 2*K) / IP) + C
fip_diff = away_FIP - home_FIP

Positive = home pitcher advantage. FIP constant C calibrated to league ERA. Coverage: 99.3% of games matched.

Bias (Intercept)

Baseline home advantage

Weight: ███ (proprietary)

The learned intercept represents residual home-field advantage after Elo's explicit +24 adjustment. A positive bias indicates the home team wins slightly more often than Elo alone predicts.

Full Prediction Formula

z = bias + w₁ · elo_diff + w₂ · pyth_diff + w₃ · fip_diff (weights proprietary)

P(home_win) = 1 / (1 + e^(-z))

P(calibrated) = 1 / (1 + e^(-(A * z + B))) where A,B are proprietary

P(final) = 0.50 * P(calibrated) + 0.50 * P(vegas_implied)

Section 5

Validation Results

Walk-Forward Validation (2020-2025)

Each year is predicted using only data available before that season. The model is never trained on data it later predicts. This is the gold standard for time-series model validation.

Season	Games	Accuracy	vs. Baseline	Status
2020	898	57.9%	+7.9%	60-game COVID season
2021	2,429	57.9%	+7.9%
2022	2,430	58.8%	+8.8%	Best single year
2023	2,430	56.4%	+6.4%
2024	2,430	56.1%	+6.1%
2025	2,430	56.6%	+6.6%	Holdout test set
Average	~2,175	57.3%	+7.3%	6-year mean

Statistical Significance

Under the null hypothesis that the model is no better than a coin flip (50%), the observed 57.1% accuracy on 2,430 games yields:

z = (0.571 - 0.500) / sqrt(0.25 / 2430)
z = 0.071 / 0.01015
z = 7.00

p = 2 * (1 - Φ(7.00)) < 10^-11

Observed accuracy	57.17%
Sample size	1,716
z-score	7.00
p-value	< 10^-11
95% CI	[55.1%, 59.1%]
99% CI	[54.5%, 59.7%]

Live Production Performance

54.3%

Live Accuracy

564

Predictions Graded

0.2481

Live Brier Score

v2.1

Current Model

Section 6

Evaluation Framework

Validation accuracy of 57.2% on a held-out 2025 season tells you the model can sort wins from losses on a single distribution it never trained on. That's necessary, but it's not sufficient. The MLB betting market is one of the most efficient public-information markets on earth — billions of dollars a year flow through it, and the closing line at first pitch is the consensus of every sharp bettor, every public bettor, every model, and every late-breaking lineup or weather signal. If a prediction system has real edge, the market will eventually move toward its picks; if it doesn't, any apparent accuracy comes from picking with the consensus, not from independent insight.

North-Star Metric: Closing Line Value (CLV)

For every prediction we ship, we record the implied probability of our pick at predict time and again at the closing line. The difference is Closing Line Value. Positive CLV means the market moved toward our pick — late information confirmed our read of the game. Negative CLV means the line moved away from us — the market disagreed, and the market is usually right.

CLV is reported live on the Performance dashboard: 30-day rolling average, all-time average, and beat-closing percentage. Today's CLV is benchmarked against soft-book closes (FanDuel / DraftKings); a sharp-book reference (Pinnacle / Circa) is on the roadmap. Until that lands, we treat positive soft-book CLV as necessary but not sufficient evidence of edge — a real moat requires beating the sharper books.

Supporting Metrics

Brier Score

Mean squared error between predicted probability and actual outcome. Lower is better. Penalizes overconfidence symmetrically with underconfidence. Target: ≤ 0.235.

Calibration Max-Gap

For each 5pp confidence bucket above 50% with at least 30 picks, the absolute gap between stated confidence midpoint and actual win rate. Tracks whether "65% confident" actually means 65%.

Walk-Forward Score

Every Saturday morning, we retrain the model at each month boundary in the holdout horizon and score the next month. A model that wins on a single split but degrades sharply month-over-month is overfit. We catch it before promotion.

Vanity Metric: Win-Rate Accuracy

We report it because users care, but we don't optimize for it. A 53% predictor with positive CLV is a real model. A 60% predictor with negative CLV got lucky.

Promotion Gate

Each Sunday morning, the weekly retraining job stages a candidate model in a versioned registry. The candidate is only promoted to the live serving file if all three of these hold versus the active model:

test Brier ≤ active Brier + 0.005 (epsilon allows negligible regressions)
test accuracy ≥ active accuracy − 0.5pp
calibration max-gap ≤ active max-gap + 2pp

Failed candidates do not promote — the active model continues serving, and an alert fires. Manual rollback is one command. The history of every promotion and rejection is preserved in an append-only audit log. The point of the gate isn't to ship faster — it's to make sure that every change improves on a metric we've decided to optimize, not just on the metric the change happens to look good on.

Real-Time Monitoring

Daily monitoring tracks 7-day, 30-day, and all-time CLV plus accuracy and Brier. Alerts fire when the rolling 30-day CLV goes negative on a meaningful sample, when accuracy drops below 53% on 50+ picks, or when data freshness checks fail (model weights older than 14 days, Elo ratings older than 48 hours). The same checks are exposed publicly via /api/v1/health.php for any customer to verify.

Section 7

Real-Market Backtesting

Unlike many prediction systems that backtest against synthetic odds, Diamond AI validates against 34,378 real historical closing odds from DraftKings, FanDuel, and BetMGM spanning 2010-2025.

2025 Season Results

Divergence Threshold	Bets	ROI
All predictions	2,430	-1.2%
≥ 3% divergence	~820	+0.4%
≥ 5% divergence	~410	+1.7%
≥ 10% divergence	149	+8.6%

Honest Disclosure

The 16-year average across all divergence levels is -3.5% ROI. The sportsbook vig (typically 4.5-5%) means that even a statistically significant accuracy edge does not automatically translate to positive expected value on all bets.

Profitable betting requires selective execution: only wagering when the model's probability diverges significantly from the market. At ≥5% divergence, the model has demonstrated consistent positive ROI.

Past performance does not guarantee future results. All predictions are for informational purposes only. This is not financial advice.

Section 8

Self-Learning Pipeline

Diamond AI runs a fully automated pipeline of 20+ cron jobs that continuously collect data, grade predictions, update ratings, detect edges, and retrain the model — with zero manual intervention.

Continuous (every 15 min during games)

Data Sync

Live scores, odds, and game status from MLB Stats API. Quick sync every 15 minutes 11am-midnight; full sync every 6 hours.

Hourly

Grade Predictions

Compare predictions against final scores. Calculate Brier scores, update accuracy metrics.

After Midnight (12:15am - 12:30am CT)

Auto-Learn

Analyze prediction accuracy patterns. Write journal entries. Detect systematic biases.

Elo Rebuild

Rebuild team Elo ratings incorporating yesterday's results. Full historical recalculation.

Edge Detection

Identify where model probability diverges from market odds. Generate value bet candidates.

Bankroll Simulation

Run Monte Carlo simulation on betting strategies. Track shadow portfolio performance.

Morning (6am - 2pm CT)

Full Data Sync (6am)

Complete sync of all data sources including Statcast and lineup data.

Elo Rebuild + Edge Detection (7:30am)

Second pass with morning data. Captures any late-night game results.

Arbitrage Detection (7:50am)

Scan cross-book odds for arbitrage opportunities and value plays.

Predictions (8am / 10am / 12pm)

Three prediction windows to capture lineup changes and late scratches. Each run updates probabilities.

Lineup Sync (2pm)

Final lineup confirmation before evening games begin.

Weekly (Sundays 5am CT)

Model Retraining

Full logistic regression retraining on expanded dataset. multi-run ensemble with new random seeds. Weights written to model_weights.json. Previous model archived.

Nightly (1am CT)

Accuracy Monitor

Check rolling accuracy windows. Alert if performance degrades below thresholds. Model staleness detection.

Section 9

Competitive Advantages

⋄

Trained Statistical Model

Not LLM predictions. A real logistic regression with interpretable weights trained on 10,000+ games. Every coefficient has a mathematical justification.

⋄

Walk-Forward Validated

10-year validation window (not just backtested on training data). 56-58% accuracy sustained across 6 independent test seasons.

⋄

Real Odds Backtesting

34,378 real historical closing odds from major sportsbooks. Not synthetic or reconstructed. Results are reproducible against actual market prices.

⋄

Public Track Record

Every prediction is logged with timestamp, confidence, and graded result. Full history available via API and web interface. No cherry-picking.

⋄

REST API with Documentation

Production-grade API serving predictions, team ratings, historical data, and model metadata. Full OpenAPI documentation. Ready for integration.

⋄

Self-Improving Pipeline

20+ automated cron jobs handle data collection, grading, Elo updates, edge detection, and weekly retraining. Zero manual intervention required.

⋄

Claude AI Qualitative Overlay

Claude provides natural language analysis, injury context, and narrative explanation alongside the statistical model. The AI adds readability and context — it is not the core predictor. The math drives the numbers.

Section 10

Growth Roadmap

Accuracy Improvement Path

57.17%

Current

Elo + Pythagorean + Pitcher FIP + Vegas consensus

58-60%

Target

+ Game-day lineup OPS, bullpen fatigue index, weather adjustments, park-adjusted Statcast metrics

Feature Expansion

• Lineup OPS differential
• Bullpen fatigue index
• Weather impact modeling
• Travel fatigue factor
• Umpire strike zone tendencies

Multi-Sport Expansion

• NFL (Elo + EPA/play)
• NBA (Elo + net rating)
• NHL (Elo + expected goals)
• 3-4x revenue potential

API Monetization

• Free tier: 100 calls/day
• Pro tier: $99/mo unlimited
• Enterprise: Custom pricing
• Webhook + streaming APIs

Request Full Technical Package

Includes model weights, training scripts, validation notebooks, and API documentation. Available under NDA for qualified parties.

Request PDF Package →

Last updated: May 15, 2026 at 6:20am CT

Contents

Executive Summary

Model Architecture

Data Sources

Model Specification

Elo Rating System

Algorithm Parameters

Update Formula

Optimization Process

Feature Engineering

Elo Differential

Pythagorean Run Differential

Pitcher FIP Differential

Bias (Intercept)

Full Prediction Formula

Validation Results

Walk-Forward Validation (2020-2025)

Statistical Significance

Live Production Performance

Evaluation Framework

North-Star Metric: Closing Line Value (CLV)

Supporting Metrics

Brier Score

Calibration Max-Gap

Walk-Forward Score

Vanity Metric: Win-Rate Accuracy

Promotion Gate

Real-Time Monitoring

Real-Market Backtesting

2025 Season Results

Honest Disclosure

Self-Learning Pipeline

Continuous (every 15 min during games)

Hourly

After Midnight (12:15am - 12:30am CT)

Morning (6am - 2pm CT)

Weekly (Sundays 5am CT)

Nightly (1am CT)

Competitive Advantages

Trained Statistical Model

Walk-Forward Validated

Real Odds Backtesting

Public Track Record

REST API with Documentation

Self-Improving Pipeline

Claude AI Qualitative Overlay

Growth Roadmap

Accuracy Improvement Path

Feature Expansion

Multi-Sport Expansion

API Monetization

Request Full Technical Package