Technical Whitepaper
A complete technical reference for Diamond AI's statistical prediction engine. This document covers model architecture, training methodology, validation results, and real-market backtesting.
Section 1
Diamond AI is a quantitative MLB prediction engine that generates calibrated win probabilities for every Major League Baseball game. The system is built on a logistic regression model trained on 10,192 historical games spanning 2020-2024 (real-odds only).
The model was validated on 1,716 held-out 2025 games that were never seen during training, achieving 57.17% accuracy with a Brier score of 0.2398 (a coin flip scores 0.25). This result is statistically significant at p < 10-11.
Walk-forward validation across six independent test years (2020-2025) shows consistent performance in the 56-58% accuracy range, confirming the model generalizes across different seasons, rule changes, and competitive environments.
57.17%
Holdout Accuracy
1,716 test games
0.2398
Brier Score
coin flip = 0.2500
p < 10-11
Statistical Significance
z = 7.00
6
Walk-Forward Years
2020-2025
Section 2
The prediction pipeline transforms raw data through six sequential stages. Each stage is independently testable, versioned, and produces deterministic output given the same inputs.
STAGE 1
Data Layer
Retrosheet, Lahman,
MLB API, Statcast
STAGE 2
Feature Engineering
Elo, Pythagorean,
Pitcher FIP
STAGE 3
Statistical Model
Logistic Regression
L2 + Ensemble
STAGE 4
Calibration
Platt Scaling
proprietary parameters
STAGE 5
Consensus
50% Model +
50% Vegas
OUTPUT
Win Probability
Calibrated P(win)
+ Confidence Score
Section 3
Diamond AI maintains a continuous Elo rating for all 30 MLB teams. The system adapts after every game, providing a real-time measure of team strength that captures momentum, roster changes, and form.
| K-factor | 4 |
| Home advantage | +24 points |
| Margin of victory | sqrt(MOV) |
| Season regression | 25% toward 1500 |
| Initial seeding | Pythagorean W% |
| Optimization | 96-config grid search |
Where E = expected outcome, HFA = home field advantage (+24), S = actual result (1/0), K = update speed (4).
The Elo parameters were optimized via grid search over 96 configurations (K = {2, 3, 4, 5, 6, 8}, HFA = {16, 20, 24, 28}, MOV = {none, log, sqrt, linear}), cross-validated on three independent holdout years. The selected configuration (K=4, HFA=24, sqrt MOV) minimized Brier score across all validation folds.
Section 4
The model uses three engineered features plus a bias term. Each feature captures a distinct predictive signal. Weights shown are from the trained model (v2.1).
Team strength signal
What it measures
The difference in team quality based on continuous Elo ratings. Captures overall team strength, recent form, and historical performance in a single number.
Computation
Normalized to ~(-3, +3) range. Home advantage of +24 baked into the differential.
True talent estimator
What it measures
Expected win percentage based on runs scored vs. runs allowed. Filters out luck by estimating how many games a team should have won, independent of actual record. The strongest single predictor in the model.
Computation
Range: roughly -0.3 to +0.3. Exponent 1.83 from Davenport's MLB-optimized formula.
Starting pitcher quality
What it measures
Fielding Independent Pitching isolates the starting pitcher's true skill from defense and luck. Captures the individual matchup advantage from the day's pitching assignment.
Computation
Positive = home pitcher advantage. FIP constant C calibrated to league ERA. Coverage: 99.3% of games matched.
Baseline home advantage
The learned intercept represents residual home-field advantage after Elo's explicit +24 adjustment. A positive bias indicates the home team wins slightly more often than Elo alone predicts.
Section 5
Each year is predicted using only data available before that season. The model is never trained on data it later predicts. This is the gold standard for time-series model validation.
| Season | Games | Accuracy | vs. Baseline | Status |
|---|---|---|---|---|
| 2020 | 898 | 57.9% | +7.9% | 60-game COVID season |
| 2021 | 2,429 | 57.9% | +7.9% | |
| 2022 | 2,430 | 58.8% | +8.8% | Best single year |
| 2023 | 2,430 | 56.4% | +6.4% | |
| 2024 | 2,430 | 56.1% | +6.1% | |
| 2025 | 2,430 | 56.6% | +6.6% | Holdout test set |
| Average | ~2,175 | 57.3% | +7.3% | 6-year mean |
Under the null hypothesis that the model is no better than a coin flip (50%), the observed 57.1% accuracy on 2,430 games yields:
| Observed accuracy | 57.17% |
| Sample size | 1,716 |
| z-score | 7.00 |
| p-value | < 10-11 |
| 95% CI | [55.1%, 59.1%] |
| 99% CI | [54.5%, 59.7%] |
54.3%
Live Accuracy
564
Predictions Graded
0.2481
Live Brier Score
v2.1
Current Model
Section 6
Validation accuracy of 57.2% on a held-out 2025 season tells you the model can sort wins from losses on a single distribution it never trained on. That's necessary, but it's not sufficient. The MLB betting market is one of the most efficient public-information markets on earth — billions of dollars a year flow through it, and the closing line at first pitch is the consensus of every sharp bettor, every public bettor, every model, and every late-breaking lineup or weather signal. If a prediction system has real edge, the market will eventually move toward its picks; if it doesn't, any apparent accuracy comes from picking with the consensus, not from independent insight.
For every prediction we ship, we record the implied probability of our pick at predict time and again at the closing line. The difference is Closing Line Value. Positive CLV means the market moved toward our pick — late information confirmed our read of the game. Negative CLV means the line moved away from us — the market disagreed, and the market is usually right.
CLV is reported live on the Performance dashboard: 30-day rolling average, all-time average, and beat-closing percentage. Today's CLV is benchmarked against soft-book closes (FanDuel / DraftKings); a sharp-book reference (Pinnacle / Circa) is on the roadmap. Until that lands, we treat positive soft-book CLV as necessary but not sufficient evidence of edge — a real moat requires beating the sharper books.
Mean squared error between predicted probability and actual outcome. Lower is better. Penalizes overconfidence symmetrically with underconfidence. Target: ≤ 0.235.
For each 5pp confidence bucket above 50% with at least 30 picks, the absolute gap between stated confidence midpoint and actual win rate. Tracks whether "65% confident" actually means 65%.
Every Saturday morning, we retrain the model at each month boundary in the holdout horizon and score the next month. A model that wins on a single split but degrades sharply month-over-month is overfit. We catch it before promotion.
We report it because users care, but we don't optimize for it. A 53% predictor with positive CLV is a real model. A 60% predictor with negative CLV got lucky.
Each Sunday morning, the weekly retraining job stages a candidate model in a versioned registry. The candidate is only promoted to the live serving file if all three of these hold versus the active model:
Failed candidates do not promote — the active model continues serving, and an alert fires. Manual rollback is one command. The history of every promotion and rejection is preserved in an append-only audit log. The point of the gate isn't to ship faster — it's to make sure that every change improves on a metric we've decided to optimize, not just on the metric the change happens to look good on.
Daily monitoring tracks 7-day, 30-day, and all-time CLV plus accuracy and Brier. Alerts fire when
the rolling 30-day CLV goes negative on a meaningful sample, when accuracy drops below 53%
on 50+ picks, or when data freshness checks fail (model weights older than 14 days, Elo ratings
older than 48 hours). The same checks are exposed publicly via
/api/v1/health.php
for any customer to verify.
Section 7
Unlike many prediction systems that backtest against synthetic odds, Diamond AI validates against 34,378 real historical closing odds from DraftKings, FanDuel, and BetMGM spanning 2010-2025.
| Divergence Threshold | Bets | ROI |
|---|---|---|
| All predictions | 2,430 | -1.2% |
| ≥ 3% divergence | ~820 | +0.4% |
| ≥ 5% divergence | ~410 | +1.7% |
| ≥ 10% divergence | 149 | +8.6% |
The 16-year average across all divergence levels is -3.5% ROI. The sportsbook vig (typically 4.5-5%) means that even a statistically significant accuracy edge does not automatically translate to positive expected value on all bets.
Profitable betting requires selective execution: only wagering when the model's probability diverges significantly from the market. At ≥5% divergence, the model has demonstrated consistent positive ROI.
Past performance does not guarantee future results. All predictions are for informational purposes only. This is not financial advice.
Section 8
Diamond AI runs a fully automated pipeline of 20+ cron jobs that continuously collect data, grade predictions, update ratings, detect edges, and retrain the model — with zero manual intervention.
Data Sync
Live scores, odds, and game status from MLB Stats API. Quick sync every 15 minutes 11am-midnight; full sync every 6 hours.
Grade Predictions
Compare predictions against final scores. Calculate Brier scores, update accuracy metrics.
Auto-Learn
Analyze prediction accuracy patterns. Write journal entries. Detect systematic biases.
Elo Rebuild
Rebuild team Elo ratings incorporating yesterday's results. Full historical recalculation.
Edge Detection
Identify where model probability diverges from market odds. Generate value bet candidates.
Bankroll Simulation
Run Monte Carlo simulation on betting strategies. Track shadow portfolio performance.
Full Data Sync (6am)
Complete sync of all data sources including Statcast and lineup data.
Elo Rebuild + Edge Detection (7:30am)
Second pass with morning data. Captures any late-night game results.
Arbitrage Detection (7:50am)
Scan cross-book odds for arbitrage opportunities and value plays.
Predictions (8am / 10am / 12pm)
Three prediction windows to capture lineup changes and late scratches. Each run updates probabilities.
Lineup Sync (2pm)
Final lineup confirmation before evening games begin.
Model Retraining
Full logistic regression retraining on expanded dataset. multi-run ensemble with new random seeds. Weights written to model_weights.json. Previous model archived.
Accuracy Monitor
Check rolling accuracy windows. Alert if performance degrades below thresholds. Model staleness detection.
Section 9
Not LLM predictions. A real logistic regression with interpretable weights trained on 10,000+ games. Every coefficient has a mathematical justification.
10-year validation window (not just backtested on training data). 56-58% accuracy sustained across 6 independent test seasons.
34,378 real historical closing odds from major sportsbooks. Not synthetic or reconstructed. Results are reproducible against actual market prices.
Every prediction is logged with timestamp, confidence, and graded result. Full history available via API and web interface. No cherry-picking.
Production-grade API serving predictions, team ratings, historical data, and model metadata. Full OpenAPI documentation. Ready for integration.
20+ automated cron jobs handle data collection, grading, Elo updates, edge detection, and weekly retraining. Zero manual intervention required.
Claude provides natural language analysis, injury context, and narrative explanation alongside the statistical model. The AI adds readability and context — it is not the core predictor. The math drives the numbers.
Section 10
Current
Elo + Pythagorean + Pitcher FIP + Vegas consensus
Target
+ Game-day lineup OPS, bullpen fatigue index, weather adjustments, park-adjusted Statcast metrics
Includes model weights, training scripts, validation notebooks, and API documentation. Available under NDA for qualified parties.
Request PDF Package →Last updated: May 15, 2026 at 6:20am CT