Analytics Wiki

Philosophy

ELO vs OPR: Why We Need Both

A Brief History of ELO

The ELO rating system was developed by physicist Arpad Elo in the 1960s for chess. Today it's used across competitive domains: FIFA world rankings, League of Legends matchmaking, FiveThirtyEight's NFL predictions, and professional esports. The system's power lies in its ability to predict outcomes and adapt based on results.

Why Not Just Use OPR?

If OPR estimates a team's scoring contribution, why can't we just add up OPRs to predict winners?

The Problem: Close Matches

Consider these two outcomes:

Match	Red Score	Blue Score	Result
Match A	200	198	Red Wins
Match B	50	48	Red Wins

OPR sees these as completely different matches (200 vs 50 points). But for winning, they're equally valuable - a 2-point victory either way. ELO captures this: both red alliances get similar rating boosts because both achieved the outcome that matters.

When Each Metric Shines

📊 Use OPR For:

Predicting expected scores
Evaluating robot hardware capability
Alliance selection scouting
Identifying high-scoring partners

🎯 Use ELO For:

Predicting match winners
Measuring competitive success rate
Bracket placement and seeding
Cross-regional ranking

💡 Key Insight: A team scoring 150 points per match (high OPR) but consistently losing 180-150 will have lower ELO than a team scoring 120 but winning 120-110. OPR says the first team has a better robot; ELO says the second team wins more often. Both are true - they measure different things.

Core Metric

Normalized cELO: The Best of Both Worlds

Normalized Cumulative ELO (cELO) combines competitive success with absolute performance, adjusted for regional strength and meta evolution. It's our most comprehensive single metric for ranking teams globally.

The Three-Level System

Event ELO

Isolated rating from a single event's matches

cELO (Cumulative ELO)

Running total across all matches, exponentially weighted toward recent performance

Normalized cELO

cELO adjusted for regional strength and blended with cOPR-based absolute performance

Recency Weighting

Teams improve throughout the season. To reflect current skill rather than historical averages, we apply exponential decay weighting to match importance. Recent matches contribute significantly more to your rating than matches from weeks or months ago.

w(t) = e^{-\lambda \cdot \Delta t}

Where Δt is days since the match and λ is the decay parameter. This ensures ratings reflect a team's current skill level.

The Regional Normalization Challenge

Consider two teams with identical cELO ratings:

Team A: Dominates a weaker region (15-0 record, avg opponent ELO: below average)
Team B: Competes in an elite region (8-7 record, avg opponent ELO: well above average)

Which team is truly stronger? Raw ELO can't distinguish between "big fish in small pond" and "contender among elites."

Hybrid Normalization

Our normalization blends two components to create a globally fair rating:

1. Competitive Component (Majority Weight)

Traditional ELO from win/loss record - measures competitive success

2. Performance Component (Minority Weight)

Based on cOPR relative to global mean - measures absolute robot scoring capability

Evolution Scaling

To prevent artificial rating ceilings and account for meta evolution (teams collectively improving as the season progresses), the entire ELO scale adjusts proportionally to global scoring trends.

As teams collectively improve and raise the scoring ceiling, the ELO scale naturally inflates to match. A world-class team today might rate differently than a world-class team from an earlier season due to meta evolution.

Example: Cross-Regional Comparison

A team with a perfect record in a weak region but low scoring ability will be normalized down, while a team with a mediocre record in an elite region but high scoring ability will be normalized up. This enables meaningful cross-region comparisons.

Use Cases

Cross-regional team comparisons and world rankings
Championship seeding and advancement predictions
Identifying underrated teams from highly competitive regions
Multi-season historical comparisons despite meta evolution

Performance

Cumulative Offensive Power Rating (cOPR)

While ELO measures ability to win, cOPR measures ability to score points. It isolates an individual team's contribution to alliance scores, with exponentially higher weight given to recent events.

The Alliance Score Problem

FTC matches are 2v2, but we only observe total alliance scores. If Red Alliance (Teams 123 + 456) scores 180 points, how much did each team contribute individually?

Linear System Solution

We model alliance scores as a linear system across many matches:

\text{cOPR}_{\text{Team}_1} + \text{cOPR}_{\text{Team}_2} \approx \text{Alliance Score}

Over an event with N teams and M matches, this creates an overdetermined system \( Ax = b \), solved using Weighted Least Squares Regression.

Time-Weighted Recency

Teams improve throughout the season. To emphasize current performance, recent matches receive significantly higher weight than older ones using exponential decay:

Most recent matches: Full weight
Older matches: Progressively less influence (exponential decay)

This makes cOPR more predictive of current capability than a simple average across all events.

💡 Why Weighted? A team that scored poorly at their first event but now scores well should be rated closer to their current ability, not dragged down by early-season struggles.

Trend

Momentum

Momentum quantifies the rate of improvement over time. It answers: "Is this team getting better, staying stable, or declining?"

Methodology

We perform Weighted Least Squares regression on match scores over time, with higher weights on recent matches. The slope of the fitted line represents points-per-match improvement rate.

\text{Score}(t) = \beta_0 + \beta_1 \cdot t + \epsilon

Where β₁ (the slope) indicates improvement direction:

Positive slope: Improving performance
Near-zero slope: Stable performance
Negative slope: Declining performance

The raw slope is normalized to a 0-100 scale for interpretability, with 50 representing stable (no trend).

Reliability

Consistency Index

Consistency measures how reliably a team performs near their average. High consistency means few "bad matches," while low consistency indicates volatility.

Mathematical Foundation

Based on the Coefficient of Variation (CV):

CV = \frac{\sigma}{\mu}

Where σ is standard deviation and μ is mean score. We invert and scale this to 0-100, where CV = 0 (perfect consistency) maps to 100.

💡 Why It Matters: A team with high variance is riskier for eliminations than a team scoring slightly lower but with tight consistency. Alliance captains should consider this when picking!

Penalties

Foul cOPR

Foul cOPR estimates the average penalty points a team gives to opponents per match. Like scoring OPR, penalties are reported per alliance, so we use the same linear system approach to isolate individual responsibility.

Lower Foul cOPR is better. A team with a high Foul cOPR contributes significant penalty points to opponents per match on average - something to watch out for during alliance selection!

Time-Weighted Evolution

Foul cOPR uses the same recency weighting as scoring cOPR. Teams that clean up their driving or fix problematic mechanisms will see rapid improvement in this metric.

Context

Schedule Grade

Schedule Grade measures the strength of opposition a team has faced. It answers: "Did this team earn their record against tough opponents or easy ones?"

How It Works

We calculate the average ELO rating of all opposing alliances a team has faced. This is then converted to a letter grade (A+ through F) based on how it compares to the typical opponent strength across all teams.

Interpretation

High Grade (A+/A): Team has faced tough opponents. Their record is "battle-tested."
Mid Grade (B/C): Team has faced average competition.
Low Grade (D/F): Team has faced weak opponents. Their record may be inflated.

💡 Note: A high schedule grade combined with a winning record is extremely valuable - it means the team has proven themselves against quality opponents.

Context

Strength of Schedule (SOS)

Strength of Schedule is the raw numerical value behind the Schedule Grade. It represents the average ELO rating of opposing alliances faced.

Calculation

For each match, we identify the opposing alliance and average their ELO ratings. The team's SOS is the average of all these opposing alliance ratings across all matches played.

\text{SOS} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{\text{Opp}_1 \text{ ELO} + \text{Opp}_2 \text{ ELO}}{2} \right)_i

Use Cases

Contextualizing win/loss records
Comparing teams with similar records but different opposition quality
Identifying teams that may be underrated due to tough schedules

Ranking Points

RP Reliability

In FTC Into The Deep (DECODE), Ranking Points determine tournament seeding. Beyond just winning, teams can earn bonus RPs for achieving specific game objectives. RP Reliability estimates the probability of earning each bonus RP type in the next match.

Bayesian Inference with Recency

We blend three statistical approaches:

Historical Success Rate: Long-term track record
Recency Weighting: Recent matches weighted exponentially higher
Bayesian Smoothing: Prevents overfitting to small samples (e.g., a single success shouldn't mean 100% probability)

P(\text{RP}) = \frac{\sum_{i} w_i \cdot \text{Success}_i + \text{Prior Successes}}{\sum_{i} w_i + \text{Prior Trials}}

Where wᵢ are recency weights. This produces robust probabilities that adapt quickly to new strategies (like a new autonomous path) without overreacting to outliers.

Special Case: Never Achieved

If a team has never earned a particular RP in any match, their probability for that RP is forced to 0% - we won't predict they'll suddenly achieve something they've never demonstrated.

Predictions

Match Win Probability

Given two alliances, what's the probability each alliance wins?

ELO-Based Probability

The probability Alliance A defeats Alliance B follows a logistic curve:

P(A \text{ wins}) = \frac{1}{1 + 10^{(R_B - R_A) / D}}

Where R_A and R_B are alliance ratings (sum of both teams' Normalized cELOs) and D is a scaling constant.

Score Prediction Enhancement

We also estimate expected scores using cOPR and Foul cOPR:

\text{Expected Score}_A = \sum \text{cOPR}_{A} + \sum \text{Foul cOPR}_{B}

Alliance A's expected score equals their teams' combined scoring ability plus penalties they'll draw from Alliance B.

💡 Two Models, One Prediction: If ELO predicts Red wins but score prediction favors Blue, we flag this as a high-uncertainty match requiring further analysis.

Event Analysis

Event Difficulty

Event Difficulty quantifies how challenging an event is based on the strength of competing teams relative to the current global competition level. This dynamic system adapts throughout the season as teams improve.

Dynamic Percentile-Based Rating

Unlike absolute thresholds, difficulty ratings are calculated using global percentiles that update as the season progresses:

Why Relative Ratings Matter

An event with an average top-8 ELO of 1350 in Week 1 might be rated Elite (10/10) because teams are just starting out. That same 1350 ELO event in Week 20 might only rate Moderate (5/10) because the global competition has improved significantly.

Similarly, a team scoring 50 points in Week 1 could be considered highly competitive, while 60 points in Week 20 might be below average—it's all relative to the current meta.

Calculation Method

The system calculates difficulty by:

Identifying Top Teams: Takes the average ELO of the top 8 teams at the event
Computing Global Percentiles: Calculates current percentiles (p99, p90, p70, p50, etc.) from ALL teams in the season
Mapping to Scale: Compares the event's top-8 average to global percentiles to assign a 1-10 score

Difficulty Scale

Ratings are assigned based on where the event falls in the global distribution:

Elite (9-10): Top 1-10% of global competition—championship-caliber field
High (7-8): Top 10-30%—very strong regional competition
Moderate (4-6): Top 30-70%—typical competitive event
Low (2-3): Bottom 30-50%—developing region or early-season event
Beginner (1): Bottom 10%—entry-level competition

💡 Key Insight: The same teams at the same event could receive different difficulty ratings depending on when the event occurs. This ensures ratings always reflect current competitive context, not arbitrary fixed thresholds.

Applications

Contextualize team performance relative to competition strength
Compare events across different regions and time periods fairly
Predict advancement probabilities for Championship events
Help teams strategize for event selection and preparation

Analysis

Upset Detection

An upset occurs when the predicted loser wins a match. We track upsets to identify matches with unexpected outcomes.

How We Detect Upsets

Before each match, we calculate win probabilities based on team ratings. If the underdog (team with lower win probability) wins, and the favorite had a confident prediction, we flag it as an upset.

Upset Magnitude

Not all upsets are equal. A 45% underdog winning is barely an upset, but a 15% underdog winning is shocking. The magnitude is calculated based on how confident the original prediction was.

💡 Why Track Upsets? They reveal which teams overperform under pressure, which matchups are volatile, and where our predictions need improvement.