Philosophy

ELO vs OPR: Why We Need Both

A Brief History of ELO

The ELO rating system was developed by physicist Arpad Elo in the 1960s for chess. Today it's used across competitive domains: FIFA world rankings, League of Legends matchmaking, FiveThirtyEight's NFL predictions, and professional esports. The system's power lies in its ability to predict outcomes and adapt based on results.

Why Not Just Use OPR?

If OPR estimates a team's scoring contribution, why can't we just add up OPRs to predict winners?

The Problem: Close Matches

Consider these two outcomes:

MatchRed ScoreBlue ScoreResult
Match A200198Red Wins
Match B5048Red Wins

OPR sees these as completely different matches (200 vs 50 points). But for winning, they're equally valuable - a 2-point victory either way. ELO captures this: both red alliances get similar rating boosts because both achieved the outcome that matters.

When Each Metric Shines

📊 Use OPR For:

  • Predicting expected scores
  • Evaluating robot hardware capability
  • Alliance selection scouting
  • Identifying high-scoring partners

🎯 Use ELO For:

  • Predicting match winners
  • Measuring competitive success rate
  • Bracket placement and seeding
  • Cross-regional ranking
💡 Key Insight: A team scoring 150 points per match (high OPR) but consistently losing 180-150 will have lower ELO than a team scoring 120 but winning 120-110. OPR says the first team has a better robot; ELO says the second team wins more often. Both are true - they measure different things.
Core Metric

Normalized cELO: The Best of Both Worlds

Normalized Cumulative ELO (cELO) combines competitive success with absolute performance, adjusted for regional strength and meta evolution. It's our most comprehensive single metric for ranking teams globally.

The Three-Level System

Event ELO

Isolated rating from a single event's matches

cELO (Cumulative ELO)

Running total across all matches, exponentially weighted toward recent performance

Normalized cELO

cELO adjusted for regional strength and blended with cOPR-based absolute performance

Recency Weighting

Teams improve throughout the season. To reflect current skill rather than historical averages, we apply exponential decay weighting to match importance. Recent matches contribute significantly more to your rating than matches from weeks or months ago.

$$ w(t) = e^{-\lambda \cdot \Delta t} $$

Where Δt is days since the match and λ is the decay parameter. This ensures ratings reflect a team's current skill level.

The Regional Normalization Challenge

Consider two teams with identical cELO ratings:

  • Team A: Dominates a weaker region (15-0 record, avg opponent ELO: below average)
  • Team B: Competes in an elite region (8-7 record, avg opponent ELO: well above average)

Which team is truly stronger? Raw ELO can't distinguish between "big fish in small pond" and "contender among elites."

Hybrid Normalization

Our normalization blends two components to create a globally fair rating:

1. Competitive Component (Majority Weight)

Traditional ELO from win/loss record - measures competitive success

2. Performance Component (Minority Weight)

Based on cOPR relative to global mean - measures absolute robot scoring capability

Evolution Scaling

To prevent artificial rating ceilings and account for meta evolution (teams collectively improving as the season progresses), the entire ELO scale adjusts proportionally to global scoring trends.

As teams collectively improve and raise the scoring ceiling, the ELO scale naturally inflates to match. A world-class team today might rate differently than a world-class team from an earlier season due to meta evolution.

Example: Cross-Regional Comparison

A team with a perfect record in a weak region but low scoring ability will be normalized down, while a team with a mediocre record in an elite region but high scoring ability will be normalized up. This enables meaningful cross-region comparisons.

Use Cases

  • Cross-regional team comparisons and world rankings
  • Championship seeding and advancement predictions
  • Identifying underrated teams from highly competitive regions
  • Multi-season historical comparisons despite meta evolution
Performance

Cumulative Offensive Power Rating (cOPR)

While ELO measures ability to win, cOPR measures ability to score points. It isolates an individual team's contribution to alliance scores, with exponentially higher weight given to recent events.

The Alliance Score Problem

FTC matches are 2v2, but we only observe total alliance scores. If Red Alliance (Teams 123 + 456) scores 180 points, how much did each team contribute individually?

Linear System Solution

We model alliance scores as a linear system across many matches:

$$ \text{cOPR}_{\text{Team}_1} + \text{cOPR}_{\text{Team}_2} \approx \text{Alliance Score} $$

Over an event with N teams and M matches, this creates an overdetermined system \( Ax = b \), solved using Weighted Least Squares Regression.

Time-Weighted Recency

Teams improve throughout the season. To emphasize current performance, recent matches receive significantly higher weight than older ones using exponential decay:

  • Most recent matches: Full weight
  • Older matches: Progressively less influence (exponential decay)

This makes cOPR more predictive of current capability than a simple average across all events.

💡 Why Weighted? A team that scored poorly at their first event but now scores well should be rated closer to their current ability, not dragged down by early-season struggles.
Trend

Momentum

Momentum quantifies the rate of improvement over time. It answers: "Is this team getting better, staying stable, or declining?"

Methodology

We perform Weighted Least Squares regression on match scores over time, with higher weights on recent matches. The slope of the fitted line represents points-per-match improvement rate.

$$ \text{Score}(t) = \beta_0 + \beta_1 \cdot t + \epsilon $$

Where β₁ (the slope) indicates improvement direction:

  • Positive slope: Improving performance
  • Near-zero slope: Stable performance
  • Negative slope: Declining performance

The raw slope is normalized to a 0-100 scale for interpretability, with 50 representing stable (no trend).

Reliability

Consistency Index

Consistency measures how reliably a team performs near their average. High consistency means few "bad matches," while low consistency indicates volatility.

Mathematical Foundation

Based on the Coefficient of Variation (CV):

$$ CV = \frac{\sigma}{\mu} $$

Where σ is standard deviation and μ is mean score. We invert and scale this to 0-100, where CV = 0 (perfect consistency) maps to 100.

💡 Why It Matters: A team with high variance is riskier for eliminations than a team scoring slightly lower but with tight consistency. Alliance captains should consider this when picking!
Penalties

Foul cOPR

Foul cOPR estimates the average penalty points a team gives to opponents per match. Like scoring OPR, penalties are reported per alliance, so we use the same linear system approach to isolate individual responsibility.

Lower Foul cOPR is better. A team with a high Foul cOPR contributes significant penalty points to opponents per match on average - something to watch out for during alliance selection!

Time-Weighted Evolution

Foul cOPR uses the same recency weighting as scoring cOPR. Teams that clean up their driving or fix problematic mechanisms will see rapid improvement in this metric.

Context

Schedule Grade

Schedule Grade measures the strength of opposition a team has faced. It answers: "Did this team earn their record against tough opponents or easy ones?"

How It Works

We calculate the average ELO rating of all opposing alliances a team has faced. This is then converted to a letter grade (A+ through F) based on how it compares to the typical opponent strength across all teams.

Interpretation

  • High Grade (A+/A): Team has faced tough opponents. Their record is "battle-tested."
  • Mid Grade (B/C): Team has faced average competition.
  • Low Grade (D/F): Team has faced weak opponents. Their record may be inflated.
💡 Note: A high schedule grade combined with a winning record is extremely valuable - it means the team has proven themselves against quality opponents.
Context

Strength of Schedule (SOS)

Strength of Schedule is the raw numerical value behind the Schedule Grade. It represents the average ELO rating of opposing alliances faced.

Calculation

For each match, we identify the opposing alliance and average their ELO ratings. The team's SOS is the average of all these opposing alliance ratings across all matches played.

$$ \text{SOS} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{\text{Opp}_1 \text{ ELO} + \text{Opp}_2 \text{ ELO}}{2} \right)_i $$

Use Cases

  • Contextualizing win/loss records
  • Comparing teams with similar records but different opposition quality
  • Identifying teams that may be underrated due to tough schedules
Ranking Points

RP Reliability

In FTC Into The Deep (DECODE), Ranking Points determine tournament seeding. Beyond just winning, teams can earn bonus RPs for achieving specific game objectives. RP Reliability estimates the probability of earning each bonus RP type in the next match.

Bayesian Inference with Recency

We blend three statistical approaches:

  1. Historical Success Rate: Long-term track record
  2. Recency Weighting: Recent matches weighted exponentially higher
  3. Bayesian Smoothing: Prevents overfitting to small samples (e.g., a single success shouldn't mean 100% probability)
$$ P(\text{RP}) = \frac{\sum_{i} w_i \cdot \text{Success}_i + \text{Prior Successes}}{\sum_{i} w_i + \text{Prior Trials}} $$

Where wᵢ are recency weights. This produces robust probabilities that adapt quickly to new strategies (like a new autonomous path) without overreacting to outliers.

Special Case: Never Achieved

If a team has never earned a particular RP in any match, their probability for that RP is forced to 0% - we won't predict they'll suddenly achieve something they've never demonstrated.

Predictions

Match Win Probability

Given two alliances, what's the probability each alliance wins?

ELO-Based Probability

The probability Alliance A defeats Alliance B follows a logistic curve:

$$ P(A \text{ wins}) = \frac{1}{1 + 10^{(R_B - R_A) / D}} $$

Where RA and RB are alliance ratings (sum of both teams' Normalized cELOs) and D is a scaling constant.

Score Prediction Enhancement

We also estimate expected scores using cOPR and Foul cOPR:

$$ \text{Expected Score}_A = \sum \text{cOPR}_{A} + \sum \text{Foul cOPR}_{B} $$

Alliance A's expected score equals their teams' combined scoring ability plus penalties they'll draw from Alliance B.

💡 Two Models, One Prediction: If ELO predicts Red wins but score prediction favors Blue, we flag this as a high-uncertainty match requiring further analysis.
Event Analysis

Event Difficulty

Event Difficulty quantifies how challenging an event is based on the strength of competing teams relative to the current global competition level. This dynamic system adapts throughout the season as teams improve.

Dynamic Percentile-Based Rating

Unlike absolute thresholds, difficulty ratings are calculated using global percentiles that update as the season progresses:

Why Relative Ratings Matter

An event with an average top-8 ELO of 1350 in Week 1 might be rated Elite (10/10) because teams are just starting out. That same 1350 ELO event in Week 20 might only rate Moderate (5/10) because the global competition has improved significantly.

Similarly, a team scoring 50 points in Week 1 could be considered highly competitive, while 60 points in Week 20 might be below average—it's all relative to the current meta.

Calculation Method

The system calculates difficulty by:

  1. Identifying Top Teams: Takes the average ELO of the top 8 teams at the event
  2. Computing Global Percentiles: Calculates current percentiles (p99, p90, p70, p50, etc.) from ALL teams in the season
  3. Mapping to Scale: Compares the event's top-8 average to global percentiles to assign a 1-10 score

Difficulty Scale

Ratings are assigned based on where the event falls in the global distribution:

  • Elite (9-10): Top 1-10% of global competition—championship-caliber field
  • High (7-8): Top 10-30%—very strong regional competition
  • Moderate (4-6): Top 30-70%—typical competitive event
  • Low (2-3): Bottom 30-50%—developing region or early-season event
  • Beginner (1): Bottom 10%—entry-level competition
💡 Key Insight: The same teams at the same event could receive different difficulty ratings depending on when the event occurs. This ensures ratings always reflect current competitive context, not arbitrary fixed thresholds.

Applications

  1. Contextualize team performance relative to competition strength
  2. Compare events across different regions and time periods fairly
  3. Predict advancement probabilities for Championship events
  4. Help teams strategize for event selection and preparation
Analysis

Upset Detection

An upset occurs when the predicted loser wins a match. We track upsets to identify matches with unexpected outcomes.

How We Detect Upsets

Before each match, we calculate win probabilities based on team ratings. If the underdog (team with lower win probability) wins, and the favorite had a confident prediction, we flag it as an upset.

Upset Magnitude

Not all upsets are equal. A 45% underdog winning is barely an upset, but a 15% underdog winning is shocking. The magnitude is calculated based on how confident the original prediction was.

💡 Why Track Upsets? They reveal which teams overperform under pressure, which matchups are volatile, and where our predictions need improvement.