Exploring NHL Athlete Longevity w/ Classification

hlfwatson
Mar 10
8 min read

Updated: Mar 24

Introduction

In this second project, I aimed to develop classification models that predict the longevity of a National Hockey Player's career. Specifically, I wanted to determine whether an athlete would remain in the NHL for five or more seasons. We examined the impact of an athlete's performance metrics, such as penalty minutes (PIM), scoring (G, A, TP), and games played (GP). Understanding which features contribute to an athlete's career longevity can provide valuable insights for a team's general manager, scouts, analysts, and fitness coaches to evaluate development and sustainability under current projections.

Data Collection and Preprocessing

The dataset was collected using Patrick Bacon's TopDownHockey EliteProspects scraper (made possible by ___). The web scraping tool extracts NHL Player Statistics from EliteProspects.com. The dataset spans from 2003-04 to 2024-25 seasons, including all athletes participating in the league between '03 and '25. Notably, the historical range for the dataset was determined by the oldest active athletes' ages in the NHL. Ryan Suter, Brent Burns, and Marc Andre Fleury all began their careers in the 2003 season. The extensive timeframe provided a comprehensive dataset for analysis, amounting to several thousand data points.

Before modeling the dataset, I performed extensive data cleaning to ensure accuracy and consistency. First, replacing null or blank (-) values with zeroes.

This was necessary because missing values indicate players placed on long-term injury reserve (LTIR, IR) prevented from accumulating ice time or altering their plus-minus statistic. Text-based entries, such as player name and team, were cleaned by whitespace removal and format normalization. Many names contained diacritics, and caused inconsistencies in search and filtering options. To the entire dataset, I applied the unidecode package to standardize them. The position column from the scraper output was eliminated due to its frequent inaccuracies - instead, I categorized players based on two position classifications to reflect coaching decisions regarding multi-way athletes. Additionally, I removed the league column since all the data points exclusively represented NHL games and athletes. Lastly, I reordered the columns to present key information more intuitively to enhance readability and align with industry standard formats such as ESPN and NHL.com. My preprocessing steps ensure that the dataset was clean, uniform, and well-structured for subsequent modeling steps.

Understanding and Visualization

To gain a better understanding of the dataset, I performed exploratory data analysis and visualized key trends with statistical methods. Finally, I added a feature importance plot to compare how different attributes contribute to predictions across models.

Histogram illustrating the distribution of athletes in the NHL with careers lasting less than or more than five seasons.

A heat map for key features to reveal relationships between variables, specifically how games played strongly correlates with career longevity. — A heat map for key features to reveal relationships between variables, specifically how **games played strongly correlates with career longevity.**

GP (Games Played) and Total Points (TP) contributed most to classification in the Gradient Boosting model. — GP (Games Played) and Total Points (TP) contributed most to classification in the **Gradient Boosting** model.

Created with feature permutation in Support Vector Model - GP (Games Played) exhibits a similar trend to Gradient Boosting results. — Created with feature permutation in **Support Vector Model** - GP (Games Played) exhibits a similar trend to Gradient Boosting results.

Team (OneHotEncoded) makes a surprising impact with GP (Games Played) continuing to lead feature importance in the Random Forest Model. — Team *(OneHotEncoded)* makes a surprising impact with GP (Games Played) continuing to lead feature importance in the **Random Forest Model.**

Modeling

My choice of models revolved around their individual classification strengths and ability to work with various feature sets. Three chosen were - Random Forest, Support Vector Machine, and Gradient Boosting. I expect the SVM model to take a longer time to process the size of our dataset.

Random Forest

Random Forest is an ensemble learning technique that uses numerous decision trees to predict a target variable. Each tree is assembled from a randomly selected subset of training data and features; as a programmer, you can recognize this as calling the traintestsplit method. The final prediction is based on the majority ruling of all decision trees' outcome.

I chose Random Forest for the dataset's large size and complex feature set. A key advantage is how it can handle missing data, including athletes missing ice time due to physical injury or a healthy scratch (Athlete was well enough to skate, but the team did not require the athlete). However, Random Forest is less practical for real-time analysis due to its computational expense and slow performance in large-scale implementation. In the context of athlete longevity, our model can achieve strong classification metrics, but a lack of interpretability may impede identifying the utmost contributing features

Support Vector Machine

The next algorithm, Support Vector Machine (SVM), is used for regression and classification. It is often regarded as supervised because the machine is provided with categories to achieve (Random Forest & Gradient Boosting are too!). SVM aims to achieve the optimal hyperplane (decision boundary) to separate data points into different classes. Crucial in SVM to determine the margin and hyperplane, the closest datapoints to the hyperplane are called support vectors. The margin distance, between the hyperplane and support vectors, is aimed to be maximized for better classification performance.

Given the dataset of nearly 20,000 points and ten predictive features, our dataset presents a size constraint to the Support Vector Machine (SVM) algorithm. While the algorithm is reliable in high-dimensional spaces and can handle nonlinear relationships, the benefits come at a time cost. My immense dataset causes SVM to be computationally inefficient and difficult to upkeep regularly, in a practical scenario. The lack of probabilistic outputs also indicates we'd need extra processing to interpret prediction confidence, consider the summary statistics and feature importance below. While the Support Vector Machine may aid in uncovering intricate patterns, its trade-offs could ultimately hurt the model's practicality in this case.

Gradient Boosting

Gradient Boosting (GB) is a popular 'boosting' algorithm in classification and regression. Its methods involve aggregating multiple 'weaker' models and creating a more accurate prediction; a 'weak' model is a decision tree. Each new tree focuses on fixing the errors and fine-tuning parameters based on all previous models' residual errors. More accurate predictors develop over time by iteratively minimizing the error residual.

The Gradient Boosting algorithm is a strong learner, because of its iterative process of combining decision trees. By correcting errors at each stage, GB captures subtle patterns in athlete performance that other models might have missed. Its strength in handling imbalanced data ensures more accurate classification of both categories. Like SVM, Gradient Boosting can be extensive and expensive for practical use due to long training times. Its tree depth also increases the risk of overfitting. Although surrogate splits take care of missing values momentarily, performance can degrade if missing data is not properly handled.

Analysis

Random Forest

Accuracy: 0.82

Classification Report:

precision recall f1-score support

< 5 years 0.59 0.40 0.48 801

>= 5 years 0.86 0.93 0.89 3188

accuracy 0.82 3989

macro avg 0.73 0.66 0.69 3989

weighted avg 0.81 0.82 0.81 3989

Confusion Matrix:

[[ 319 482]

[ 218 2970]]

Feature Importances:

Feature Importance

0 gp 0.186836

9 team 0.171751

5 pim 0.120807

6 +/- 0.118799

4 ppg 0.089252

3 tp 0.083566

2 a 0.079325

1 g 0.060298

7 position1 0.048999

8 position2 0.040368

Findings:

Random Forest performed well, achieving 0.82 accuracy and 0.93 recall predicting athletes' careers >= 5 years. The models' performance in predicting athletes' careers < 5 years did not succeed at the same rate; its recall is lower (0.40), consequently reflecting a lower f1 score. The weighted average f1-score represents a counterbalanced performance of both categories.

Support Vector Machine

Accuracy: 0.72

Classification Report:

precision recall f1-score support

< 5 years 0.39 0.74 0.51 801

>= 5 years 0.92 0.71 0.80 3188

accuracy 0.72 3989

macro avg 0.65 0.73 0.66 3989

weighted avg 0.81 0.72 0.74 3989

Confusion Matrix:

[[ 592 209]

[ 911 2277]]

Feature Importance:

Feature Importance

0 gp 0.020055

7 position1 0.016245

8 position2 0.010178

5 pim 0.007145

9 team 0.000827

4 ppg -0.000777

6 +/- -0.001504

1 g -0.012911

2 a -0.013111

3 tp -0.014791

Findings:

Noticable improvement in Support Vector Machine's recall ability for athletes' careers < 5 years. Although we saw general improvements, SVM exhibited a lower overall accuracy, 0.72. The precision and f1 score performed better than Random Forest, but the SVM models' recall was nearly 0.20 lower - indicating shortfalls in identifying athletes' careers >= 5 years and increased false negatives. The confusion matrix reflects misclassified athletes' careers >=5 years, partly due to the dataset size and the algorithm's inefficiencies.

Gradient Boosting

Accuracy: 0.80

Classification Report:

precision recall f1-score support

< 5 years 0.52 0.30 0.38 801

>= 5 years 0.84 0.93 0.88 3188

accuracy 0.80 3989

macro avg 0.68 0.61 0.63 3989

weighted avg 0.78 0.80 0.78 3989

Confusion Matrix:

[[ 239 562]

[ 221 2967]]

Feature Importances:

Feature Importance

0 gp 0.568982

3 tp 0.191297

2 a 0.049149

9 team 0.044409

5 pim 0.041878

6 +/- 0.028180

4 ppg 0.027459

7 position1 0.027287

8 position2 0.016644

1 g 0.004716

Findings:

Gradient Boosting achieved similar summary statistics for athletes' careers >= 5 years, 84-93-88% split between precision, recall, and f1 score. The higher recall for categorizing the careers of >= 5 year athletes, and a lower false negatives in the confusion matrix tell us this model was the best for predicting >= 5 years. However, its statistics for the careers of < 5 years athletes reflected frequently misclassified short careers as long ones.

Overall, Random Forest provided the best trade-offs between context interpretability and performance. This makes RF modeling the most practical choice for NHL career longevity classification.

Storytelling

Modeling and visualizations aided in identifying key influences on NHL athletes' career longevity. The key features across three models were Games Played (gp), Total Points (tp), assists (a), and penalty minutes (PIM). The algorithmic outputs lead us to conclude that athletes with higher point totals, particularly assists and games played, are more likely to have careers >= 5 years. Nevertheless, athletes with a career < 5 years are harder to predict accurately. The disparities in summary statistics, particularly recall, suggest more variability in the < 5 year athletes' careers. I hypothesize that it could be influenced by the type of player, team dynamics, front-office decisions, and long-term injury reserve. I would use to improve my models' predictions for short-term athletes' careers, but Random Forest and Gradient Boosting provide a reasonable classification of NHL athlete longevity.

Impact

Analytics play a crucial role in modern sports, and the NHL presents one of the most challenging environments for data-driven decision-making. The Stanley Cup is regarded as the hardest trophy to win in professional sports, making predictive algorithms and strategies valuable in optimizing team performance. While fans see the action and player interaction on the ice, data analytics and learning models could have wider implications for athlete scouts, salary negotiations, roster management, and front-office decisions come gametime.

However, relying on a tool to make player predictions introduces complications. Athletes who are surprise performers or late bloomers are more likely to be overlooked in an automated learning system. While this raises concerns about the validity of predictive algorithms, teams should not prioritize algorithmic outputs in subjective matters and continue to rely on human intuition. Counterbalances must be maintained to keep predictive tools enhancing decision-making, rather than governing it week-to-week.

Predictive Analytics can represent an athlete's exertion during games and practice. The consistent data from athlete monitors aid in developing dietary, lifestyle, and fitness regimes that optimize performance, recovery periods, and reduce injury risk. By analyzing individual workload data, teams can ensure longer, more sustainable careers for athletes.

While our project demonstrates the potential of expanded machine learning in sports analytics, it is essential to recognize ethical, social, and economic implications. Tools must be used responsibly to avoid reinforcing biases and ensure human opinion remains utmostly influential to decisions. Our abilities to maintain fairness and competitive balance in modern professional sports will be shaped by predictive interpretation and insights.

References:

Random Forest G4G

Random Forest Pickl

Gradient Boosting G4G

Gradient Boosting Google

Gradient Boosting Medium

GB Medium 2

Support Vectors IBM

Support Vectors Medium