Exploring NHL Athlete Longevity w/ Classification
- hlfwatson
- Mar 10
- 8 min read
Updated: Mar 24
Introduction
In this second project, I aimed to develop classification models that predict the longevity of a National Hockey Player's career. Specifically, I wanted to determine whether an athlete would remain in the NHL for five or more seasons. We examined the impact of an athlete's performance metrics, such as penalty minutes (PIM), scoring (G, A, TP), and games played (GP). Understanding which features contribute to an athlete's career longevity can provide valuable insights for a team's general manager, scouts, analysts, and fitness coaches to evaluate development and sustainability under current projections.
Data Collection and Preprocessing
The dataset was collected using Patrick Bacon's TopDownHockey EliteProspects scraper (made possible by ___). The web scraping tool extracts NHL Player Statistics from EliteProspects.com. The dataset spans from 2003-04 to 2024-25 seasons, including all athletes participating in the league between '03 and '25. Notably, the historical range for the dataset was determined by the oldest active athletes' ages in the NHL. Ryan Suter, Brent Burns, and Marc Andre Fleury all began their careers in the 2003 season. The extensive timeframe provided a comprehensive dataset for analysis, amounting to several thousand data points.
Before modeling the dataset, I performed extensive data cleaning to ensure accuracy and consistency. First, replacing null or blank (-) values with zeroes.
This was necessary because missing values indicate players placed on long-term injury reserve (LTIR, IR) prevented from accumulating ice time or altering their plus-minus statistic. Text-based entries, such as player name and team, were cleaned by whitespace removal and format normalization. Many names contained diacritics, and caused inconsistencies in search and filtering options. To the entire dataset, I applied the unidecode package to standardize them. The position column from the scraper output was eliminated due to its frequent inaccuracies - instead, I categorized players based on two position classifications to reflect coaching decisions regarding multi-way athletes. Additionally, I removed the league column since all the data points exclusively represented NHL games and athletes. Lastly, I reordered the columns to present key information more intuitively to enhance readability and align with industry standard formats such as ESPN and NHL.com. My preprocessing steps ensure that the dataset was clean, uniform, and well-structured for subsequent modeling steps.
Understanding and Visualization
To gain a better understanding of the dataset, I performed exploratory data analysis and visualized key trends with statistical methods. Finally, I added a feature importance plot to compare how different attributes contribute to predictions across models.

Histogram illustrating the distribution of athletes in the NHL with careers lasting less than or more than five seasons.




Modeling
My choice of models revolved around their individual classification strengths and ability to work with various feature sets. Three chosen were - Random Forest, Support Vector Machine, and Gradient Boosting. I expect the SVM model to take a longer time to process the size of our dataset.
Random Forest
Random Forest is an ensemble learning technique that uses numerous decision trees to predict a target variable. Each tree is assembled from a randomly selected subset of training data and features; as a programmer, you can recognize this as calling the traintestsplit method. The final prediction is based on the majority ruling of all decision trees' outcome.
I chose Random Forest for the dataset's large size and complex feature set. A key advantage is how it can handle missing data, including athletes missing ice time due to physical injury or a healthy scratch (Athlete was well enough to skate, but the team did not require the athlete). However, Random Forest is less practical for real-time analysis due to its computational expense and slow performance in large-scale implementation. In the context of athlete longevity, our model can achieve strong classification metrics, but a lack of interpretability may impede identifying the utmost contributing features
Support Vector Machine
The next algorithm, Support Vector Machine (SVM), is used for regression and classification. It is often regarded as supervised because the machine is provided with categories to achieve (Random Forest & Gradient Boosting are too!). SVM aims to achieve the optimal hyperplane (decision boundary) to separate data points into different classes. Crucial in SVM to determine the margin and hyperplane, the closest datapoints to the hyperplane are called support vectors. The margin distance, between the hyperplane and support vectors, is aimed to be maximized for better classification performance.
Given the dataset of nearly 20,000 points and ten predictive features, our dataset presents a size constraint to the Support Vector Machine (SVM) algorithm. While the algorithm is reliable in high-dimensional spaces and can handle nonlinear relationships, the benefits come at a time cost. My immense dataset causes SVM to be computationally inefficient and difficult to upkeep regularly, in a practical scenario. The lack of probabilistic outputs also indicates we'd need extra processing to interpret prediction confidence, consider the summary statistics and feature importance below. While the Support Vector Machine may aid in uncovering intricate patterns, its trade-offs could ultimately hurt the model's practicality in this case.
Gradient Boosting
Gradient Boosting (GB) is a popular 'boosting' algorithm in classification and regression. Its methods involve aggregating multiple 'weaker' models and creating a more accurate prediction; a 'weak' model is a decision tree. Each new tree focuses on fixing the errors and fine-tuning parameters based on all previous models' residual errors. More accurate predictors develop over time by iteratively minimizing the error residual.
The Gradient Boosting algorithm is a strong learner, because of its iterative process of combining decision trees. By correcting errors at each stage, GB captures subtle patterns in athlete performance that other models might have missed. Its strength in handling imbalanced data ensures more accurate classification of both categories. Like SVM, Gradient Boosting can be extensive and expensive for practical use due to long training times. Its tree depth also increases the risk of overfitting. Although surrogate splits take care of missing values momentarily, performance can degrade if missing data is not properly handled.
Analysis
Random Forest
Accuracy: 0.82
Classification Report:
precision recall f1-score support
< 5 years 0.59 0.40 0.48 801
>= 5 years 0.86 0.93 0.89 3188
accuracy 0.82 3989
macro avg 0.73 0.66 0.69 3989
weighted avg 0.81 0.82 0.81 3989
Confusion Matrix:
[[ 319 482]
[ 218 2970]]
Feature Importances:
Feature Importance
0 gp 0.186836
9 team 0.171751
5 pim 0.120807
6 +/- 0.118799
4 ppg 0.089252
3 tp 0.083566
2 a 0.079325
1 g 0.060298
7 position1 0.048999
8 position2 0.040368
Findings:
Random Forest performed well, achieving 0.82 accuracy and 0.93 recall predicting athletes' careers >= 5 years. The models' performance in predicting athletes' careers < 5 years did not succeed at the same rate; its recall is lower (0.40), consequently reflecting a lower f1 score. The weighted average f1-score represents a counterbalanced performance of both categories.
Support Vector Machine
Accuracy: 0.72
Classification Report:
precision recall f1-score support
< 5 years 0.39 0.74 0.51 801
>= 5 years 0.92 0.71 0.80 3188
accuracy 0.72 3989
macro avg 0.65 0.73 0.66 3989
weighted avg 0.81 0.72 0.74 3989
Confusion Matrix:
[[ 592 209]
[ 911 2277]]
Feature Importance:
Feature Importance
0 gp 0.020055
7 position1 0.016245
8 position2 0.010178
5 pim 0.007145
9 team 0.000827
4 ppg -0.000777
6 +/- -0.001504
1 g -0.012911
2 a -0.013111
3 tp -0.014791
Findings:
Noticable improvement in Support Vector Machine's recall ability for athletes' careers < 5 years. Although we saw general improvements, SVM exhibited a lower overall accuracy, 0.72. The precision and f1 score performed better than Random Forest, but the SVM models' recall was nearly 0.20 lower - indicating shortfalls in identifying athletes' careers >= 5 years and increased false negatives. The confusion matrix reflects misclassified athletes' careers >=5 years, partly due to the dataset size and the algorithm's inefficiencies.
Gradient Boosting
Accuracy: 0.80
Classification Report:
precision recall f1-score support
< 5 years 0.52 0.30 0.38 801
>= 5 years 0.84 0.93 0.88 3188
accuracy 0.80 3989
macro avg 0.68 0.61 0.63 3989
weighted avg 0.78 0.80 0.78 3989
Confusion Matrix:
[[ 239 562]
[ 221 2967]]
Feature Importances:
Feature Importance
0 gp 0.568982
3 tp 0.191297
2 a 0.049149
9 team 0.044409
5 pim 0.041878
6 +/- 0.028180
4 ppg 0.027459
7 position1 0.027287
8 position2 0.016644
1 g 0.004716
Findings:
Gradient Boosting achieved similar summary statistics for athletes' careers >= 5 years, 84-93-88% split between precision, recall, and f1 score. The higher recall for categorizing the careers of >= 5 year athletes, and a lower false negatives in the confusion matrix tell us this model was the best for predicting >= 5 years. However, its statistics for the careers of < 5 years athletes reflected frequently misclassified short careers as long ones.
Overall, Random Forest provided the best trade-offs between context interpretability and performance. This makes RF modeling the most practical choice for NHL career longevity classification.
Storytelling
Modeling and visualizations aided in identifying key influences on NHL athletes' career longevity. The key features across three models were Games Played (gp), Total Points (tp), assists (a), and penalty minutes (PIM). The algorithmic outputs lead us to conclude that athletes with higher point totals, particularly assists and games played, are more likely to have careers >= 5 years. Nevertheless, athletes with a career < 5 years are harder to predict accurately. The disparities in summary statistics, particularly recall, suggest more variability in the < 5 year athletes' careers. I hypothesize that it could be influenced by the type of player, team dynamics, front-office decisions, and long-term injury reserve. I would use to improve my models' predictions for short-term athletes' careers, but Random Forest and Gradient Boosting provide a reasonable classification of NHL athlete longevity.
Impact
Analytics play a crucial role in modern sports, and the NHL presents one of the most challenging environments for data-driven decision-making. The Stanley Cup is regarded as the hardest trophy to win in professional sports, making predictive algorithms and strategies valuable in optimizing team performance. While fans see the action and player interaction on the ice, data analytics and learning models could have wider implications for athlete scouts, salary negotiations, roster management, and front-office decisions come gametime.
However, relying on a tool to make player predictions introduces complications. Athletes who are surprise performers or late bloomers are more likely to be overlooked in an automated learning system. While this raises concerns about the validity of predictive algorithms, teams should not prioritize algorithmic outputs in subjective matters and continue to rely on human intuition. Counterbalances must be maintained to keep predictive tools enhancing decision-making, rather than governing it week-to-week.
Predictive Analytics can represent an athlete's exertion during games and practice. The consistent data from athlete monitors aid in developing dietary, lifestyle, and fitness regimes that optimize performance, recovery periods, and reduce injury risk. By analyzing individual workload data, teams can ensure longer, more sustainable careers for athletes.
While our project demonstrates the potential of expanded machine learning in sports analytics, it is essential to recognize ethical, social, and economic implications. Tools must be used responsibly to avoid reinforcing biases and ensure human opinion remains utmostly influential to decisions. Our abilities to maintain fairness and competitive balance in modern professional sports will be shaped by predictive interpretation and insights.
References:
Comments