NHL Athlete Total Point based on Performance

hlfwatson
Mar 31
9 min read

Updated: Apr 10

As the National Hockey League begins each October, one aspect of an athlete's next six months that remains consistent is the number of games: 82. Since the 1950s, NHL teams are expected to play 82 games during the six months of their regular season, not including playoff games. Unexpected weather and cancellations are expected to be made up in the regular season, or a game will be forfeited. From my past projects working with NHL Player data, there is an evident, inexplicable pattern that an athlete's season and career follow. Most athletes and teams themselves go through periods of very good and bad performances. This fact leaves the modeling likely to overfit data and project athletes with good weekly performances to a higher point pace for the season.

In this project, we aim to develop a regression model to predict NHL player performance based on various statistical performance metrics. Specifically, I am interested in predicting a player's total points (tp), which is the sum of goals (g) and assists (a). Interpreting these results, we can better understand performance metrics that contribute to an athlete's total points (tp). Athlete performance is crucial for coaches, scouts, and athletic trainers to make informed decisions about organizational changes. Statistical modeling can aid in identifying patterns and making data-driven predictions about athlete contributions.

Regression

Regression is a statistical analysis method that is fundamental to modeling. It captures relationships between a dependent variable and independent variables. When first learning its versatile use, teachers describe the independent variable as what is being changed, while they describe the dependent variable as what is being measured. For our purposes, we will use linear regression to predict an athlete's total points (tp) based on input features, such as goals, assists, and games played.

Linear regression operates on several assumptions for accurate predictions. The relationship between independent and dependent variables is linear (linearity). Collected observations are independent of one another (independence). The variance of residuals is constant across levels of the independent variable (homoscedasticity). Independent variables should not have high correlation with one another (multicollinearity). In Linear Regression, residuals are observed to be normally distributed. The Linear model estimates the best-fit line by minimizing the sum of squared residuals (differences between observed & predicted tp). The most effective metrics are the coefficient of determination (R^2) and root mean squared error (RMSE) to evaluate Linear regression's performance.

Experiment One: Data Understanding

Before conducting data cleaning and preprocessing, we performed exploratory data analysis (EDA) techniques to gain insights into the dataset's variables. We computed descriptive data statistics for performance metrics. We defined performance metrics as the game contributing statistics, such as goals, assists, total points, games played, points per game, penalty minutes, and plus/minus rating. Finally, we checked variables for missing values to identify potential gaps.

nhl_final_df null sums

gp     0

g      0

a      0

tp     0

ppg    0

pim    0

+/-    0

dtype: int64

# Explore the performance stats: gp, g, a, tp, ppg, pim, +/-

nhl_final_df[['gp', 'g', 'a', 'tp', 'ppg', 'pim', '+/-']].describe()

	gp	g	a	tp	ppg	pim	+/-
count	904	904	904	904	904	904	904
unique	77	41	62	88	119	84	72
top	0.983686	-0.876247	-0.982878	-0.996877	-1.223664	-1.031831	0.049031
freq	97	193	143	117	117	120	97

Next, we created a correlation matrix and visualized the data using a heatmap to better understand variable relationships and contributions. From its analysis, we concluded that total points (tp) correlate most strongly with goals (g), assists (a), and points per game (ppg). This conclusion makes sense because goals and assists contribute directly to total points, while points per game provide a metric of scoring efficiency. As a final step to explore the data, I checked the unique row values under each one-hot encoded team column. This response is a count of players for each team and ensures that encoding did not eliminate teams, generalize, or overfit for the categories.

Data Cleaning & Preprocessing

Following the guidelines of Project Three, I was eager to implement experimental models on data, although cleaning and converting performance data was daunting. Where the team variable is categorical, there is a certain numeric variable associated with collective players on a hot streak. I was unsure how to approach the process. I began the analysis by encoding categorical variables, which included team, position1, and position2, with one-hot encoder. Unnecessary columns, such as the links, are deleted to ensure a legible and minimized dataset. Missing values across columns, gp, g, a, tp, etc. indicate no data to record (i.e., did not contribute) and fill with zeroes. Beyond that, we standardized numerical features for uniformity across different scales. To remove 'totals' as a team, we created a dictionary object with the table of traded athletes' names and their current teams. The dictionary mapped onto the team column and replaced 'totals', which presented as the team for traded athletes' aggregate performance statistics. Finally, we checked that all relevant columns are properly formatted and do not contain duplicate entries. Collectively, the steps enhanced the dataset integrity and will allow more accurate regression modeling.

Experiment One: Results

In our first experiment, we trained a linear regression model on the preprocessed NHL athlete dataset. The dataset is split into training and testing sets using an 80/20 ratio, ensuring controlled evaluation. Using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (R^2), the model's performance will be evaluated.

EXPERIMENT ONE RESULTS :

Mean Absolute Error: 0.0000000000

Mean Squared Error: 0.0000000000

R-squared: 1.0000000000

Our output interpreted in the situation of NHL data is as follows. The perfect statistical measures show that our target variable is too closely related to other variables in the feature set. Looking back at the correlation heatmap, tp was too predictable as a target variable because tp = g + a, thus the correlation is ~90%. Linear Regression likely recognized this relationship in the scaled data values and learned from the direct correlation rather than finding meaningful patterns.

Experiment 2

To address the performance correlation, I removed goals (g) and assists (a) from the feature set to predict total points (tp). We aim to predict an athlete's total point production using the feature set, such as penalty minutes (pim), points per game (ppg), and the plus/minus rating, without its strongest predictors, goals, and assists. A linear regression model trained on the revised features to provide an independent baseline from total points, goals, and assists. The dataset split and evaluation metrics remained the same as Experiment One.

EXPERIMENT TWO RESULTS (TRAIN):

Mean Absolute Error: 0.2058943171

Mean Squared Error: 0.0868641265

R-squared: 0.9143840594

EXPERIMENT TWO RESULTS (TEST):

Mean Absolute Error: 0.2120129867

Mean Squared Error: 0.0743256826

R-squared: 0.9210645862

Comparing the output to Experiment One, the original feature set and model achieved perfect values due to multicollinearity and feature leakage. These metrics were likely due to the relationship between goals and assists, two components that define total points. Experiment 2 provided a more realistic model to interpret by removing those features, whilst still explaining 91% of the predicted values. Reminder that the coefficient of determination is also R-squared, describing how well a model predicts its outcome. The test set R-squared was slightly higher than the training set, suggesting the model generalizes well. The model also presented a low mean absolute error (MAE) on its test set, 0.2120, indicating that predicted values deviated by an average of 0.21 points. The mean squared error (MSE) was consequently lower, on average, 0.07432 closer to the actual values. Compared to Experiment One, this modeling was a lot more successful in predicting and explaining variation in the data.

I look to implement an alternative regression technique in the next experiment, such as Ridge Regression.

Experiment Three

Using the same test splits, feature sets, and evaluation metrics as the previous Experiment Two, I will be implementing a Ridge Regression statistical model on NHL Player data. Unlike Linear Regression, Ridge incorporates a regularization term that prevents overfitting and improves model stability. This technique will hopefully help to stabilize multicollinearity between ppg and tp. As well as applying an additional regularization term that constrains coefficient magnitudes and prevents overfitting further.

EXPERIMENT THREE RESULTS (TRAIN):

Mean Absolute Error: 0.2049989349

Mean Squared Error: 0.0867066012

R-squared: 0.9145393211

EXPERIMENT THREE RESULTS (TEST):

Mean Absolute Error: 0.2097645706

Mean Squared Error: 0.0736002004

R-squared: 0.9218350634

The results are very similar to linear regressions, aside from small discrepancies in the thousandths decimal places. Its small differences are explainable by ridge regression reinforcing patterns from the Linear model in a more controlled space. In the test set results, Ridge Regression did outperform the linear regression evaluaiton metrics.

Test Evaluation:				Ridge  ||  Linear

Mean Absolute Error:			0.2098 ||  0.2120

Mean Squared Error:			0.0736  ||  0.0743

R-squared:					0.9218 ||  0.9211

The performance on the training set is nearly identical, again with small discrepancies in the thousandths decimal places. Overall, the metrics suggest that regularization in Ridge improves the model's generalization on unseen data.

Experiment Four

Finally, I decided to change small parameters in the Ridge Regression model to gain insight into their effects on evaluation metrics. For example, I increased the test size in the training and testing split. Its effect: I observed a .001 increase in the training sets' Mean Absolute Error and Mean Squared Error, while the testing sets' metrics reflected a 0.01-0.05 difference. Still, both MAE and MSE increased. With less information to detect patterns, this error response can be explained by the smaller training data. Differentially, the training R-squared increased by 0.0004 while the testing R-squared increased by 0.0048. While there is not much to explain the difference, we could hypothesize it to be because of reduced overfitting.

Results from Changing Test Size

## Test Size = 0.25

EXPERIMENT FOUR RESULTS (TRAIN):

Mean Absolute Error: 0.2061009562

Mean Squared Error: 0.0869992602

R-squared: 0.9142749272

EXPERIMENT FOUR RESULTS (TEST):

Mean Absolute Error: 0.2014805178

Mean Squared Error: 0.0701449659

R-squared: 0.9254222851

## Test Size 0.2

EXPERIMENT FOUR RESULTS (TRAIN):

Mean Absolute Error: 0.2061009562

Mean Squared Error: 0.0869992602

R-squared: 0.9142749272

EXPERIMENT FOUR RESULTS (TEST):

Mean Absolute Error: 0.2014805178

Mean Squared Error: 0.0701449659

R-squared: 0.9254222851

Impact

Looking forward to the possible applications, the ability to predict NHL athlete performance can influence coaches' line-up decisions, negotiating athlete contracts, and composing scouting reports. A consistent and reliable model that motivates athletes based on their capable potential could improve individual and team performances. The data-driven approach to talent evaluation can ensure you are recruiting productive athletes, but recursively discourage traditional scouting assessments. Finding a healthy balance between intuition-based and data-driven approaches can create a front office that is well-informed on its athletes and prospect pools.

As mentioned above, there is the possibility of introducing biases in predictive modeling because of the lack of available data. There is only so much generalization and extrapolation that can be accurately interpreted when the available data is not comprehensive. Meaning that there are nearly fifteen games left in the regular season, not including teams that make playoff games. Furthermore, some athletes play incredibly well in team settings, yet do not always capitalize on scoring chances (i.e., Defensemen produce lower stats because it is not their job to play offense). Ensuring fairness in team evaluation is vital, and future modeling should consider mitigating potential biases.

Finally, our first results from Experiment One present how easy it is for predictive statistical modeling to suffer from overfitting. In our case of multicollinearity, I did not consider the high correlation to have affected modeling severely until after its evaluation. Serving as a cautionary tale, teams should double-check the meaning and math behind R-squared values to avoid misleading conclusions. Often, high R-squared values indicate an underlying relationship, in this case, multicollinearity.

Conclusion

I learned valuable insights into data preprocessing, exploratory analysis, model evaluation, and the applicability of regression techniques in predicting statistical metrics. As seen above, Experiment One reminded us of the inadvertent effects when variables in the feature set correlate with the target variable. Goals and Assists were too closely related to total points, as it was a sum of the two. Carefully select feature variables to avoid reinforcing multicollinear relationships, as seen above. This correlation led to perfect performance metrics, MAE, MSE & R-squared, which did not accurately describe the linear regression model I was trying to achieve. In Experiment Two, I removed these variables (goals and assists) and provided a much more realistic model. The performance became more interpretable, with R-squared ~0.921 indicating good predictive accuracy.

Experiment Three implemented a different regression technique on the preprocessed data, Ridge Regression. The Ridge Regression method introduces regularization techniques to reduce overfitting, such as suggesting that the model choose smaller coefficients. This issue introduces a trade-off. While shrinking coefficients can reduce variances, it does so at the cost of flexibility (the model's bias). Though performing similarly, the Ridge reduced errors in the Testing set. Lastly, Experiment Four changes the parameters set in the previous experiment iteration. I observed the effects of changing the test size, as well as changing the alpha. Maximizing the test size resulted in higher error metrics, likely due to less information in the training set to predict on. The increase in test size also led to a slight increase in the R-squared values, suggesting improved generalization. These latter observations highlighted the trade-offs in model selection, as well as the importance of training data size and model generalization. Ultimately, I enjoyed implementing Regression on the dataset and exploring models under a topic of my interest.