I don't think I need to sell you on the utility of an expected goals model. Having a model that predicts the probability of a shot being a goal is a good thing to have. I've recently put one together and am housing the numbers for the past year and a half on my site (offsidereview.com) so I figured I should explain my methodology. I'd also like to note that nothing I'm doing here is really new. This is all established territory. I highly recommend reading the previous work done on the subject (You don't have to read them all through...but if you are unfamiliar with the topic I recommend looking through the links below. I'll also be referring to a few them by name throughout the post):
- Evaluating NHL Goalies, Skaters, and Teams Using Weighted Shots - Brian Macdonald
- Expected Goals - Emmanuel Perry
- Expected Goals - DTMAboutHeart
- Expected Goals (xG), Uncertainty, and Bayesian Goalies - Cole Anderson
- Moneypuck
- NHL Expected Goals Model - Matthew Barlowe
Model
I probably should have said this in the intro but, just to be clear, I'm attempting to assign a probability of an unblocked shot of being a goal. Some models attempt to do so for just SOG's and since the NHL (for some reason) only records the coordinates for where a shot was blocked we can't use Blocks.
The coordinates are also adjusted for rink bias using the method employed by Schuckers (The code for this adjustment was graciously provided by @OilersNerdAlert).
Here are the features/inputs for my model:
1. Distance: Distance of shot from the net
2. Angle: Angle of shot
3. Shot Type: Slap Shot, Snap Shot, Wrist Shot, Deflected, Tip-In, Wrap-Around, Backhand
4. Off Wing: If the player took the shot from his off wing
5. Empty Net: If the net was empty
6. Strength: 5v5, 4x5, 5x5, 3x3...etc. for the shooting team
7. Score Category: Score differential for the shooting team. It spans from -3+ to 3+ (I just bin everything above 3 and below -3)
8. Is Forward: If the shooter is a forward
9. Is Home: If the shooter plays for the home team
10. Distance Change: Distance from previous event
11. Time Elapsed: The difference in time from the last event
12. Angle Change: The change in angle if it's a rebound shot (last event was an SOG <= 2 seconds ago)
13. Previous Event & Team: Whether the previous event was a Fac, Sog, Block/Miss, or a Take/Hit (I changed gives to takes for the other team) and for which team. This is represented by eight dummy variables (the four choices for both teams).
As you can see I chose to not model rushes and rebounds explicitly (in a similar fashion to Moneypuck) like most other models do. I can't imagine it makes too much of a difference but this is how I personally like it. I also looked into incorporating shift info into the model like Macdonald (How long the shooter was on the ice for and the average shift length for both the Ev. Team and the Opp. Team). Some early testing suggested the the importance of those features were small and since it takes a while to calculate the info for every shot I chose not to include them.
Another thing to note is I chose not to include "shooting talent" as a model feature. I plan on writing more on this in the near future so I'll keep this brief but I think whether or not to include it as a model feature depends on what you are trying to measure. I also think more care could be taken in how it's calculated.
Training and Testing
I chose to fit the data using Gradient Boosting. I won't go through how it works (and it doesn't really matter here) but I encourage looking into it if you are interested.
The data I used here is the regular season and playoff data from the 2007-2016 seasons. I shuffled the data and used 80% of the data for training the model (so the training and testing sets are both random subsets of the total dataset). I then did 10 fold cross validation on the training set to tune the hyperparameters and create the model.
I then tested the model on the test set. Using the model I just created I calculated the probability of each shot in the test set of being a goal. To evaluate these predictions, I calculated both the area underneath the ROC Curve (AUC) and the Log Loss for the predictions on the test set:
AUC score: .782 (Update: Not much but it's actually .781)
Log Loss: .206
The AUC score can take values from 0 to 1 with higher being better. For Log Loss they range from 0 to infinity with 0 being perfect. If we created a model that classified each shot randomly we would get an AUC score of .5 and a log loss of .693. So the model is definitely better than random. The AUC score is also similar to those calculated by others (see both Macdonald and Anderson). So I'm confident with how the model performs.
Conclusion
So that's really it, all my code can be found here on Github and the numbers for the past year and a half for goalies, teams, and skaters can be found on my site. Nothing really new was done here but it's always good to go over your methodology if you make the model outputs public. I'm also sure that the model could be improved in certain areas. Some inputs could be added (like shooter talent) and some parts could possibly be cleaned up a little (I'm fine with it, and highly doubt it would matter, but I could model each previous event separately instead of grouping some together...would take a little longer to train though).