Thursday, January 25, 2018

Expected Goals Model

I don't think I need to sell you on the utility of an expected goals model. Having a model that predicts the probability of a shot being a goal is a good thing to have. I've recently put one together and am housing the numbers for the past year and a half on my site ( so I figured I should explain my methodology. I'd also like to note that nothing I'm doing here is really new. This is all established territory. I highly recommend reading the previous work done on the subject (You don't have to read them all through...but if you are unfamiliar with the topic I recommend looking through the links below. I'll also be referring to a few them by name throughout the post):


I probably should have said this in the intro but, just to be clear, I'm attempting to assign a probability of an unblocked shot of being a goal. Some models attempt to do so for just SOG's and since the NHL (for some reason) only records the coordinates for where a shot was blocked we can't use Blocks.

The coordinates are also adjusted for rink bias using the method employed by Schuckers (The code for this adjustment was graciously provided by @OilersNerdAlert).

Here are the features/inputs for my model:

1. Distance: Distance of shot from the net
2. Angle: Angle of shot
3. Shot Type: Slap Shot, Snap Shot, Wrist Shot, Deflected, Tip-In, Wrap-Around, Backhand
4. Off Wing: If the player took the shot from his off wing
5. Empty Net: If the net was empty
6. Strength: 5v5, 4x5, 5x5, 3x3...etc. for the shooting team
7. Score Category: Score differential for the shooting team. It spans from -3+ to 3+ (I just bin everything above 3 and below -3)
8. Is Forward: If the shooter is a forward
9. Is Home: If the shooter plays for the home team
10. Distance Change: Distance from previous event
11. Time Elapsed: The difference in time from the last event
12. Angle Change: The change in angle if it's a rebound shot (last event was an SOG  <= 2 seconds ago)
13. Previous Event & Team: Whether the previous event was a Fac, Sog, Block/Miss, or a Take/Hit (I changed gives to takes for the other team) and for which team. This is represented by eight dummy variables (the four choices for both teams).

As you can see I chose to not model rushes and rebounds explicitly (in a similar fashion to Moneypuck) like most other models do. I can't imagine it makes too much of a difference but this is how I personally like it. I also looked into incorporating shift info into the model like Macdonald (How long the shooter was on the ice for and the average shift length for both the Ev. Team and the Opp. Team). Some early testing suggested the the importance of those features were small and since it takes a while to calculate the info for every shot I chose not to include them.

Another thing to note is I chose not to include "shooting talent" as a model feature. I plan on writing more on this in the near future so I'll keep this brief but I think whether or not to include it as a model feature depends on what you are trying to measure. I also think more care could be taken in how it's calculated.

Training and Testing

I chose to fit the data using Gradient Boosting. I won't go through how it works (and it doesn't really matter here) but I encourage looking into it if you are interested.

The data I used here is the regular season and playoff data from the 2007-2016 seasons. I shuffled the data and used 80% of the data for training the model (so the training and testing sets are both random subsets of the total dataset). I then did 10 fold cross validation on the training set to tune the hyperparameters and create the model.

I then tested the model on the test set. Using the model I just created I calculated the probability of each shot in the test set of being a goal. To evaluate these predictions,  I calculated both the area underneath the ROC Curve (AUC) and the Log Loss for the predictions on the test set:

AUC score:  .782 (Update: Not much but it's actually .781)
Log Loss:     .206 

The AUC score can take values from 0 to 1 with higher being better. For Log Loss they range from 0 to infinity with 0 being perfect. If we created a model that classified each shot randomly we would get an AUC score of .5 and a log loss of .693. So the model is definitely better than random. The AUC score is also similar to those calculated by others (see both Macdonald and Anderson). So I'm confident with how the model performs.


So that's really it, all my code can be found here on Github and the numbers for the past year and a half for goalies, teams, and skaters can be found on my site. Nothing really new was done here but it's always good to go over your methodology if you make the model outputs public. I'm also sure that the model could be improved in certain areas. Some inputs could be added (like shooter talent) and some parts could possibly be cleaned up a little (I'm fine with it, and highly doubt it would matter, but I could model each previous event separately instead of grouping some together...would take a little longer to train though).


  1. I'm sure others have explained it before, but, other than for simplicity's sake since xG is calculated using Fenwick rather than SOG, why does your adjSV% use Fenwick and not simply SOG? On the surface it seems like that would add noise to the data, as I can't imagine goalies have a ton of influence over how often shooters miss the net (although I can imagine an argument where they do have at least some control over this).

    Is the idea that the sample size will be increased using Fenwick vs SOG, and the "miss%" of each goalie should stabilize over time, meaning that adjFSV% would be just as valid as adjSV%?

    I ask mainly because I notice that Corsica uses dSV% on their site, which uses SOG only, while you and others use fenwick for your adjSV%. Does adjFSV% have greater repeatability than adjSV%?

    1. in my last sentence when I say adjSV% I mean using SOG only and not Fenwick in case that wasn't clear.

    2. Hey, I'm the author. To answer your question it is because the model wasn't made specifically for analyzing goalies. There are a lot of other avenues on the team and player level where an xG model can be used. I guess an SOG based model could be made but, as you mentioned, I believe that goalies can influence miss%.