I recently developed and wrote about my expected goals model. One thing that I noted was missing from the model was a "shooter talent" input. The idea was first developed here by DTMAboutHeart and Asmae Toumi. Since then, I think, a couple of expected goals models have been developed using the same logic as them. So in this post, I'll attempt to do the same. This piece is part one of (what should be) two (I'll do some testing in the next part).
Defining the "Shooter Talent" Input
In DTM's and Asmae's piece they defined it as regressed sh% of the previous two seasons. The amount to regress is 375 shots for forwards and 275 for defensemen. They then took the regressed sh% and divided it by the league average sh% to get what they called the shot multiplier.
While I do generally agree with this, I do have a concern. I guess the best way to explain it is by talking about what we want our "shooter talent" input to do. A standard xG model controls for the situational factors (sans the shooter) and tries to determine the probability of a shot of being a goal. But we know this isn't enough. If a shot with the same xG is taken by Steven Stamkos and Zac Rinaldo...I think we'd all agree one shot has a better chance of going in (if not I guess you can just stop reading now). So, therefore, we want out shooter talent input to account for how much better than expected we think this player is.
I think that's the issue with using just regressed Sh% (or just Fsh%...they say Sh% but the model is Fenwick based so I don't know), it doesn't get what we are looking for. Sh% is a combination of: the standard probability of one's shots ("if this player get high quality chances") and the player's shooter talent (How much better than expected they do). A player could have a high sh% because of the quality of shots he takes or because of his actual shooting ability. We need to distinguish between the two and just focus on how much better than expected this player is controlling for the quality of shots he gets (since the other variables in our model attempt to control for the quality of the chance itself).
Well, so how do we do that? I talked about it a while back (somewhat clumsily) and it really just comes down to Goals/xGoals (this would be our "shooter talent" multiplier). Then for each player we could use his previous data and regress it to get our best estimate of his multiplier.
But how do we do that? I mean, if we are building a model for xG how do we get the multiplier for each shot (how do we get the xG values to do it)? I struggled with this at first but I think it's pretty simple. We create a standard xG model that doesn't account for the shooter (which I already did). We use that to get our multiplier for each shot. We then create another model with the same exact parameters but we also include a shooter talent input (which is our multiplier we just calculated).
Regressing the Shooter Talent Multiplier
To get my multiplier for each shot (given the shooter), I need to use that player's previous data and regress it to account for randomness. The question, of course, is how much do I need to regress it. This depends on how noisy (or alternatively how much signal) the statistic has. This has been done many times in baseball analysis. DTM and Asmae determined how much to regress by using KR-21 Test. That (and KR-20) only work for binary data so they aren't applicable in this case so I chose to do it a little differently.
For each player season combination (each season for a given player was considered different as player talent changes over time) I listed the xG of each shot they took. I then randomly selected k shots from each player and split it up into two halves. I then ran the correlation between the two halves for all players. This was done from values of k=50 until they were less than 100 players in the regression. This procedure was done 250 times. I then calculated the average correlation for each value of k using Fisher Z Transformation (Note: Some of you may be thinking why I didn't just run Cronbach's alpha and the reason is because I couldn't get my code to work and I didn't feel like wasting any more time. This really shouldn't matter as how I did it here should do a perfectly fine job of estimating the true values). So I do this and I search for when r = .5, as this will tell me how much I need to regress it.
Of course, as you could guess, I don't actually reach r = .5 as I'm considering each player on a season by season basis. Players don't take enough shots in a season to reach it. That's fine, I can estimate when r will be the value using the Spearman–Brown prediction formula. But which sample value do I use to predict when r will be .5? I calculated the correlation for many different samples (Note: For each sample k, the actual sample size is actually k/2 because I cut it in half to regress against each other), do I just use my last sample to estimate when it will be .5? I chose to do it like Derek Carty did here. I get the implied amount for each value of k and then get my final value by taking the weighted average (based on how many players were in the regression) of all my predictions.
So, this was done separately for both forwards and defensemen for All-Situation (it's all-situations because my xG model is...I'm just trying to stay consistent) data from 2007-2016. I then got 280 shots for forwards and 2350 for defensemen. This may seem really high for defensemen (and possibly low for forwards) but I see no reason to doubt it. Defensemen aren't chosen for their shooting ability so I didn't expect a whole lot.
Building the Model
Ok, time to build the model. In order to compare it to my previous model I built it on the same training data and will test it on the same test data. In order to calculate the shooter talent multiplier I summed up the players stats for up to two years earlier not including the given shot and any data after it. So for example, for a shot during the 2010 season, I sum up the player's statistics for that year before that shot and for the 2008 and 2009 seasons. I then determine how much to regress his multiplier by how many shots he took and the numbers calculated at the end of the previous section.
I think it's important to stress why I only include previous data to calculate the player's "shooter talent". This is important as if I did include future data, those numbers would then then impact the xG of that shot and therefore inflate any correlations I take of the future of that season (since it would in effect be telling that shot that this player either scores more/less in the future). Also, from a classifying perspective, including the shot itself would give a (small) indication as to the outcome of the shot (future data only matters once we use it to calculate stats).
But what do I regress to? I originally was going to regress to that year's average but I think that's a mistake. The reason is that since the average changes a little every year it would, in a sense, give information about the outcome of the shot (this is smaller concern than using future info for the shooter multiplier but I still think it matters a little). So the average is based on the previous year's data and whatever we've accumulated so far for that given season (since the data spans from 2007-2016, "2006" was just made as an average of all the years). One last note is that I then divided by the regressed multiplier by the same average I regressed to, in order to normalize it across years.
As my previous model I used Gradient Boosting to fit the data (I also used the same features with the addition shooter talent input). So how does the model do? Let's see:
AUC Log Loss
Standard xG: .781 .206
Shooter xG: .782 .206
So, it's the same (I guess a drop better). I guess this may seem weird but it's important to remember a couple things: One is that, in aggregate, how good of a shooter a player is can only matter so much. Other variables like distance, angle...etc. matter much more (this just makes sense intuitively). Another is that just because one doesn't do better than the other as a classifier doesn't mean it's not a better in other respects. These metrics just focus on how good it is at predicting if a shot's a goal. It may do better as a stat at predicting future goals. I guess my point here is to really not to get too caught up about how it's only a marginally better classifier (hold on until part two).
I know what some of you may be thinking, let's run some tests to see if is a better predictor of goals. Well, this post is already a little longer than I expected it to be and I have a lot more to say, so I'll dedicate another post to it. For those interested the code for this project can be found here.