Pages

Monday, June 19, 2017

Looking Back at Goalie Projections

Before the season started I made some goalie projections (maybe a couple of you remember). To recap very briefly, I put together a simple Marcel projections system which tried to predict the Low%/Mid%/High% Danger Sv% (as defined on Corsica.hockey) for each goalie. And, well, if you are going to make projections you kind of have to look at how they did. So let's do that.

***All numbers here are courtesy of Corsica.hockey (thankfully I downloaded the numbers before it went down for the summer)

I guess the first real question here is how do you grade a projection? There are a lot of ways to do this, this is how I did it: 

The first thing I had to do was to scale the projections based on the 2016-2017 averages. The numbers change a bit each year so you have to account for it. My actual projections predict how much better/worse than average you would expect a goalie to do. When I made the projections I provided a google docs which had them scaled to the previous year's average. And for the sake of now comparing the projections to what happened I'll scale them so they are on equal footing with this past year's numbers (I could have just made this year's numbers in terms of how much better/worse than average but I prefer it this way).

The next thing is for each goalie with a projection and who played this year (even if they just faced one shot), you take the absolute value of the difference between their projection and their observed Sv%. We then weigh the absolute difference for each goalie by how many shots the goalie faced and then take the weighted average for all the goalies. So goalies who faced fewer shots won't matter as much as those who faced more.

So is that it? No because while checking how far my projections deviated is nice, without some sort of reference we have no idea how good they actually were. We need to check how other ways of predicting this past year's numbers did. Ideally I'd include projections made by other people but I don't know if anyone made projections for Low/Mid/High Sv% so I'll just include a few simple ones. So besides for my projections, I'll also test: the previous year, that player's career numbers (from 2007-2008), and league average.

With all that out of the way here are the numbers (lower is better):


Type   LSv%   MSv%   HSv%
Projections 0.0053 0.0123 0.0208
Previous Year 0.0110 0.0193 0.0317
Career 0.0069 0.0147 0.0276
League Average 0.0052 0.0121 0.0225

Let's first look at how the projections did overall for the three danger zones. Basically, Lsv% is better than MSv% which is better than HSv%. You might be tempted to think that this means we are better at predicting them in that order. That's not true though, because remember that LSv% and MSv% both contain less "talent" than HSv%. So the average difference is so small because the observed spread is just smaller. It's just due to the fact that the observed standard deviation is smaller so the average difference is smaller. So it may look better but it actually isn't. A better way of doing this would probably have been in terms of standard deviations (this isn't really a big deal though).

Ok, let's now look at how the four "projections" did. So, for each category there's a clear order. Just using last year's numbers is clearly the worst in each category (which is of course why you shouldn't just trust one year of data). Next you get the player's career (at least from 2007-2008) numbers. Then you get league average and then my projections. My projections and league average are virtually the same for LSv% and MSv% but mine edge it out slightly for HSv%. 

Also you might be surprised about how good league average does but I think it's make good sense. Goalie performance contains a lot of randomness so projecting average is a good bet. And of course this is especially true with LSv% and MSv% (and to a lesser extent with HSv%) which both are mostly random. 

Conclusion

To conclude my projections for this past year was found to be better than a player's previous year's numbers and (to a lesser extent) his career numbers (with it being the best in HSv% and worst in LSv%). Compared to league average my projections were the same for LSv% and Msv% and slightly better for HSv%. 


Monday, February 27, 2017

A Few Stuff about Miss%

This post is a combination of a few stuff on Miss%. DTMAboutHeart originally wrote about Miss% and I wrote my own post in July. In my post I used only road numbers, noting that ideally I'd adjust for rink bias. I recently worked out some rink adjustments for misses so in the first part of this post I'll use that to conduct a better analysis on Miss%. This isn't to say that I think my method of adjusting is perfect, but I think it does a good enough job (and it's definitely better than just getting rid of half the data). In the second part of this post I talk about a few things concerning saves and misses.

(**All Data courtesy of Corsica.hockey**)

Reliability

In DTM's original post he showed how Miss% had good repeatability. I kept is simple here. Using the adjusted data I ran year over year correlations for goalies with 20+ and 40+ games in back to back seasons going back to 2007. I also included the corresponding values for Sv% from this piece by Emmanuel Perry. Here they are (numbers below are r^2 and only 5v5):

                      Miss%       Sv%
20+ Games:   .136          .042
40+ Games:   .231          .072

As you can see (and I assume already know) Miss% is easily more repeatable than Sv%. 

Team Effects

I checked for this in my original post. The methodology used there will be the same as here, expect now I won't be restricted to only road data. The numbers used here are only 5v5 and span from 2008-2009 to 2015-2016. In order to achieve a better sample for players with more shots, I coupled up the years. Years 2008-2009 and 2009-2010, 2010-2011 and 2011-2012 are paired up and so on. This is done for sample size reasons. For each pair, I gathered each player's numbers by team and matched it up with that team's numbers in that couplet. Then for each goalie I calculated that goalie's Miss% and the Miss% for the goalie's team when he wasn't on the ice.  And then, based on what sample I'm using, I run a correlation between what the goalie did for that team (and that team only, any numbers the goalie accumulated for other teams was excluded) and what the team did without him. 

The numbers below are r^2 and the sample restriction applies for both the goalie and without him (you may note that in my original post I divided it up by danger zone and here I'm not. Ideally I would do that but in order to do that there are a few things I need to account for. Hopefully, I'll be able to update these numbers with those values in the near future):

Sample    n        r^2
500+       306     .038
750+       255     .036
1000+     194     .035

The column n is the number of goalies in that sample. The correlations here, while higher than what I previously found, are rather small. There definitely is a relationship but it's not very big and not something that should be a big concern. Also adjusting for shot quality (like is done on Corsica) should cut down on this.

Value of a Miss

I've heard a few common things about goalies and misses (not referencing anyone particular here). One that I feel obliged to talk briefly about is the idea that since a missed shot has no chance of going in the net it shouldn't be looked at. This is because, since a miss is not on goal it never had a chance of actually being a goal (Misses go in 0% of the time). I don't see how this makes any sense at all. If a goalie can influence the amount of shots that miss the net that means a goalie can "force" shots wide. That means shots that would normally end up hitting the net are now missing the net. Is this not a positive thing?

The reason, I think, is because they are thinking too much in terms of shots on goal and not unblocked shots. Let's think of a player releasing a shot (let's assume here it won't be blocked). There are two important factors when a player releases a shot. What is the probability of it hitting the net? And if it does, what it the probability that it goes in? These two values are caked into any "Fenwick" based expected goal model. Multiplying those two probabilities gets us an expected goal amount for that shot. With Sv% we look at at the at the second factor. Once it's on goal, how good is the goalie at stopping it. But the first factor matters too. The shot has to hit the net to go in. So if a goalie can affect the probability of a shot hitting the net he thereby reduces the expected goal amount of that shot. Just like how a goalie can influence whether a SOG will go in, a goalie can also influence whether a unblocked shot can hit the net. This is important too and can't be ignored.

Saves and Misses

What's interesting about including misses into our evaluation of goalies is the consequence for saves. With Sv% an average save was worth about .078 goals (since league average Sh% is about 7.8%). But if we include misses and now look at Fenwick Sv%, then misses and saves are interchangeable. They both result in the same thing, no goal. So both a miss and a save are worth .056 goals (league average FSh% is about 5.6%) since they are in the same bucket. 

Again, we can break this down into the probability of a shot hitting the net and if it does the probability of it going in. The average Miss% is about 27.8% (so 72.2% of shots hit the net). So if a shot misses the net, that means a goalie stopped a SOG which carries a .078 value. But since the average shot hits the net 72.2% of the time, he only stopped .728 SOG's. He was expected on average to make .278 misses so he made .722 above average. And .728*.078 ~.056 goals. 

For shots on goal we used to have a save valued at .078. But on an average unblocked shot we expect 72.2% to hit the net. So allowing a SOG is more than we would expect (.278 more). Stopping the SOG is worth .078 but we have to penalize the goalie for allowing the SOG in the first place. So we do .078-(.278*.078)~.056 goals. 

With all that said, I don't think saves and misses are, on average, equal in terms of goals saved. The reason is simple. Imagine I said three shots happened in the past 5 minutes. One resulted in a goal, one resulted in a save, and another in a miss. If you had to guess where each shot most likely came from and the circumstances surrounding it, I'd imagine, you would expect the goal came the closest to the net, followed by the save and then the miss (and that the goal was the most likely to be a rebound or a rush shot). So if we had to guess how dangerous each shot likely was, the order would be: Goal->Save->Miss. So we can infer from the outcome that a save is likelier to be a more dangerous shot than a miss.

We can look at this. Using the PBP files graciously shared by Emmanuel Perry and his expected goal model, we can calculate the average expected goal amount for a save and a miss. This is good because it gives us an unbiased view of how dangerous the shot was without knowing the outcome. What we see is that the the average save has an xG of .0569 and the average Miss has .0488. The difference here is ~.008 goals (note: This is random but the adding up the total xG of every shot and calculating the expected xG per shot we get .0586 which suggest that this model slightly overrates the probability of a shot going in).

So saves tend to be more dangerous shots. So they aren't "equal". But any model that uses a shot quality component already has this factored in, so as long as one controls for quality this doesn't matter. So, yes, misses are less dangerous shots but this is something most models already account for (like DTM's model and Adj.Fsv% on Corsica). This would only be an issue with raw Fenwick Sv%. As an aside, I don't think the value of a save and a miss is as simple as this but that's for another post. 

Conclusion

To recap, using rink-adjusted numbers (as opposed to only road numbers) I corroborated the findings that goalie Miss% is more repeatable than Sv% and that the team effects affecting Miss% is minimal. I then followed that up with some assorted thoughts on saves and misses.


Friday, January 27, 2017

Rink Bias for SOG and Misses

Sadly, because the NHL still records their data in a shitty matter there are some errors to be found. We see this in the location of recorded events and even with the actual amount of events (http://objectivenhl.blogspot.com/2010/03/shot-recording-bias-part-n.html). For example, as talked about in that last link, some rinks may inflate or deflate the numbers of shots on goal that occurred (or really saves), Every arena has their own trackers and they all have their own certain biases so this is to be expected. This has been looked at previously by Macdonald and Schuckers, but those methods are beyond my grasp so I'll try to replicate the numbers in a simpler manner for both shots on goal and misses. All data used here is courtesy of Corsica.hockey. (Note: I think it goes without saying that all numbers used here are 5v5).

The way the rink factors (I'm lifting the term "factors" from baseball) will be calculated is very simple. I'm not looking for some perfect way of doing this but an easy way of getting a solid estimate. And what the factors will tell us is how much each rink over/under counts saves and misses. So if a factor is 1.2, that means it over counts that statistic by a factor of 1.2. So to adjust the statistic in question, we would multiply those that occur in that rink by 1/1.2. So what we'll do is compare home numbers to road numbers, take multiple years into account, and regress (similar to this method). The comparison of home to road numbers is meant to isolate the home rink effect, as it assumes that that the away numbers are indicative of the "true" numbers we'd expect. This of course isn't true because the road numbers won't perfectly even out as they'll be biases there too. But I think the road numbers should "mostly" even out so I'll let this be for now.

Another problem is that both the home and road numbers are just a one year sample (like looking at one year of a player's statistics). Therefore I'll take multiple years into account (If possible of course. This doesn't apply to teams that changed arenas). How many? Well I chose three. Why? Because beyond three years didn't really seem to add anything. So, for example, the rink factors for 2015-2016 will take into account the two previous years (they will each be weighted the same too). The last problem is that just like one year isn't indicative of the "true factor" each measurement will also contain some randomness. So each factor will have to be regressed a certain amount (because for all we know we are just measuring randomness).

Lastly, all numbers here will be score adjusted. Adjusting for score is done to account for the fact that a team may have trailed or lead more at home (or on the road). And in order to isolate any one effect we have to do our best to account for other possible one's. This will be done similarly to how Micah Blake McCurdy laid it out here. It differs in that while Micah calculates the coefficient for each state by using the average events for both teams in that game state, I used the average % in general not just the average for that state. For example, here's the (5v5) Sv% by state for the away team:

Road Lead       Sv%
-3+                  .9116
-2                    .9085
-1                    .9165
0                     .923
1                     .9225
2                     .9236
3+                   .9265
Average          .9202

So to arrive at the coefficient for each state we just divide Average/Sv%. I did this because while with shot metrics we care what the other team does as it effects both sides (how many shots Team A gets is how much Team B gives up). For something like Sv%, how much goals Team A gives up per shot is irrelevant to Team B. So I just related it back to the average. With all that said, I think adjusting for score here doesn't really change anything.

Shots on Goal

As Tore Purdy originally noted, when we say a rink may over/under count the number of shots, we really mean saves. Thankfully goals are impossible to miscount. Therefore he used sh% to better examine the issue. I'll be doing it in the same spirit but in a slightly different manner. I'm going to be using the ratio of saves to goals ( Sv%/(1-Sv%) ). The higher the ratio, the higher the Sv%.

So for every team I calculated the cumulative ratio (meaning we get the ratio for both teams combined) at home and the cumulative ratio on the road (Reminder: The numbers here are score adjusted). We then just divide Home by Away (Home/Away) to get our base "factor". If it's above 1 that means the ratio was higher on home than on the road, and vice versa if it's below 1. But, as we discussed before, we should really use more than one year. And also, for all we know these numbers are completely random so we'll need to check the repeatability of these "factors". So for each season (if possible of course) I'll predict the Save factor using the past season, the past 2 seasons, and the past 3. The correlation coefficients are below:

Years        Sv% Factor
1 Year:       .033
2 Years:     .066
3 Years:     .0985

As you can see, the numbers here are pretty low. But based on the work by Schuckers and Macdonald (linked in the beginning) we expected this. As they noted the effects for shots are, by and large, rather small. And it's important to note, that just because the correlation is small that doesn't mean it doesn't exist. There exists a relationship, it just has a lot of noise. Here are the 3 year regressed factors for last year:


Team Shot Factor
CGY 0.986
OTT 0.987
MIN 0.988
TOR 0.988
STL 0.990
COL 0.992
WPG 0.992
CBJ 0.994
BUF 0.997
S.J 0.997
NYI 0.998
DAL 0.999
L.A 1.000
PIT 1.000
DET 1.000
NYR 1.000
N.J 1.001
T.B 1.002
NSH 1.002
FLA 1.004
EDM 1.004
WSH 1.004
VAN 1.005
ARI 1.009
CAR 1.010
ANA 1.011
PHI 1.011
BOS 1.014
MTL 1.017
CHI 1.017

Most of the factors are rather small (especially compared to misses as we'll see soon). Most of them are really closely centered around 1. I won't say it doesn't matter......but it kind of suggest that it mostly doesn't. Also it's important to remember what these numbers mean. These are how much each rink over/under counts saves. So when adjusting shots, these only apply to saves. Not to total shots on goal.


Misses

For misses I calculated the cumulative ratio of misses to shots on goal at home and away for each team (score adjusted). Shots on goal here were rink adjusted as I detailed in the last section. I then divided the home ratio by away ratio to get each team's Miss home factor. And what we see is there is a lot more signal in these numbers. Here is predicting each year's factor for each team using the past season, the past two seasons, and three (the numbers below are the r not r^2):

Years       Miss% Factor
1 Year:           .754
2 Years:         .7912
3 Years:         .797

Here are the 3 year regressed factors for last year:

Team Miss Factor
CHI 0.803
N.J 0.846
PIT 0.857
DET 0.868
COL 0.909
FLA 0.922
VAN 0.922
WPG 0.926
CBJ 0.929
NSH 0.936
NYI 0.941
MTL 0.942
BOS 0.958
T.B 0.964
OTT 0.965
CGY 0.980
BUF 0.983
WSH 0.990
NYR 1.009
PHI 1.030
EDM 1.035
MIN 1.042
STL 1.064
S.J 1.097
ARI 1.154
ANA 1.179
DAL 1.213
L.A 1.223
TOR 1.243
CAR 1.244

     As you can see these are a lot more significant. They also align with what Schuckers and Macdonald found (go to the misses section of their study-Table 8): Toronto, Dallas, Carolina, and L.A tend to over count and Chicago and N.J tend to undercount. Of course a few stand from the factors they listed and mine. This seems to mostly be due to the fact that their numbers apply from the 2007-2013 seasons and mine above only take 2013-2016. Looking back (I posted all the factors back to 2007-2008 at the end of this post) at the factors from the time period they used, the numbers seem to align more with those posted by Schuckers and Macdonald. For example CBJ and BOS appear lower and Chicago's factor is closer to .6 (this is besides for N.J as they report numbers lower than mine for them). It's possible the official recorders at each rink have shaped up a little since then.

Effect of Adjusting

It's nice seeing the raw factors but a better way to really see the effect on rink bias is to see it in action. So I calculated the rink adjusted Sv% and Miss% for each goalie from 2007 until last season. That is each save they made, not just the the home numbers (these numbers will also be included in the Google Docs). To see the effect on rink adjusting, I looked at all goalie seasons in which the player played at least 30+ games. I then calculated the difference between their raw numbers and their adjusted one's. Below are the 10 largest differences for Sv%:




Goalie Year Sv%              Adjusted Sv% Difference
MIKE.CONDON 20152016 0.9138 0.9127 0.00109
JEAN-SEBASTIEN.GIGUERE 20102011 0.9130 0.9121 0.00090
ANTERO.NIITTYMAKI 20092010 0.9186 0.9179 0.00079
JOHAN.HEDBERG 20102011 0.9184 0.9192 -0.00074
MARTIN.BRODEUR 20092010 0.9248 0.9255 -0.00065
COREY.CRAWFORD 20152016 0.9332 0.9326 0.00062
MIKE.SMITH 20092010 0.9113 0.9107 0.00061
JONAS.GUSTAVSSON 20092010 0.9143 0.9137 0.00061
JOSH.HARDING 20112012 0.9205 0.9211 -0.00058
MARTIN.BRODEUR 20102011 0.9124 0.9130 -0.00057


As you can see even at the extremes the differences are small. The highest, Mike Condon, only moves 1% and the tenth highest (Brodeur) only moves a little over .5%. While I don't want to say these differences are irrelevant, they really don't move the needle too much. Let's now look at the 10 largest for Miss%:


Goalie Year Miss%        Adjusted Miss% Difference
CRISTOBAL.HUET 20092010 0.2633 0.2971 -0.03377
NIKOLAI.KHABIBULIN 20082009 0.2128 0.2447 -0.03196
JONATHAN.QUICK 20102011 0.3065 0.2773 0.02930
JONATHAN.BERNIER 20142015 0.3120 0.2835 0.02849
JONATHAN.QUICK 20132014 0.3117 0.2834 0.02835
NIKOLAI.KHABIBULIN 20072008 0.2070 0.2352 -0.02817
JONATHAN.QUICK 20142015 0.2959 0.2690 0.02687
CAM.WARD 20152016 0.3025 0.2768 0.02570
CRISTOBAL.HUET 20082009 0.2542 0.2799 -0.02565
ANTTI.NIEMI 20092010 0.2653 0.2898 -0.02443


The differences here are a lot bigger. To put these differences into perspective, let's look at Cam Ward's numbers on the above chart. Last season, among goalies who played at least 30+ Games, the mean Miss% was .2811 and the standard deviation was .0173. That means the z score for Cam Ward last year was 1.24 ( (.3025-.2811)/.0173). That means he was in the .8925 percentile in Miss% among goalies with 30+ Games. After adjusting for rink I did the same procedure. Ward's z score this time around was -.196. This puts him in about the .425 percentile. After adjusting he went from near the top of league to slightly below average. I know this is one of the more extreme examples but I think it shows how the rink bias can play a large role in regard to misses.

Conclusion
To conclude, in this post I represent a very simple way of calculating a rink's bias in regard to shots on goal and misses. The results were nothing new as this has been established previously by Schuckers and Macdonald. That is that while biases exists for both SOG and misses, it is a fair deal greater for misses.

I'd also argue that the subsequent example of Cam Ward shows how vital it is to adjust for rink bias in regards to misses. While I believe that both shots on goal and misses should be adjusted (well...ideally for shots, it really doesn't matter), the errors for misses seem much greater and warrant intervention. I hope that this post inspires more (and better) research into rink effects. I also hope to expand upon this post in the near future.

Here's a Google Docs with all the numbers.

**All Data here is courtesy of Corsica.Hockey