Pages

Tuesday, May 31, 2016

Regressing Sv% by Danger Zone

Goalies aren't easy to evaluate. Or as some might say, they are voodoo. Introduced some time back by the war on ice crew(which will soon be defunct), and used by Nick Mercadante in his Mercad60, they split up shots into low, medium, and high danger based on the position of the shot and whether it was a rebound or a rush shot. And it's been used quite a bit. I've been thinking about it lately, and strangely, I've never actually seen anyone regress it. So I did so and I thought I might as well share it.

What we would expect, from intuition and what we know, is that: Low-Danger is almost all noise, Mid-Danger has a signal but it's weak, and High-Danger is the best and has a fair signal. Of course, we need to confirm this. My first to idea to test all this was to do it the conventional way. Take all goalies and line up all their shots. And run odd-even correlations until r=.5 (that being the correlation coefficient). The amount of shots needed to reach r=.5 would be how much we regress (and we would do so at league average). But the problem is that Sv% (in all zones) take time until a good signal is reached, and there are only so many goalies who have faced a lot of shots. We start to run out of goalies as the sample is too small (I was actually able to get a read on High-Danger Sv% and found it reached r=.5 at 1050......this makes sense as we'll see soon).

Thankfully, there are other ways. On way, which is easier and more convenient, is shown here by Tom Tango. I strongly recommend you read that before continuing (and I also recommend reading his blog in general). Ok (I'm assuming you read it), I ran the numbers on the last 4 years on goalies who faced at least 750 total shots. Here's what we find (It goes without saying....this is all Even Strength numbers). Also I only included sv% for the hell of it and for a self-check (I know it should be about 3000). The best way is regressing each zone independently.

                                  n             Sv%          Low-Sv%        Mid-Sv%           High-Sv%

SD of z Scores:           76            1.307         1.136                1.033               1.298

# of shots regressed:   76            3534          3915                 10127               1001


First thoughts from these numbers is that high danger and regular Sv% (it's near Tango's estimate) make sense. But low and mid seem to be in the wrong spots. Flip it and it makes sense, but it doesn't match up. So what I did was run it on the 4 years previous to that (2008-2012). And what we find is:

                                  n               Sv%          Low-Sv%        Mid-Sv%           High-Sv%

SD of z Scores:           71            1.457           .942             1.268             1.264

# of shots regressed:   71            2638          -11565           1352               1400

For this one Sv% makes a little more sense and High-Danger stays about the same. But as you can easily tell Low-Danger is fucked up and Mid-Danger seems too low. So what we'll do is run them together. We'll just combine the data and see what we find (One note: The percentages used now is for the entire eight year period not each four individually).

                                  n              Sv%          Low-Sv%        Mid-Sv%           High-Sv%

SD of z Scores:           147             1.4            1.046               1.16              1.295

# of shots regressed:   147             2838          12933            2165               1121

And these estimates make the most sense. Sv% is near the 3000 mark, Low-Danger is pretty much just noise, High-Danger is the best and has a fair signal. And Mid-Danger is a weaker signal.

Lastly, I'd like you to think of the spread for each danger zone sv%. Over the past three years the observed Standard Deviation for each one is: Low- .69%,  Mid- 1.317%, High- 2.232%. And what we know is that the amount of noise decreases as we move to the right. So not only does the observed spread increase as we move to High-Danger but we can attribute more of that to talent (the standard deviation due to talent is higher). I think this is important in showing why we should be mainly focusing on High Danger sv% when it comes to evaluating goalies (obviously not only focus on it, just mostly) as it has the biggest spread in talent.

***All Data courtesy of War On Ice

For this who care.....the league average sv% for each zone (and in general) over the past 4 years:

Sv%    Low-Sv%    Mid-Sv%     High Sv%
.923     .9738           .927          .8347