Read The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball Online
Authors: Benjamin Baumer,Andrew Zimbalist
2. With the exception of the metric SAFE, no confidence intervals or standard errors for these estimates are provided; thus, the illusion of precision is perpetuated without quantification. Even stalwart proponents of UZR, such as Fangraphs’ Dave Cameron, suggest that the margin of error is ± 5 runs per season.
26
This means, among all players from 2002 to 2011, almost 92 percent had a performance that was not distinguishable from zero at their position.
27
3. While these data sets provide an (
x,y
)-coordinate indicating where the play was made, they provide no indication of where the fielder was standing when the play began. Thus, the separate skills of range and positioning are inherently conflated. That is, it is impossible to tell the difference between a fielder who actually covers a lot of ground once the ball is hit, and one who merely happens to position himself well before the ball is pitched, so that “what looked like superior defense might have been brilliant defensive positioning by the bench coach.”
28
One might be tempted to argue that philosophically, we shouldn’t care to distinguish between these two skills, since what is ultimately important is whether the ball is fielded, regardless of the method employed to do so. However, from a physiological standpoint it seems self-evident that range will decline with age, whereas positioning might not (in fact, it might improve), thus making the distinction important, inter alia, for forecasting purposes. Moreover, it is not even known whether positioning is in fact a skill. Further, the recent trend toward shifting infielders to the first-base side when left-handed sluggers are batting compounds the severity of this problem.
29
Finally, if the pitcher is successful in hitting the catcher’s target, it simplifies player positioning and facilitates the jump a player gets on a batted ball.
4. It is assumed that the estimate of the probability of a ball in each bin being fielded is accurate, but in many cases this is probably not true. There is an inherent trade-off between making bins small enough so that all of the balls in them are actually similar, and having a sample size large enough to actually get a good estimate. Despite Lewis’s claim that “any ball hit any place on a baseball field had been hit just that way thousands of times before,”
30
the truth is that (especially if one wants to make bins for each ballpark and control for batter and pitcher handedness), the sample sizes are much smaller, and the data only goes back to the mid-1990s at best. Furthermore, most data sets contain no information about the profoundly relevant issue of how long it took each ball to reach the designated (
x,y
)-coordinate on the field, or the spin on the ball, or the moisture on the grass, or the number of hops the ball took on its way there. In general, location is the dominant factor that determines into which bin a particular ball goes. Yet, a ground ball that goes up the middle in one second is clearly more difficult to field than one that takes
two seconds to get there. UZR employs ordinal data on three categories of how a ball is hit (slow, medium, fast) and two categories of the runner’s speed (above and below average). This enables some consideration of these variables, but hardly allows for the precision that is often claimed or attributed to this metric.
31
Further, a player’s range encompasses not only his side-to-side and back-and-forth mobility, but also his ability to leap and time his leaps; the latter skills appear to be left out of UZR, since infield line drives are excluded.
5. Again, with the exception of SAFE, the metrics presented above “could value only past performance.”
32
The difference is that while the metrics in the methodology described above derive a model of the league-average fielder, and then measure the deviations from that model for individual fielders,
33
SAFE constructs a model of the individual fielder, using the league-average fielder as a guide,
34
and then evaluates the fielder based on the cumulative run value of the balls that he is expected to see in the future. This subtle distinction imbues SAFE with predictive inference that is not present in UZR.
6. The reliability of these defensive metrics is not particularly high. Yankee shortstop Derek Jeter notoriously scored poorly on UZR, on the claim that he moved slowly to his left. Then suddenly in 2009 Jeter’s UZR rating, at thirty-five years of age, soared, placing him well above average. Outfielder Nate McLouth is another puzzle, with a UZR rating of –13.8 in 2008 but a +3.6 rating in 2009.
35
Jeter and McLouth are not isolated examples. Estimates of the year-to-year correlation for all players range from 0.35 to 0.45, which, as we saw above, is roughly akin to batting average, i.e., it is an unreliable predictor of future performance.
36
7. Given the low reliability of these metrics, it has often been posited that three years’ worth of data is needed to make an accurate assessment. The problem with this interpretation is that as the length of time over which data is collected increases, the assumption that the player’s true defensive ability remains unchanged becomes less and less realistic. That is, it doesn’t help to widen the time interval of study when the player is already changing within that window.
37
This makes estimating fielding ability somewhat of a moving target.
Many of these issues may be addressed in the coming years, when the long-rumored FIELDf/x data set arrives. FIELDf/x data, which is collected
using a series of cameras mounted at the press level of each ballpark, promises to deliver exactly the variables that have thus far been lacking: the 3D position of the fielders and the ball at every fifteenth of a second during each play.
38
This has the potential to eliminate the concept of bins entirely, since now the probability of a ball being fielded successfully can be modeled as a continuous function of its position and the elapsed time that it takes it to reach a fielder.
39
But while the never-before-seen data conveyed by FIELDf/x holds promise, not everyone remains convinced that it will lead directly to an accurate assessment of fielding. Bill James, for one, remains skeptical that FIELDf/x will usher in a new era of precise defensive evaluation. In an April 2010 interview, James opined: “We’ve had these cameras pointed at pitchers for several years now and we haven’t really learned a damn thing that is useful. . . . I’d suspect the same thing would be true with respect to fielding.”
40
Another issue is that while it is taken as self-evident that important differences among fielders exist, it is not clear how important those differences are. In
Moneyball
, A’s former statistical consultant Eric Walker is quoted as estimating that fielding is “at most five percent of the game.”
41
DePodesta makes the more nuanced argument that “the variance between the best and worst fielders on the outcome of the game is a lot smaller than the variance between the best and worst hitters.”
42
That is, DePodesta is skeptical that poor fielding cannot be overcome by good hitting. In other words, while the old adage is that “good pitching beats good hitting,” DePodesta suggests that good hitting beats good fielding. Contrast this with the experience of the Tampa Bay Rays, whose last-place finish in 2007 was in part driven by their .669 DER, the worst in baseball history since the adoption of the 162-game schedule in 1961. Through various means, the Rays were able to improve their DER dramatically the following season, to a league-best .723. This led to a whopping 273 fewer runs allowed by the Rays, which fueled their first-place finish and World Series appearance. In many ways, the improvement in DER warrants the lion’s share of the credit for the Rays transformation, since the team’s offense actually scored 8 fewer runs in 2008, and the strikeout, walk, and home run rates of the team’s pitching staff were largely unchanged.
43
Since 2008, the Rays have been able to sustain their defensive prowess, posting the league’s
best DER by a wide margin from 2008 to 2011.
44
The question of how the Rays achieved this remarkable transformation remains open to debate. In particular, it is unclear how large of a role defensive metrics like UZR played in the personnel changes the Rays made. Other factors may include position changes (e.g., permanently shifting B. J. Upton from the infield to center field) and an expanded use of defensive shifts. But, in any case, given the large number of runs allowed to be gained from a small change in DER, Walker, and by extension the A’s at the time of
Moneyball
, most likely undervalued the importance of defense.
45
We shall return to consider the Rays sudden and surprising success in 2008 and after, and its relation to sabermetrics, in
Chapter 7
.
To this point, we have illustrated how sabermetricians estimate the offensive and defensive contributions of both position players and pitchers. In both cases, those estimates can be constructed so that they are on the scale of runs. A natural next step is to combine these elements into a measure of each player’s overall contribution, whether offensive or defensive, in terms of runs. A final adjustment will translate runs into wins (using James’s model for expected winning percentage), and the result is a statistic that measures how many additional wins each player contributes to his team. If we choose to interpret the size of this contribution in relation to a “replacement” level player, then the result is Wins Above Replacement (WAR).
For example, during the 2008 season, Fangraphs estimates that David Wright contributed 44.0 runs above average to the Mets through his hitting. As a baserunner, Wright cost the team 4.5 runs, while the value provided by his defensive contributions as a third baseman amounted to 5.1 runs. The sum of Wright’s contributions was thus 44.6 runs above what an average player would produce. If a replacement level player, who is considerably worse than an average player, had played as often as Wright played in 2008, he would have been about 24.5 runs worse than average, so Wright is also credited with that amount. Finally, because Wright played third base, a more difficult defensive position than average, this provides additional value amounting to
2.3 runs. Thus, Wright is credited with 71.4 runs (44.6 + 24.5 + 2.3) to the Mets beyond what a replacement level player would have contributed. Since a quick-and-dirty estimate of the number of runs necessary to produce one additional win is about 10,
46
Fangraphs estimates that David Wright was worth about 7.1 wins above replacement to the Mets in 2008.
47
The interpretation is that had Wright gone down with an injury in spring training, the Mets would have been forced to replace him. And if they did not have a major-league-ready prospect in the minor leagues, or another major leaguer on the roster who could play third, they would likely have to replace Wright with a so-called “4A” player.
48
This is a hypothetical journeyman who spends most of his time at the AAA level of the minor leagues and has success there. He probably has some major league time under his belt, but (at least recently) has not played a significant role for any major league team, and has not been able to hold on to a roster spot. Players like this are frequently minor league free agents at the end of the season and are assumed to be plentiful.
49
Thus, it is assumed that a player of this caliber is available to every major league team at any time, and his level of production represents essentially the worst-case scenario for the major league club. This is the Platonic ideal of a “replacement” level player.
50
A replacement-level player, by definition, contributes 0 WAR. Under the assumption that the Mets could have replaced Wright with a player (or group of players) who would have produced 0 WAR, the value that Wright provided to the Mets must be understood in relation to this level of production. In economic terms, what is being measured is Wright’s marginal physical product.
While the idea behind WAR (modeling marginal physical product) is a good one, in our view the existing methodologies leave much to be desired. This is a shame, since the statistic appears to be easily understandable, which has enabled it to permeate the mainstream media.
51
Our concern is, in a manner analogous to the discussion of UZR, that the details of a statistic used by too many are known only to too few. Furthermore, there are some subtleties in the modeling aspect of WAR that may not be fully understood.
Currently, there are three popular implementations of WAR, one version computed and available through the Fangraphs website (known as fWAR), another version computed by Sean Forman and Sean Smith available on
Baseball-Reference.com (rWAR, sometimes also called bWAR), and yet another version available through Baseball Prospectus (WARP). As the three sites do not use the same data set nor the same methodology, the numbers returned by the three systems usually do not agree with one another.
52
For example, in
Table 5
, we show the components that comprise the WAR calculation for David Wright’s 2008 season from all three sources. While in this case there is a general consensus that Wright was about a seven-win player in 2008, there is considerable disagreement over the components that comprise that estimate. In particular, Baseball Prospectus viewed Wright as being nearly 5 runs
below
average as a fielder, while Fangraphs had him at nearly 5 runs
above
average. In contrast, Baseball Prospectus viewed Wright’s baserunning as having made a small positive contribution (1.5 runs), while Fangraphs saw it as being negative (-4.5 runs).
This state of affairs is frustrating for anyone trying to understand the true value of Wright’s worth to the Mets. In part, the discrepancies among the numbers spit out by the three systems speak to the difficulty of estimating this unknown quantity. In effect, we have three different models, operating on at least two different data sets, created by many people occasionally working together and occasionally in competition, all trying to estimate the same unknown. What we would like to emphasize (and it is a subtlety generally
missed by the media) is that WAR is not a statistic akin to OBP or even UZR. Rather, it is an unknown quantity that is modeled more or less independently by at least three statistics (fWAR, rWAR, and WARP). While the creators of these models do generally recognize that the margin of error in their calculations is relatively large (± 0.5 wins by one estimate),
53
they do not provide standard errors or confidence intervals. Moreover, the overwhelming majority of players have WARs that are within one win of zero.
54