Read The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball Online
Authors: Benjamin Baumer,Andrew Zimbalist
Table 4. Properties of Pitching Statistics
The first major blow to this contention was delivered by Tom Tippett in
2003.
7
Tippett found that knuckleballers, as a group, had consistently lower BABIP than pitchers in general. Furthermore, he found that many pitchers with long, successful careers had consistently low BABIPs.
8
This suggests that pitch type plays a role in determining BABIP, and indeed this contention has stood up to further scrutiny. In addition, as we saw above, the rates of ground balls and fly balls put into play against a certain pitcher show high reliability, since they reflect attributes of that pitcher (e.g., his arm angle, the particular spin he puts on the ball, or the location of his pitches). The rate of BABIP on ground balls vs. fly balls is different, and thus it stands to reason that the rate at which a pitcher induces ground balls would affect his BABIP. This line of thinking has led to numerous attempts to improve the ability to predict future BABIP, but to our knowledge those improvements have been only incremental.
9
Controlling for ballpark characteristics also helps, but controlling for defense is difficult. In many ways we consider the future prediction of batting average on balls in play against pitchers to be the most prominent open problem in the field of baseball analytics.
10
As the understanding of DIPS has permeated the sabermetric community, a plethora of alternative performance metrics for pitchers has been proposed. A clear evolution can be traced from Fielding Independent Pitching (FIP) and Expected Fielding Independent Pitching (xFIP), which ignore pitcher BABIP entirely, to Skill-Interactive ERA (SIERA), which incorporates additional variables like ground ball rate.
11
However, in our view none of these pitching metrics represents true insight into the relationship between pitching and defense in the way that McCracken’s work did.
FIP and xFIP continue in the longstanding tradition of conjuring arbitrary constants so as to peg the scale of a new metric to that of an old one (ERA). But the scale of ERA is itself arbitrary, in that only runs have intrinsic meaning to baseball—earned runs largely reflect an outdated convention. What this field truly needs is a simple, illustrative, but effective model to evaluate pitchers. Until a model can be constructed with interpretable coefficients (
à la
linear weights), or with meaningful interaction of terms (
à la
Runs Created), no real insight will be gained, and there is unlikely to be any consensus about which metric is best.
12
The statistical prediction of future pitcher performance thus boils down
to two things that are relatively easy to predict (strikeout rate and walk rate), one thing that is hard to predict (BABIP), and one thing that is somewhere in between (home run rate). A more comprehensive approach—one that is employed by Nate Silver’s PECOTA—is to pool pitchers into similar groups, and use the future performance of the group as a guide. Does the career path of a soft-tossing lefty like Jamie Moyer really have much to say about the career path of a big, strong, hard-throwing righty like Josh Johnson? Probably not, so it’s probably more sensible to try to understand Johnson’s career trajectory among pitchers who are similar to him in body type, repertoire, velocity, and so on (such as Curt Schilling and Josh Beckett). How those comparisons are made, and how the results are harvested, form the distinctions between prediction systems of this type.
While much has changed in the decade following the publication of
Moneyball
, it remains true that there is no consensus about “exactly which part of defense [is] pitching and which part fielding, and no one [can] say exactly how important fielding [is].”
13
Nevertheless, as the theory behind DIPS has been refined and assimilated, interest in differentiating pitching from fielding has increased. In particular, many new metrics that attempt to evaluate fielders have sprung up and gained popularity, not only in the sabermetrics community, but also in the mainstream media.
14
We are not alone in remaining dissatisfied with these metrics, which have been likened to “a flashlight in a dark room.”
15
In what follows, we illuminate the state of defensive metrics in a historical context, and provide a substantive critique of the limitations of the current state of the art.
Until play-by-play data became freely available on the Internet through the tireless efforts of Retrosheet and Project Scoresheet volunteers, attempts by the public to evaluate fielders in baseball were limited by the fact that only three basic statistics were commonly recorded: assists, errors, and put outs. Compounding the problem, the designation of an error was subjective, because it relied on the judgment of the home ballpark’s official scorer—a human being whose objectivity, not to mention visual acuity, was often called
into question. Fielding percentage (FPCT), which is simply the ratio of total plays made (assists + put outs) to recorded total opportunities (assists + put outs + errors), is a sensible way to combine those three statistics, and it remains the definitive measurement of defensive prowess for much of the base-ball-watching world. But while the lack of objectivity that goes into fielding percentage is troubling, the question that FPCT addresses is not even really that interesting. It does provide a somewhat reasonable, if subjective, assessment of the relative sure-handedness of a fielder. But it says nothing about the at least equally important skill of range or the skill of leaping and timing. As Alan Schwarz points out in his excellent book, fielding percentage harkens back to the earliest days of baseball, when baseball gloves were little more than what Dan Marino might slip into the Christmas stockings of his offensive linemen, and the concept of an error referred literally to failing to catch a ball that hit you in the hands.
16
In today’s game, the skill of sure-handedness or throwing accuracy is of questionable value compared to the skill of range, which measures how much ground a fielder can cover. Thus, the most important question is not “how often do you turn a ball into an out, given that it is hit to you?” but rather “how often do you turn a ball into an out?”
At this point it probably comes as no surprise that one person who popularized this distinction was Bill James. By the mid-1970s, James had created Range Factor (RF) to evaluate individual fielders, and Defensive Efficiency Rating (DER) to evaluate teams, using only conventionally available statistics. Range Factor measures the number of plays made by a particular fielder, which in theory quantifies the player’s range. Unfortunately, this metric is of limited value when comparing different players, because the number of opportunities varies so widely based on a host of external factors, such as the composition of the team’s pitching staff. Fielders who play behind a staff of strikeout pitchers are hopelessly uncompetitive with those who play behind a staff who “pitch to contact.” Similarly, outfielders playing behind a fly ball-heavy pitching staff will have a leg up on those playing behind a staff of ground ball pitchers.
However, Defensive Efficiency Rating (DER) is the perfect complement to DIPS,
17
in that it is essentially 1 – BABIP.
18
That is, what percentage of the balls put into play against a team is converted into outs? In typical fashion,
James’s statistic is both simple and insightful, and it directly addresses a not-immediately-obvious question of profound interest. But while DER may adequately measure the defensive performance of a team, there is no obvious way to apply the metric to individual fielders. Part of the problem is that the interaction among fielders is very difficult to disentangle. At the time
Moneyball
was written, a second problem was that “there wasn’t the data available to make a meaningful appraisal of fielding.”
19
Advances in the past decade may have begun to change that assessment.
Play-by-play data that is currently available from a variety of vendors provides much more detailed information, enabling more sophisticated models of defensive performance. Retrosheet data gives a zone that indicates where on the field a ball was caught, dropped in for a hit, or went through the infield (
Figure 9
). Proprietary data from STATS, Inc., Baseball Info Solutions, and Major League Baseball Advanced Media provide an even greater level of precision (or at least the illusion of it), by giving an (
x,y
)-coordinate pair over a grid of the field for that same information. Armed with this data, groups of researchers (many of whom are working for teams on proprietary models) have attempted to answer questions like: “If Derek Jeter is playing shortstop, and a ground ball is hit through the hole 22 degrees to the right of the third base line, what is the probability that he will turn it into an out?” It is important to note that, for the most part, these data sets contain no information about the spin on the ball, and only an ordinal description of the trajectory (e.g., ground ball, fly ball, line drive) and speed (e.g., hard-hit, medium, soft).
A non-exhaustive list of models known to the public that start with this question include ESPN’s Zone Rating, David Pinto’s Probabilistic Model of Range, John Dewan’s Plus/Minus system, Shane Jensen’s Spatial Aggregate Fielding Evaluation (SAFE), and the de facto industry standard, Mitchel Lichtman’s Ultimate Zone Rating (UZR). While a full dissection of the differences among these metrics is beyond the scope of what we can accomplish here, they share a common mathematical core, which we discuss for illustrative purposes.
20
Suppose that all balls hit into play can be divided into bins, in which all of the balls in the same bin are similar, in some sense. (The methodology for determining bins, and assigning balls to them, differs from metric to metric,
and from data set to data set.)
21
Then for each bin, we can estimate the probability that an average major league fielder at each of the nine defensive positions will successfully convert a ball in that bin into an out.
22
Moreover, we can estimate the average value (in runs) of a ball hit to each one of the bins. After all, a ball hit down the first-base line past the first baseman is more valuable than a ball hit through the hole between first and second, since it is more likely to become an extra-base hit. Finally, for each fielder, we have observations about each ball put into play while he was on the field, and which ones he successfully fielded.
23
By comparing his actual performance to his expected performance (based on the expectation of a league-average fielder at his position), we get an estimate of the defensive value (in runs) he provided relative to a league-average fielder. With the caveat that we have overlooked the details of this procedure, this is how UZR arrives at an estimate of the number of runs that a player has saved over the course of a season due to his range. The final estimate for UZR includes additional components for assessing throwing, sure-handedness, and the ability to turn a double play.
Figure 9. Project Scoresheet Hit Location Diagram
This is a start at a smart and reasonable methodology for evaluating fielders. So why are we so dissatisfied with the state of defensive metrics? Here is a non-exhaustive list:
1. UZR is a proprietary metric developed by a single person. Furthermore, the data that is fed into the system is proprietary. Thus, the system is a black box that spits out numbers in which we should have little confidence. For proprietary metrics of this nature, there is no assurance that the computation is mathematically sound, does not contain bugs, or even that the numbers are not simply picked out of a hat. The dangers of this state of affairs came into full view after the 2009 season, when Jason Bay’s awful UZR numbers were “updated” by Lichtman.
24
Bay, who was considered by scouts to be an adequate, if unspectacular left fielder, was one of the worst outfielders in baseball upon replacing Manny Ramirez as the Red Sox left fielder, according to UZR. Meanwhile, Ramirez went from being a laughingstock to a slightly below average left fielder upon his trade to the Los Angeles Dodgers, who play in a park with a relatively large outfield, unlike the notoriously cramped Fenway Park. After Lichtman improved his ballpark estimates, Bay’s UZR per 150 games of –13.8 runs was “updated” to +1.8 runs, transforming him from a lousy outfielder to an average one overnight, and during the offseason to boot.
25
Of course, this is nonsense, but we do not view the episode as reflecting especially poorly upon Lichtman. If anything, he should be praised for making improvements to his model, especially given the somewhat embarrassing light into which the changes were cast. Rather, the incident reflects poorly on all of those in the media and sabermetric community who tout the validity and accuracy of a closed-source, proprietary metric that is mysteriously controlled by one person. It seems hard to believe that Bay’s imprecise UZR would not have been caught earlier had the formula for UZR been known to the public.