Read The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball Online
Authors: Benjamin Baumer,Andrew Zimbalist
Table 5. Wins Above Replacement for David Wright in 2008
What is needed at a minimum, in our view, to solidify the presence of WAR as a meaningful quantity worthy of discussion and comparison, is a fully open-source implementation of Wins Above Replacement. This would include:
1. A clear description of the methodology employed, preferably including mathematical notation and certainly including justifications for any arbitrary constants, scaling factors, or “corrections.”
2. An open data set. At the moment this makes Retrosheet the only option, unless one is willing to parse the MLBAM GameDay files, and release the source code for that parser.
55
3. The source code, using only open-source software, that will reproduce
all
of the calculations necessary to arrive at the final WAR estimates.
56
The payoff to such an undertaking would be to lift the veil that obscures the details known only to the select few who stand to profit from the current implementations of WAR. Baseball Prospectus is a private company whose business model is predicated on subscriptions, so that readers can get the proprietary metrics that only BP provides. They can always claim that their implementation of WAR (e.g., WARP) is
better
than the open-source version (let’s call it openWAR). Fangraphs and Mitchel Lichtman can always claim that UZR, a proprietary metric built on top of a proprietary data set, will provide superior defensive evaluations to the relatively crude estimates to which one might be limited by using only Retrosheet data. But at least when it is claimed that David Wright’s openWAR in 2008 was 6.9 runs, with a 0.5 run margin of error, there will be universal agreement in what that number
means, and what exactly went into the computation to arrive at it. Then, and only then, will it make sense for third-party organizations (e.g., ESPN or the MLB Network) to claim that Wright’s WAR in 2008 “was” 6.9 runs. Under the current system, such a claim has no meaning.
As it stands, the problems with WAR have to do not only with the opaqueness of the underlying data and methodology, but also with known elements of the method that are dubious. For instance, a player’s run differential is judged in reference to a replacement player, but it is not clear that there exists a pool of replacement players with the productivity that is ascribed to them.
57
Even accepting the run differential, the use of James’s Pythagorean Expectation to convert runs into wins is less than robust. One need only reflect on the 2012 Baltimore Orioles, who outperformed their expected win total by 11 games, to see how inaccurate the runs to wins conversion can be.
Finally, many sabermetricians have taken the questionable WAR estimate and converted it into a marginal revenue product estimated via the metric MORP, or marginal value over replacement player. The basic idea here is to estimate the value of a player to a team and, thereby, inform the team how much it should be willing to pay the player.
58
To do this, an average value of a win is estimated at roughly $4 million and this average is then applied, with only the most minor of variations, to every player in MLB. As we show in
Chapter 6
, the value of a win varies significantly from team to team, depending on a team’s win percentage, the economic size of its market, and other factors. Since MORP is intended as a guide to building a competitive roster, it makes little sense to abstract from profound differences among teams to estimate a player’s value.
The complexity of baseball is one of its great lures, and in the National League at least, strategic play continues to provide excitement night after night. Again, while many strategic maxims have become part of the conventional wisdom over decades of play, the more recent availability of play-by-play data has enabled sabermetricians to analyze these strategies with newfound precision. Although many baseball managers and fans are wedded to doing
things by “the book” (e.g., sacrifice bunting), a new generation of thinkers has been more influenced by
The Book
, an ironically titled attempt to reconcile common knowledge with actual data.
59
Although some find the tone of
The Book
condescending, it ably covers important elements of baseball strategy in greater depth than we can achieve here. In a wide variety of publications, sabermetricians have tackled an ample array of questions, including platoon effects, lineup construction, reliever deployment, stealing, pitching rotations, intentional walks, base stealing, pitch selection, the hot hand, individual batter versus pitcher matchups, clutch hitting, sacrifice bunting, defensive shifts, and so on. In what follows, we will briefly characterize sabermetric thinking on a few of these topics, and illustrate common frameworks for thinking about how to address these issues.
It has long been observed that left-handed hitters have a much more difficult time hitting against left-handed pitchers, while right-handed hitters have a much harder time hitting against right-handed pitchers. Certainly, the angle of a pitcher’s delivery is importantly different, and many hitters claim to “see the ball better” against a pitcher of the opposite hand. Most breaking pitches, such as the ubiquitous slider, break away from a hitter of the same hand, making them more difficult to hit. Indeed, since 1995 left-handed batters have hit nearly 60 points higher in OPS against right-handed pitchers (.782 versus .713), while right-handed hitters have hit about 42 points higher in OPS against left-handed pitchers (.773 versus .731). This discrepancy leads to the so-called “platoon” advantage. The strategy of attempting to maximize the number of plate appearances in which one’s team has the platoon advantage has become de rigueur, and while its obsessive pursuit is often associated with Tony LaRussa, it was clearly known not only to John McGraw, the legendary New York Giants manager, but in fact to the earliest professional players. Bill James notes the presence of switch-hitters in 1871—the first season of what is now considered professional baseball—as proof.
60
In general, hitters have the platoon advantage about 54 percent of the time, but managers can increase this percentage by matching effective platoon
partners, using pinch-hitters off their bench, and having lots of switch-hitters at their disposal. Since 1995, the Mets have the distinction of fielding teams that enjoyed the platoon advantage most often (71 percent of the time in 2008, thanks to switch-hitters including Jose Reyes and Carlos Beltran) and least often (only 35 percent of the time in 2000, when they went to the World Series). Not only does the frequency of having the platoon advantage vary from team to team, but the size of the effect attributable to each player varies as well. A few players, like Jim Thome and Ryan Howard, put up Hall of Fame numbers against opposite-handed pitchers (1.058 and 1.020 OPS, respectively), but would struggle to start at their position against same-handed pitchers (.777 and .749 OPS, respectively). Conversely, a few players (Alex Rodriguez and Matt Holliday, to name two) have slightly better career numbers against same-handed pitchers. In this respect, sabermetrics has merely clarified, rather than turned on its head, the conventional wisdom about the advantages of platooning.
One of the more controversial questions addressed by sabermetrics is about clutch hitting. The conventional wisdom has long held that certain players are “clutch” (such as Derek Jeter). The presence of clutchness imbues certain players with the ability to raise their game at the most critical juncture, often realized through a timely hit, a magnificent defensive play, or a biting third strike. A related (and often conflated) notion is that of the “hot hand,” or streakiness. This is the suggestion that a player who is performing especially well is more likely to continue to play especially well. In some sense this is a weak converse to the notion of regression to the mean.
The problem is that statisticians have long doubted the presence, or at least the importance, of clutchness and streakiness. In the framework that we have developed in this chapter, a major flaw in the notions of clutchness and streakiness is that they do not appear to be persistent skills. That is, measures of clutchness and streakiness have very low reliability, suggesting that they capture more randomness than skill. In fact, the reliability of such metrics has been so low that sabermetricians began to question whether the effect was
real at all. Bill James points to a seminal study conducted by Dick Cramer in 1977 that in the author’s mind “established clearly that clutch hitting cannot be an important or a general phenomenon.”
61
In an example from basketball, psychologists from Cornell and Stanford analyzed the shooting patterns of members of the Philadelphia 76ers in 1985 and found “no evidence for a positive correlation between the outcomes of successive shots.”
62
Given the general interest in the subject, many researchers have tried and failed to identify incontrovertible proof that clutch hitting and/or streakiness exists.
But while many adherents of sabermetric dogma interpret this absence of evidence as evidence of absence, James smartly expressed skepticism in an enormously influential article entitled “Underestimating the Fog,” shortly after the publication of
Moneyball
. James backtracks on several findings he had previously published and cautions his colleagues to embrace the limitations of their tools. Specifically, he warns against interpreting a failure to find evidence of an effect as evidence (or worse, proof) that said effect does not exist. From a statistician’s point of view, this is like interpreting a failure to reject the null hypothesis as evidence in confirmation of the null hypothesis—an elementary misinterpretation. Among the more notable pronouncements James makes is that he believes clutch hitting to be an “open question,” while “no one has made a compelling argument either in favor of or against the hot-hand phenomenon.”
63
We will leave the question of clutch hitting to the reader, but the question of streakiness has become an interesting episode in the history of sabermetrics. Since the world is generally far too complicated to model precisely on a computer (certainly this is true in baseball), statisticians create idealized models that behave nicely, and study those. A typical approach to addressing a question like streakiness goes like this:
1. Let’s suppose (even though we know it isn’t true) that all players behave like coins.
2. Let’s flip a whole bunch of coins.
3. If the results from reality (the real players) are grossly out of whack with what was generated by the coin flips, then we can claim that the idealized model does not accurately capture reality. This may be
a finding, in that it implies that the idealized model, which does not include streakiness, is not sufficiently complex to generate the observations that we see in life. Conversely, if the idealized model (that doesn’t include streakiness) describes reality well, then we haven’t found evidence for streakiness (but we also can’t say that it
doesn’t
exist).
Using this general methodology, researchers at Cornell found that although Joe DiMaggio was an unlikely candidate to have a fifty-six-game hitting streak, there was nearly a 50 percent chance that
someone
would have had a hitting streak of at least that length over the course of baseball history.
64
Implicit in this methodology is the assumption that a player’s likelihood of getting a hit is the same for each game. This is obviously not true in reality, but the results of the simulation suggest that the reality we have observed is a fairly likely outcome even if you do make this assumption.
In 2011, Trent McCotter pondered a more nuanced question: what happens if you switch the order of the games? In each game, each hitter either got a hit or didn’t, but the games occurred in a certain order (i.e., chronological order). If streakiness exists, then you would expect to see more long streaks of games in which hitters got hits in reality than you would if you randomly reordered all of the games.
65
And McCotter found that in fact there
were
more such streaks, to a statistically significant degree.
66
McCotter interpreted his discovery as strong evidence that streakiness exists, and while there is still some debate about that interpretation,
67
it is clear that McCotter’s work shows that the order of the games matters. This implies that players do go through longer phases in which they are more or less successful than randomness alone would predict. This finding would not have been possible even a few years prior, before Retrosheet provided the data and fast computers could perform the computations. Moreover, it is notable that the existence of streakiness has gone from self-evident conventional wisdom, to sabermetric anathema, back to some gray area where sabermetricians are grudgingly acknowledging its likely existence.