The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball (8 page)

BOOK: The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball
6.49Mb size Format: txt, pdf, ePub

With MLBAM cannibalizing play-by-play and pitch-by-pitch data revenue, the third-party providers have invented new products. Sportvision licenses its PITCHf/x data to MLBAM to disseminate to the clubs, but maintains separate relationships for other sources of data, such as HITf/x, which tracks the muzzle velocity and trajectory of batted balls coming off of the bat, and COMMANDf/x, which measures the magnitude and direction of how far a pitcher missed his spot (i.e., how far the catcher had to move his glove). Trackman uses radar technology (in contrast to Sportvision’s camera technology) to measure the spin rate and flight time of pitches. TruMedia provides not data, but a user interface to data, via an interactive web application useful for advance scouting. Since many clubs had already allocated money for baseball analytics, the economy of scale achieved through MLBAM simply allowed them to spend their money on other products. What is unclear at this point is how much meaningful information the clubs are getting from these third-party products. By 2013, DePodesta began to question not the importance of data analysis, but the value of the data onslaught, writing, “more data is not always better data. What we are seeking is relevant data.”
38

Conclusion

In this chapter we have demonstrated that front offices have many more people now working on baseball analytics than ever before. In particular, more than half of the thirty clubs have more than one person who is primarily working on analytics, and just four clubs appear to have little or no analytical presence. At the same time, more and better data is flowing to clubs, who are spending money either to upgrade their technological infrastructure or to outsource that job to a third-party company. But whereas at the time of
Moneyball
, the A’s had just a few employees who were craving for more and better data so that they could figure out what they wanted to know, the challenge in today’s front offices is to find enough employees who are capable of extracting meaningful information from what is quickly becoming a torrent of data. The greatest unknown, and perhaps the source of the next market inefficiency, is how clubs will meet that challenge.

3

An Overview of Current Sabermetric Thought I
Offense

In the next two chapters we will present an overview of the current state of baseball analytics, while making careful attempts to compare the current results to those that were mentioned in
Moneyball
. Our emphasis is on exposition, in that we will attempt to explain and justify the basics of sabermetric theory to the reader. Although much lies beyond the scope of what we can accomplish here, a thorough reading should give the interested reader a firm grasp of how sabermetricians think about the game, and demystify some important results that are mentioned in passing in both
Moneyball
and the popular media.

Why Do Teams Win Games?

For those new to sabermetrics, one of the most eye-opening passages in
Moneyball
, the book, begins with Paul DePodesta “reducing the coming six months to a math problem.”
1
To accomplish this, DePodesta estimates four quantities:

1. the number of wins likely necessary to make the playoffs (about 95);

2. the number of runs by which the A’s need to outscore their opponents over the course of the season in order to win that many games (about 135);

3. the number of runs that the A’s, as currently constituted, are likely to score (810 ± 10); and,

4. the number of runs that the A’s, as currently constituted, are likely to allow (660 ± 10).

The answers to the last two questions allow DePodesta to determine whether the A’s will reach the threshold in the first question. Lewis mentions parenthetically the missing piece of the equation: a strong relationship between the number of runs that a team scores and allows over the course of the season, and the number of games that they win. This may seem obvious, but keep in mind that we are only talking about the
cumulative
number of runs scored and allowed over the course of a season, with no information about the distribution of how those runs are scored in any particular game.

The relationship to which Lewis alludes is known, somewhat misleadingly, as the Pythagorean Expectation, and it is one of Bill James’s more enduring contributions to the field of sabermetrics. James created a simple but nonlinear statistical model that relates runs scored (RS) and runs allowed (RA) to a team’s expected winning percentage (WPCT):

James described his formula as Pythagorean because the sum of squared terms reminded him of the Pythagorean Theorem (a
2
+ b
2
= c
2
, where
a
and
b
are the lengths of the shorter sides of a right triangle, and
c
is the length of the hypotenuse). But this similarity was largely a coincidence. While James undoubtedly used the exponent of 2 (the solid line in
Figure 1
) for convenience and simplicity, later sabermetricians sought a more precise, less arbitrary value, and found that as the game has changed over the years, the value of the exponent that best fits the data has changed with it. For clarity, we show (dotted line) that the exponent that best fits the data from all team-seasons since 1954 is about 1.85.
2

Figure 1. Winning Percentage Versus Run Ratio, 1954–2011

Each dot represents one team in one season, and all teams from 1954–2011 are represented (the dots are partially transparent, so a darker cluster indicates that more dots are present). The solid line shows James’s model for expected winning percentage as a function of a team’s run ratio (with an exponent of 2). The dashed line shows the best fit model (with an exponent of about 1.85).

It is worth reiterating that James’s formula defines an
expected
winning percentage, based on the known ratio of runs scored to runs allowed. Since the formula seems to work so well in practice, it is commonly used to estimate a team’s projected finish midway through the season, given its current run ratio. For example, it is not uncommon for a baseball team to be several games over .500 at the All-Star break, but have been outscored on the season. If we assume that this team will continue to score and allow runs at the same rates, then by James’s formula, the expected winning percentage for that team in the second half would be under .500. Thus, the team’s expected final winning percentage would include the number of games they had actually won, plus the expectation that they would win less than half of their remaining games.

Naturally, deviations from this expected winning percentage are the subject of some debate. The standard deviation between the expected and actual wins is about four games, and it is rare for teams to underperform or overperform their expected winning percentage by more than ten games. When that happens, is it pure luck? Is it the team’s performance in one-run games? Is it the presence of a spectacular bullpen or closer? Is it clutch hitting? Theories abound, but compelling explanations are elusive.

The notion of expected winning percentage has caught on in other sports, each having a different exponent. In basketball, the exponent is much higher (somewhere between 14 and 17), while in football, it is about 2.4.
3
Nevertheless, an analytic explanation of why James’s model was so successful eluded researchers until 2005, when Steven Miller proved that James’s model, with an unknown exponent, could be derived by assuming that a team’s runs scored and runs allowed were independent, and each followed a well-known statistical distribution.
4

The fact that James’s expected winning percentage hewed so closely to a team’s actual winning percentage over the course of the season gave the A’s confidence that they could accurately predict the team’s likely finish once they had a good enough estimate of the strength of their offense relative to their defense.
5
We proceed with a discussion of how current sabermetric thinking may have led to DePodesta’s estimates of those two quantities.

Offense

Clearly, the only quantity that really matters when evaluating a team’s offense is the number of runs that they score. How they score those runs is a matter of taste, and the success of James’s model for expected winning percentage over the course of the season might even diminish the importance of the distribution of
when
they are scored. Offense in baseball can be divided broadly into two skills: hitting and baserunning.

Baserunning

While baserunning is important, its value relative to hitting is small. Sabermetricians have estimated that most teams generally gain or lose at most 20 runs over the course of a season as a result of baserunning,
6
while individual baserunners rarely add or subtract more than 10 runs from what their team would likely have scored if they ran more conservatively.
7
Moreover, sabermetricians have suggested that even by creating a fantasy-style lineup of excellent baserunners, the upper limit on the value of baserunning is about ± 70 runs over the course of a season.
8
The average team scores and allows about 700 runs over a 162-game season, so the contribution of baserunning toward a team’s offense is almost certainly less than 10 percent in practice.

Although some may view these results as dubious,
9
their credibility is aided by the fact that researchers have employed two entirely different methodologies and achieved corroborating results. The first empirical approach is to sum the changes in the expected run matrix (see
Appendix
for an illustration of how the expected run matrix works). That is, if there is a runner on first with one out, and if we ignore many particulars of the specific situation (who the pitcher and batter are, etc.), we can derive an estimate of how many runs are likely to be scored in the remainder of the inning. This estimate is likely to be about 0.5 runs. If the batter hits a single, then the runner necessarily advances to second base, and the number of expected runs increases to about 0.9 runs. However, if the baserunner advances all the way to third base on that single, then an additional 0.3 runs can be added to the expected run value. The empirical method for evaluating baserunning credits the runner with that 0.3 runs for each time he goes from first to third on a single. The reader can imagine any number of other scenarios in which a baserunner could be credited with taking the extra base. The fact that the myriad assumptions made are not actually true in each case is mitigated by the fact that we are primarily interested in how baserunners perform over entire seasons, giving us a sample large enough that many of those assumptions will be reasonable in an aggregate sense.

The second approach, which corroborates the findings of the first, is to build a simulation engine, and seed it with estimates of the probabilities
of how often baserunners take the extra base. For example, one of the best baserunners in recent memory is Chase Utley, who, in addition to stealing about twelve bases per year while being caught less than 10 percent of the time, advances from first to third on a single roughly 45 percent of the time, compared to the league average of about 26 percent. If we simulate, say, one million innings of Phillies baseball with Utley running like he does, we get a good estimate of the average number of runs scored per inning for that team. But we can run the same simulation with Utley running at league-average rates, and compare the change in run scoring to the previous figure. The results suggest that even Utley’s baserunning is worth no more than ten runs per season over that of an average runner.

Other books

Creed by Herbert, James
Otter Chaos! by Michael Broad
Futuro azul by Eoin Colfer
Fangs for Nothing by McCarthy, Erin, Love, Kathy
Skipping a Beat by Sarah Pekkanen
A Freewheelin' Time by Suze Rotolo
High Couch of Silistra by Janet Morris
The Healing by David Park