Read The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball Online
Authors: Benjamin Baumer,Andrew Zimbalist
The Retrosheet data is voluminous (on the order of several gigabytes), but it contains detailed play-by-play data going back to the 1940s. Working with Retrosheet data is more cumbersome, since it requires manipulating
customized text-processing tools, but the scope of questions that can be addressed is virtually limitless. For example, the LahmanDB provides no information about individual games, batter-versus-pitcher matchups, or situational statistics. But a well-oiled Retrosheet database can make quick work of incredibly specific questions (e.g., did Mickey Mantle hit better in the fifth inning of home games with a runner on third and less than two outs on even or odd numbered days?).
27
Thanks to these data sources and online venues like Baseball Prospectus, work that would have taken James hours of painstaking calculation and self-publishing could be performed by a college student in a matter of minutes on a laptop in his dorm room, and sent around the world overnight. Accordingly, the pace of sabermetric activity and the potential for its assimilation swiftly increased.
When James joined the front office of the Red Sox in 2003, it presaged a flood of talent from the sabermetric community to the baseball industry. Sabermetrically inclined executives like Theo Epstein and Paul DePodesta were now running major league clubs, and they needed assistants who could perform cutting-edge analysis for them. As the most prominent venue for the exchange of sabermetric ideas, Baseball Prospectus became a clearinghouse for young sabermetric talent. BP alum Keith Law was already working for the Toronto Blue Jays when
Moneyball
was published, but in the years that followed James Click joined the Rays, Keith Woolner joined the Indians, Dan Fox joined the Pirates, and more recently, Mike Fast and former BP managing partner Kevin Goldstein left for the Astros. Mercifully, there was no shortage of talent being funneled to BP itself. Students who had been exposed to sabermetrics in college could parlay a well-written article or two for BP into a highly sought-after internship with a major league club. The jobs did not pay well, but they sometimes led to full-time jobs (most of which also didn’t pay well). The demand for sabermetric content online was so great that numerous sites sprung up around BP: FanGraphs, Hardball Times, Baseball Analysts, and many other websites provided articles with new thoughts, new metrics, and new venues for discussion every day. By the late 2000s, early BP members like Law and Will Carroll were respected journalists carrying cards of the Baseball Writers Association of America, and the voting privileges that go
with them. In less than ten years, sabermetric bloggers had moved from the fringes of the Internet to the press box, and scratched out the line separating them from newspaper columnists.
But while the blogosphere has played a catalytic role in expanding the reach of sabermetrics, its stewardship of sabermetric theory has been more complicated. The decentralized nature of online communication has produced thousands of articles, with a relatively democratic vetting process. With the open data sources mentioned above and the computational power of just about any decent computer, barriers to enter the field are low. Moreover, few sabermetric articles require any understanding of statistics beyond what you would likely be exposed to in an introductory statistics class—mainly because most sabermetric practitioners have not had any statistical training beyond this level. This has led to very creative approaches to difficult questions. However, one downside to this decentralization has been an outpouring of acronyms that do little other than confuse even habitual sabermetric readers. For example, we have chosen to use BABIP as shorthand for “batting average on balls in play,” but nearly as many others (including Vörös McCracken) continue to refer to this exact same concept as “hits per balls in play” and use the abbreviation HPBP. Further, while it is undoubtedly true that the avoidance of statistical terminology has made sabermetric research accessible to many who would not otherwise follow, it has hindered the development of a coherent body of sabermetric theory. As a result, modeling decisions that are justified by theory are presented as if they were made ad hoc.
More important, the foundation of scientific research is reproducibility. And because much of the sabermetric work published online is derived from open data sources, it is particularly well suited to exact reproduction. While it is true that scientific terminology presents a barrier to many, its purpose is to streamline the intake of new ideas by those who wish to verify the details. Sabermetrics literature sometimes seems caught between two audiences: a lay audience that wants to know the results of a study without the details; and a technical audience that wants to verify the details. Research in many fields is presented in both venues simultaneously, where a technical paper is sent to a peer-reviewed journal, but a more accessible paper is released to the public. The
Journal of Quantitative Analysis in Sports
has helped in this respect, but
in our view the gap between the online sabermetric community and the academic sabermetric community is too large. There are too many good ideas floating around in each venue that are not being translated to the other.
The relationship between statistics and baseball is long and storied, but the adoption of technology within baseball continues to be fraught with road-blocks. Discussion about expanding the use of instant replay during games continues to be a major topic at the owners’ meetings each fall, and while some progress has been made, the subject continues to be divisive. Behind the scenes, many front offices are still struggling to effectively integrate technology into their baseball operations staff. Still, all scouts are now equipped with smartphones in addition to radar guns, and some use tablets instead of laptops for entering their reports. But a more interesting evolution is taking place within team offices, where a new kind of employee is setting up shop. In scouting and player development, the same kinds of people as before are getting the new jobs (former players mostly)—but they may be expected to use new tools. Conversely, in many front offices entry-level positions are being filled by young staffers who would likely have had no claim to a front office job as recently as the 1990s—they would have been more likely to be found in an information technology department, or on Wall Street.
With an eye undoubtedly honed through his own background in finance, Lewis sees DePodesta as an early embodiment of this trend: “Everywhere one turned in competitive markets, technology was offering the people who understood it an edge. What was happening to capitalism should have happened to baseball: the technical man with his analytical magic should have risen to prominence in baseball management, just as he was rising to prominence on, say, Wall Street.”
28
Consider how Lewis’s words have been put into action since
Moneyball
. John Henry bought the Red Sox in 2002 with money he made building quantitative financial models, and after whiffing on Billy Beane himself, gave the reins to a twenty-eight-year-old Theo Epstein, and hired James to be a key advisor. When Stuart Sternberg assumed control of the Rays in 2005, he
installed Andrew Friedman—a former Bear Stearns analyst with no previous experience in baseball—as his GM.
29
To say that those two hires were successful is an understatement. Between them they produced two World Series rings, another appearance, and a Sporting New Executive of the Year Award. Indeed, Lewis’s “technical man” rose to prominence in baseball more or less immediately after he wrote the words in the passage above.
In the years that have followed, more and more talent that might otherwise have been headed to Mountain View or Wall Street has flowed to Yawkey Way. Ironically, sabermetrically inclined executives like Beane and Indians GM Chris Antonetti already see themselves as future casualties of this migration, In Beane’s words: “The people who are coming into the game, the creativity, the intelligence—it’s unparalleled right now. In ten years if I applied for this job I wouldn’t even get an interview.”
30
Antonetti, who lacks professional playing experience, is also astounded by the quality of the resumes he sees today, and attributes the influx of talent directly to
Moneyball
.
31
There will always be a place in the front office for intelligent former players, but there are a few more places now for everyone else.
Moreover, the skills necessary to perform sabermetric analysis have changed. Whereas many of those associated with sabermetrics at the higher levels had degrees in the social sciences and used a spreadsheet as their weapon of choice, today’s entry-level sabermetricians are more likely to have a degree in applied mathematics or computer science and come writing their own code. Again, to illustrate the difference in the order of magnitude of the problems each can solve, until recently Microsoft Excel was limited to 65,536 (i.e., 2
16
) rows of data, and even the current version is limited to a little more than 1 million rows. In contrast, commonly used database management systems like MySQL can easily store many millions of rows. This distinction has become relevant to baseball in just the past few years, as the deluge of play-by-play and even pitch-by-pitch data has overwhelmed clubs.
32
Thankfully, baseball’s data onslaught is merely a trickle compared to the flood faced by large Internet companies like Google, Facebook, and Amazon. The technological ecosystem that supports these companies is a powerful library from which baseball teams can borrow, often for free.
The buzzword for the enormous data processed by companies like Google
is “big data,” while the field devoted to studying modern data analysis techniques is known as “data science.” The former is distinguished by data streams that are three orders of magnitude larger than what major league clubs are currently storing (two if you include video). However, the introduction of FIELDf/x data has the potential to shrink that gap to two (one with video). Nevertheless, many baseball operations departments are investing in computer hardware to build data centers that rival those of the entire organization’s IT department. Others, recognizing their limited ability to keep up with the pace of technology, are paying six figures per season to outsource the job of warehousing, analyzing, and displaying their data to Bloomberg Sports, an offshoot of the financial giant.
33
Conversely, sabermetrics is very much a popular embodiment of data science. Data science is distinguished from statistics through its heavier emphasis on data, its reliance upon programming techniques common in computer science but rare in classical statistics, and the importance of incorporating extensive “domain knowledge” (specific knowledge of the subject being studied) into the analysis. Sabermetricians practice this exactly: they frequently work with large, sometimes messy databases that require customized code to manage, and their knowledge of baseball is integral to formulating a model that is applicable to the problem of interest. This happy coincidence is exactly what enables a successful practitioner of sabermetrics like Nate Silver to seamlessly migrate his data science skills to other application areas. Moreover, it provides an accessible venue for higher-education courses that equip graduates with skills applicable to a variety of fields. Accordingly, classes in sabermetrics have spring up at Williams College (alma mater of former Mets and Orioles GM Jim Duquette), Tufts University, and Bowling Green State University, among others.
At its core,
Moneyball
was about a market inefficiency, and as we will show in
Chapter 7
, the market inefficiency the A’s exploited in the early 2000s is now likely closed. We showed above that more front-office talent is now devoted to analytics than ever before, but we will present evidence in
Chapter 7
that
suggests that sabermetrics is working, and thus the $60,000 statistical analyst more than makes up for his own salary in cost savings to the club. But this analyst is limited by the quality of his resources, be they in the form of data, hardware, storage, or software. This observation was made years ago by a whole host of third-party vendors who provide a variety of services in a desperate attempt to get a piece of the major league pie. Every year at the winter meetings, a massive trade show is held where hundreds of such companies vie for the attention of baseball front offices. Are these companies part of a symbiotic ecosystem surrounding analytics? Or are they capitalist parasites depleting the limited resources of MLB clubs?
The most precious resource to a statistical analyst is data, and a variety of companies have been licensing data to major league clubs for decades. From the ashes of Project Scoresheet, a group initiated by Bill James, two parallel data organizations arose: the aforementioned Retrosheet, a nonprofit that provides its data for free; and STATS LLC (better known as STATS, Inc.), a private company founded by John Dewan, a frequent business associate of James’s. While Retrosheet continued in the grassroots tradition of Project Scoresheet, cobbling play-by-play data together from newspaper accounts and donated scoresheets, STATS became a multimillion-dollar global sports information provider, partnering with industry giants like ESPN and the NBA.
34
Meanwhile, Dave Smith runs Retrosheet for $10 per month in web hosting fees.
35
After Dewan sold his interest in STATS in 1999, he got back into the baseball data game with Baseball Info Solutions, which has been competing with STATS in the proprietary data market since the early to mid-2000s. The data provided by these private companies was superior to the data provided by Retrosheet, primarily because it could be uploaded daily and included information about the location, velocity, and pitch type of nearly every pitch thrown.
36
A full season feed cost a major league club tens of thousands of dollars, and as the number of teams licensing the data increased, the clubs got wiser. By 2006, Major League Baseball Advanced Media, a limited partnership of the club owners, was providing analogous data, including PITCHf/x data collected by Sportvision, to the major league clubs at no additional charge.
37
While there were undoubtedly teams that still had little interest in this type of data, this turn of events suggests that enough of the major league clubs were
interested so as to forego an expensive proprietary advantage (licensing the data from a third party) to achieve an economy of scale.