The Baseball Economist: The Real Game Exposed (34 page)

BOOK: The Baseball Economist: The Real Game Exposed
12.21Mb size Format: txt, pdf, ePub
The main advantage of regression analysis is not that we can generate correlations between two variables. The most useful aspect of this method to the social scientist is its ability to accommodate more than one explanatory factor. By including other important determinants of an explained variable, we can know the added, or marginal, impact that each explanatory variable has on the value of the explained variable. This impact is separate, or in addition to, the other factors included in the analysis.
In the analysis of income and education, we implicitly assume that all other possible explanatory influences are random and cancel out, but do they? Obviously, there are more factors than years of education that determine the incomes of workers, such as natural intelligence (IQ), work ethic, field of study, location, and physical attractiveness—just to name a few. The problem is that some excluded characteristics may, in fact, be correlated with education. For example, we expect people with high intelligence to continue to get more education than those who are not so gifted. If this occurs, then the β estimate of education may actually be picking up a correlation between income and intelligence. Education might actually not be as important as natural ability, but because we did not include a measure of worker intelligence in the model, it will look like education is a greater determinant of income than it actually is.
This problem is known as
omitted variable bias,
and it occurs when non-random outside factors not included in the regression estimation are correlated with explanatory variables that are included in the regression estimation. This can create serious problems; therefore the empirical researcher must be very careful to include all relevant factors in the regression. OLS allows us to control for the influence of other factors by including many relevant factors in the model. For example, if we had an IQ test score of every worker in the sample, we could estimate the marginal impacts of both factors on income. Our equation would be: Income = α + β
1
Schooling + β
2
IQ. β
1
and β
2
are individual estimated magnitudes for each factor, and each factor’s weight takes into account the impact of the other factor.
How does this work? It involves a complicated mathematical procedure— which, thankfully, computers can do in a matter of seconds—that examines how
all
of the explained and explanatory variables differ across the sample. But it’s easier to understand if you think about it like this: Let’s say we were able to group every worker in our sample by number of years of schooling. We could then look at the income of the members of each group (e.g., twelve years, thirteen years, etc.) as their IQs differed. Because the education level remains constant, we can assign a weight to the importance of IQ without having the education level confuse our estimate. Similarly, we could group every worker in the sample by IQ. Then we could look at whether each additional year of schooling impacted income while holding IQ constant. OLS, and other multiple regression analysis procedures are able to hold constant numerous factors and assign marginal impacts to each.
Another useful result from multiple regression analysis is that we can see how much the explanatory factors influence the explained factors. When OLS minimizes the sum of squared errors, it generates useful information that tells us how well the model “fits” the data. The smaller the errors in prediction, the more the difference in the explained variable is explained by the explanatory variables. The measure of fit generated by OLS is the R
2
, or “R-squared.” The R
2
ranges from 0 to 1 and can be understood as a percentage. As the R
2
approaches 1, the fit of the model improves. For example, assume that our hypothetical model estimating income from schooling and IQ has an R
2
of 0.75. This means that 75 percent percent of the differences in incomes across these workers is explained by difference in education and IQ.
Multiple regression analysis allows the empirical researcher to control for, or hold constant, many possible influences on an observed outcome. This is valuable because is enables the researcher to isolate individual impacts among many concurrent determinants. The multiple regression analysis estimator OLS is just one of many possible estimators; however, it is the estimator most commonly employed by economists. But all of the alternative procedures share the ability to hold multiple factors constant.
This appendix is only an introduction to the subject. Using regression analysis requires understanding issues beyond what can be provided in this mini-primer.
APPENDIX B
Baseball Statistics Glossary
Batting Average (AVG)
: Number of hits divided by number of at-bats.
Batting Average on Balls in Play (BABIP)
: The number of hits less the number of home runs divided by the number of at-bats less the number of home runs less the number of strikeouts.
Earned Run Average (ERA)
: Earned runs allowed divided by innings pitched multiplied by nine.
ERA
+: A measure of ERA expressed in terms of the run production of the league and the home park in which the pitcher pitches. A league-average pitcher has an ERA+ of 100. A better (worse) pitcher has an ERA+ above (below) 100.
The metric is available for all players at
Baseball-Reference.com
.
Fielding Independent Pitching (FIP)
: Thirteen times home runs, plus three times walks minus two times strikeouts, all divided by innings pitched, and add 3.2.
A metric that generates an ERA using only defense independent statistics. Some versionsexclude hit batters and use a different constant. It is a simplified version of Voros McCracken’s DIPS ERA developed by Tom Tango.
Home Runs per 9 Innings (HR9)
: Home runs divided by innings pitched multiplied by nine.
Isolated Power (Iso-Power)
: Slugging percentage minus batting average.
Linear Weights (LWTS)
: (0.46 × Singles) + (0.8 × Doubles) + (1.02 × Triples) + (1.4 × Home Runs) + (0.33 × Walks) + (0.33 × Hit by Pitch) + (0.3 × Stolen Bases) − (0.6 × Caught Stealing) − (0.25 × (At-Bats − Hits)).
A metric developed by John Thorn and Pete Palmer that weights each offensive event according to its run productionvalue.
On-Base Percentage (OBP)
: The sum of hits, walks, and hit by pitches divided by the sum of at-bats, walks, hit by pitches, and sacrifice flies.
OPS
: The sum of on-base percentage and slugging percentage.
A metric developedby John Thorn and Pete Palmer that correlates well with Linear Weights but is much simpler to calculate.
OPS
+: A measure of OPS expressed in terms of the run production of the league and the home park in which the hitter plays. A league-average hitter has an OPS+ of 100. A better (worse) hitter has an OPS+ above (below) 100.
The metric is available for all players at
Baseball-Reference.com
.
PrOPS
: PrOPS is an OPS predicted by how a player hits the ball. It includes a player’s drive percentage, groundball-to-flyball ratio, strikeout rate, walk rate, hit batter rate, and home run rate. A player with a PrOPS greater (less) than his OPS is likely to improve (decline).
The author developed this metric in conjunctionwith
The Hardball Times,
and it is available on their website (
hardballtimes.com
)
.
Pythagorean Winning Percentage
: The square of runs scored divided by the sum of the squares of runs scored and runs allowed.
A hypothetical winning percentagebased on the runs scored and allowed by a team. The original metric was developedby Bill James, and several slight modifications to the formula exist.
Slugging Percentage (SLG)
: Singles plus two times doubles, plus three times triples, plus four times home runs, divided by at-bats.
A batting average that weights each hit equal to the number of bases the hitter advances.
Strikeouts per 9 Innings (K9)
: Strikeouts divided by innings pitched multiplied by nine.
Strikeout-to-Walk Ratio (K/BB)
: Strikeouts divided by walks.
Walks per 9 Innings (BB9)
: Walks divided by innings pitched multiplied by nine.
APPENDIX C
Useful Websites
Statistics
The Baseball Cube
:
TheBaseballCube.com
Baseball-Reference:
Baseball-Reference.com
First Inning
:
FirstInning.com
The Hardball Times:
HardballTimes.com
The Lahman Baseball Database:
Baseball1.com
Retrosheet:
Retrosheet.org
Analysis
Baseball Analysts:
BaseballAnalysts.com
Baseball Musings:
BaseballMusings.com
Baseball Prospectus:
BaseballProspectus.com
The Hardball Times:
HardballTimes.com
Sabernomics.com
(my favorite!):
Sabernomics.com
The Sports Economist:
thesportseconomist.com
APPENDIX D
Player Values
Season RS:
Runs scored by the team if the player had 100 percent of the plate appearances for the team in the season.
Season RA:
Runs allowed by the team if the player had 100 percent of the innings pitched for the team in the season.
%PA:
The percentage of the team’s plate appearances made by the player.
%IP:
The percentage of the team’s innings pitched made by the player.
RSAA:
Runs scored above what the average player with the same percentage of plate appearances would have produced.
RABA:
Runs scored below what the average player with the same percentage of innings pitched would have produced.
$ValAA
: Dollar value above what the average player would have produced given the same %PA or %IP.
MRP:
The gross marginal revenue product of the player, which is the dollar value (in millions) of what the player is worth to the team in generating revenue. In a competitive market for talent, player wages should equal the gross MRP minus the marginal resource cost of putting the player on the field (i.e., training and equipment costs). The 2006 and 2007 values are based on an estimated 10 percent annual growth rate in revenue from 2005.

Other books

The Oxygen Murder by Camille Minichino
Enslaved by Elisabeth Naughton
Yarn to Go by Betty Hechtman
Nelson: The Essential Hero by Ernle Dusgate Selby Bradford
Devils and Dust by J.D. Rhoades
Lucca by Karen Michelle Nutt
Young Mr. Keefe by Birmingham, Stephen;