Statistics for Dummies (19 page)

Read Statistics for Dummies Online

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies
10.6Mb size Format: txt, pdf, ePub

 

Summarizing Numerical Data

With
numerical data
, measurable characteristics such as height, weight, IQ, age, or income are represented by numbers. Because the data have numerical meaning, you can summarize them in more ways than is possible with categorical data. Certain characteristics of a numerical data set can be described using statistics, such as where the center is, how spread out the data are, and where certain milestones are. These kinds of summaries occur often in the media, so knowing what these summary statistics say and don't say about the data helps you better understand the research that's presented to you in your everyday life.

Getting centered

The most common way to summarize a numerical data set is to describe where the center is. One way of thinking about what the center of a data set means is to ask, "What's a typical value?" Or, "Where is the middle of the data?" The center of a data set can actually be measured in different ways, and the method chosen can greatly influence the conclusions people make about the data.

Averaging out NBA salaries

NBA players make a lot of money, right? Compared to most people, they certainly do. But how much do they make, and is it really as much as you think it is? The answer depends on how you choose to summarize the information. You often hear about players like Shaquille O'Neal, who made $21.4 million in the 2001–2002 season. Is that what the typical NBA player makes? No. Shaquille O'Neal was the highest paid NBA player of that season.

So how much does the typical NBA player make? One way to answer this is to look at the
average
salary. The average is probably the most commonly used statistic of all time. It is one way to determine where the "center" of the data is.

Here is what you need to do to find the average for a data set, denoted
x
.

  1. Add up all the numbers in the data set.

  2. Divide by the number of numbers in the data set
    ,
    n.

For example, player salary data for the 2001–2002 season is shown in
Table 5-2
for the 13 players on the Los Angeles Lakers roster (excluding those who were released early in the season).

Table 5-2:
Salaries for Los Angeles Lakers NBA Players, 2001–2002 Season

Player

Salary ($)

Shaquille O'Neal

$21,428,572

Kobe Bryant

$11,250,000

Robert Horry

  $5,300,000

Rick Fox

  $3,791,250

Lindsey Hunter

  $3,425,760

Derek Fisher

  $3,000,000

Samaki Walker

  $1,400,000

Mitch Richmond
[
*
]

  $1,000,000

Brian Shaw
[
*
]

      $963,415

Devean George

      $834,250

Mark Madsen

      $759,960

Jelani McCoy

      $565,850

Stanislav Medvedenko

      $465,850

Total

$54,184,907

[
*
]
without salary cap adjustments

Adding all the salaries, the total payroll for this team is $54,184,907. Dividing by the total number of players (
n
= 13) gives an average salary of $4,168,069.77. That's a pretty nice average salary, isn't it? But notice that Shaquille O'Neal is
at the top of this list, and in that year, his salary was the highest in the entire league. If you take the average salary of all of the Lakers players besides Shaq, you would get an average of $32,756,335 ÷ 12 = $2,729,694.58. This is still a hefty amount, but one that's significantly lower than the average salary of all players including Shaquille O'Neal. (Of course, fans would argue that this merely shows how important he is to the team. And this issue is but the tip of the iceberg of the never-ending debates that sports fans love to have about statistics.)

REMEMBER 

Another word that's used for average is the word
mean.

So, for the 2001–2002 season, the average salary for the Lakers was about $4.2 million. But does the average always tell the whole story? In some cases, the average may be a bit misleading, and this is one of those cases. That's because every year, a few top-notch players (like Shaq) make much more money than anybody else (and, like Shaq, they also tend to be taller than anyone else, by the way). These are called
outliers
(numbers in the data set that are extremely high or extremely low, compared to the rest of the data). Because of the way the average is calculated, outliers that are high tend to drive the average upward (just like Shaq's salary did in the preceding example). Similarly, outliers that are extremely low tend to drive the average downward.

Remember in school when you took an exam, and you and most of the rest of the class did badly, while a couple of the nerds got 100? Remember how the teacher didn't change the grading scale to reflect the poor performance of most of the class? Your teacher was probably using the average, and the average in that case didn't really represent the true center of the students' scores.

What can you report, other than the average, to show what the salary of a "typical" NBA player would be or what the test score of a "typical" student in your class was? Another statistic that is used to measure the center of a data set is called the median. The median is still an unsung hero of statistics in the sense that it isn't used nearly as often as it should be, although people are beginning to report it more and more nowadays.

Splitting salaries down the median

The
median
of a data set is the value that lies exactly in the middle. Here are the steps for finding the median of a data set:

  1. Order the numbers from smallest to largest.

  2. If the data set contains an odd number of numbers, choose the one that is exactly in the middle.

    This is the median.

  3. If the data set contains an even number of numbers, take the two numbers that appear exactly in the middle and average them to find the median.

The salaries for the Los Angeles Lakers during the 2001–2002 season (refer to
Table 5-2
) are already ordered from smallest (starting at the bottom) to largest (at the top). Because the list contains the names and salaries of 13 players, the middle salary is the seventh one from the bottom (or top), or the salary of Samaki Walker, who earned $1.4 million that season from the Lakers. This is the median.

This median salary for the Lakers is well below the average of $4.2 million for this team. But because the average Laker salary includes outliers (like the salary of Shaquille O'Neal), the median salary is more representative of the middle salary for the team. (Notice that only 3 players earned more than the average Laker salary of $4.2 million, while 6 players earned more than the median salary of $1.4 million.) The median isn't affected by the salaries of those players who are way out there on the high end, the way the average is. (By the way, the lowest Lakers' salary for the 2001–2002 season was $465,850 — a lot of money by most people's standards but mere peanuts compared to what you think of when you think of an NBA player's salary!)

The U.S. Government often uses the median to represent the center with respect to its data. For example, U.S. Census Bureau reported that in 2001, the median household income was $42,228, down 2.2% from the year 2000, when the median household income was $43,162.

Interpreting the center: Comparing means to medians

Now suppose you're part of an NBA team trying to negotiate salaries. If you represent the owners, you want to show how much everyone is making and how much money you're spending, so you want to take into account those superstar players and report the average. But if you're on the side of the players, you would want to report the median, because that's more representative of what the players in the middle are making. Fifty percent of the players make a salary above the median, and 50% of the players make a salary below the median. That is why they call it the median — like the median of an interstate highway, it's the point in the exact middle.

TECHNICAL STUFF 

A
histogram
is a type of graph that organizes and displays numerical data in picture form, showing groups of data and the number or percentage of the data that fall into each group. (See
Chapter 4
for more information on histograms and other types of data displays.) If the data have outliers on the upper end, the histogram of the data will be
skewed to the right
, and the mean will be larger than the median. (See the top histogram in
Figure 5-1
for an example of data that is skewed to the right.) If the data have outliers on the lower end, the histogram of the data will be
skewed to the left
, and the mean will be smaller than the median. (The middle histogram in
Figure 5-1
shows an example of a histogram that shows data that is skewed to the left.) If the data are
symmetric
(have about the same shape on either side of the middle), the mean and the median will be about the same. (The bottom histogram in
Figure 5-1
shows an example of symmetric data in a histogram.)

Figure 5-1:
Data skewed to the right; data skewed to the left; and symmetric data.
REMEMBER 

The average (or mean) of a data set is affected by outliers, but the median is not. If someone reports the average value, also ask for the median, so that you can compare the two statistics and get a better feel for what's actually going on in the data and what's truly typical.

Accounting for variation

Variation always exists in a data set, regardless of which characteristic you're measuring, because not every individual is going to have the same exact value for every variable. Variability is what makes the field of statistics what it is. For example, the price of homes varies from house to house, from year to year, and from state to state. Household income varies from household to household, from country to country, and from year to year. The number of passing yards a quarterback achieves in a game varies from player to player, from game to game, and from season to season. The amount of time that it
takes you to get to work each day varies from day to day. The trick to dealing with variation is to be able to measure that variability in a way that best captures it.

Knowing what standard deviation means

By far the most commonly used measure of variability is the standard deviation. The
standard deviation
represents the typical distance from any point in the data set to the center. It's roughly the average distance from the center, and in this case, the center is the average. Most often, you don't hear a standard deviation given just by itself; if it's reported (and it's not reported nearly enough) it will probably be in the fine print, usually given in parentheses, like "(
s
= 2.68)."

TECHNICAL STUFF 

The standard deviation
of an entire population of data
is denoted with a Greek letter v. The standard deviation
of a sample from the population
is denoted with the letter
s.
Because most of the time the population standard deviation isn't a value that's known, any formulas involving the standard deviation would leave you high and dry without something to plug in for it. But, never fear. When in Rome, do as the Romans do, right? So when dealing with statistics, do as the statisticians do — whenever they are stuck with an unknown value, they just estimate it and move on! So
s
is used to estimate v in cases where v is unknown.

In this book, when I use the term
standard deviation
, I mean
s
, the sample standard deviation. (If and when I refer to the population standard deviation, I let you know!)

Calculating the standard deviation

The formula for standard deviation is

To calculate the sample standard deviation,
s
, do the following steps:

  1. Find the average of the data set.

    To find the average, add up all the numbers and divide by the number of numbers in the data set,
    n.

  2. Take each number and subtract the average from it.

  3. Square each of the differences.

  4. Add up all of the results from Step 3.

  5. Divide the sum of squares (found in Step 4) by the number of numbers in the data set, minus one (
    n
    – 1).

  6. Take the square root of the number you get.

TECHNICAL STUFF 

Statisticians divide by
n

1 instead of
n
in the formula for
s
so that the sample standard deviation has nice properties that work out with all of their theory. (Believe me, that's more than you want to know about
that
issue.) For example, dividing by
n

1 makes sure that the standard deviation isn't
biased
(off target) on average. In case you weren't confused enough already, here's more: If you do ever get the entire population of data and you want to find the population standard deviation,
σ
, use the same formula as the one for
s
, except do divide by
n
, not
n

1!

Look at the following small example. Suppose you have four numbers: 1, 3, 5, and 7. The mean is 16 ÷ 4 = 4. Subtracting the mean from each number, you get (1

4) =

3, (3

4 ) =

1, (5

4) = +1, and (7

4) = +3. Squaring each of these results, you get 9, 1, 1, and 9. Adding these up, the sum is 20. In this example,
n
= 4, and therefore
n

1 = 3, so you divide 20 by 3 to get 6.67. Finally, you take the square root of 6.67, which is 2.58, and that is the standard deviation of this data set. So for the data set 1, 3, 5, 7, the typical distance from the mean is 2.58.

REMEMBER 

Because calculating the standard deviation involves many steps, in most cases, you will probably have a computer calculate it for you. But knowing how to calculate the standard deviation helps you better interpret this statistic and can help you figure out when the statistic may be wrong.

Interpreting the standard deviation

Standard deviation can be difficult to interpret as a single number on its own. Basically, a small standard deviation means that the values in the data set are close to the middle of the data set, on average, while a large standard deviation means that the values in the data set are farther away from the middle, on average.

A small standard deviation can be a goal in certain situations where the results are restricted (for example, in product manufacturing and quality control). A particular type of car part that has to be centimeters in diameter to fit properly had better not have a very big standard deviation. A big standard deviation in this case would mean that lots of parts end up in the trash because they don't fit right; either that, or the cars will have problems down the road.

In situations where you just observe and record data, a large standard deviation isn't necessarily a bad thing; it just reflects a large amount of variability in the group that is being studied. For example, if you look at salaries for everyone in a certain company, including everyone from the student intern to the CEO, the standard deviation could be very large. On the other hand, if you narrow the group down by looking only at the student interns or only at the corporate executives, the standard deviation will be smaller, because the individuals within each of those two groups have salaries that are less variable.

HEADS UP 

Watch for the units when determining whether a standard deviation is large. For example, a standard deviation of 2 in units of years is equivalent to a standard deviation of 24 in units of months. Also look at the value of the mean when putting standard deviation into perspective. If the average number of Internet newsgroups that a user posts to is 5.2, and the standard deviation is 3.4, that's a lot of variability, relatively speaking. But if you're talking about the age of the newsgroup users, where the mean is 25.6 years, a standard deviation of 3.4 would be comparatively smaller.

Another way to interpret standard deviation is to use it in conjunction with the mean to describe where most of the data are. If the data are distributed in a bell-shaped curve (with lots of data close to the middle, with fewer values as you move away from the middle) you can use something called the empirical rule to interpret the standard deviation. (See
Chapter 4
.) The
empirical rule
says that about 68% of the data should lie within one standard deviation of either side of the mean; about 95% of the data should lie within two standard deviations of the mean, and about 99% of the data should lie within three standard deviations of the mean.

In a study of how people make friends in cyberspace using newsgroups, for example, the age of the users of an Internet newsgroup was reported to have a mean of 31.65 years, with a standard deviation of 8.61 years. The data were distributed in a bell-shaped curve. According to the empirical rule, about 68% of the newsgroup users had ages within 1 standard deviation (8.61 years) of the mean (31.65 years). So, about 68% of the users were between ages 31.65-8.61 years and 31.65 + 8.61 years, or between 23.04 and 40.26 years. About 95% of the users were between the ages of 31.65

2(8.61), and 31.65 + 2(8.61), or between 14.43 and 48.87 years. Finally, about 99% of the Internet users' ages were between 31.65

3(8.61) and 31.65 + 3(8.61), or between 5.82 and 57.48 years. (For more applications of the empirical rule, see
Chapter 8
.)

REMEMBER 

Most people don't bother trying to account for 99% of the values in a data set; they're usually happy with 95%. Going out one more standard deviation on either side of the mean just to pick up an extra 4% of the data (99% – 95%) doesn't seem worthwhile to many people.

Understanding the properties of the standard deviation

Here are some properties that can help you when interpreting a standard deviation:

  • The standard deviation can never be a negative number. (That's because of how it's calculated and the fact that it measures a distance; distances are never negative numbers.)

  • The smallest possible value for the standard deviation is 0, and that happens only in contrived situations where every single number in the data set is exactly the same (no deviation).

  • The standard deviation is affected by outliers (extremely low or extremely high numbers in the data set). That's because the standard deviation is based on the
    distance
    from the mean. And remember, the mean is also affected by outliers.

  • The standard deviation has the same units as the original data.

Lobbying for the standard deviation

The standard deviation is something that is not reported very often in the media, and that's a real problem. If you find out only where the center of the data is without some measure of how variable those data are, you have only part of the story. In fact, you could be missing the most interesting part of the story. Variety is the spice of life, yet without an indication of how diverse or varied the data are, you're not being told how spicy the data are.

Without knowing the standard deviation, you can't get a handle on whether all the data are close to the average (as are the diameters of car parts that come off of a conveyor belt when everything is operating correctly) or whether the data are spread out over a wide range (as are the salaries of NBA players). If someone told you that the average starting salary for someone working at Company Statistix is $70,000, you may think, "Wow! That's great." But if the standard deviation for starting salaries at Company Statistix is $20,000, by using the empirical rule assuming that the distribution of salaries is bell-shaped, you could be making anywhere from $30,000 to $110,000 (that is, $70,000, plus or minus two standard deviations, each worth $20,000). Company Statistix has a lot of variation in terms of how much money you can make, so the average starting salary of $70,000 isn't as informative in the end, is it? On the other hand, if the standard deviation was only $5,000, you would have a much better idea of what to expect for a starting salary at that company.

REMEMBER 

Without the standard deviation, you can't compare two data sets effectively. What if the two sets of data have about the same average and the same median; does that mean that the data are all the same? Not at all. For example, the data sets 199, 200, 201, and 0, 200, 400 both have the same average, which is 200, and the same median, which is also 200. Yet they have very different standard deviations. The first data set has a very small standard deviation compared to the second data set.

Journalists often don't report the standard deviation. The only reason I can think of for this is that people must not ask for it — perhaps the public just isn't ready for the standard deviation yet. But reference to the standard deviation may become more commonplace in the media as more and more people discover what the standard deviation can tell them about a set of results. And in many workplaces, the standard deviation is frequently reported and used, because this statistic is a standard and well-accepted way of measuring variation.

Being out of range

Many times, the media will report the range of a data set as a way to measure the variability. The
range
is the largest value in the data set minus the smallest value in the data set. The range is easy to find; all you do is put the numbers in order (from smallest to largest) and do a quick subtraction. Maybe that's why the range is used so often; it certainly isn't because of its interpretative value.

HEADS UP 

The range of a data set is almost meaningless. It depends on only two numbers in the data set, both of which could reflect extreme values (outliers). My advice is to ignore the range and try to find the standard deviation, which is a more informative measure of the variability in the data set.

NBA salaries, as expected, have great deal of variability. Salaries for one single team, the Los Angeles Lakers, for the 2001–2002 season are a typical example. Reference
Table 5-2
for the salaries of all 13 players on the team. The average salary is $4,168,069.77, and the median is $1,400,000. The salaries range from the highest, which is $21,428,572 (Shaquille O'Neal) to the lowest, $465,850 (Stanislav Medvedenko) with a range of $21,428,572

$465,850 = $20,962,722. Wow — that's a huge range! It signifies a big difference between the highest and lowest paid players, that's for sure. But does it mean much in terms of the overall variability in salaries for the whole team? Not really. The standard deviation is $5.98 million, which is still a very big number, but because the standard deviation is based on all the salaries of the team (not just the largest and the smallest ones) the standard deviation has much more statistical meaning than the range does.

Tip 

When you come across summary statistics, look for the standard deviation to get a handle on how much variation is in the data. If it's not available, ask for it, or go to the source (the press release, the journal article, the researchers themselves), where you are sure to find it. Don't put much credibility in the range; it's too rough of an estimate of variability to account for much of anything.

Other books

101. A Call of Love by Barbara Cartland
Unconventional Scars by Allie Gail
Dirty Little Secrets by Erin Ashley Tanner
The Lion Triumphant by Philippa Carr
Promise the Doctor by Marjorie Norrell
Rekindled by Nevaeh Winters
The Devil She Knows by Diane Whiteside