Statistics for Dummies (17 page)

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies

4.42Mb size Format: txt, pdf, ePub

Picturing Data with a Histogram

Numerical data in their raw, unorganized form are hard to absorb. For example, look at
Table 4-6
, which shows the 2000 population estimates for each of the 50 states (and the District of Columbia), put together by the U.S. Census Bureau. Stare at the table for 30 seconds or so. After you've done that that, go ahead and try to answer these questions quickly:

Which states have the largest/smallest populations?
How many people reside in most of the states? Give a rough range of values.
How much variability exists between state populations? (Are the states very similar, or very different, in terms of their total population?)

Table 4-6:
Population Estimates by State (2000 Census)
State	Census 2000 Population
Alabama	4,447,100
Alaska	626,932
Arizona	5,130,632
Arkansas	2,673,400
California	33,871,648
Colorado	4,301,261
Connecticut	3,405,565
Delaware	783,600
District of Columbia	572,059
Florida	15,982,378
Georgia	8,186,453
Hawaii	1,211,537
Idaho	1,293,953
Illinois	12,419,293
Indiana	6,080,485
Iowa	2,926,324
Kansas	2,688,418
Kentucky	4,041,769
Louisiana	4,468,976
Maine	1,274,923
Maryland	5,296,486
Massachusetts	6,349,097
Michigan	9,938,444
Minnesota	4,919,479
Mississippi	2,844,658
Missouri	5,595,211
Montana	902,195
Nebraska	1,711,263
Nevada	1,998,257
New Hampshire	1,235,786
New Jersey	8,414,350
New Mexico	1,819,046
New York	18,976,457
North Carolina	8,049,313
North Dakota	642,200
Ohio	11,353,140
Oklahoma	3,450,654
Oregon	3,421,399
Pennsylvania	12,281,054
Rhode Island	1,048,319
South Carolina	4,012,012
South Dakota	754,844
Tennessee	5,689,283
Texas	20,851,820
Utah	2,233,169
Vermont	608,827
Virginia	7,078,515
Washington	5,894,121
West Virginia	1,808,344
Wisconsin	5,363,675
Wyoming	493,782
U.S. TOTAL	281,421,906

Without some way of organizing these data, you have difficulty answering these questions. Although most of the media favors the use of tables to organize numerical data, statisticians favor the histogram as their data display of choice for these kind of data. What is a histogram, you ask?

A
histogram
is basically a bar graph that applies to numerical data. Because the data are numerical, the categories are ordered from smallest to largest (as opposed to categorical data, such as gender, which has no inherent order to it). And because you want to be sure each number falls into exactly one group, the bars on a histogram touch each other but don't overlap. Each bar is marked on the
x
-axis (or horizontal axis) by the value representing its midpoint. For example, suppose a histogram showing length of time until failure
of a car part (in hours) has two adjacent bars marked with midpoints of 1,000 hours and 2,000 hours, and each bar has a width of 500 hours. This means the first bar represents car parts that lasted anywhere from 500 to 1,500 hours, and the second bar represents car parts that lasted anywhere from 1,500 to 2,500 hours. (Numbers on the border can go on either side, as long as you're consistent for all the borderline values.)

The height of each bar of a histogram represents either the number of individuals in each group (also known as the
frequency
of each group) or the percentage of individuals in each group (also known as the
relative frequency
of each group). For example, if 50% of the car parts lasted between 500 and 1,500 hours, the first bar in the preceding example would have a relative frequency of 50%, and the height of that bar would be reflective of that.

You can see a histogram of the state population data in
Figure 4-14
. You can easily answer most of the questions at the beginning of this section by looking quickly at the histogram. And in my opinion, in many situations, a histogram provides a more interesting organizational summary of a data set than a table does.

Figure 4-14:
State population sizes (2000 Census).

A majority of the states and the District of Columbia (31 out of 51, or 60.8%) have fewer than 5 million people. Another 25.5% have populations of between 5 and 10 million. This means that 86.3% of the states haves fewer than 10 million people each. Each of the remaining seven states has very large populations, making the histogram look lopsided and trailing off to the right (this is called
skewed to the right
). Except for those few very large states, the populations of the states aren't as variable as you may think. The histogram doesn't tell you which state is which, of course, but a quick sorting of the original data can tell you which states are largest and smallest. The five most populous states are California, Texas, New York, Florida, and Illinois (which is closely followed by Pennsylvania). The smallest state is Wyoming with about 494,000 people.

Tip	If questions come up while you're looking at a data display, try to get access to the original data set. Researchers should be able to provide you with their data if you ask for them.

Analyzing mothers' ages

In one birth statistics example (refer to
Table 4-3
), the age of the mother is shown for various years from 1975 to 2000. For any year on the table, the age variable is divided into groups, and you're given the number of mothers in each group. Because you're given the total numbers, you can make a histogram of mothers' ages showing either the frequencies or the relative frequencies, whichever is the most appropriate in terms of the point you want to make.

Suppose you want to compare the ages of mothers in 1975 and 2000. You can make two histograms, one for each year, and compare the results.
Figure 4-15
shows two such histograms for 1975 (top) and 2000 (bottom). Notice that the relative frequencies (or percentages) are shown on the vertical axis, and the age groups for the mothers are shown on the horizontal axis.

A histogram can summarize the features of numerical data quite well. One of the features that a histogram can show you is the so-called
shape
of the data (in other words, how the data are distributed among the groups). Are the data distributed evenly, in a uniform way? Are the data
symmetric
, meaning that the left-hand side of the histogram is a mirror image of the right-hand side of the histogram? Does the histogram have a
U-shape
, with lots of data on extreme ends and not much in the middle? Does the histogram of the data have a
bell-shape
, meaning that it looks like a mound in the middle with tails trailing off in either direction as you move away from the center? Or is the histogram
skewed
, meaning that it looks like a lopsided mound with one long tail either going off to the right (indicating the data are
skewed right
) or going off to the left (indicating the data are
skewed left
)?

Mothers' ages in
Figure 4-15
for years 1975 and 2000 appear to be mostly mound-shaped, although the data for 1975 are slightly more skewed to the right, indicating that as the women got older, fewer of them had babies, relative to the situation in the year 2000. Another way of saying this is that in the year 2000, a higher proportion of older women were having babies compared to 1975.

Figure 4-15:
Colorado live births, by age of mother for 1975 and 2000.

You can also get a sense of how much variability exists in the data by looking at a histogram. If a histogram is quite flat with the bars close to the same height, you may think this indicates less variability because the heights of the bars are similar. In fact, the opposite is true. That's because you have an equal number in each bar, but the bars themselves represent different ranges of values, so the entire data set is actually quite spread out. Now if the histogram has a big lump in the middle with tails on the sides, this indicates that more data are in the middle bars than the outer bars, so the data are actually closer together. Comparing 1975 mothers' ages to 2000 mothers' ages, you see more variability in 2000 than in 1975. This, again, indicates changing times; more of today's women are waiting to have children, compared to 1975, when most women had their children by age 30, and the length of time they're waiting varies. (
Chapter 5
shows you ways to measure variability in a data set.)

HEADS UP

Variability in a histogram should not be confused with variability in a time chart (see the "
Keeping Pace with Time Charts
" section). If values change over time, they're shown on a time chart as highs and lows, and many changes from high to low (over time) indicate lots of variability. So, a flat line on a time chart indicates no change and no variability in the values across time. However, when the heights of bars of a histogram appear to be flat (uniform), this shows the opposite — the values are spread out uniformly over many groups, indicating a great deal of variability in the data at one point in time.

A histogram can also give you some idea of where the center of the data lies. The center of a data set is measured in different ways (see
Chapter 5
for a discussion of these measures). One way to eyeball the center on a histogram is to think of the histogram as a picture of people sitting on a teeter-totter and the center as the point where the fulcrum has to be in order to balance the weight on each side. Refer to
Figure 4-15
, which shows the ages of Colorado mothers in 1975 and 2000, and note that the mid-point appears to be around 25 years for the 1975 histogram and around 27.5 years for the 2000 histogram. This suggests that in the year 2000, Colorado women were having children at older ages, on average, than they did in 1975.

Histograms aren't as commonly found in the media as they should be. The reason for this is not clear, and tables are much more commonly used to show breakdowns for numerical data. However, a histogram can be informative, especially when used to compare one group or time period to another. At any rate, if you want to look at data graphically, you can always take data from a table and convert them to a data display.

HEADS UP

Watch for histograms that use unusual scales to mislead readers. As with bar graphs, you can exaggerate differences by using a smaller scale on the vertical axis of a histogram, and you can play down differences by using a larger scale.

Readers can be mislead by a histogram in ways that aren't possible with a bar graph. Remember that a histogram deals with numerical data, not categorical data. This means that you need to determine how you want the numerical data to be broken down into groups to display on the horizontal axis. How you determine those groupings can make the graph look very different.

Other books

Don Alfredo by Miguel Bonasso

Straight Cut by Bell, Madison Smartt

Leena’s Dream by Marissa Dobson

Candace C. Bowen - A Knight Series 02 by A Knight of Battle

Penguin Book Of Indian Ghost Stories by Bond, Ruskin

Catching Fire: How Cooking Made Us Human by Richard Wrangham

The Well by Mildred D. Taylor

Ten Thousand Lies by Kelli Jean

Eternal Blood: The Mark of the Vampire by Wright, Laura

The Button Man: A Hugo Marston Novel by Pryor, Mark