Numerical data in their raw, unorganized form are hard to absorb. For example, look at
Table 4-6
, which shows the 2000 population estimates for each of the 50 states (and the District of Columbia), put together by the U.S. Census Bureau. Stare at the table for 30 seconds or so. After you've done that that, go ahead and try to answer these questions quickly:
Which states have the largest/smallest populations?
How many people reside in most of the states? Give a rough range of values.
How much variability exists between state populations? (Are the states very similar, or very different, in terms of their total population?)
State | Census 2000 Population |
---|---|
Alabama | 4,447,100 |
Alaska | 626,932 |
Arizona | 5,130,632 |
Arkansas | 2,673,400 |
California | 33,871,648 |
Colorado | 4,301,261 |
Connecticut | 3,405,565 |
Delaware | 783,600 |
District of Columbia | 572,059 |
Florida | 15,982,378 |
Georgia | 8,186,453 |
Hawaii | 1,211,537 |
Idaho | 1,293,953 |
Illinois | 12,419,293 |
Indiana | 6,080,485 |
Iowa | 2,926,324 |
Kansas | 2,688,418 |
Kentucky | 4,041,769 |
Louisiana | 4,468,976 |
Maine | 1,274,923 |
Maryland | 5,296,486 |
Massachusetts | 6,349,097 |
Michigan | 9,938,444 |
Minnesota | 4,919,479 |
Mississippi | 2,844,658 |
Missouri | 5,595,211 |
Montana | 902,195 |
Nebraska | 1,711,263 |
Nevada | 1,998,257 |
New Hampshire | 1,235,786 |
New Jersey | 8,414,350 |
New Mexico | 1,819,046 |
New York | 18,976,457 |
North Carolina | 8,049,313 |
North Dakota | 642,200 |
Ohio | 11,353,140 |
Oklahoma | 3,450,654 |
Oregon | 3,421,399 |
Pennsylvania | 12,281,054 |
Rhode Island | 1,048,319 |
South Carolina | 4,012,012 |
South Dakota | 754,844 |
Tennessee | 5,689,283 |
Texas | 20,851,820 |
Utah | 2,233,169 |
Vermont | 608,827 |
Virginia | 7,078,515 |
Washington | 5,894,121 |
West Virginia | 1,808,344 |
Wisconsin | 5,363,675 |
Wyoming | 493,782 |
U.S. TOTAL | 281,421,906 |
Without some way of organizing these data, you have difficulty answering these questions. Although most of the media favors the use of tables to organize numerical data, statisticians favor the histogram as their data display of choice for these kind of data. What is a histogram, you ask?
A
histogram
is basically a bar graph that applies to numerical data. Because the data are numerical, the categories are ordered from smallest to largest (as opposed to categorical data, such as gender, which has no inherent order to it). And because you want to be sure each number falls into exactly one group, the bars on a histogram touch each other but don't overlap. Each bar is marked on the
x
-axis (or horizontal axis) by the value representing its midpoint. For example, suppose a histogram showing length of time until failure
of a car part (in hours) has two adjacent bars marked with midpoints of 1,000 hours and 2,000 hours, and each bar has a width of 500 hours. This means the first bar represents car parts that lasted anywhere from 500 to 1,500 hours, and the second bar represents car parts that lasted anywhere from 1,500 to 2,500 hours. (Numbers on the border can go on either side, as long as you're consistent for all the borderline values.)
The height of each bar of a histogram represents either the number of individuals in each group (also known as the
frequency
of each group) or the percentage of individuals in each group (also known as the
relative frequency
of each group). For example, if 50% of the car parts lasted between 500 and 1,500 hours, the first bar in the preceding example would have a relative frequency of 50%, and the height of that bar would be reflective of that.
You can see a histogram of the state population data in
Figure 4-14
. You can easily answer most of the questions at the beginning of this section by looking quickly at the histogram. And in my opinion, in many situations, a histogram provides a more interesting organizational summary of a data set than a table does.
A majority of the states and the District of Columbia (31 out of 51, or 60.8%) have fewer than 5 million people. Another 25.5% have populations of between 5 and 10 million. This means that 86.3% of the states haves fewer than 10 million people each. Each of the remaining seven states has very large populations, making the histogram look lopsided and trailing off to the right (this is called
skewed to the right
). Except for those few very large states, the populations of the states aren't as variable as you may think. The histogram doesn't tell you which state is which, of course, but a quick sorting of the original data can tell you which states are largest and smallest. The five most populous states are California, Texas, New York, Florida, and Illinois (which is closely followed by Pennsylvania). The smallest state is Wyoming with about 494,000 people.
Tip | If questions come up while you're looking at a data display, try to get access to the original data set. Researchers should be able to provide you with their data if you ask for them. |
In one birth statistics example (refer to
Table 4-3
), the age of the mother is shown for various years from 1975 to 2000. For any year on the table, the age variable is divided into groups, and you're given the number of mothers in each group. Because you're given the total numbers, you can make a histogram of mothers' ages showing either the frequencies or the relative frequencies, whichever is the most appropriate in terms of the point you want to make.
Suppose you want to compare the ages of mothers in 1975 and 2000. You can make two histograms, one for each year, and compare the results.
Figure 4-15
shows two such histograms for 1975 (top) and 2000 (bottom). Notice that the relative frequencies (or percentages) are shown on the vertical axis, and the age groups for the mothers are shown on the horizontal axis.
A histogram can summarize the features of numerical data quite well. One of the features that a histogram can show you is the so-called
shape
of the data (in other words, how the data are distributed among the groups). Are the data distributed evenly, in a uniform way? Are the data
symmetric
, meaning that the left-hand side of the histogram is a mirror image of the right-hand side of the histogram? Does the histogram have a
U-shape
, with lots of data on extreme ends and not much in the middle? Does the histogram of the data have a
bell-shape
, meaning that it looks like a mound in the middle with tails trailing off in either direction as you move away from the center? Or is the histogram
skewed
, meaning that it looks like a lopsided mound with one long tail either going off to the right (indicating the data are
skewed right
) or going off to the left (indicating the data are
skewed left
)?
Mothers' ages in
Figure 4-15
for years 1975 and 2000 appear to be mostly mound-shaped, although the data for 1975 are slightly more skewed to the right, indicating that as the women got older, fewer of them had babies, relative to the situation in the year 2000. Another way of saying this is that in the year 2000, a higher proportion of older women were having babies compared to 1975.
You can also get a sense of how much variability exists in the data by looking at a histogram. If a histogram is quite flat with the bars close to the same height, you may think this indicates less variability because the heights of the bars are similar. In fact, the opposite is true. That's because you have an equal number in each bar, but the bars themselves represent different ranges of values, so the entire data set is actually quite spread out. Now if the histogram has a big lump in the middle with tails on the sides, this indicates that more data are in the middle bars than the outer bars, so the data are actually closer together. Comparing 1975 mothers' ages to 2000 mothers' ages, you see more variability in 2000 than in 1975. This, again, indicates changing times; more of today's women are waiting to have children, compared to 1975, when most women had their children by age 30, and the length of time they're waiting varies. (
Chapter 5
shows you ways to measure variability in a data set.)
HEADS UP | Variability in a histogram should not be confused with variability in a time chart (see the " |
A histogram can also give you some idea of where the center of the data lies. The center of a data set is measured in different ways (see
Chapter 5
for a discussion of these measures). One way to eyeball the center on a histogram is to think of the histogram as a picture of people sitting on a teeter-totter and the center as the point where the fulcrum has to be in order to balance the weight on each side. Refer to
Figure 4-15
, which shows the ages of Colorado mothers in 1975 and 2000, and note that the mid-point appears to be around 25 years for the 1975 histogram and around 27.5 years for the 2000 histogram. This suggests that in the year 2000, Colorado women were having children at older ages, on average, than they did in 1975.
Histograms aren't as commonly found in the media as they should be. The reason for this is not clear, and tables are much more commonly used to show breakdowns for numerical data. However, a histogram can be informative, especially when used to compare one group or time period to another. At any rate, if you want to look at data graphically, you can always take data from a table and convert them to a data display.
HEADS UP | Watch for histograms that use unusual scales to mislead readers. As with bar graphs, you can exaggerate differences by using a smaller scale on the vertical axis of a histogram, and you can play down differences by using a larger scale. |
Readers can be mislead by a histogram in ways that aren't possible with a bar graph. Remember that a histogram deals with numerical data, not categorical data. This means that you need to determine how you want the numerical data to be broken down into groups to display on the horizontal axis. How you determine those groupings can make the graph look very different.