Statistics for Dummies (17 page)

Read Statistics for Dummies Online

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies
4.42Mb size Format: txt, pdf, ePub

 

Picturing Data with a Histogram

Numerical data in their raw, unorganized form are hard to absorb. For example, look at
Table 4-6
, which shows the 2000 population estimates for each of the 50 states (and the District of Columbia), put together by the U.S. Census Bureau. Stare at the table for 30 seconds or so. After you've done that that, go ahead and try to answer these questions quickly:

  • Which states have the largest/smallest populations?

  • How many people reside in most of the states? Give a rough range of values.

  • How much variability exists between state populations? (Are the states very similar, or very different, in terms of their total population?)

Table 4-6:
Population Estimates by State (2000 Census)

State

Census 2000 Population

Alabama

4,447,100

Alaska

626,932

Arizona

5,130,632

Arkansas

2,673,400

California

33,871,648

Colorado

4,301,261

Connecticut

3,405,565

Delaware

783,600

District of Columbia

572,059

Florida

15,982,378

Georgia

8,186,453

Hawaii

1,211,537

Idaho

1,293,953

Illinois

12,419,293

Indiana

6,080,485

Iowa

2,926,324

Kansas

2,688,418

Kentucky

4,041,769

Louisiana

4,468,976

Maine

1,274,923

Maryland

5,296,486

Massachusetts

6,349,097

Michigan

9,938,444

Minnesota

4,919,479

Mississippi

2,844,658

Missouri

5,595,211

Montana

902,195

Nebraska

1,711,263

Nevada

1,998,257

New Hampshire

1,235,786

New Jersey

8,414,350

New Mexico

1,819,046

New York

18,976,457

North Carolina

8,049,313

North Dakota

642,200

Ohio

11,353,140

Oklahoma

3,450,654

Oregon

3,421,399

Pennsylvania

12,281,054

Rhode Island

1,048,319

South Carolina

4,012,012

South Dakota

754,844

Tennessee

5,689,283

Texas

20,851,820

Utah

2,233,169

Vermont

608,827

Virginia

7,078,515

Washington

5,894,121

West Virginia

1,808,344

Wisconsin

5,363,675

Wyoming

493,782

U.S. TOTAL

281,421,906

Without some way of organizing these data, you have difficulty answering these questions. Although most of the media favors the use of tables to organize numerical data, statisticians favor the histogram as their data display of choice for these kind of data. What is a histogram, you ask?

A
histogram
is basically a bar graph that applies to numerical data. Because the data are numerical, the categories are ordered from smallest to largest (as opposed to categorical data, such as gender, which has no inherent order to it). And because you want to be sure each number falls into exactly one group, the bars on a histogram touch each other but don't overlap. Each bar is marked on the
x
-axis (or horizontal axis) by the value representing its midpoint. For example, suppose a histogram showing length of time until failure
of a car part (in hours) has two adjacent bars marked with midpoints of 1,000 hours and 2,000 hours, and each bar has a width of 500 hours. This means the first bar represents car parts that lasted anywhere from 500 to 1,500 hours, and the second bar represents car parts that lasted anywhere from 1,500 to 2,500 hours. (Numbers on the border can go on either side, as long as you're consistent for all the borderline values.)

The height of each bar of a histogram represents either the number of individuals in each group (also known as the
frequency
of each group) or the percentage of individuals in each group (also known as the
relative frequency
of each group). For example, if 50% of the car parts lasted between 500 and 1,500 hours, the first bar in the preceding example would have a relative frequency of 50%, and the height of that bar would be reflective of that.

You can see a histogram of the state population data in
Figure 4-14
. You can easily answer most of the questions at the beginning of this section by looking quickly at the histogram. And in my opinion, in many situations, a histogram provides a more interesting organizational summary of a data set than a table does.

Figure 4-14:
State population sizes (2000 Census).

A majority of the states and the District of Columbia (31 out of 51, or 60.8%) have fewer than 5 million people. Another 25.5% have populations of between 5 and 10 million. This means that 86.3% of the states haves fewer than 10 million people each. Each of the remaining seven states has very large populations, making the histogram look lopsided and trailing off to the right (this is called
skewed to the right
). Except for those few very large states, the populations of the states aren't as variable as you may think. The histogram doesn't tell you which state is which, of course, but a quick sorting of the original data can tell you which states are largest and smallest. The five most populous states are California, Texas, New York, Florida, and Illinois (which is closely followed by Pennsylvania). The smallest state is Wyoming with about 494,000 people.

Tip 

If questions come up while you're looking at a data display, try to get access to the original data set. Researchers should be able to provide you with their data if you ask for them.

Analyzing mothers' ages

In one birth statistics example (refer to
Table 4-3
), the age of the mother is shown for various years from 1975 to 2000. For any year on the table, the age variable is divided into groups, and you're given the number of mothers in each group. Because you're given the total numbers, you can make a histogram of mothers' ages showing either the frequencies or the relative frequencies, whichever is the most appropriate in terms of the point you want to make.

Suppose you want to compare the ages of mothers in 1975 and 2000. You can make two histograms, one for each year, and compare the results.
Figure 4-15
shows two such histograms for 1975 (top) and 2000 (bottom). Notice that the relative frequencies (or percentages) are shown on the vertical axis, and the age groups for the mothers are shown on the horizontal axis.

A histogram can summarize the features of numerical data quite well. One of the features that a histogram can show you is the so-called
shape
of the data (in other words, how the data are distributed among the groups). Are the data distributed evenly, in a uniform way? Are the data
symmetric
, meaning that the left-hand side of the histogram is a mirror image of the right-hand side of the histogram? Does the histogram have a
U-shape
, with lots of data on extreme ends and not much in the middle? Does the histogram of the data have a
bell-shape
, meaning that it looks like a mound in the middle with tails trailing off in either direction as you move away from the center? Or is the histogram
skewed
, meaning that it looks like a lopsided mound with one long tail either going off to the right (indicating the data are
skewed right
) or going off to the left (indicating the data are
skewed left
)?

Mothers' ages in
Figure 4-15
for years 1975 and 2000 appear to be mostly mound-shaped, although the data for 1975 are slightly more skewed to the right, indicating that as the women got older, fewer of them had babies, relative to the situation in the year 2000. Another way of saying this is that in the year 2000, a higher proportion of older women were having babies compared to 1975.

Figure 4-15:
Colorado live births, by age of mother for 1975 and 2000.

You can also get a sense of how much variability exists in the data by looking at a histogram. If a histogram is quite flat with the bars close to the same height, you may think this indicates less variability because the heights of the bars are similar. In fact, the opposite is true. That's because you have an equal number in each bar, but the bars themselves represent different ranges of values, so the entire data set is actually quite spread out. Now if the histogram has a big lump in the middle with tails on the sides, this indicates that more data are in the middle bars than the outer bars, so the data are actually closer together. Comparing 1975 mothers' ages to 2000 mothers' ages, you see more variability in 2000 than in 1975. This, again, indicates changing times; more of today's women are waiting to have children, compared to 1975, when most women had their children by age 30, and the length of time they're waiting varies. (
Chapter 5
shows you ways to measure variability in a data set.)

HEADS UP 

Variability in a histogram should not be confused with variability in a time chart (see the "
Keeping Pace with Time Charts
" section). If values change over time, they're shown on a time chart as highs and lows, and many changes from high to low (over time) indicate lots of variability. So, a flat line on a time chart indicates no change and no variability in the values across time. However, when the heights of bars of a histogram appear to be flat (uniform), this shows the opposite — the values are spread out uniformly over many groups, indicating a great deal of variability in the data at one point in time.

A histogram can also give you some idea of where the center of the data lies. The center of a data set is measured in different ways (see
Chapter 5
for a discussion of these measures). One way to eyeball the center on a histogram is to think of the histogram as a picture of people sitting on a teeter-totter and the center as the point where the fulcrum has to be in order to balance the weight on each side. Refer to
Figure 4-15
, which shows the ages of Colorado mothers in 1975 and 2000, and note that the mid-point appears to be around 25 years for the 1975 histogram and around 27.5 years for the 2000 histogram. This suggests that in the year 2000, Colorado women were having children at older ages, on average, than they did in 1975.

Histograms aren't as commonly found in the media as they should be. The reason for this is not clear, and tables are much more commonly used to show breakdowns for numerical data. However, a histogram can be informative, especially when used to compare one group or time period to another. At any rate, if you want to look at data graphically, you can always take data from a table and convert them to a data display.

HEADS UP 

Watch for histograms that use unusual scales to mislead readers. As with bar graphs, you can exaggerate differences by using a smaller scale on the vertical axis of a histogram, and you can play down differences by using a larger scale.

Readers can be mislead by a histogram in ways that aren't possible with a bar graph. Remember that a histogram deals with numerical data, not categorical data. This means that you need to determine how you want the numerical data to be broken down into groups to display on the horizontal axis. How you determine those groupings can make the graph look very different.

Other books

Don Alfredo by Miguel Bonasso
Straight Cut by Bell, Madison Smartt
Leena’s Dream by Marissa Dobson
The Well by Mildred D. Taylor
Ten Thousand Lies by Kelli Jean