This test is used when the variable is numerical (for example, income, cholesterol level, or miles per gallon) and two populations or groups are being
compared (for example, men versus women, athletes versus non-athletes, or cars versus SUVs). Two separate random samples need to be selected, one from each population, in order to collect the data needed for this test. The null hypothesis is that the two population means are the same; in other words, that their difference is equal to 0. The notation for the null hypothesis is H
0
:
μ
x
−
μ
y
= 0, where
μ
x
represents the mean of the first population, and
μ
y
represents the mean of the second population.
The formula for the test statistic comparing two means is
. To calculate it, do the following:
Calculate the sample means (
x
and
y
) and sample standard deviations (s
x
and s
y
) for each sample separately. Let
n
1
and
n
2
represent the two sample sizes (they need not be equal).
See
Chapter 4
for these calculations.
Find the difference between the two sample means,
x
−
y
.
Calculate the standard error,
. Save your answer.
Divide your result from Step 2 by your result from Step 3.
To interpret the test statistic, look up your test statistic on the standard normal distribution (see
Table 8-1
in
Chapter 8
) and calculate the
p
-value (see
Chapter 14
for more on
p
-value calculations).
For example, suppose you want to compare the absorbency of two brands of paper towels (call the brands Stats-absorbent and Sponge-o-matic). You can make this comparison by looking at the average number of ounces each brand can absorb before being saturated. H
o
says the difference between the average absorbencies is 0 (non-existent), and H
a
says the difference is not 0. In other words, H
0
:
μ
x
−
μ
y
= 0 versus H
a
:
μ
x
−
μ
y
≠
0. Here, you have no indication of which paper towel may be more absorbent, so the not-equal-to alternative is the one to use. (See
Chapter 14
.)
Suppose you select a random sample of 50 paper towels from each brand and measure the absorbency of each paper towel. Suppose the average absorbency of Stats-absorbent (
x
) is 3 ounces, with a standard deviation of 0.9 ounces, and for Sponge-o-matic (
y
), the average absorbency is 3.5 ounces, with a standard deviation of 1.2 ounces.
Given these data, you have
x
= 3,
s
x
= 0.9,
y
= 3.5,
s
y
= 1.2,
n
1
= 50, and
n
2
= 50.
The difference between the sample means for (Stats-absorbent – Sponge-o-matic) is (3
−
3.5) =
−
0.5 ounces. (A negative difference simply means that the second sample mean was larger than the first.)
The standard error is
.
Divide the difference,
−
0.5, by the standard error, 0.2121, which gives you
−
2.36, which rounds to
−
2.4. This is your test statistic.
To find the
p
-value, look up
−
2.4 on the standard normal distribution (Z-distribution) — see
Table 8-1
in
Chapter 8
. The chance of being beyond, in this case to the left of,
−
2.4 is equal to the percentile, which is 0.82%. Because H
a
is a not-equal-to alternative, you double this percentage to get 2 × 0.82% = 1.64%. Finally, change this to a probability by dividing by 100 to get a
p
-value of 0.0164. This
p
-value is less than 0.05. That means you do have enough evidence to reject H
o
.
Your conclusion is that a statistically significant difference exists between the absorbency levels of these two brands of paper towels, based on your samples. And it looks like Sponge-o-matic comes out on top, because it has a higher average.
HEADS UP | Being the savvy statistician you are, don't fall for those commercials that show one single sheet from one single roll of paper towels (that is, a sample size of 1) being more absorbent than another. And don't give credibility to those morning TV news shows that send producers on the street asking two or three people for information and making comparisons. Anecdotes are interesting, but they can't be generalized. A hypothesis test, done right, gives results that are both interesting |
TECHNICAL STUFF | Most hypothesis tests comparing two separate population means are done using samples that are quite large, because they are most often based on surveys. However, if both samples do happen to be under 30 in size, you need to use the t-distribution (with degrees of freedom equal to |
This test is used when the variable is numerical (for example, income, cholesterol level, or miles per gallon), and the individuals in the sample are either paired up in some way (identical twins are often used) or the same people are used twice (for example, using a pre-test and post-test). Paired tests are typically used for studies in which they're testing to see whether a new treatment, technique, or method works better than an existing method, without having to worry about other factors about the subjects that may influence the results. See
Chapter 17
for details.
For example, suppose a researcher wants to see whether teaching students to read using a computer game gives better results than teaching with a tried-and-true phonics method. She randomly selects 20 students and puts them into 10 pairs according to their reading readiness level, age, IQ, and so on. She randomly selects one student from each pair to learn to read via the computer game, and the other learns to read using the phonics method. At the end of the study, each student takes the same reading test. The data are shown in
Table 15-1
.
Student Pair # | Reading Score for Student under Computer Method | Reading Score for Student under Phonics Method | Paired Differences (Computer Score Phonics Score) |
---|---|---|---|
1 | 85 | 80 | +5 |
2 | 80 | 80 | +0 |
3 | 95 | 88 | +7 |
4 | 87 | 90 | –3 |
5 | 78 | 72 | +6 |
6 | 82 | 79 | +3 |
7 | 57 | 50 | +7 |
8 | 69 | 73 | –4 |
9 | 73 | 78 | –5 |
10 | 99 | 95 | +4 |
The data are in pairs, but you're really interested only in the difference in reading scores (computer reading score – phonics reading score) for each pair, not the reading scores themselves. So, you take the difference between the scores for each pair, and those
paired differences
make up your new set of data to work with. If the two reading methods are the same, the average of the paired differences should be 0. If the computer method is better, the average of the paired differences should be positive (because the computer reading score should be larger than the phonics score). So you really have a hypothesis test for one population mean, where the null hypothesis is that the mean (of the paired differences) is 0, and the alternative hypothesis is that the mean (of the paired differences) is > 0.
The notation for the null hypothesis is H
0
:
μ
d
= 0, where
μ
d
is the mean of the paired differences. (The
d
in the subscript is just supposed to remind you that you're working with the paired differences.)
The formula for the test statistic for paired differences is
To calculate it, do the following:
For each pair of data, take the first value in the pair minus the second value in the pair to find the paired difference.
Think of the differences as your new data set.
Calculate the mean,
d
, and the standard deviation,
s
, of all the differences.
Let
n
represent the number of paired differences that you have.
Calculate the standard error:
. Save your answer.
Take
d
divided by the standard error from Step 3.
REMEMBER | Remember that |
For the reading scores example, you can use the preceding steps to see whether the computer method is better in terms of teaching students to read.
Calculate the differences for each pair; you can see those differences in column 4 of
Table 15-1
. Notice that the sign on each of the differences is important; it indicates which method performed better for that particular pair.The mean and standard deviation of the differences (column 4 of
Table 15-1
) must be calculated. (See
Chapter 4
for calculating means and standard deviations.) The mean of the differences is found to be +2, and the standard deviation is 4.64. Note that
n
= 10 here.The standard error is 4.64 divided by the square root of 10 (3.16). So you have 4.64 ÷ 3.16 = 1.47. (Remember that here,
n
is the number of pairs, which is 10.)For the last step, take the mean of the differences, +2, divided by the standard error, which is 1.47, to get +1.36, the test statistic. That means the average difference for this sample is 1.36 standard errors above 0. Is this enough to say that a difference in reading scores applies to the whole population in general?
Because
n
is less than 30, you look up 1.36 on the t-distribution with 10
−
1 = 9 degrees of freedom (see
Table 14-2
in
Chapter 14
) to calculate the
p
-value. The
p
-value in this case is greater than 0.05 because 1.36 is close to the value of 1.38 on the table, and, therefore, its
p
-value would be more than 0.10 (the corresponding
p
-value for 1.38). That's because 1.38 is in the column under the 90th percentile, and because H
a
is a greater-than alternative, you take 100%
−
90% = 10% = 0.10. You conclude that there isn't enough evidence to reject H
o
, so the computer game can't be touted as a better reading method. (This could be due to the lack of additional evidence needed to prove the point with a smaller sample size.)
HEADS UP | In many paired experiments, the data sets will be small due to costs and time associated with doing these kinds of studies. That means the t-distribution (see |