HEADS UP | Notice that the standard error of the sample proportion actually contains |
You can use the central limit theorem to answer questions involving proportions. For example, suppose you want to know what proportion of incoming college students would like some help in math. A student survey accompanies the ACT test each year, and one of the questions asked is whether each student would like some help with his or her math skills. In 2002, 38% of the students taking the ACT test responded yes to this question. This is a situation in which the population proportion,
p
, is known (
p
= 0.38). The original data in this case (as with all categorical data) do not have a normal distribution because only two results are possible: yes or no. The distribution of the population of answers to the math skills question is shown in
Figure 9-5
as a bar graph (see
Chapter 4
for more information on bar graphs).
Suppose you were to take samples of size 100 from this combined population of over a million students (all of the students who took the ACT test in 2002), and find the proportion who indicated that they needed help with their math skills in each case. The distribution of all sample proportions is shown in
Figure 9-6
. It is a normal distribution with mean
p
= 0.38 and standard error =
or 4.9%, or about 5%. Using the CLT, you can say that some of the sample proportions are higher than 0.38, some are lower, but most of them (about 95% of them) lie in the area of 0.38 plus or minus 2 × 0.05 = 0.10, or 38% ± 10%. These results still vary by quite a bit, by 10% on either side of the population proportion.
Now take samples of size 1,000 from the original population and find the proportion who responded that they needed help with math skills for each sample. The distribution of sample proportions in this case will look much like
Figure 9-7
. Everything will look the same as
Figure 9-6
, except
that the distribution will be tighter; the standard error would reduce to
or 1.5%. About 95% of the sample results will lie between 0.38
−
2(0.015) and 0.38 + 2(0.015), or between 0.35 and 0.41 (that is, between 35% and 41%). In other words, if you take several different samples all of size 1,000 from this population and find the sample proportion for each sample, your sample proportions won't change much from sample to sample. Instead, they'd all be quite close together: That's due to the high sample size of 1,000.
REMEMBER | Before you draw conclusions from any sample percentages, get some idea of how much the results should vary by finding the standard error or the margin of error (which is about two standard errors; see |
TECHNICAL STUFF | How large is large enough for the central limit theorem to work for categorical data? Most statisticians agree that you should have |
The CLT is good news for people who are trying to interpret sample results. As long as the sample size is large enough (and the data are credible and unbiased), the information reported will be close to the truth. (But remember, I said as long as the results are credible and unbiased. See
Chapter 2
for examples of how statistics can go wrong.)
HEADS UP | The central limit theorem also allows you to answer other important questions regarding sample means and proportions. For example, if a package delivery promises an average delivery time of two days, and your sample of 30 packages took 2.4 days, is this enough evidence to say that the company is guilty of false advertising? Or was this just an atypical sample of late packages? I address this type of question in |
If you're worried that you always need to know the population mean (
μ
) or the population proportion (
p
) in order to use the CLT, never fear! You will find out the secret that statisticians have known for years: If you don't know what a certain value is, just estimate it and move on. (More on this in
Chapter 11
.)
Two major factors influence the amount of variability in a sample mean or sample proportion: the size of the sample and the amount of variability in the original population.
The size of the sample affects the amount of variability in the sample results. Suppose that you have a pond of fish, and you want to find the average length of all the fish in the pond. If you take repeated random samples of size 100 and repeated random samples of size 1,000, recording the sample mean each time, which sample means would vary more, those of size 100 or those of size 1,000? Those of size 100 would vary more, because each of the sample means was based on less information (that is, on fewer fish). Sample proportions would be affected similarly.
REMEMBER | Small sample sizes result in sample means (and sample proportions) with large standard errors. Larger sample sizes result in sample means (and sample proportions) with smaller standard errors. In other words, the more data you collect with a single sample, the less variability you should have from sample to sample. |
TECHNICAL STUFF | Variability in the sample means (or in the sample proportions) is measured by the standard errors. The variability of the sample means is |
As the variability in the population increases, so does the variability in the sample mean or sample proportion. Suppose that you have two ponds of fish, and you want to find the average length of all the fish in each pond. The fish in Pond Vary-Lot are much more variable in length than the fish in Pond VaryLittle are. You take a sample of 100 fish from each pond and find the mean length of the fish in your sample. If you take repeated samples of size 100 from each pond and record the sample mean in each case, which sample means will vary more, those from Pond Vary-Lot or those from Pond Vary-Little? The sample means from Pond Vary-Lot would vary more, because the population of fish in Pond Vary-Lot were more variable in their lengths to begin with.
TECHNICAL STUFF | Variability in sample proportions is affected in a similar way by the variability in the population. For example, suppose you want to estimate the proportion of fish in Pond Vary-Little that are in good health (call it |
REMEMBER | More variability in the original population contributes more variability to the standard error of the sample means (or the sample proportions). Note that this increased variability can be offset, however, by increasing the sample size, as discussed above. |
TECHNICAL STUFF | Recall that the variability of the sample means is |
HEADS UP | Anyone can plug numbers into a formula and report a measure of what they feel (or want you to believe) is the true accuracy of their results. But if those results are biased to begin with, their accuracy isn't relevant. (The formulas don't know this, though, so you need to be on the lookout.) Be sure to check to see how the sample in a particular study was selected and how the data were collected before examining any measures of how much those results are expected to vary. ( |