Statistics for Dummies (27 page)

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies

2.83Mb size Format: txt, pdf, ePub

HEADS UP

Notice that the standard error of the sample proportion actually contains
p
, which is the population proportion. That value will most likely be unknown; you can estimate it with the sample proportion,
More on this in
Chapter 12
.

What proportion needs math help?

You can use the central limit theorem to answer questions involving proportions. For example, suppose you want to know what proportion of incoming college students would like some help in math. A student survey accompanies the ACT test each year, and one of the questions asked is whether each student would like some help with his or her math skills. In 2002, 38% of the students taking the ACT test responded yes to this question. This is a situation in which the population proportion,
p
, is known (
p
= 0.38). The original data in this case (as with all categorical data) do not have a normal distribution because only two results are possible: yes or no. The distribution of the population of answers to the math skills question is shown in
Figure 9-5
as a bar graph (see
Chapter 4
for more information on bar graphs).

Figure 9-5:
Population percentages for all students responding to the ACT math-help question in 2002.

Suppose you were to take samples of size 100 from this combined population of over a million students (all of the students who took the ACT test in 2002), and find the proportion who indicated that they needed help with their math skills in each case. The distribution of all sample proportions is shown in
Figure 9-6
. It is a normal distribution with mean
p
= 0.38 and standard error =
or 4.9%, or about 5%. Using the CLT, you can say that some of the sample proportions are higher than 0.38, some are lower, but most of them (about 95% of them) lie in the area of 0.38 plus or minus 2 × 0.05 = 0.10, or 38% ± 10%. These results still vary by quite a bit, by 10% on either side of the population proportion.

Figure 9-6:
Proportion of students responding "yes" to the ACT math help question in 2002 for samples of size 100.

Now take samples of size 1,000 from the original population and find the proportion who responded that they needed help with math skills for each sample. The distribution of sample proportions in this case will look much like
Figure 9-7
. Everything will look the same as
Figure 9-6
, except
that the distribution will be tighter; the standard error would reduce to
or 1.5%. About 95% of the sample results will lie between 0.38
−
2(0.015) and 0.38 + 2(0.015), or between 0.35 and 0.41 (that is, between 35% and 41%). In other words, if you take several different samples all of size 1,000 from this population and find the sample proportion for each sample, your sample proportions won't change much from sample to sample. Instead, they'd all be quite close together: That's due to the high sample size of 1,000.

Figure 9-7:
Proportion of students responding "yes" to the ACT math help question in 2002 for samples of size 1,000.

REMEMBER

Before you draw conclusions from any sample percentages, get some idea of how much the results should vary by finding the standard error or the margin of error (which is about two standard errors; see
Chapter 10
). Knowing the expected amount of variability will help you keep the results in perspective.

TECHNICAL STUFF

How large is large enough for the central limit theorem to work for categorical data? Most statisticians agree that you should have
n
×
p
and
n
× (1
−
p
) both be greater than or equal to 5. This takes care of any situations in which the proportion is very close to either 1 or 0 (in other words, those extreme situations where either almost everybody or almost nobody is in the group of interest). In these extreme situations, you'd need a larger sample to ensure that all the groups are represented, even those that don't contain many people. Most surveys and polls easily sample enough people to take care of this condition.

The CLT is good news for people who are trying to interpret sample results. As long as the sample size is large enough (and the data are credible and unbiased), the information reported will be close to the truth. (But remember, I said as long as the results are credible and unbiased. See
Chapter 2
for examples of how statistics can go wrong.)

HEADS UP

The central limit theorem also allows you to answer other important questions regarding sample means and proportions. For example, if a package delivery promises an average delivery time of two days, and your sample of 30 packages took 2.4 days, is this enough evidence to say that the company is guilty of false advertising? Or was this just an atypical sample of late packages? I address this type of question in
Chapter 14
.

If you're worried that you always need to know the population mean (
μ
) or the population proportion (
p
) in order to use the CLT, never fear! You will find out the secret that statisticians have known for years: If you don't know what a certain value is, just estimate it and move on. (More on this in
Chapter 11
.)

Examining Factors That Influence Variability in Sample Results

Two major factors influence the amount of variability in a sample mean or sample proportion: the size of the sample and the amount of variability in the original population.

Sample size

The size of the sample affects the amount of variability in the sample results. Suppose that you have a pond of fish, and you want to find the average length of all the fish in the pond. If you take repeated random samples of size 100 and repeated random samples of size 1,000, recording the sample mean each time, which sample means would vary more, those of size 100 or those of size 1,000? Those of size 100 would vary more, because each of the sample means was based on less information (that is, on fewer fish). Sample proportions would be affected similarly.

REMEMBER

Small sample sizes result in sample means (and sample proportions) with large standard errors. Larger sample sizes result in sample means (and sample proportions) with smaller standard errors. In other words, the more data you collect with a single sample, the less variability you should have from sample to sample.

TECHNICAL STUFF

Variability in the sample means (or in the sample proportions) is measured by the standard errors. The variability of the sample means is
, and the variability in the sample proportions is
. The denominator of each of these formulas has
n
in it (and nothing else). Therefore, as the sample size (the denominator) increases, the standard error (the entire fraction) decreases. More information provided by the sample (through larger sample sizes) decreases the variability in the sample means (and in the sample proportions).

Population variability

As the variability in the population increases, so does the variability in the sample mean or sample proportion. Suppose that you have two ponds of fish, and you want to find the average length of all the fish in each pond. The fish in Pond Vary-Lot are much more variable in length than the fish in Pond VaryLittle are. You take a sample of 100 fish from each pond and find the mean length of the fish in your sample. If you take repeated samples of size 100 from each pond and record the sample mean in each case, which sample means will vary more, those from Pond Vary-Lot or those from Pond Vary-Little? The sample means from Pond Vary-Lot would vary more, because the population of fish in Pond Vary-Lot were more variable in their lengths to begin with.

TECHNICAL STUFF

Variability in sample proportions is affected in a similar way by the variability in the population. For example, suppose you want to estimate the proportion of fish in Pond Vary-Little that are in good health (call it
p
). If the fish in Pond Vary-Little are almost all either in good health (meaning
p
is close to 1), the standard deviation of the population,
p
(1
−
p
), is going to be small because most of the fish have the same health status. If you then take many samples of fish from this homogeneous (health-wise) population and find the percentage that is in good health, you shouldn't expect that percentage to change much from one sample to the next. So the standard error of the sample proportion is small when
p
is close to 1. The same thing happens when most of the fish are in poor health (
p
is close to 0). However, if about 50% of the fish are in good health and 50% are in poor health, you'll see more variability in your sample proportions from sample to sample, because the population has more variability in its health. In fact, a population where
p
is equal to 0.5 has the most variability in it, resulting in the standard errors of the sample proportions to be at their largest, as well.

REMEMBER

More variability in the original population contributes more variability to the standard error of the sample means (or the sample proportions). Note that this increased variability can be offset, however, by increasing the sample size, as discussed above.

TECHNICAL STUFF

Recall that the variability of the sample means is
, and the variability in the sample proportions is
. The numerator of each of these formulas is actually the standard deviation of the original population in each case (
σ
for numerical data, and
p
[1
−
p
] for categorical data). Therefore, as the population standard deviation (the numerator) increases, the standard error (the entire fraction) also increases. More variability in the population means more variability in the sample means (or in the sample proportions). This increased variability can be offset by increasing the sample size, because as
n
(the denominator) increases, the overall fraction comprising the standard error decreases.

HEADS UP

Anyone can plug numbers into a formula and report a measure of what they feel (or want you to believe) is the true accuracy of their results. But if those results are biased to begin with, their accuracy isn't relevant. (The formulas don't know this, though, so you need to be on the lookout.) Be sure to check to see how the sample in a particular study was selected and how the data were collected before examining any measures of how much those results are expected to vary. (
Chapter 17
covers these issues in greater detail.)

Other books

The Seduction (The Seduction 1) by Rain, Scarlett

The Scent Of Rosa's Oil by Lina Simoni

Death on the Pont Noir by Adrian Magson

The Human Comedy by Honore de Balzac

The Missing Mitt by Franklin W. Dixon

Ghosts in the Machine (The Babel Trilogy Book 2) by Richard Farr

In the Realm of the Wolf by David Gemmell

Ghost Hunter by Jayne Castle

Delirious by Suzannah Daniels

The Mysterious Disappearence of Leon by Ellen Raskin