Statistics for Dummies (48 page)

Read Statistics for Dummies Online

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies
8.83Mb size Format: txt, pdf, ePub

 

Quantifying the Relationship: Correlations and Other Measures

After the bivariate data have been organized, the next step is to do some statistics that can quantify or measure the extent and nature of the relationship.

Quantifying a relationship between two numerical variables

If both variables are numerical or quantitative, statisticians can measure the direction and the strength of the linear relationship between the two variables
x
and
y.
Data that "resemble an uphill line" have a positive linear relationship, but it may not necessarily be a strong relationship. The strength of the relationship depends on how closely the data resemble a straight line. Of course, varying levels of "closeness to a line" exist. Plus you have to distinguish between the positive and the negative relationships. Can one statistic measure all of that? Sure!

Statisticians use what they call the
correlation coefficient
to measure the strength and direction of the linear relationship between
x
and
y.

Calculating the correlation coefficient (r)

The formula for the correlation coefficient (denoted
r
) is

To calculate the correlation coefficient:

  1. Find the mean of all the
    x
    values (call it
    x
    ) and the mean of all the
    y
    values (call it
    y
    ).

    See
    Chapter 5
    for calculations.

  2. Find the standard deviation of all the
    x
    values (call it
    s
    x
    ) and the standard deviation of all the
    y
    values (call it
    s
    y
    ).

    See
    Chapter 5
    .

  3. For each (
    x, y
    ) pair in the data set, take
    x
    minus
    x
    and
    y
    minus
    y
    , and then multiply these differences together.

  4. Add all of these products together to get a sum.

  5. Divide the sum by
    s
    x
    ×
    s
    y
    .

  6. Divide that result by
    n

    1, where
    n
    is the number of (
    x, y
    ) pairs.

For example, suppose you have the data set (3, 2), (3, 3), and (6, 4). Following the preceding steps, you can calculate the correlation coefficient. Note that the
x
values are 3, 3, and 6, and the
y
values are 2, 3, and 4.

  1. x
    is 12 ÷ 3 = 4, and
    y
    is 9 ÷ 3 = 3.

  2. The standard deviations are
    s
    x
    = 1.73 and
    s
    y
    = 1.00.

  3. The differences found in Step 3 multiplied together are: (3

    4)(2

    3)= (

    1)(

    1)=1; (3

    4)(3

    3)=(

    1)(0)=0; (6

    4)(4

    3) =(+2)(+1)=+2.

  4. The results from Step 3, all added up, are 1 + 0 + 2 = 3.

  5. Dividing the Step 4 result by
    s
    x
    ×
    s
    y
    gives you 3 ÷ (1.73 × 1.00) = 3 ÷ 1.73 = 1.73.

  6. Dividing the Step 5 result by 3 – 1 (which is 2), you get 0.87.

    This is the correlation.

Interpreting the correlation

The correlation
r
is always between

1 and +1.

  • A correlation of exactly

    1 indicates a perfect downhill linear relationship.

  • A correlation close to

    1 indicates a strong downhill linear relationship.

  • A correlation close to 0 means that no linear relationship exists.

  • A correlation close to +1 indicates a strong uphill linear relationship.

  • A correlation of exactly +1 indicates a perfect uphill linear relationship.

HEADS UP 

Many folks make the mistake of thinking that a correlation of

1 is a bad thing, indicating no relationship. In fact, the opposite is true! A correlation of

1 means that the data are lined up in a perfect straight line, the strongest linear relationship you can get. That line just happens to be going downhill — that's what the minus sign is for!

How "close" do you have to get to

1 or +1 to indicate a strong linear relationship? Most statisticians like to see correlations above +0.6 (or below

0.6) before getting too excited about them. However, don't expect a correlation to always be +0.99 or

0.99; these are real data, and real data aren't perfect.

Figure 18-4
shows examples of what various correlations look like, in terms of the strength and direction of the relationship.

Figure 18-4:
Scatterplots with various correlations.

For my subset of the cricket chirps versus temperature data, I calculated a correlation of 0.98, which is almost unheard of in the real world (these crickets are
good!
).

Understanding the properties of the correlation coefficient

Here are some useful properties of correlations:

  • The correlation is a unitless measure. This means that if you change the units of
    x
    or
    y
    , the correlation won't change. (For example, changing from Fahrenheit to Celsius won't affect the correlation between the frequency of chirps and the outside temperature.)

  • The values of
    x
    and
    y
    can be switched in the data set, and the correlation won't change.

Quantifying a relationship between two categorical variables

If both variables are categorical (such as whether or not the patient took aspirin and whether or not the patient developed polyps), you really can't use the word "correlation" to describe the relationship, because correlation measures the strength of the linear relationship between numerical variables. (This mistake occurs in the media all the time, and it drives statisticians crazy!)

The word that is used to describe a relationship between two categorical variables is
association
. Two categorical variables (such as treatment group and outcome) are associated if the percentage of subjects who had a certain outcome in one group is significantly different than the percentage who had that same outcome in the other group. In the aspirin versus polyp example discussed in the first section of this chapter, researchers found that in the aspirin group, 17% of the colon cancer patients developed polyps, whereas in the placebo group, 27% developed polyps. Because these percentages are quite different, the two variables are associated.

How different do the percentages have to be in order to signal a meaningful association between the two variables? The difference found by the sample has to be
statistically significant
. That way, the same conclusion about a relationship can be made about the whole population, not just for a particular data set. A hypothesis test of two proportions will work for this purpose (see
Chapter 15
for details on this test). I analyzed the data from the aspirin versus polyps study using that test and got a
p
-value of less than 0.0001. That means these results are highly significant. (See
Chapter 14
for more on
p
-values.) You can see why the researchers stopped this study midstream and decided to give everyone the aspirin treatment!

 

Explaining the Relationship: Association and Correlation versus Causation

If two variables are found to be either associated or correlated, that doesn't necessarily mean that a cause-and-effect relationship exists between the two variables. Whether two variables are found to be causally associated depends on how the study was conducted. Only a well-designed experiment (see
Chapter 17
) or several different observational studies can show that an association or a correlation crosses over into a cause-and-effect relationship.

Taking aspirin does seem to help

I feel confident about the conclusions drawn by the researchers in the aspirin versus polyps study discussed in the first section of this chapter; this study was a well-designed experiment, according to the criteria established in
Chapter 17
. It included random assignment of patients to treatments, it had large enough sample sizes to obtain accurate information, and it controlled for confounding variables. This means that the researchers truly are entitled to the headline of the press release, "Aspirin Prevents Polyps in Colon Cancer Patients." Because of the design of this study, you can say that a cause-and-effect relationship (association) exists between whether the colon cancer patients took aspirin on a daily basis and whether polyps developed.

Turning up the heat on cricket chirps

Does the outside temperature cause crickets to chirp faster or slower? (Obviously the reverse isn't true, but what about the possible causation in this direction?) Some people speculate that changes in the outside temperature cause crickets to chirp at different frequencies. However, I'm not aware of any data based on experiments (as opposed to observational studies) that would confirm or deny a cause-and-effect relationship here. Perhaps you can do an experiment of your own and turn up the heat on some crickets and see what happens! (Before leaping into this — yes, the pun was intended — be sure to design a good experiment following the criteria in
Chapter 17
.)

 

Making Predictions: Regression and Other Methods

After you've found a relationship between two variables and you have some way of quantifying this relationship, you can create a model that allows you to use one variable to predict another.

Making predictions with correlated data

In the case of two numerical variables, if a strong correlation has been established, researchers often use the relationship between
x
and
y
to make predictions. Because
x
is correlated with
y
, a
linear relationship
exists between them. This means that you can describe the relationship with a straight line. If you know the slope and the
y
-intercept of that line, then you can plug in a value for
x
and predict the average value for
y.
In other words, you can predict
y
from
x.

Because the correlation between cricket chirps and temperature is so high (
r
= 0.98), you can find a line that fits the data. This means that you want to find the one line that best fits the data (in terms of the average distance from all of the points in the data set to the line you generate). Statisticians call this search for the best-fitting line performing a
regression analysis
.

HEADS UP 

Never do a regression analysis unless you've already found a strong correlation (either positive or negative) between the two variables. I've seen cases where researchers go ahead and make predictions when a correlation was as low as 0.2! That doesn't make any sense. If the scatterplot of the data doesn't resemble a line to begin with, you shouldn't try to use a line to fit the data and to make predictions about the population.

REMEMBER 

Before examining any model that predicts one variable from another, find the correlation first; if the correlation is too weak, stop right there.

You may be thinking that you have to try lots and lots of different lines to see which one fits best. Fortunately, this is not the case (although eyeballing a line on the scatterplot does help you think about what you'd expect the answer to be). The best-fitting line has a distinct slope and
y
-intercept that can be calculated using formulas (and, I may add, these formulas aren't too hard to calculate).

Getting a formula for best-fitting line

The formula for the
best-fitting line
(or
regression line
) is
y
=
mx
+
b
, where
m
is the slope of the line and
b
is the
y
-intercept. The
slope
of a line is the change in
y
over the change in
x.
For example, a slope of 10/3 means that as
x
moves to the right 3 units, the
y
-value moves up 10 units, as you move from one point on the line to the next. The
y
-intercept is that place on the
y
-axis where the line crosses. For example, in the equation
, the line crosses the
y
-axis at the point

6. The coordinates of this point are (0,

6) — because you are crossing the
y
-axis, the
x
value of the
y
-intercept is always 0. To come up with the best-fitting line, you need to find values for m and b so that you have a real equation of a line (for example,
y
= 2
x
+ 3; or
y
=

10
x

45).

Tip 

To save a great deal of time calculating the best-fitting line, keep in mind that five well-known summary statistics are all you need to do all the necessary calculations. Statisticians call them the
big-five summary statistics:

  • The mean of the
    x
    values (denoted
    x
    )

  • The mean of the
    y
    values (denoted
    y
    )

  • The standard deviation of the
    x
    values (denoted
    s
    x
    )

  • The standard deviation of the
    y
    values (denoted
    s
    y
    )

  • The correlation between
    x
    and
    y
    (denoted
    r
    )

(This chapter and
Chapter 5
contain formulas and step-by-step instructions for these statistics.)

Finding the slope of the best-fitting line

The formula for the slope,
m
, of the best-fitting line is
m
=
r
, where
r
is the correlation between
x
and
y
, and
s
y
and
s
x
are the standard deviations of the
y
-values and the
x
-values, respectively (see
Chapter 5
for more on standard deviation).

To calculate the slope,
m
, of the best-fitting line:

  1. Divide
    s
    y
    by
    s
    x
    .

  2. Multiply the result in Step 1 by
    r.

HEADS UP 

The slope of the best-fitting line can be a negative number because the correlation can be a negative number. A negative slope indicates that the line is going downhill.

TECHNICAL STUFF 

The formula for slope simply takes the correlation (a unitless measurement) and attaches units to it. Think of
s
y
÷
s
x
as the change in
y
over the change in
x.
And the standard deviations are each in terms of their original units (for example, temperature in Fahrenheit and number of cricket chirps in 15 seconds).

Finding the y-intercept of the best-fitting line

The formula for the
y
-intercept,
b
, of the best-fitting line is
b
=
y

m x
, where
y
and
x
are the means of the
y
-values and the
x
-values, respectively, and
m
is the slope (the formula for which is given in the preceding section).

To calculate the
y
-intercept,
b
, of the best-fitting line:

  1. Find the slope,
    m
    , of the best-fitting line using the steps listed in the preceding section.

  2. Multiply
    m
    ×
    x

  3. Take
    y
    and subtract your result from Step 2.

Tip 

Always calculate the slope before calculating the
y
-intercept. The formula for the
y
-intercept contains the slope in it, so you need
m
to calculate
b.

Finding the best-fitting line for cricket chirps and temperature

Although the formula for the line that best fits the relationship between cricket chirps and temperature is subject to a bit of discussion (see Appendix), the consensus seems to be that a good working model for this relationship is
y
=
x
+40, or temperature = 1 × (number of chirps in 15 seconds) + 40, where the temperature is in degrees Fahrenheit. Note that the slope of this line is 1,
x
= number of chirps in 15 seconds, and
y
= temperature in degrees Fahrenheit.

HEADS UP 

Notice that the formulas for the slope and
y
-intercept are in the form of
x
and
y
, so you have to decide which of your two variables you'll call
x
and which you'll call
y.
When doing correlations, the choice of which variable is
x
and which is
y
doesn't matter, as long as you're consistent for all the data; but when fitting lines and making predictions, the choice of
x
and
y
does make a difference. Take a look at the preceding formulas — switching the roles of
x
and
y
makes all of the formulas change.

So how do you determine which variable is which? In general,
x
is the variable that is the predictor. Statisticians call
x
the
explanatory variable
, because if you change
x
, that explains why and how
y
is going to change. In this case,
x
is the number of cricket chirps in 15 seconds. The
y
variable is called the
response variable;
it responds to changes in
x.
In other words,
y
is being predicted by
x.
Here,
y
is the temperature.

Comparing the working model to the data subset

The big-five summary statistics from the subset of cricket data are shown in
Table 18-3
.

Table 18-3:
Cricket Data Big-Five Summary Statistics

Variable

Mean

Standard Deviation

Correlation

# Chirps
(x)

x
= 26.5

s
x
= 7.4

r
= +0.98

Temp
(y)

y
= 67

s
y
= 6.8

 

The slope,
m
, for the best-fitting line for the subset of cricket chirp versus temperature data is
r
× (
s
y
÷
s
x
) = 0.98 × (6.8 ÷ 7.4) = 0.98 × 0.919 = 0.90. Now, to find the
y
-intercept,
b
, you take
y

m
×
x
, or 67

(0.90)(26.5) = 67

23.85 = 43.15. So the best-fitting line for predicting temperature from cricket chirps based on the data is:
y
= 0.9
x
+43.2, or

temperature (in degrees Fahrenheit) = 0.9 × (number of chirps in 15 seconds) + 43.2.

HEADS UP 

Note that the preceding equation is close to, but not quite the same as, the working model:
y
=
x
+ 40. Why isn't the preceding equation exactly the same as the working model? A couple of reasons come to mind. First, "working model" is fancy language for "not necessarily precise, but very practical." I'm guessing that over the years, the slope has been rounded to the nearest whole number (1) and the
y
-intercept has been rounded to the nearest ten (which makes it 40) just to make it easier for people to remember and more fun to write about. (This isn't good statistical practice and is an example of how statistics can drift over a period of years.) Second, the data I used are just a random subset of the original data set (for purposes of illustration) and will be a bit off, just by chance (see
Chapter 9
for more on variation between samples). However, because the data are so highly correlated, the difference between one sample of data and another shouldn't be much.

Other books

A New Day by Nancy Hopper
The Rational Animal: How Evolution Made Us Smarter Than We Think by Douglas T. Kenrick, Vladas Griskevicius
The Hell Screen by I. J. Parker
No Signature by William Bell
Mad Morgan by Kerry Newcomb
Islands of the Damned by R.V. Burgin
Manhattan Mafia Guide by Eric Ferrara
Such a Pretty Face by Cathy Lamb