Statistics Essentials For Dummies (43 page)

BOOK: Statistics Essentials For Dummies

13.85Mb size Format: txt, pdf, ePub

ads

The variables
X
and
Y
can be switched in the data set, and the correlation doesn't change. For example, if height and weight have a correlation of 0.53, weight and height have the same correlation.

Finding the Regression Line

After you've found a linear pattern in the scatterplot, and the correlation between the two numerical variables is moderate to strong, you can create an equation that allows you to predict one variable using the other. This equation is called the
simple linear regression line
.

Which is X and which is Y?

Before moving forward with your regression analysis, you have to identify which of your two variables is
X
and which is
Y
. When doing correlations, the choice of which variable is
X
and which is
Y
doesn't matter, as long as you're consistent for all the data; but when fitting lines and making predictions, the choice of
X
and
Y
makes a difference. In general,
X
is the variable that is the predictor. Statisticians call the
X
-variable (here cricket chirps) the
explanatory variable,
because if
X
changes, the slope tells you (or explains) how much
Y
is expected to change. The
Y
-variable (here temperature) is called the
response variable
because if
X
changes, the response (according the equation of the line) is a change in
Y
. Hence
Y
can be predicted by
X
if a strong relationship exists.

Note:
In this example, I want to predict the temperature based on listening to crickets. Obviously, the real cause-and-effect is the opposite: As temperature rises, crickets chirp more.

Checking the conditions

In the case of two numerical variables, it's possible to come up with a line that you can use to predict
Y
from
X
, if (and only if) the following two conditions we examined in the previous sections are met: 1) The scatterplot must find a linear pattern; and 2) The correlation,
r
, is moderate to strong (typically beyond ±0.60).

It's not always the case that folks actually check these conditions. I've seen cases where researchers go ahead and make predictions when a correlation was as low as 0.20, or where the data follow a curve instead of a line when you make the scatterplot! That doesn't make any sense.

But suppose the correlation
is
high; do we need to look at the scatterplot? Yes. There are situations where the data have a somewhat curved shape, yet the correlation is still strong.

Understanding the equation

For the crickets and temperature data, you see the scatterplot in Figure 10-1 shows a linear pattern. The correlation between cricket chirps and temperature was found to be very strong (
r
= 0.98). You now can find one line that best fits the data (in terms of the having the smallest average distance to all the points.). Statisticians call this technique for finding the best-fitting line a
simple linear
regression analysis.

Do you have to try lots of different lines to see which one fits best? Fortunately, this is not the case (although eyeballing a line on the scatterplot does help you think about what you'd expect the answer to be). The best-fitting line has a distinct slope and
y
-intercept that can be calculated using formulas (and, I may add, these formulas aren't too hard to calculate).

The formula for the
best-fitting line
(or
regression line
) is
y
=
mx
+
b
, where
m
is the slope of the line and
b
is the
y
-intercept. (This is the same equation from algebra.) The slope of a line is the change in
Y
over the change in
X
. For example, a slope of 10/3 means as the
x-
value increases (moves right) by 3 units, the
y
-value moves up by 10 units on average.

The
y
-intercept is that place on the
y
-axis where the line crosses. For example, in the equation
y
= 2
x
- 6, the line crosses the
y
-axis at the point -6. The coordinates of this point are (0,-6); when a line crosses the
y
-axis, the
x
-value is always 0. To come up with the best-fitting line, you need to find values for
m
and
b
that fit the pattern of data the absolute best. The following sections find these values.

Finding the slope

The formula for the slope,
m
, of the best-fitting line is
m
=
,

where
r
is the correlation between
X
and
Y
, and
s
_xand
s
_yare

the standard deviations of the
x
-values and the
y
-values . To calculate the slope,
m
, of the best-fitting line:

1. Divide
s
y by
s
x.

2. Multiply the result in Step 1 by
r
.

The correlation and the slope of the best-fitting line are not the same. The formula for slope takes the correlation (a unitless measurement) and attaches units to it. Think of
s
_y/
s
_xas the change in
Y
over the change in
X
, in units of
X
and
Y
; for example, change in temperature (degrees Fahrenheit) per increase of one cricket chirp (in 15 seconds).

Finding the y-intercept

The formula for the
y
-intercept,
b
, of the best-fitting line is
b
=
-
m
, where
and
are the means of the
x
-values and the
y
-values, respectively, and
m
is the slope (the formula for which is given in the preceding section). To calculate the
y
-intercept,
b
, of the best-fitting line: