Chapter Eight: Multivariate Analysis
Crosstabs Revisited
Simple crosstabs, which examine the influence of
one variable on another, should be only the first step in the analysis
of social science data (refer to Chapter Five). It is fun to hypothesize
that the more conservative a person's political orientation the more likely
they are to oppose abortion, then run the crosstabs, and then conclude
you were right. However, this one step method of hypothesis testing is
very limited. What if all the Republicans in your sample are religiously
conservative and all the Democrats are atheists? Is it the political party
that best explains your findings, or is it religious orientation?
Or what if the political conservatives as a group are much older than the
liberals, would age then be the real causal factor? Or is it some
combination among all of these variables that explains the varying opinions
of your respondents?
In Exercise 3 of Chapter Five, we wondered if TRUST
is related to RACE. Using TRUST as the dependent variable and RACE as the
independent, here is the table from Chapter 5 (Figure 8-1):
Figure 8-1
At first glance, RACE differences appear to be very
important (overall, 58% of those surveyed said people cannot be trusted,
but the epsilon statistic -- the difference between the highest and lowest
percentage -- is 22). Also note that few Respondents said “Depends” – most
had a definite opinion here.
Let’s do some recoding: RACE should be recoded into
a different variable called RACER (Race Recoded). Whites and Blacks will
stay the same, but Other is eliminated by recoding it as missing (see Figures
8-2 and 8-3). Review Chapter 3 if you need to refresh your memory on how
to recode.
Figure 8-2
|
Figure 8-3
|
Let’s also recode TRUST into a different variable
called TRUSTR to eliminate the “Depends” category. Don’t forget to create
new value labels after you recode. Now run the crosstabs for TRUSTR and
RACER. Your output should look like Figure 8-4.
Figure 8-4
When "pp", a percentage point difference (epsilon)
is this high, it’s “interesting” (actually, anything higher than 10-12
is interesting) even if you don't yet know whether it is statistically
significant. Here you have a pp difference of 24. And here’s how you might
describe what you’ve found so far: “Although most Respondents (62%) say
that other people cannot be trusted, over 80% of the Black respondents
said this compared to 58% of the Whites in this sample.” Or, “Fewer than
one-fifth of Blacks said that people can be trusted, compared to more than
two-fifths of Whites.”
Is this a strong relationship (statistically speaking)?
There are a lot of choices in the "Statistics" dialog box, but here we
will just look at the gamma statistic (your instructor will probably have
you look at other statistics, but gamma is almost always appropriate; see
Figure 8-5). Yes it is significant.
Figure 8-5
Can you have confidence that race is the causal factor
here? While it may indeed be true that race is explanatory, you won't really
have confidence in this conclusion until you have failed to account for
this variation in any other way. To do this, we will need to do some elaboration
analysis by running crosstabs of (i.e., "controlling for") other independent
variables to see if something else might account for this variation among
respondents.
Recall that your original crosstabs procedure produces
one contingency table, with as many rows as there are categories (or values)
of the dependent variable, and as many columns as there are categories
of the independent variable. When you start using control (sometimes called
test) variables, you will get as many separate tables as there are categories
of the control variable. For instance, if you want to control for levels
of education, and simply used EDUC as the control variable, you end up
with 20 separate tables. This is NOT a good idea. Try doing this to see
what we mean. Notice how difficult it is to compare across this many tables.
So before you do any further analysis, recode your variables into the smallest
number of categories that are still logically useful. Review Chapter 3
if you have forgotten how to do this.
In the next example EDUC was recoded as EDUC2 into
two categories, those with high school or less (0 12 years), and those
with more then high school (13+ years). After you have done these
recodes, let's see what happens when we do crosstabs again. This time we
will control for our recoded education variable. To do the appropriate
crosstabs, go to the Analyze, Descriptive Statistics, Crosstabs menu. Enter
TRUSTR into the Row box and RACER into the Column box. Now you are ready
for the next step, the addition of a control variable. Choose EDUC2 from
your variables list and enter it into the empty box at the bottom of the
Crosstabs screen. Figure 8-6 shows you what the Crosstabs dialog box will
look like.
Figure 8-6
The SPSS output for this procedure is shown in Figure
8-7.
Figure 8-7
You still have the two columns of your independent
variable (RACER), but you can compare TRUSTR for people who have no college
education (0-12 Years) with those who do (13+). A possible description:
"Whites are more likely than blacks to think people can be trusted holding
education constant (50.1% vs. vs. 31.3% and 32.3% vs. 9.3%). Those with
more education are more likely than those with less education to say people
can be trusted holding race constant (50.1% vs. 32.3% and 31.3% vs. 9.3%).
Both education and race are related to trust of people".
So what is more important, race or years of education?
Just as you can’t stop with a crosstab of only two variables when you want
to test out your hypotheses, you also can’t stop with just one control
variable. Some of the other “major demographic variables” that might explain
social differences include sex, social class, income, occupation, marital
status, age, political ideology, and religion.
Figure 8-4 shows the original, or zero order contingency table of the
relationship between trust and race.
Figure 8-7 shows the two partial tables that resulted
from controlling for education, one for each category of that variable
(0-12, 13+).
Try other variables as a control to see what happens.
As a general rule, here is how to interpret what you find from this elaboration
analysis:
-
If the partial tables are similar to the zero order table, you have
replicated your original findings, which means that in spite of the introduction
of a particular control variable, the original relationship persists. The
only way to convince us that this is indeed a strong, or even causal, relationship
is if you control for all the other logical independent variables you can
think of, and still find essentially no differences between the zero order
tables and their partials.
-
If all the partials are significantly less than those found in the original
AND IF your control variable is antecedent (occurs prior in time) to both
the other variables, you have found a spurious relationship and explained
away the original. In other words, the original relationship was due to
the influence of that other variable, not the one you hypothesized.
-
If the partials are less AND IF your control variable is intervening, you
have interpreted the relationship. If the time sequence between the independent
and control variable is not determinable (or otherwise unclear), you don't
know whether you have explanation or interpretation, but you do know that
the control variable is important.
-
If one or more partials is stronger than the original relationship and
one or more is weaker, you have discovered the conditions under which the
original relationship is strongest. This is referred to as specification,
or the interaction effect.
-
If the zero order table showed weak association between the variables,
you might still find strong associations in the partials (which is a good
argument for keeping on with your initial analysis of the data even if
you didn't "find" anything with bivariate analysis). The addition of your
control variable showed it to have been acting as a suppressor in the original
table.
Last, if a zero order table shows only a weak or
moderate association, the partials might show the opposite relationship,
due to the presence of a distorter variable.
Try some of your own three way (or higher) tables
using some of the variables in the GSS00A data set. Recall that for this
procedure, there should be few categories for each variable, particularly
your control variables (so you might need to recode), and you are limited
to variables measured at, or recoded to, nominal or ordinal levels.
Multiple Regression
Once you have discovered that several of your independent
variables are related to your dependent variable, you might want to try
multiple regression (multiple linear regression analysis). The three or
more way crosstabs shown previously are more an exploratory technique,
whereas multiple regression is more explanatory. With multiple regression
you can generate beta values (partial regression coefficients) which give
you an idea of the relative impact of each independent variable on the
dependent.
You also will generate the R squared value, which
is a summary statistic of the impacts of all the independent variables
taken together. Remember the important assumptions for using regression:
a linear relationship between each independent variable and the dependent;
a normal distribution of your variables, and variables measured at interval
or ratio levels. Any variable with only two categories can be treated as
interval level.
Go to the Analyze, Regression, Linear menu. For
your dependent variable, choose TRUST from the variable list. For the independent
variables choose EDUC (unrecoded), CLASS, AGE (see Figure 8-8).
Figure 8-8
Lets look at some of the options possible. Choose
the "Statistics" button at the bottom of the dialog box and a new dialog
box will appear, shown here in Figure 8-9 with the default options. The
defaults are appropriate for us.
Figure 8-9
Click on the "Continue" button to return to Figure
8-8, then click on the "Plots" button. Your screen should now look like
Figure 8-10. Again the defaults are appropriate for us.
Figure 8-10
Click on "Continue" and then "Options" and your screen
should look like Figure 8-11, which shows the default options.
Figure 8-11
Defaults are acceptable so click "Continue" to return
to the Linear Regression dialog box (Figure 8-7). Your last task is to
choose your method of analysis. Click on the "Method:" button right under
the "Independent(s): " box. You have several choices here, and you can
use the scroll button to see what they are. "Stepwise" is the one we chose
for this example, and the one that you will probably use most often (see
Figure 8-12.)
Figure 8-12
For an in depth discussion of all the possible choices
for Multiple Regression, you will need to consult the SPSS manuals.
When you finally click "OK" in the Linear Regression
dialog box after having chosen stepwise regression using all the default
options discussed above, you will see the results in the Output window
(Figures 8-13 through 8-15).
Figure 8-13
|
Figure 8-14
|
Figure 8-15
Chapter Eight Exercises
1. Create some hypotheses that use RACER and TRUSTR
and some of the other independent variables found in our GSS00A data set.
Remember to recode any variables that have too many categories. Test your
hypotheses first using Crosstabs, then using Regression analysis. Do the
race differences ever go away completely?
2. How would you hypothesize the relationship between
FEAR (Afraid to walk at night in neighborhood) and SEX? After you have
looked at the output, control for CLASS. How would you discuss what you
found? Now run FEAR and SEX but control for TRUSTR. How would you characterize
the relationships among these variables?
3. After creating the appropriate hypotheses, run
the Crosstabs for each of the seven abortion variables with SEX, AGE (recoded),
some measure of religiosity, and some measure of political ideology, controlling
for RACER and for education (recoded). How are all these variables related?