Univariate analysis, looking at single variables, can be used for data cleaning, exploring a data distribution, describing the data and inferring from the data to the population from which it was drawn. SPSS offers many univariate analysis tools to perform these tasks. We will first look at these, and then we will perform a few exercises that illustrate some of these uses. The SPSS tools for looking at single variables include the following procedures: Frequencies, Descriptives and Explore all located under the Statistics menu.
To begin the process start SPSS, then open the data file and under the "Statistics" menu, choose "Summarize" and the procedure desired: "Frequencies", "Descriptives", or "Explore". For the following examples we will use the GSS96A.SAV file so you need to open this data set in SPSS. Look back in Chapter 1 "Getting a Data File" if you need to refresh your memory on how to start SPSS and bring in a file.
Frequencies is primarily used for looking at detailed information on nominal (category) data. Categorical data is for variables such as gender i.e. males are coded as "1" and females are coded as "2". Frequencies options include a table showing counts and percentages, statistics including percentile values, central tendency, dispersion and distribution, and charts including bar charts and histograms. The steps for using the frequencies procedure is to choose "Frequencies" from the "Statistics" menu, select your variables for analysis, choose statistics options, choose chart options, choose format options, and then have SPSS calculate your request, after which you can examine the output.
For this example we are going to check out attitudes on the abortion issue. The 1996 General Social Survey, GSS96A.SAV, has the variable "ABANY" with the label "ABORTION IF WOMAN WANTS FOR ANY REASON". We will look at this variable for our initial investigation.
Choosing Frequencies Procedure:
From the "Statistics" menu, click "Summarize", Figure 4-1.
Move to the submenu and click on "Frequencies". A dialog box, Figure 4-2, will appear providing a scrollable list of the variables on the left, a "Variable(s)" choice box, and buttons for "Statistics", "Charts" and "Format" options.
If you want to know more about a variable, the label, codes, etc., click the left mouse button with the mouse on the variable name in the variable list and choose the appropriate choice in the menu that appears Selecting Variables for Analysis:
First select your variable from the main frequencies dialog box, Figure 4-2, by clicking the variable name once. (Use the scroll bar if you do not see the variable you want.) In this case "ABANY" is the first variable and is already selected (i.e., highlighted). Thus, you need not click on it.
Click the arrow to the left of the "Variable(s):" box, Figure 4-2, to move "ABANY" into the box. All variables selected for this box will be included in any procedures you decide to run. We could click OK to obtain a frequency and percentage distribution of the variables, but in most cases we would also choose one or more statistics.
Choosing Statistics for Variables:
Click the "Statistics" button, bottom of Figure 4-2, and a dialog box of statistical choices will appear, Figure 4-3.
This variable is a nominal variable so click only the "Mode" box within the central tendency choices. I have done this in Figure 4-3.
After clicking the "Mode" box (this is the appropriate measure of central tendency for nominal or categorical data) click the "Continue" button and we return to the main Frequencies dialog box, Figure 4-2. We would have made different statistical choices for ordinal, interval, or ratio measured variables.
We could now click "OK" and SPSS would calculate and present the frequency and percent distribution, Figure 4-4, with our chosen statistics.
You may do this as well to see what is produced. We will discuss this output shortly, but first, in the more typical manner we will continue and include choices for charts and check out the options possibilities. If you clicked "OK" like I did, just choose "Statistics" from the menu and "Frequencies" from the submenu and you will be back to this point with your variable and statistics chosen.
Choosing Charts for Variables:
Click on "Statistics", then "Frequencies" to get back to the frequencies dialog box if you created a table like Figure 4-4. On the main frequencies window, click the "Charts" button, Figure 4-2, and a dialog box of chart choices, Figure 4-5, will appear.
Click "Bar Chart" since this is a categorical variable then click "Continue" to return to the main Frequencies window box. We could now click "OK" in the main Frequencies dialog box and SPSS would calculate and present a frequency and percent distribution with our chosen statistics. However, first we will look to see if any additional choices should be made by clicking the "Format" button.
Choosing Format for Variable Output:
Click the "Format" button on the main frequencies dialog box, Figure 4-2, and a dialog box of format choices, Figure 4-6, will appear.
In the dialog box you can see the choices for ordering output and page format. Leave the default selections, they are fine for our purpose since we are looking at only one variable. (If you do not choose the default number of categories , you can type in your preference for the number of categories for suppression.) Next, click "Continue" and the main Frequencies window, Figure 4-2, will reappear.
Now click "OK" on the main variable dialog box and SPSS will calculate and present a frequency and percent distribution with our chosen format, statistics, and chart.
Looking at Output from Frequencies:
We will now take a brief look at our output from the SPSS frequencies procedure. (Processing time for SPSS to perform the analysis in the steps above will depend on the size of the data set and the amount of work you are asking SPSS to do). The SPSS Output Navigator with the output will appear when SPSS has completed its computations. Either scroll down to the chart in the left hand window, or click the "Bar Chart" icon in the outline pane to the left of the output.
Interpreting the Chart:
We now see the chart, Figure 4-7.
The chart is a bar chart with the categories at the bottom, X axis, and the frequency scale at the left, Y axis. To display the chart, drag the scroll bar on the right of your table. The variable label is displayed at the bottom of the chart We see that there are slightly more "no", 34.5%, answers than "yes", 28.2%, answers when respondents were asked if a woman should be able to get an abortion for any reason. A much smaller number, 3.2%, choose don't know, "DK". If these were the only data for this variable presented in a report, you should look at the frequency output and report the total responses and the percentage of "yes" and "no" and "DK" answers as I did.
We could choose to copy our chart to a word processor program for a report. To do this first select the chart by clicking the mouse on the bar chart, a box with handles will appear around the chart. Select "Copy Objects" from the "Edit" menu. Start your word processing document, click the mouse where you want the chart to appear and choose "Paste" from the "Edit" menu to paste the chart into your document. There are lots of possibilities for enhancing this chart within SPSS but these will be discussed later in the text.
Interpreting Frequency Output:
To display the frequency distribution, we simply have to move the scroll bar on the right of our output window or click the "Frequencies" icon in the outline pane to the left of the output. You may want to click on the Maximize Arrow in the upper-right corner of the SPSS Output Navigator window to enlarge the output window. Use the scroll bar to display different parts of the output. The most relevant part of this output is in Figure 4-8.
We can now see some of the specifics of the SPSS frequencies output for the variable "ABANY". At the top is the variable label "ABORTION IF WOMEN WANTS FOR ANY REASON". The major part of the display shows the Value Labels ("YES", "NO", "Total", and the missing categories "NAP" [Not Appropriate], "DK" [Don't Know] and "Total"), and the Frequency, Percent, Valid Percent, Cumulative Percent (the cumulative for values as they increase in size), for each classification of the variable. The "Total" frequency and percent is listed at the bottom of the table. We see that when asked if a woman should be able to obtain an abortion for any reason, 28.2 %, of our sample answered "yes" while 34.5 % responded "no." with 3.2 % choosing "DK", don't know. The 33.8 % "NAP" for not appropriate was that portion of the sample that were not asked this question. We report Valid Percent since this number does not include the missing cases.
Descriptives is used to obtain summary information about the distribution, variability, and central tendency of continuous variables. Possibilities for Descriptives include mean, sum, standard deviation, variance, range, minimum, maximum, S.E. mean, kurtosis and skewness. For this example we are going to look at the distribution of age and education for the General Social Survey sample. Since both these variables were measured at interval/ratio level, we will be using different statistics from our previous example.
Choosing Descriptive Procedure:
First click the "Statistics" menu, drag to "Summarize", then across to the submenu to "Descriptives" (Figure 4-9).
Selecting Variables for Analysis:
First click on "AGE", the variable name for AGE OF RESPONDENT. Click the select arrow and SPSS will place AGE in the Variable(s) box. Follow the same steps to choose "EDUC", the variable name for HIGHEST YEAR OF SCHOOL COMPLETED. The dialog box should look like Figure 4-10.
We could click "OK" and obtain a frequency and percentage distribution, but we will click the "Options" button and decide on statistics for our output.
Click "Options and the Descriptives: Options" dialog box, Figure 4-11, will open.
Since these variables are interval/ratio measures, choose: "Mean," "Std. deviation," "Minimum" and "Maximum". We will leave the defaults for the "Distribution" and "Display Order".
Next, click the "Continue" button to return to the main descriptives dialog box, Figure 4-10. Click "OK" in the main descriptives dialog box, Figure 4-10, and SPSS will calculate and display the output seen in Figure 4-12.
Interpretation of the Descriptives Output
In the Interpretation of Figure 4-12, "AGE OF RESPONDENT" has a mean of 44.78 and a standard deviation of 16.87. The youngest respondent was 18 and the oldest was 89. "HIGHEST YEAR OF SCHOOL COMPLETED", has a mean of 13.36 ( a little more then 1 year beyond high school) and a standard deviation of 2.93. Some respondents indicated no "0" years of school completed. The most education reported was 20 years.
Explore is primarily used to visually examine the central tendency and distributional characteristics of continuous variables. Explore statistics include M-estimators, outliers, and percentiles. Grouped frequency tables and displays, as well as Stem-and-leaf and box-plots, are available. Explore will aid in checking assumptions with Normality plots and Spread vs. Level with Levene test.
Choosing the Explore Procedure:
From the "Statistics" menu choose "Summarize", drag to the submenu and select "Explore".
As in the other procedures, find and click the variable(s) you want to explore, then click the select arrow to include your variable in the Dependent List box. Choose the variable "EDUC". The dialog box should look like Figure 4-13.
In the Display box on the bottom left, you may choose either Both, Statistics, or Plots. We left the default selection, "Both".
Click the "Statistics" button, bottom middle of Figure 4-13, and the Explore: Statistics dialog box will open, Figure 4-14.
Leave checked the default box for "Descriptives with 95% Confidence Interval for the Mean", and click the "Outliers" box so we can look at the extreme observations for our variable. Click "Continue" to return to the main explore dialog window.
Click the "Plots" button on the main Explore Dialog Box, Figure 4-13, and the Explore: Plots dialog box, Figure 4-15, will open.
Leave the default choices in the Boxplots box and then click "Histogram" and "Stem-and-leaf" in the Descriptive box. Click on "Normality Plots with Tests" so we can see how close the distribution of this variable is to normal. Leave the default for "Spread vs Level with Levene Test". Click "Continue" to return to the main explore dialog box.
Click the "Options" button in the main explore dialog box, Figure 4-13, and the Explore: Options dialog box, Figure 4-16, will be displayed.
No changes are needed here since the default of "Exclude cases listwise" is appropriate. Now click "Continue" to return to the main Explore dialog box, Figure 4-13. Click "OK" in the main Explore dialog box and SPSS will perform the chosen tasks and display the data in the SPSS Output Navigator.
Interpretation of Explore Output:
Use the scroll bar to view any part of the output. The first part of the output is the "Case Processing Summary", Figure 4-17.
We can see that 2895 (99.7%) of our respondents answered this question with 9 (.3%) of the sample "Missing", not answering the question. The GSS in recent years has had a split sample where not all respondents in the sample are asked the same questions. This is a question where all respondents where asked the question, so the total sample size was 2904 (100%).
"The "descriptive" statistics output should look like Figure 4-18.
We can see all the typical descriptive statistics on this output: mean (13.36), lower bound (13.26) and upper bound (13.47) for a 95% confidence of the mean (in polling terminology this says that we are 95% confident that the mean for the population is between 13.36 and 13.47), median (13.00), variance (8.58), standard deviation (2.93), minimum (0), maximum (20), range (20), interquartile range (4.00), skewness (-.147), kurtosis (.904). A narrative explaining the education of first marriage for the US population in 1996 would be somewhat like the following:
Our sample from the General Social Survey of 1996, indicates that the average education for those over 18 in the US in 1996 was 13.36 with a 95% confidence that the real average would fall between 13.36 and 13.47. The least years of education reported was found to be 0 and the most was 20. The exact middle point of the population with 50% falling below and 50% above was 13.00.
The Extreme Values can be seen in Figure 4-19.
This figure shows the five highest and the five lowest values for our variable. We see that more then 5 respondents listed their years of education as 20. Four people listed their education as 0 years in our sample and a number of respondents listed their education at the next lowest number of 3 years. The "Test of Normality" is shown next, Figure 4-20.
The histogram, Figure 4-21, shows that SPSS divided our distribution into nine groups with a width of 2.5 years of education for each group.
The largest group has a little less then 1200 cases, an eyeball guess. The smallest group has very few cases (we know there are four respondents who reported 0 years of education from our Extreme Values output but the graph does not show us this). The statistics on the histogram tell us that the standard deviation is 2.93 with a mean of 13.4 for a total N of 2895.
The Stem-and-Leaf is next. Figure 4-22 reaffirms a close but not quite normal distribution with significant outliers on the end of the distribution. We saw this in our earlier output.
Interpretation of the Q-Q Plot of Age:
Continue scrolling down the SPSS Output Navigator to the "Normal Q-Q Plot of HIGHEST YEAR OF SCHOOL COMPLETED", Figure 4-23.
A q-q plot is a plot of our observed values against the expected values. If our distribution is normal, the plot would have observations distributed closely around the straight line. In Figure 4-23, the expected normal distribution is the straight line and the line of little boxes is the observed values from our data. Our plot shows the distribution deviates somewhat from normality at the low end. The high end of the distribution is pretty much normal.
Interpretation of the Boxplot:
In the SPSS Output Navigator, scroll to the box plot of HIGHEST YEAR OF SCHOOL COMPLETED. The box plot should look like Figure 4-24.
We can see that the major part of our distribution appears close to normal, but there are significant outliers, the cases beyond the lower line of our boxplot. Our outliers are at the lowest end of the distribution.
Chapter Four Exercises
These exercises are designed to familiarize you with the SPSS univariate procedures. They are open-ended with no specific answers.
- In this chapter we looked at, ABANY (WOMAN WANTS AN ABORTION FOR ANY REASON), one of the variables in the GSS96 data measuring peoples attitudes about abortion. There are other variables measuring different aspects of the abortion issue. These are:
Pick one of these variables and perform the appropriate techniques discussed in this chapter for the variable. Write up a short narrative explaining what you found about this variable. (Looking back at what we did with ABANY should help you with this. Your write up should be designed to best explain what you found so do not report all the techniques we used, just those necessary to clearly and accurately describe your findings.)
- ABDEFECT (POSSIBILITY OF SERIOUS DEFECT IN BABY),
- ABHLTH (WOMAN'S HEALTH IS SERIOUSLY THREATNED),
- ABNOMORE, (WOMAN IS MARRIED AND DOESN'T WANT ANY MORE CHILDREN),
- ABPOOR (WOMAN IS POOR AND CAN'T AFFORD MORE CHILDREN),
- ABRAPE (PREGNANT AS RESULT OF RAPE),
- ABSINGLE (WOMAN IS NOT MARRIED).
In this chapter we looked at EDUC (HIGHEST YEAR OF SCHOOL COMPLETED). There are similar variables measuring respondent's parents education:
Pick one of these variables and perform the appropriate techniques discussed in this chapter for describing the variable. Write up a short narrative explaining what you found about this variable. (You might want to look back at what we did with EDUC. Your write up should be designed to best explain what you found so do not report all the techniques we used, just those necessary to clearly and accurately describe your findings.)
- PAEDUC (HIGHEST YEAR OF SCHOOL COMPLETED, FATHER)
- MAEDUC (HIGHEST YEAR OF SCHOOL COMPLETED, MOTHER)
The GSS96A.SAV file provides answers to a wide range of questions from a sample of respondents in the US in 1996 on their lifestyle and attitudes. Look over the attitude variables in the survey. You can do this by clicking the Utilities menu and choosing Variables. This will provide a dialog box, which can be used to examine the variable and value labels for our data file. There is also a codebook for this data set in Appendix A that lists all the variable information. Pick a couple of interesting attitude questions and use an appropriate SPSS univariate procedure discussed in this chapter to describe the responses for these variables by this sample. Write a narrative description of your SPSS output. (You might want to take another look at what we did in this chapter. Your write up should be designed to best explain what you found so do not report all the techniques we used, just those necessary to clearly and accurately describe your findings.) One way to evaluate how close a sample is to the population from which it was drawn is by a comparison of known variables of the population with the same variables in the sample. The 1996 General Social Survey has variables for which we know the US population distribution (distributions of age, race, gender, etc). Pick a few of these and find their distribution in our GSS sample. Use the procedures we learned in this chapter.
See how close the sample distribution for the variables you choose comes to matching the U.S. populations distribution for the same variables by checking a library source for US census data (Statistical Abstracts is one source). Write a short narrative, explaining what you found. For a challenge attempt an explanation for any differences between the sample and the population. (You might want to look at the web site for the General Social Survey to determine how the survey was conducted and who was chosen and not chosen to be interviewed. The web site is: http://www.norc.uchicago.edu/gss/homepage.htm.)