SSRIC Teaching Resources Depository
Exploring the US Census
Eugene Turner, California State University, Northridge

Appendix A: Detailed Descriptions and Access to Major Files

© The Author, 1998; Last Modified 17 August 1998
  1. Summary Tape Files

  2. Short-Form Files

    STF1 contains 36 tables of population information and 45 tables of housing information. Each of these 81 tables is repeated for each geographic unit within a file. STF2 is much like STF1 except that it contains more table categories. Furthermore, it contains a and b record types for both person and housing data. The a record type is the tabulation for all the population while the b record is a tabulation for a particular ethnic group. One form of STF2 contains tabulations for 9 ethnic groups while another contains tabulations for up to 28 ethnic groups. STF2 contains 13 a-type person tables, 28 a-type housing tables, 27 b-type person tables, and 27 b-type housing tables.

    Long-Form Files

    STF3 contains 170 person tables and 92 housing tables. Like STF2, STF4 contains a and b record types which make it by far the largest and most detailed of the Summary Tape Files. It contains 122 a-type person tables, 76 a-type housing tables, 161 b-type person tables, and 74 b-type housing tables. One form of STF4 contains a and b tabulations for 9 ethnic groups while another contains tabulations of up to 48 groups. The table below indicates the relative sizes of the four Summary Tape Files by the number of cells of data for each geographic unit.






  3. Geography in Summary Tape Files

  4. The table below indicates the smallest level of geography contained in each of the data files.

    File Minimum Units
    STF1a Block Group

    STF1b Block

    STF1c County and Place > 10,000

    STF1d Congressional District

    STF2a Tract
    STF2b County Subdivision and Place > 1000

    STF2c County and Place > 10,000

    STF3a Block Group
    STF3b ZIP Code

    STF3c County Subdivision and Place > 10,000

    STF3d Congressional District

    STF4a Tract
    STF4b County Subdivision and Place > 2500

    STF4c County Subdivision > 10,000

    In Appendix D are the Summary Level Codes and the Geographic Component Codes for the STF3 data files. Note, for example, that the State code has potentially five records. If only the state total record is desired then the Geographic Component code must be set to 00.

    Appendix D indicates the geographical hierarchy followed to reach the final geographic unit. Of particular importance is the hierarchy used to reach census tracts and block groups. Many tract boundaries are split by incorporated place boundaries. This means that selecting Summary Level Code 080 or 090 will result in many more geographic units than selecting Summary Level Code 140 or 150. The latter numbers are for complete tracts while the former codes might be useful when studying tracts only within a specific city. Accidentally selecting the split-tracts will cause considerable difficulty with most mapping programs since they typically use only complete tract boundary files. Also statistical computations may be affected because of numerous zero values in parts of the split tracts.

    Also there are differences in some of the Summary Level Codes when accessing STF4. Complete tracts in STF4a have the Summary Level Code of 141, places over 2500 persons in STF4b have the Summary Level Code of 163, and places over 10,000 in STF4c have the Summary Level Code of 161. Summary Level Codes are found on page 6-1 of U.S. Census Summary Tape File Codebooks. A discussion of census geography can be found in Appendix A of the same documentation.

    Coding Geographic Units - FIPS Codes

    For various types of features within a state or county FIPS code numbers increase according to the name of a location in alphabetical order. For example, Alameda County in California is 001 and Amador County is 003. A census tract FIPS code consists of six digits. The first four identify a tract and the last two serve as a suffix. Tracts that are split in a later census because of increased population would have a suffix of .01 and .02 such as 1101.01 and 1101.02. In some cases split tracts have been split a second time. The suffix may also have values from .80 to .98 indicating that the tract was created by modifying an existing boundary. A value of .99 indicates persons aboard a ship at the time the census was taken.

    In most cases a FIPS code must be specified to subset a desired geographic unit or units from a census file. Occasionally more than one FIPS code must be used to create a desired data set. For example, to get the tracts of Los Angeles County, one would request a Summary Level Code of 140 and a county FIPS Code of 037. To get the data for the state of California one might specify a Summary Level Code of 040, a Geographic Component Code of 00, and a state FIPS Code of 06.

    For most mapping programs FIPS codes from the census must be joined to create a matching value for the boundary codes in the program's mapping data. Thus, if one wanted to map tracts in Los Angeles County, one would have to join the state, county, and tract FIPS codes to create an identifying label. For example, 060371101.02 would specify tract 1101.02 in county 037 in state 06. The number for the census file must match the number in the mapping file exactly or the census data will not load into the mapping program. One convenient approach is to export a list of geographic unit labels from the mapping program to check on the format needed for the census data. Often this list can be pasted directly into the census data table although some record checking is usually required to account for unlabeled areas.

    The Census Bureau publishes a large list of FIPS codes in its Geographic Identification Coding Scheme publication. Each census data record contains an area name (ANPSADPI) so that names of locations can be directly located in the census files.

    Considerable care is needed in comparing data from the 1980 and 1990 census at the tract level, not only because of split or aggregated tracts, but because some boundaries have been shifted contrary to policy. The California Department of Finance maintains Tract Equivalency Files that can be used to locate where changes have occurred.

  5. Public-Use Microdata Sample Files

  6. Many spreadsheet programs seem incapable of dealing with the structure of PUMS. For this reason a modified structure has been created for the files stored at the Social Sciences Database Archive. In the SSDBA files, the household record has been appended to each person record. This has the effect of greatly enlarging the database while making it easier to work with. It also means that a user must restrict tabulations to include only heads of household records when tabulating housing data. Otherwise calculations will be based on duplicate housing records appended to each person in the household.

    One also needs to be particularly conscious of the population serving as the universe when trying to replicate aggregations used by the U.S. Census. For example, employment data should be tabulated only for persons over age 15 who are employed as civilians. One could also limit the tabulations to those employed full-time.

    Since customized populations can be created from the PUMS files, some care is needed in dealing with the significance of very small counts especially when PUMAs are being used. Chapter 3 of the PUMS Codebook suggests procedures in dealing with this issue.

  7. Census Data in the Social Sciences Database Archive

  8. The Social Sciences Database Archive contains a number of digital files from the U.S. Census Bureau. These include the PUMS and Summary Tape Files for 1980 and 1990 as well as County-City Databooks and Current Population Estimates. Not all STF and PUMS files are available through the SSDBA for all states. Data files are in SPSS format, and in most cases additional files provide dictionaries and codebooks necessary to extract information from the files. A few of the files have spss programs for reading the data, and these may be expanded to carry out various procedures. See Appendix B for a description of the available data sets. The following table describes the current location and availability of the 1990 STF and PUMS resources in the SSDBA.

    1990 STF and PUMS Resources in the SSDBA

    This table contains the descriptions of the locations of various census resources within the SSDBA. There are four types of files which are indicated with the following codes:

    Data: the file containing the database. All are SPSS system files. "By request" means that the data are not directly accessible.
    Cbk: the codebook describing the database

    Dic: a data dictionary for the database

    Prog:a SSDBA program that will describe the contents of the database. SPSS statistics commands can be appended to it.

    STF1a CA Data: /usr/ssdba/ssdba46/c90stf1a-ca.sys
    CBk: /ssdba-data/docs/codebooks/c90stf1.cb

    Dic: /ssdba-data/docs/codebooks/c90stf1a.dic

    STF1b CA Data: By request
    Cbk: /ssdba-data/docs/codebooks/c90stf1.cb

    Dic: /ssdba-data/docs/codebooks/c90stf1b.dic

    Prog: /ssdba-data/docs/programs/untested/c90stf1b-1.uspss




    STF1c U.S. Data: /usr/ssdba/ssdba42/c90stf1c.sys
    Cbk: /ssdba-data/docs/codebooks/c90stf1c.cb

    STF2a U.S. Data: By request

    STF3a Los Angeles and Orange Counties only
    Data: /usr/ssdba/ssdba37/c90stf3a-ca1.sys

    Other CA Counties Data: /usr/ssdba/ssdba38/c90stf3a-ca2.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    Prog: /ssdba-data/docs/programs/icp9782.spss

    STF3b ZIPS beginning 8 or 9
    Data: /usr/ssdba/ssdba76/c90stf3b.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    STF3c NE U.S. Data: /usr/ssdba/ssdba37/c90stf3c-a.sys
    Rest of U.S. Data: /usr/ssdba/ssdba47/c90stf3c-b.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    STF4a CA
    B recs, All persons Data: /usr/ssdba/ssdba50/c90stf4a-t1.sys

    B recs, White Data: /usr/ssdba/ssdba48/c90stf4a-t2.sys

    B recs, Black Data: /usr/ssdba/ssdba65/c90stf4a-t3.sys

    B recs, Am Inds Data: /usr/ssdba/ssdba61/c90stf4a-t4.sys

    B recs, Asian Data: /usr/ssdba/ssdba49/c90stf4a-t5.sys

    B recs, Other Race Data: /usr/ssdba/ssdba51/c90stf4a-t6.sys

    B recs, Hispanic Data: /usr/ssdba/ssdba61/c90stf4a-t7.sys

    B recs, NonHisp Wh Data: /usr/ssdba/ssdba51/c90stf4a-t8.sys

    B recs, NonHisp Bl Data: /usr/ssdba/ssdba53/c90stf4a-t9.sys

    B recs, NonHisp Oth Data: /usr/ssdba/ssdba49/c90stf4a-t10.sys

    A recs, All persons Data: /usr/ssdba/ssdba53/c90stf4a-t11.sys

    Cbk: None: See Census Docs.

    Person & Housing Recs Data: /c/census/c90pums-p5.sys

    Housing Recs only Data: /c/census/c90pums-hr

    Cbk: /ssdba-data/docs/codebooks/c90pums-p1.frq

    STF and PUMS SPSS Programs in the SSDBA

    The programs below are located in the following directory:






















  9. SPSS Programs for Extracting Data from the Sample Databases and from the SSDBA Archive

  10. In order to extract variables from one of the census files you need to know the variable names. One easy way to get these is to execute a DISPLAY DICTIONARY command. This command will list out the contents and formats of a database. Note that the variable names have slightly different formats between the various Summary Tape Files in the SSDBA and so such a listing is necessary.

    Data Dictionaries

    The following program will create a dictionary of a STF4b file on the unix version of SPSS at the SSDBA.

    Program to read PUMS extract and crosstab ethnicity by occupation

    The following two programs were used to read the SSDBA PUMS file and create some crosstabulations. Note that the resulting table is for all persons who were employed, not just civilian employed. A non-Hispanic white category was created by using the Hispanic and Race variables. This program could be copied and pasted into the pc-version of spss.

    Program to Crosstab Ethnicity by Income Categories
  11. Running SPSS on Venus at the SSDBA.

  12. Because the PUMS file is quite large and because some time is required to access STF databases, it is quite possible that a program will take from 30 to 45 minutes to finish execution. An alternative to waiting at the terminal is to submit your spss file as a batch job. This can be done be entering the following statement at the unix prompt.

    The program file contains the spss program statements and the listing file receives the results of the program execution. If a data output file is created, it will be saved according to its name in the program. A caution is that successive runs will fail if the program attempts to generate a new file with the same name as an existing file.

    The SPSS program on Venus serves as a good editor for entering and updating programs. A number of basic edit functions are available through various escape key -number strokes. Esc-1 can be used to view directory contents. Esc-2 will allow you to switch between an output and input screen. Esc-3 inputs an existing file, Esc-9 saves an edited file, and Esc-0 runs a program from the editor and quits the editor. Two escape strokes will terminate a command selection.

Module Table of Contents