## Chapter 3 -- Population Characteristics

Because differences in the age, income, education, gender, ethnicity, and employment characteristics of the population may affect access to resources and social status, these population components are frequently studied in more detail or are controlled when studying an issue. For example, white males tend to earn higher incomes than white females or persons of many other ethnic groups, persons with higher education attainment tend to earn higher incomes, women predominate in older age groups, and men and women are often found in different occupations.

In this section you will examine a few of the measures commonly used to describe some of these population components.

A. The Sex Ratio

The sex component of the population is a significant element affecting many statistical tabulations. For example, there are important differences between men and women in the areas of employment, age, and income. It may be useful to control these elements in an analysis.

The sex ratio is an often-used measure of the difference in the number of men and women in an area. It is the ratio of males per 100 females and is calculated by dividing the number of males by the number of females and multiplying the ratio by 100. Scores higher than 100 indicate more males than females.

 Glendale Los Angeles City California 93 101 100

Several underlying factors may influence the sex ratio. For very large populations in developed countries such as the entire United States the ratio is less than 100, indicating that there are more females than males. However, when the ratio is examined across different age groups, the ratio is greater than 100 in the early age groups before dipping below 100 in the early 20s. This is because more males are born than females. The sex ratio is about 105 regardless of the country. However, in the more developed countries males tend to die at a higher rate than females.

In the early twentieth century, men were predominant in the western United States because many people had migrated into the region and the majority of those migrants were young men. However, by 1990 the sex ratio was almost even - 99.5. However, state capitals with a large number of women in clerical and administrative jobs have lower sex ratios than other cities. Similarly, retirement communities with large elderly populations have still lower ratios.

While Table 4 indicates that overall the total number of males and females in California is about the same, Table 5 presents the sex ratios by age category. As expected, males predominate up to age 34. After age 60 the proportion of females increases markedly. Among people age 75 and older, the table shows that women outnumber men two to one. The increase in the proportion of males from age 10 to age 24 is due to the male predominance among in-migrants, some of whom were teenagers in young families coming from other states or from countries like Mexico.

 Age Group Sex Ratio 0 - 4 yrs 105 5 - 9 yrs 105 10 - 14 yrs 110 15 - 19 yrs 116 20 - 24 yrs 110 25 - 29 yrs 106 30 - 34 yrs 103 35 - 39 yrs 100 40 - 44 yrs 99 45 - 49 yrs 98 50 - 54 yrs 95 55 - 59 yrs 88 60 - 64 yrs 82 65 - 69 yrs 48 70 - 74 yrs 68 75 - 79 yrs 50 80+ yrs 50

B. The Location Quotient

While percentages provide an indication of the relative proportion of a population subgroup in different areas, the location quotient can be used to compare a local proportion to that of a much larger area. Location quotients can determine if an ethnic population or employment within a certain occupation or industry is relatively strong or weak in different areas.

A location quotient is calculated by first dividing the number of persons in a population subgroup by the total population within a local area. This ratio is then divided by the comparable ratio for a much larger area such as an entire state. For example, if 3000 out of 10,000 persons in a community were Hispanic (a ratio of .3) and 300,000 persons out of one million persons in a state were Hispanic (a ratio of .3), the location quotient would be 1. This would indicate that the community has the same proportion of Hispanics as the entire state. When the location quotient is greater than one, the community would have a higher concentration of Hispanics than the state. A score of 2.0 would mean that the community has twice the proportion of Hispanics as the state while a score of 0.25 would indicate the community has one quarter the percentage of the state.

In the table below, various occupational categories and classes of employment are compared between Los Angeles County and the State of California. The location quotients in the first table indicate that Los Angeles County has about 1.5 times the proportion of workers in private household and machine operator occupations as the state as a whole. Also, Los Angeles County has about half the proportion of workers in farming, forestry, and fishing occupations as found over the entire state. The second table reveals that Los Angeles County has about two-thirds the proportion of state and federal government workers as the entire state. This is surprising to many people, who may have imagined that the much larger number of poorer people in Los Angeles County compared to all other counties in the state would automatically translate into a higher proportion of government workers in that county. Also, there are relatively fewer self-employed people in Los Angeles County than in the state despite the concentration of immigrants in the county and the fact that many immigrants have opened small businesses as a means of adaptation to this country.

 Occupations California  Employed Persons LA County Employed Persons California Proportion LA County Proportion Location Quotient Executive, Admin, Managerial 1,939,417 555,616 0.139 0.132 0.95 Professional Specialty Occupations 2,057,087 603,519 0.147 0.144 0.98 Technicians and Support 527,367 141,767 0.038 0.034 0.90 Sales 1,690,007 486,374 0.121 0.116 0.96 Administrative Support 2,319,459 730,744 0.166 0.174 1.05 Private Household Services 95,059 44,456 0.007 0.011 1.56 Protective Services 235,799 65,721 0.017 0.016 0.93 Other Services 1,402,919 406,436 0.100 0.097 0.96 Farming, Forestry, Fishing 382,369 52,446 0.027 0.012 0.46 Precision Production, Repair 1,548,625 462,923 0.111 0.110 1.00 Machine Operators 797,300 345,158 0.057 0.082 1.44 Transportation and Moving 480,057 142,276 0.034 0.034 0.99 Helpers, Laborers 520,844 166,356 0.037 0.040 1.06 Total Employed 13,966,309 4,203,792
 Class of Worker California  Employed Persons LA County Employed Persons California Proportion LA County Proportion Location Quotient Private for Profit 10,000,783 3,134,368 0.715 0.746 1.04 Private not for Profit 734,520 223,631 0.052 0.053 1.01 Local Government 1,078,146 307,672 0.077 0.073 0.95 State Government 499,399 100,286 0.036 0.024 0.67 Federal Government 446,373 90,789 0.032 0.022 0.67 Self Employed 1,173,375 329,115 0.084 0.078 0.93 Unpaid Family 60,713 17,931 0.004 0.004 0.98 Total Employed 13,996,309 4,203,792

C. The Entropy Index

The entropy index (H) is a measure of the diversity of groups in an area. If all component groups are equally present, the index reaches a maximum. If only one of several groups is present it is 0. The maximum score increases with the number of groups used in computing the entropy index. However, it can be standardized to a maximum of 1 by dividing all values by the maximum possible score (i.e. all groups equally present in an area).

n
H = - S  (Pk/P) ln(Pk/P)

k=1
Here Pk is the population of the subgroup and P is the total population.
In the table below five major ethnic categories have been tabulated for four California cities. The proportion of each group in its city multiplied by the natural log of the proportion is reported in the lower part of the table. The sum of the indexes for each city is the Entropy Index (H), which is reported in its raw and standardized values at the bottom. The raw scores (H) were standardized by dividing by the maximum possible score for five groups (1.609).

The cities of Los Angeles and San Francisco are found to be much more diverse than Glendale and Burbank (Table 7). Because of their large Asian and Hispanic populations, many cities in California are among the most ethnically diverse in the United States.

 Group Los Angeles Persons San Francisco Persons Glendale Persons Burbank Persons Non Hispanic Whites 1,299,604 337,118 114,765 64,453 Blacks 487,674 79,039 2,334 1,638 American Indians 16,379 3,456 629 501 Asians & Pacific Islanders 341,807 210,876 25,453 6,335 Hispanics 1,391,411 100,717 37,731 21,172 Group Total 1990 3,536,875 731,206 180,912 94,099 (Pk/P) ln(Pk/P) Non Hispanic Whites 0.368 0.357 0.289 0.259 Blacks 0.273 0.240 0.056 0.071 American Indians 0.025 0.025 0.020 0.028 Asians & Pacific Islanders 0.226 0.359 0.276 0.182 Hispanics 0.367 0.273 0.327 0.336 H 1.259 1.254 0.967 0.875 Standardized H 0.782 0.779 0.601 0.544

D. Geographic Association

Scattergrams

Very frequently social scientists want to determine the strength of the association of two or more variables over space. For example, one might want to know if larger populations within metropolitan counties are associated with higher crime rates. One way to examine this association is to make a scattergram. Scattergrams graphically portray how closely changes in one variable correspond to changes in another. In the example below the population values for the 593 metropolitan counties in the U.S. have been plotted on the x-axis and the corresponding crimes per 100,000 persons have been plotted on the y-axis.

Figure 2. Scattergram of Population vs Crimes Per 100,000 Persons

In this scattergram there does appear to be some association between higher crime rates and larger populations. However, there is quite a bit of variability in this trend - a few cities with large populations have relatively low crime rates and a few small cities have relatively high crime rates. If the relationship were very strong, the points would spread out along a line and if it were very weak, the points would be scattered randomly over the plot. Very strong, almost linear, distributions may be found in physical relationships such as the increase in pressure in a container with an increase in temperature. However, such strong relationships are rare among social data.

Correlation

If a scatter of points does seem to exhibit a pattern, then one might choose to measure the strength and the direction of it through the use of correlation statistics. Correlation determines whether a relationship exists between two variables which have usually been sampled from a larger population. Correlation measures are usually standardized so that if an increase in the first variable, x, always brings the same increase in the second variable, y, then the correlation value would be +1.0. If the increase in x always brought the same decrease in the y variable, then the correlation score would be -1.0. If an increase in x brought no regular change in y, then the correlation would be 0. In most calculations of correlation, an approximation of a linear relationship is assumed. However, the relationship could be curvilinear or cyclical, and so one should always examine a scattergram to see if the relationship between two values is non-linear.

There are several types of correlation measures which can be applied to different measurement scales of a variable (i.e. nominal, ordinal, or interval). One of these, the Pearson product moment correlation coefficient, is based on interval-level data and on the concept of deviation from a mean for each of the variables. A statistic, covariance, is the product of the deviations of the observed values from each of their means divided by the number of observations. This mean deviation is divided by the product of the standard deviations of the two variables to get the correlation or:

S(X - SX/N)   x   (Y - SY/N)
r    =                                        N                                        .

SQRT  [ S(X - X)2 ]   x   [ SQRT  [S(Y - Y)2 ]

N                                       N
The correlation statistic above is for the entire population. If a sample had been selected, the N would have been replaced by n - 1.

Computing the Pearson product moment correlation for the crime and population data yields a correlation score of .449, which is only a moderate level of correlation. Another statistic, called the coefficient of determination, can be calculated to determine the percent of the total variance explained by the correlation between the two variables. The coefficient of determination is simply the square of the r or correlation coefficient. In this example, the coefficient of determination is only .202. Thus, about 20% of the variance between population size and crime rate is accounted for by the correlation between these two variables. This would suggest that other variables yet unaccounted for are more powerful influences on the relationship.

The distribution of points in the scattergram rises quickly and then spreads out to the right, which suggests that the distribution may be somewhat non linear. Calculating the natural logarithm of the population increases the correlation coefficient to .605 and the coefficient of determination to .367. Thus, a non-linear form of correlation increases the percent of variance explained to about 37%. Apparently the crime rate does increase with population size, but at a decreasing rate.

Because all 593 metropolitan counties in the U.S. were used to compute the correlation statistic, and there is less value in testing its significance. Had a sample of the counties been taken, one could consider the possibility that such a relationship could have occurred by chance. To test the significance of the relationship, one could assume that there is no relationship between population size of counties and the crime rate (null hypothesis) and that the value of r is due to sampling error. A statistic called the t  statistic is commonly used to test the hypothesis that the correlation value is due to sampling error.

t   =  |r|  x  SQRT(n - 2)
SQRT(1 - r2)
If the 593 counties had been a sample, the t test yields a value of 12.204. Consulting a table of t-statistic values indicates that a score of 1.96 would be expected to occur by chance only 5% of the time and 3.922 only .01% of the time. The value of 12.204 is far higher than that. Thus, the null hypothesis could be rejected.

There are a number of assumptions made about the data in correlation analysis which are not always met. For example, the observations should be selected randomly, they should be measured on the interval or ratio scale, be normally distributed, and they should be independent of each other. The latter condition may be a particular problem in samples that are geographically near to one another. However, large sample sizes can mitigate many of these problems.

Regression

If the correlation between two variables is found to be significant and there is reason to suspect that one variable influences the other, then it may be useful to calculate a regression line for the two variables. In this example one might expect that an increase in population produces an increase in the crime rate. Thus, the crime rate would be considered a dependent variable and the population size would be considered an independent variable. When plotting these variables, the dependent variable, crime, would be plotted on the y-axis and the independent variable would be plotted on the x-axis of a scattergram.

Regression expresses the relationship between the two variables as the equation for a line which best fits the scatter of points in a scattergram. The line minimizes the sum of the squared deviations of the dependent (y variable) from the line. From the equation one can estimate the value of y for a given value of x. Differences between the estimated and real y-axis values are called residuals.

Figure 3. Regression of Population vs Crimes Per 100,000 Persons

The equation for the above regression line is

Crimes/100k = 3897.35 + 0.005149 * Pop
Since it is possible that quite different scatters of points could produce the same line, it is also helpful to calculate the standard error of the estimate. This provides an indication of the scatter of the points about the line. This value can be useful for comparing different samples.

SE of Est  =  SQRT (S(Y-SY/N)2
N
For this crime example the standard error of the estimate is 2252.9

The reliability of the regression equation also may be tested with analysis of variance. With the F statistic one can determine how much of the total y variability is due to the regression line and how much is due to the residuals. If a large portion of the variance comes from the equation and the independent variable then the model provides a good prediction of y and a high value of F.

(S(Y - SY/N)2
F  =          df               .

(S(Y - SY/N)2

n  -  df  -  1
Where df is the degrees of freedom.
For the crime example, the F statistic is 148.94. The null hypothesis would state that the regression equation fails to predict the variation in y and could, by chance, generate a value of 3.86 (from a table of F statistics) 5% of the time. Thus the null hypothesis can be rejected. Because 148.94 is much greater than 3.86, the null hypothesis can be rejected. This means that larger cities do indeed have higher crime rates even though other factors have a greater effect on crime rates than city size.