THE CROSS-TABULATON REFINES

THE CROSSTABULATON REFINES

THE STARTING point of any statistical analysis is the one dimensional, straightforward table, showing a distribution among several groups, in its simplest form, among two groups as in Table 1. [1] From this simplest of all statistical tables, one can learn that at the time of thepoll there was a bare majority for the Republican candidate, and that if things did not change, the Republican candidate could expect to carry the county. Thus, such a straight table has descriptive and to some extent predictive value.

PURPOSE OF CROSSTABULATION

Looked at in another way, such a table is but the starting point for explorations that proceed by dividing the sample into subgroups, in order to learn how the dependent variable (voting Republican or Democrat, in this case) varies from one group to the other. This, then, is the function of the cross-tabulation.

If the new, two-dimensional distribution differs from the old one-dimensional one, one step has been taken in the process of discovering the factors that determine the over-all proportions

TABLE 1 Sample Poll in County X Before Presidential Election
Will Vote For:	Percentage
Republican candidate	52%
Democratic candidate	48
Total	100%
(N)	(5,160)

For example, if the table is broken down by the prospective voter's economic status, Table 2 is obtained:

TABLE 2 Preelection Poll in County X by Economic Status
Political Party	Economic Status
Political Party	High	Low
Republican	60%	45%
Democratic	40%	55%
Total	100%	100%
(N)	(2,604)	(2,556)

This table shows that the proportion of Republican voters is larger among voters from the upper economic strata than from the lower. Conversely, the proportion of Democratic voters is larger in thelower economic brackets. Thus, generally speaking, economic status is one factor that determines theproportion of Democratic and Republican votes.

Such a fourfold, or two-by-two, table is the simplest type of cross-tabulation. Its purpose, as that ofany cross-tabulation, is to find out whether the proportions to be studied vary significantly in the two (or more) subgroups of the sample.

Table 3 is another example from one of the many studies of automobile accidents.

TABLE 3 Accident Rate of Automobile Drivers*
Never had an acident while driving	62%
Had at least one accident while driving	38%
TOTAL	100%
(N)	(14,030)

* The example is based on poll data of the American Institute of Public Opinion as reproduced in Smash Hits of the Year (The Travelers Insurance Co., Hartford 1940). Only the proportion of accidents is actually taken from these data; the remamlng data presented here are fictitious.

If we want to find out what factors characterize the people who have automobile accidents, we must begin by finding sub groups which we suspect of having many accidents, and other groups which have relatively few. If we suspect, for instance, that the driver's sex affects the accident rate, we would break the sample down into male and female drivers, as in Table 4.

TABLE 4 Accident Rate of Male and Female Drivers
Driving Record	Men	Women
No Accidents	56%	68%
One or More Accidents	44%	32%
TOTAL	100%	100%
(N)	(7,080)	(6,950)

This table sustains the hunch that a larger proportion of male drivers have accidents than female drivers. By having introduced the additional factor (sex) into the analysis, the preliminary result is refined and light is shed on the factors that determine the ordinal distribution.

TYPES OF CROSSTABULATION

The procedure can be extended, of course, by injecting alternate factors into the tabulation. Such a series of alternative breakdowns--by sex, by age, by economic status, and so on--are the prevalent form in which statistical surveys are presented. Yet, as the following paragraphs show, the results of such alternative cross-tabulations by various factors are unsatisfactory and sometimes even misleading. The correct procedure is to introduce each additional factor not as an alternative to, but simultaneously with, the other factors so that all possible interrelations among these factors become visible.

The simultaneous introduction of additional factors may produce any of the following effects:

It may refine the results of the simple cross-tabulation.

It may fail to refine the results of the simple cross-tabulation but may reveal an independent effect of a third factor.

It may explain the results of the simple cross-tabulation: a. by confirming the original interpretation b. by revealing the original interpretation as spurious.

THE ADDITIONAL FACTOR REFINES THE CORRELATION

By a simple cross-tabulation, users of a certain type of breakfast food were found to be more frequent among people below forty years than among older ones , see Table 5.

TABLE 5. Use of Breakfast Food XX, by Age
	Below 40	40 & Over
Use XX	28%	20%
Don't Use XX	72%	80%
Total	100%	100%
(N)	(1,224)	(952)

The investigator thought of sex as an additional factor influencing the use of XX breakfast food. The proper way of introducing this new factor into the analysis is shown by the scheme in Table 6. To simplify the table, the percentage of those who do not use XX were omitted:

TABLE 6 Use of Breakfast Food XX, by Sex and Age
	Men		Women
	Below 40	40 & Over	Below 40	40 & Over
Eat XX	36%	23%	20%	17%
(N)	(619)	(480)	(605)	(472)

This table presents the relationship between age and use of XX under two different conditions: one for men and one for women. Table 5 showed that a relationship exists between age and the use ofXX. Table 6 now refines this knowledge by showing how this age relationship differs for the two sexes: age differentiates more sharply among men (36 per cent versus 23 per cent) than among women (20 per cent versus 17 per cent). Figure 1 shows how the percentages in Table 5 are related to those in Table 6. Moreover, by a rearrangement of columns two and three, this Figure emphasizes a different aspect of Table 6: the sex difference by age.

.In this graphic presentation of Table 8-6, the height of each bar represents 100 per cent of therespondents in the particular subgroup; the width indicates the number of persons in each of thesegroups. The dotted line represents the weighted average of men and women combined using breakfastfood XX: 28 per cent among younger people, 20 per cent among the older ones. The solid lines showthat in each age bracket there are more XX users among the men than among the women; but the sexdifference is more accentuated among the young people than among the older ones (36 per cent vs. 20per cent as against 23 percent vs. 17 per cent).

CORRELATIONS NEAR ZERO

Cases where the original correlation is zero or near zero are of special interest. Only by introducing a third factor may the interrelation of the involved factors become visible. Examine the cross-tabulation of age and listening to classical music in Table 7:

TABLE 7 Listening to Classical Music, by Age*
Listen To	Below 40	40 & Over
Classical Music	64%	64%
(N)	(603)	(676)

* This example is a modification of a table in Paul F. Lazarsfeld, Radio and the Printed Page, (New York: Duell, Sloan & Pearce, 1940), p. 98.

Contrary to expectation, there is no correlation between age and listening to classical music. However, when education is introduced into the analysis as an additional factor, Table 8 is obtained:

TABLE 8 Listening to Classical Music, by Age and Education
Education	Below 40	40 & Over
College	73% (224)	73% (224)
Below college	61% (379)	56% (425)

The various relationships are more easily seen in Figure 2. The introduction of education as an additional factor reveals that there is, in fact, a correlation between age and listening to classical music. College-educated people listen more to classical music when they are older (78 per cent vs. 73 per cent). But it is just the other way around with people on a lower educational level: they listen more to classical music when they are young: (56 per cent vs. 61 per cent). If people are grouped by age, regardless of their level of education, these two tendencies tend to compensate each other, reducing the over-all difference to zero.

A similar, if more complicated situation is the substance of Table 9. It is based on a 1940 Gallup Poll designed to estimate the number of "isolationists," people who would have liked to see the United States not involved in what they considered a European war.

TABLE 9 Isolationists at Various Age and Economic Levels*
Age	Economic Status
Age	Upper	Middle	Lower	(N)
Under 30	30%	28%	22%	(26)
30 to 49	21%	23%	26%	(24)
50 & Older	17%	23%	34%	(26)

* Hadley Cantril & Associates, Gauging Public Opinion (Princeton, N.J.: Princeton University Press, 1944), p. 178.

From the Total column it would appear that age is not related to being an isolationist. The proportions vary only insignificantly (26 per cent-24 per cent-26 per cent). However if the influence of age is studied separately for each economic level, a distinct relationship appears. In the upper-income bracket the young people are much more isolationist than the old ones (30 per cent vs. 17 per cent); in the lower-income bracket the situation is exactly reversed (22 per cent vs. 34 per cent). In the Total column these two tendencies compensate each other and produce a spurious pattern of non-correlation.

A particularly interesting example of such a misleading non-correlation emerged from an experiment on the effectiveness of a headache remedy. [2] The manufacturer of analgesic (A) was running short of one of the ingredients (X) that went into its making. In order to find out whether the absence of x made the analgesic less effective, 200 subjects suffering from infrequent headaches were treated in three successive two-week periods with three products on a rotating basis as follows: with the proper drug A, with drug A but lacking ingredient x, and with a placebo, an entirely inactive pill that had merely the appearance of a drug. The success of these three treatments was measured in terms of "percentage of relieved headaches" (Table 10).

TABLE 1O Effectiveness of Three Pills
Formula Used	Found Relief
A	84%
(A - X)	80%
Placebo	52%

The inactive pill had clearly a lower success rate than the two analgesics; but the difference between A and A lacking x was not statistically significant. On closer inspection, however, ingredient X did turn out to be relevant. The analyst justly reasoned that those patients who failed to react to theinactive pill would have been more sensitive test persons than those who professed that their headaches had been cured by the placebo. He therefore computed the success rates separately for these two groups, as in Table 11.

TABLE 1 1 Effectiveness of Two Analgesics
Among those who:	Reacted to Placebo	Did Not React to Placebo
A	82%	88%
(A - X)	84%	77%

This difference now, between 88 per cent and 77 per cent, was statistically significant. It had been obscured by being mixed up with an insignificant difference in the other direction, among thse unreliable test persons who reacted to the inactive pill.

AN ADDITIONAL FACTOR REVEALS LIMITING CONDITIONS

The refinement brought about by the third factor sometimes consists of revealing that certain correlations tend to disappear under special conditions, and correspondingly increase in the absence of these conditions. A study of France's suicide statistic showed the suicide rate (number per 100,000 population) to be 20 for Catholics and 40 for Protestants;3 the suicide rate among Catholics is exactly one-half of the Protestant rate.4 When the two denominations were further divided by their place of residence. the data in Table 12 were obtained.

TABLE 12 Suicide Rate by Religion and Size of Community (per 100,000 population)
	Catholic	Protestant
Urban	31	3 8
Rural	9	41
(N)	(20)	(40)

Table 12 shows that Catholics have a lower suicide rate irrespective of where they live, but the difference among Protestants is much more marked in the rural areas (9 per cent vs. 41 per cent) than in the urban ones (31 per cent vs. 38 per cent). Note that the data in Table 8-13 permit also a slightly different reading. Instead of making the comparison between Catholics and Protestants in different surroundings, one can compare the urban- rural difference among Protestants and Catholics. In Figure 3,the two arrangements, identical in substance, highlight these different aspects.

[Figure 8-3 about here]

Clearly, the difference is sharper between Catholics and Protestants in rural areas than it is in urban areas (A), and the difference is sharper between urban and rural Catholics than it is between urban and rural Protestants (B).

THE ADDITIONAL FACTOR HAS AN INDEPENDENT EFFECT

Sometimes the third factor may turn out to have no effect on the original correlation, hence does not refine it. Instead it has an independent influence on the factor that in the original crosstabulation was considered the effect.

If we introduce religion into Table 8-2, the pre-election poll result by economic status, we obtain Table 13 :

TABLE 13 Election Poll in County X, by Economic Status and Religion Voting Republican
Religion	Economic Status
Religion	High	Low
Catholics	27%	19%
Protestants	69%	52%

On each economic level, the Catholics produce less than half as many Republican votes as the Protestants (compare vertically), and within each religious group the higher economic strata produce more Republican votes than the lower strata (compare horizontally). Again, it will be helpful to see the relationship between these four cells graphically, as in Figure 4.

[Figure 8-4 about here]

The two graphs make it clear that both factors, economic level and religion, exert their influence more or less independently; hence, the proportion of Republican votes is highest among the well-to-do Protestants and lowest among the poor Catholics

SUMMARY

The cross-tabulation, that is, the breakdown of a distribution into subgroups, is the most common device of survey analysis. This chapter discusses a preliminary function of the cross-tabulation, namely, the setting off of differences in distributions, by pinpointing the subgroup in which certain measures reach their extreme values: In which population stratum is the Republican vote most concentrated? Who is most likely to listen to classical music? Which population segments have the lowest suicide rates? This refinement operation sets the ground for the step to be discussed in the next chapter: to discover why the differences occur at the points revealed by this refinement operation.

End Notes

[1] 'This table and the ones that follow are adaptations from a study of the Willkie-Roosevelt presidential campaign of 1940, as mirrored in a small Ohio town. That pioneering study was later published in Paul F. Lazarsfeld, Hazel Gaudet, and Bernard Berelson, The People's Choice (New York: Columbia University Press, 1948). See also Tables 13-17ff.

[2] E. M. Jellinek, "Clinical Tests on Comparative Effectiveness of Analygesic Drugs," Biometric Bulletin of the American Statistical Association October 1946. pp. X7-91.

[3] From M. Halbwachs, Les Causes du Suicide (Paris: 1930), Chap. 4.

[4] See also the low suicide rate of Ireland in Table 2-2.