Cross tabulations (also known as cross tabs, or contingency tables) often arise in data analysis, whenever data can be placed into two distinct sets of categories. In market research, for example, we might categorize purchases of a range of products made at selected locations; or in medical testing, we might record adverse drug reactions according to symptoms and whether the patient received the standard or placebo treatment.
The statistical technique presented in this article, correspondence analysis, provides a means of graphically representing the structure of cross tabulations so as to shed light on the underlying mechanisms. The article provides a practical introduction to correspondence analysis in the form of a “five-finger exercise” in textual analysis—identifying the author of a text given samples of the works of likely candidates.
1. Introduction
Correspondence analysis is a statistical technique that provides a graphical representation of cross tabulations (which are also known as cross tabs, or contingency tables). Cross tabulations arise whenever it is possible to place events into two or more different sets of categories, such as product and location for purchases in market research or symptom and treatment in medical testing. This article provides a brief introduction to correspondence analysis in the form of an exercise in textual analysis—identifying the author of a text based on examination of its characteristics. The exercise is carried out using Mathematica (Version 5.2).
Perhaps the most illustrious exponent of textual analysis is the self-styled “literary detective” Donald Foster, whose 2001 book [1] describes how he identified the authors of several anonymous works, including the best-selling roman-à-clef Primary Colors [2], which satirized the 1992 Clinton presidential campaign. Foster’s methodology examines a broad spectrum of text characteristics, including word choice, punctuation, grammatical structure, and the like. The aim of the exercise in this article is to emulate Foster, though naturally the literary aspects of the approach taken are much more basic—the intent is not to describe a realistic method of textual analysis, but rather to use it to illustrate correspondence analysis.
Consider the following list of seventeenth- and eighteenth-century writers.
Imagine that we are given two fragments of text written by one or two of these writers, and charged with identifying the true author(s). To make things interesting, imagine also that the only information we are given about an unidentified fragment of text is the frequency with which certain letters appear in it. Accordingly, I have taken three distinct samples of about 1000 characters each from the writings of each these authors, and added up the number of times each of the following characters appears in each of the samples (restricting ourselves to less than the complete alphabet prevents the tables in the rest of the discussion from becoming unwieldy; the characters chosen happen to occur with middling frequency in all the texts as a whole).
This is the cross tabulation.
2. Calculations
Is it possible to say with reasonable certainty that the distribution of letters differs significantly from sample to sample (i.e., from row to row in the cross tab)? The usual means of answering such questions is Pearson’s test for independence; it tests whether a cross tab deviates significantly from one in which rows and columns are independent. In our case, independence would imply that the letters occur with the same frequency in all of the text samples.
Assume that the cross tab under examination is described formally by the matrix . We derive the correspondence matrix from by dividing its entries by their grand total:
(1) |
Next, define row and column totals:
(2) |
The statistic, , is calculated:
(3) |
Here is an estimate of an entry’s value assuming independence:
(4) |
For our example, the calculations may be expressed in Mathematica as follows.
If rows and columns really are independent (i.e., “under the null hypothesis”), should follow a distribution with degrees of freedom. We can compare the actual value computed for the example cross tab with its distribution under the null hypothesis as follows.
Thus there is (almost) no probability under the null hypothesis of observing a statistic as large as the one actually observed, and indeed only a 1% probability of seeing a value about half as large. According to the test, therefore, there is a statistically significant difference in the distribution of letters across the samples.
Unfortunately, the test by itself does not provide a solution to the problem of distinguishing the works of the different authors. Though it establishes that the distribution of letters differs significantly from one sample to another, it does not tell us whether the samples of one author differ from those of other authors more than they differ from each other, nor does it allow us to characterize the authors in terms of the distribution of letters in their works. Answers to these questions are provided by correspondence analysis.
3. Distances
For the purposes of correspondence analysis, the differences between the distributions of letters in the text samples—which you will recall are given in the rows of the cross tab—are measured by so-called distances, which are weighted Euclidean distances between normalized rows (calculated by dividing row entries by their respective row total), with weights inversely proportional to the square roots of the column totals. In symbols, the distance between row and row is given by the expression:
(5) |
This computes the distances between the text samples using the correspondence matrix and displays them in a reasonably compact table (after scaling up by 100 and rounding).
Certain characteristics of the samples can be detected in the table above. For example, it appears that the Mark Twain (MT) texts form a relatively isolated group, in that the distances from the MT samples to each other are considerably smaller than from the MT samples to those of other authors. By itself, however, the table does little to make apparent the overall pattern of the distances—something done in the next section. Before that, however, here is a little more on the nature of distances.
As their name suggests, distances are closely related to the statistic of the previous section. To show how they are related, consider the “average” row—termed the centroid or barycenter in correspondence analysis—whose entries are simply the column totals:
(6) |
From equation (5), since the row total for the centroid is 1 (by the definition of ), the distance of row to the centroid is:
(7) |
Now with as defined in (4):
(8) |
Drawing an analogy with the physical concept of angular inertia, correspondence analysis defines the inertia of a row as the product of the row total (which is referred to as the row’s mass) and the square of its distance to the centroid, . Comparing the expression for in (5) with definition of the statistic in (3), it follows that the total inertia of all the rows in a contingency matrix is equal to the statistic divided by , a quantity known as Pearson’s mean-square contingency, denoted :
(9) |
The total inertia of a table is used to assess the quality of its graphical representation in correspondence analysis. For future reference, we can calculate for our dataset.
4. Calculating Row Scores
Correspondence analysis provides a means of representing a table of distances in a graphical form, with rows represented by points, so that the distances between points approximate the distances between the rows they represent.
To compute such a representation, we begin with a matrix of standardized residuals, which are the square roots of the terms comprising the statistic in (3):
(10) |
Next, we compute the singular value decomposition of , which is to say that we find orthogonal matrices and , together with a diagonal matrix , such that (with the transpose of matrix denoted , and writing for the identity matrix):
(11) |
The scores of the rows—whose interpretation we discuss later—are given by the expression:
(12) |
Here is the diagonal matrix comprising the reciprocals of the square roots of the row totals:
(13) |
The scores of the rows in our sample cross tab are computed in the following (left multiplication by being more conveniently carried out in Mathematica by row-wise division).
The row scores may be thought of as the coordinates of points in a high-dimensional space (14-dimensional, as it turns out in this case).
These points are arranged so that the Euclidean distance between two points is equal to the distance between the two rows to which they correspond. To show how the distances between the rows are reflected in their scores, the following reconstitutes the distances in the previous section from Euclidean distances between the scores computed above. As you can see, the distances are recovered perfectly.
5. Plotting Rows
Although the row scores faithfully reproduce the distances between rows in the original table, as coordinates their dimensionality is far too high for them to be presented graphically. Thanks to the properties of the singular value decomposition, however, taking just the first two components of each row’s score usually produces a reasonable approximation to the distances, and yields coordinates that can be placed on a two-dimensional plot. (Below we have labeled the components “X” and “Y” to highlight their role as 2D coordinates.)
The following displays each row’s (abbreviated) label at the position given by its coordinates and returns a key to the abbreviations.
Figure 1. Row plot for text samples.
The plot gives a much clearer picture of the way in which the letters are distributed across the text samples. For example, it is quite evident that—as we concluded from the original cross tab—the Mark Twain samples differ significantly as a group from those of the other writers. The text samples of Darwin and Hobbes also appear to be sui generis, though the Descartes and Shelly samples appear less distinct. The plot suggests that it may be possible, therefore, to distinguish between the works of at least some of the authors using correspondence analysis of their letter distributions.
Diagnostics
Since it uses only the first two components of the row scores, the plot above only approximates the true configuration of the rows in the cross tab. Before using it to make firm inferences, we might try to gauge the quality of the representation it provides. One indicator is derived from the inertia of the rows defined in (9). Recall that , the total inertia of the rows, is calculated from the row totals and the distances of the rows to the centroid:
(14) |
It may be shown that for any contingency matrix, the procedure of the previous section always places the centroid at the origin of the plot. Therefore, since Euclidean distances on the plot are supposed to approximate distances, replacing each distance in the right-hand side of (14) by the distance of the corresponding row point to origin should yield an approximation to . The following derives this quantity and computes its ratio to the true value of .
Thus our two-dimensional plot captures about 56% of the total inertia of the table rows. While this seems hardly an impressive fraction, Murtagh ([3], p. 39) points out that ratios like this are not uncommon in correspondence analysis, and do not necessarily point to a bad representation. Nonetheless, we might want to exercise a modicum of caution before drawing categorical conclusions from our analysis.
As an aside, it turns out that the total inertia of the contingency matrix —which was calculated “longhand” in (9)—is equal to the sum of the squares of the diagonal elements of the matrix in (11). The latter comprise the singular values of the matrix . Furthermore, the inertia retained in the two-dimensional plot is simply the sum of the squares of the first two singular values in . Thus the following is an equivalent expression of the plot’s inertia.
6. Plotting Columns
We have seen how correspondence analysis can be used to derive a visual representation of the relationships between the rows of a contingency matrix. We can also use correspondence analysis to illustrate the relationship between the rows and the columns of a correspondence matrix—between the texts and letters in our example. Since our primary concern is with the text samples, the rows of the cross tab, it might seem a digression to look at the cross tab columns (the characters appearing in the texts), but we will see in the next section that the geometry of the columns is central to the identification of the mystery texts.
As with the rows, we begin by deriving scores for the columns from the singular value decomposition in (11). With reference to (11), the matrix , whose rows are the column scores, is calculated as follows:
(15) |
where
(16) |
Again, left multiplication by the diagonal matrix is more conveniently expressed as element-wise division in Mathematica.
As before, the two-dimensional column coordinates are simply the first two components of the scores.
We can display both columns and rows on the same plot with a slight elaboration of the method used to plot the rows alone. The column coordinates are scaled so that the column points occupy roughly the same region of the plot as the row points.
Figure 2. Row and column plot for text samples.
Interpreting the relationships between rows and columns from a plot such as this is not as straightforward as it was for the previous plot with the rows only. For example, it is not true in general that the closer a column appears to a row, the greater the prevalence of the corresponding letter in the corresponding text sample.
To show how such relationships are actually represented, consider the text sample “MT2” (a row) and characters “P” and “Y” (columns).
Possibly the simplest way to determine the relationship between a text sample and a character is to draw lines from their corresponding points in the plot to the origin. If the angle between the two lines is acute, then the character occurs more often in the sample than it does on average in the texts as a whole. Conversely, if the angle is obtuse, the character occurs less often than overall. The following draws the appropriate lines for our chosen text sample and characters; it appears the character “Y” occurs more often than average in “MT2”, while “P” occurs less often.
Figure 3. Simple analysis of row/column plot.
Unfortunately, the method described above only tells us if a character appears more or less often than average in a text sample, not whether one character appears more often than another in a sample. In particular, an angle that is more acute does not signify a character that is more prevalent in a text.
A rather more complicated method that does illustrate the relative frequencies of characters in a text sample entails first drawing a line on the plot through the origin and the point corresponding to the text sample in question. Perpendiculars to this line are dropped from each character’s position on the plot. The following draws such a construction for the selected text sample “MT2”.
Figure 4. Comprehensive analysis of row/column plot.
The relative frequencies of the characters in the text sample can be read off by traversing the line through the text sample (colored blue and green on the plot above), looking at the positions at which the perpendiculars from the characters intersect it. A character with an intersection on the green line segment (i.e., on the same side of the origin as the text sample) occurs more often in the sample than the average in the texts overall, whereas one on the blue line segment (on the other side of the origin) occurs less frequently than the average. In addition, the further from the origin on the green line segment such an intersection occurs, the greater the frequency of the character in the sample. Conversely, the further out on the blue segment an intersection falls, the less frequent the character in the sample.
So from the plot above, it appears that the character “W” occurs most often in the sample text, and that characters “L”, “D”, “Y”, “G”, “U”, “B”, “S”, “I”, “N”, “H”, “C”, “M”, “P”, “F”, and “R” occur successively less often; characters “W” through “B” in the ranking appear more often than average, while “S” through “R” appear less often than average.
7. Supplementary Points: Identifying the Mystery Texts
Finally, we return to the problem we faced at the outset: identifying the author or authors of the unidentified text fragments. We have seen how the application of simple correspondence analysis to the text samples allows us to view them graphically in terms of their letter distributions. In Section 5 we saw that it was generally possible to distinguish the authors of the text samples based upon the locations of the corresponding row points—with a few exceptions, samples of work by the same writer tended to occupy the same area of the plot. One might logically surmise that if we were to plot the mystery texts on the same correspondence plot as the samples, we would be able to determine their authorship by looking at the authors of the nearest samples. To begin, we need to calculate an additional cross tab containing the distribution of the selected characters in the mystery texts.
We could proceed by simply appending these frequencies as new rows to the original text samples cross tab given in Section 1 and recalculating the scores and coordinates for all the rows (that is, both the original samples and the mystery texts) in the resulting table. In principle, however, it is possible that the unidentified texts overlap one or more of the text samples, and if this were the case, appending the new rows to the cross tab would distort the analysis by “double-counting” some of the samples.
A more satisfactory approach derives from the fact that the row scores computed in Section 4 are actually weighted sums of the column scores calculated in Section 6. In matrix terms, recalling that is the correspondence matrix and is the matrix of column scores, it can be shown that:
(17) |
where
(18) |
If we replace the original correspondence matrix in (17) with a new correspondence matrix formed from the cross tab of the unidentified texts, we derive a set of row scores for the unidentified texts according to the transformation determined by the text samples only (since they alone produced the row scores ), eliminating the risk of double-counting. Treated in this way, the unidentified texts comprise supplementary points in the terminology of correspondence analysis.
The following calculates row scores for the mystery texts as supplementary points (straightforward algebra vindicates the direct use of the new cross tab without the need to derive a new correspondence matrix).
Lastly, as with the rows and columns, we take the first two elements of the scores above to produce the coordinates of the supplementary points. In the following, they are displayed on the same plot as the original rows.
Figure 5. Plot of the mystery texts as supplementary points.
All points on this plot represent texts, or rows, and distances between points can be interpreted directly as degrees of similarity, just as with the row plot in Section 5. On this basis, judging by their closeness to the authors’ other works, it appears that mystery texts 1 and 2 belong to Mark Twain and Thomas Hobbes respectively. While the manifest isolation of the Mark Twain texts on the plot leaves little doubt as to the provenance of the first unidentified text, the author of the second is a little less clearly defined—particularly given the middling diagnostic ratio calculated in Section 5. Nonetheless, I am sure you will agree that considering the rather scant literary information on which the analysis was based (amounting to no more than a table of letter frequencies), the results are encouraging.
8. Conclusion
Correspondence analysis has a long and storied history that can be traced as far back as the 1930s. We have only scratched the surface of the subject in this brief introductory article. Of course, I have omitted proofs of the various assertions I have made in the course of the presentation. Furthermore, I have glossed over an important choice concerning the scaling of row and column scores and coordinates; I have used so-called row principal scoring (which preserves distances between rows, but not columns), but there are other approaches that are equally valid.
A number of extensions exist to the so-called simple correspondence analysis presented here. Most important are multiple and joint correspondence analysis, which apply to contingency tables involving three or more variables or sets of categories (see [4] for details). For a comprehensive examination of correspondence analysis and related techniques, Greenacre’s early book [5] remains among the best texts (in the English language, at least), though it is unfortunately currently out of print. Later books by Greenacre [6] and coeditor Blasius [7] explore applications of correspondence analysis and extensions to the basic methodology. Benzécri’s treatise [8] is notable in that its author championed the use of correspondence analysis for many years, developing many of the geometric underpinnings that inform modern practice and establishing a seminal school of statistical analysis in France; unfortunately, translation from the original French and a prodigious price detract from the appeal of the text itself. Most recently, Murtagh [3] gives a thorough (if somewhat telegraphic) treatment of the subject, with an emphasis on the coding of data for analysis. Sections devoted to correspondence analysis also appear in the books by Agresti [9], Borg and Groenen [10], and Legendre and Legendre [11].
In his forward to [3], Benzécri writes of the immense opportunities afforded statisticians by “inexpensive means of computation that could not be dreamed of just thirty years ago” (indeed, correspondence analysis of realistically sized datasets is all but impossible without a computer). I hope that this article has demonstrated that Mathematica can play a valuable role in allowing all of us—statistician and non-statistician alike—to take advantage of these opportunities.
References
[1] | D. W. Foster, Author Unknown: Tales of a Literary Detective, New York: Henry Holt & Company, 2001. |
[2] | [J. Klein], Primary Colors: A Novel of Politics, New York: Random House, 1996. |
[3] | F. Murtagh, Correspondence Analysis and Data Coding with Java and R, Boca Raton: Chapman & Hall/CRC, 2005. |
[4] | M. J. Greenacre, “Multiple and Joint Correspondence Analysis,” Correspondence Analysis in the Social Sciences (M. J. Greenacre and J. Blasius, eds.), London: Academic Press, 1994 pp. 141-161. |
[5] | M. J. Greenacre, Theory and Applications of Correspondence Analysis, London: Academic Press, 1984. |
[6] | M. J. Greenacre, Correspondence Analysis in Practice, London: Academic Press, 1993. |
[7] | M. J. Greenacre and J. Blasius, eds., Correspondence Analysis in the Social Sciences: Recent Developments and Applications, London: Academic Press, 1994. |
[8] | J. P. Benzécri, Correspondence Analysis Handbook, New York: Marcel Dekker, 1992. |
[9] | A. Agresti, Categorical Data Analysis, 2nd ed., New York: Wiley, 2002. |
[10] | I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory and Applications, New York: Springer, 1997. |
[11] | P. Legendre and L. Legendre, Numerical Ecology, 2nd English ed., New York: Elsevier Science, 1998. |
P. Yelland, “An Introduction to Correspondence Analysis,” The Mathematica Journal, 2010. dx.doi.org/doi:10.3888/tmj.12-4. |
About the Author
Phillip Yelland is a data analyst at Facebook, Inc., where his work centers on the use of statistical techniques in fraud detection and risk management. He has an M.A. and a Ph.D. in computer science from the University of Cambridge in England, and an M.B.A. from the University of California at Berkeley.
Phillip Yelland
Facebook, Inc.
1601 South California Avenue
Palo Alto, CA 94304
phillip.yelland@gmail.com