Assignment 3


Part 1: Hypothesis tests and collocations


Let's continue exploring the congressional speech corpus as in Assignment 2. Just as for Assignment 2, if American politics is an unfamiliar topic, feel free to discuss your data with classmates (it's fine to use Piazza) or with friends who know the topic better than you do.
  1. Pick an association score based on a statistical hypothesis test, e.g. the chi-squared test, log-likelihood, etc. (Note that Fisher's Exact Test does not scale well to large counts, but Moore (2004) gives a number of optimizations that make it practical, if you're interested in implementing them.) Using exactly the same corpus setup as last time, what are the top 25 bigram collocations in this corpus, as measured automatically by the statistical association score you chose?

  2. Same question, but limit your attention to speeches by Democrats.

  3. Same question, but limit your attention to speeches by Republicans.

  4. Discuss the advantages and any disadvantages/limitations of the association score you used, including a comparison of this association score with frequency and PMI. Feel free to take full advantage of the work you've already done in Assignment 2. For full credit make sure to illustrate advantages and/or disadvantages, using the sorted top-25 lists of collocations.

  5. Are there any interesting differences between what you found for Democrats and what you found for Republicans?

  6. Finding some interesting contrasts by inspection is nice, but why should I trust your intuitions? For several examples of differences between Democrats and Republicans, make a statistical argument that the difference you're pointing out is a genuine difference supported by the data.

    To use the example from last assignment, having discovered that middle class might be an interesting phrase, I could use a chi-square test to show that the difference in the frequency of middle class among Democrats and Republicans is statistically significant. (Note that doing a chi-square test in this way, looking at a contingency table of {Democrat,Republican} x {middle_class, NOT middle_class}, is different from using chi-square to find an association between the word middle and the word class. Both cases involve a 2x2 contingency table, i.e. looking for evidence of an association between some binary variable A and a second variable B, but one case is looking for a lexical association between two words, and the other is looking for an association between a bigram and the party of the speaker using that bigram.)

    For full credit, be very explicit about the statistical hypothesis testing you're doing. For example, an explicit argument would (at least) explicitly describe null hypothesis H0, identify the test statistic, identify your choice for alpha (the cutoff for statistical significance), give the p-value, and present an argument for why we should draw the conclusions you want us to draw. (It's fine to use some of the same language I used in class.)

    You are welcome to use an on-line calculator or off-the-shelf code to do the statistical calculation if you'd like. Though you'll learn more if you do the calculation by hand at least once. :-) If you're thinking of using the chi-square test, note that there are problems using it when counts are small (see Problems in the chi-square test Wikipedia page). Depending on the counts, the likelihood ratio test may be a better choice. The classic formulation of that test for computational linguists is by Dunning (1993), but see also Moore (2004) for some more intuitive formulations (Figure 2) and an argument for using Fisher's Exact Test (typically viewed as intractable for large samples but Moore gives a number of optimizations that make it practical).

Extra credit (receive up to +10%)


Part 2: Information theory


First, a couple of notes.

For the mathematically inclined and/or those who liked the discussion about deriving an information measure from a set of properties such a measure should have, here are several derivations of the definition of entropy based on axioms about properties that a measure of uncertainty should have. I particularly like the discussion in A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957.

Also, in case you're curious, here's the pointer to the algorithm for creating optimal codes using Huffman coding (see also this interesting discussion by Huffman's nephew).

Finally, a reminder about notation. If X is a random variable whose distribution is (p1,p2,...,pk) then we usually write H(X) for H(p1,p2,...,pk). I will occasionally write H(p) instead of H(X), where p abbreviates p1,p2,...,pk.


  1. (From Manning and Schuetze exercise 2.12) Show that the KL divergence is not symmetric by giving an example of two distributions p and q for which D(p||q) is not equal to D(q||p). Come up with two distributions different than the ones in the book and show the explicit computation of the KL divergences and how it is not symmetric.

    Note that simple toy distributions (e.g. over {heads,tails}) are boring to grade. You'll earn less goodwill on this problem for the boring kind of example, and more goodwill for an example that is significantly interesting or amusing.

  2. (Based loosely on Manning and Schuetze exercise 2.10) Compute three probability distributions over unigrams: one for the Republican speeches in the congressional speech corpus, one for the Democrats, and one for Obama's inaugural address of January 20, 2009.

    Obama's address will need to be tokenized, and you should convert all text to lowercase in order to mitigate sparse data problems. Note that you'll still need to do something about zero counts. As the simplest option, you can smooth using Laplace ("add one", Manning and Schuetze Section 6.2.2) to avoid zero, even though in practice this is not considered the state of the art. Or, slightly better, you could use add-k for a different k, e.g. setting k to 1/|V| where |V| is the size of the vocabulary. For simplicity, you can also assume that the total vocabulary is the union of all the unigrams in the three texts.

    Here's a simple observation that might help you avoid writing a lot of new code from scratch, allowing you to re-use work you've already done: if you have a bigram frequency table with entries count(a,b), then the unigram counts are just the marginal counts for a and b, i.e.:

    count(a) = ∑b' count(a,b')

    count(b) = ∑a' count(a',b)

    Best practice would be to smooth the bigram counts first, and then compute the unigram counts as above. If you smooth unigrams and bigrams separately, there's no guarantee that summing over smoothed bigram counts to compute the marginals will give you the same number you get from smoothing unigrams separately. But this is a minor point and for this assignment I'm fine with any fairly reasonable smoothing as long as you say clearly what you did.

    Questions.