To use the example from last assignment, having discovered that middle class might be an interesting phrase, I could use a chi-square test to show that the difference in the frequency of middle class among Democrats and Republicans is statistically significant. (Note that doing a chi-square test in this way, looking at a contingency table of {Democrat,Republican} x {middle_class, NOT middle_class}, is different from using chi-square to find an association between the word middle and the word class. Both cases involve a 2x2 contingency table, i.e. looking for evidence of an association between some binary variable A and a second variable B, but one case is looking for a lexical association between two words, and the other is looking for an association between a bigram and the party of the speaker using that bigram.)
For full credit, be very explicit about the statistical hypothesis testing you're doing. For example, an explicit argument would (at least) explicitly describe null hypothesis H0, identify the test statistic, identify your choice for alpha (the cutoff for statistical significance), give the p-value, and present an argument for why we should draw the conclusions you want us to draw. (It's fine to use some of the same language I used in class.)
You are welcome to use an on-line calculator or off-the-shelf code to do the statistical calculation if you'd like. Though you'll learn more if you do the calculation by hand at least once. :-) If you're thinking of using the chi-square test, note that there are problems using it when counts are small (see Problems in the chi-square test Wikipedia page). Depending on the counts, the likelihood ratio test may be a better choice. The classic formulation of that test for computational linguists is by Dunning (1993), but see also Moore (2004) for some more intuitive formulations (Figure 2) and an argument for using Fisher's Exact Test (typically viewed as intractable for large samples but Moore gives a number of optimizations that make it practical).
First, a couple of notes.
For the mathematically inclined and/or those who liked the discussion about deriving an information measure from a set of properties such a measure should have, here are several derivations of the definition of entropy based on axioms about properties that a measure of uncertainty should have. I particularly like the discussion in A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957.
Also, in case you're curious, here's the pointer to the algorithm for creating optimal codes using Huffman coding (see also this interesting discussion by Huffman's nephew).
Finally, a reminder about notation. If X is a random variable whose distribution is (p1,p2,...,pk) then we usually write H(X) for H(p1,p2,...,pk). I will occasionally write H(p) instead of H(X), where p abbreviates p1,p2,...,pk.
Note that simple toy distributions (e.g. over {heads,tails}) are boring to grade. You'll earn less goodwill on this problem for the boring kind of example, and more goodwill for an example that is significantly interesting or amusing.
Obama's address will need to be tokenized, and you should convert all text to lowercase in order to mitigate sparse data problems. Note that you'll still need to do something about zero counts. As the simplest option, you can smooth using Laplace ("add one", Manning and Schuetze Section 6.2.2) to avoid zero, even though in practice this is not considered the state of the art. Or, slightly better, you could use add-k for a different k, e.g. setting k to 1/|V| where |V| is the size of the vocabulary. For simplicity, you can also assume that the total vocabulary is the union of all the unigrams in the three texts.
Here's a simple observation that might help you avoid writing a lot of new code from scratch, allowing you to re-use work you've already done: if you have a bigram frequency table with entries count(a,b), then the unigram counts are just the marginal counts for a and b, i.e.:
count(a) = ∑b' count(a,b')
count(b) = ∑a' count(a',b)
Best practice would be to smooth the bigram counts first, and then compute the unigram counts as above. If you smooth unigrams and bigrams separately, there's no guarantee that summing over smoothed bigram counts to compute the marginals will give you the same number you get from smoothing unigrams separately. But this is a minor point and for this assignment I'm fine with any fairly reasonable smoothing as long as you say clearly what you did.
Questions.