Assignment 2

In this assignment, you will get some hands-on experience using computational methods to identify interesting collocations in a corpus. Although you are welcome to code up everything from scratch, you are welcome to use code from existing NLP toolkits or code you find on the Web, on condition that you clearly identify any pre-existing code you've used.

The corpus

We'll work with the congressional speech corpus, created by Matt Thomas, Bo Pang, and Lillian Lee. It contains speeches made by politicians in the U.S. House of Representatives during debates over legislation. Although the work in this assignment is relatively straightforward from an NLP point of view, it's pretty close to what some real political scientists have done in text analysis; for example, see Monroe et al., Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conﬂict, Political Analysis, 2008 16(4):372-403.

You should begin by downloading the convote dataset v1.1 (9.8 Mb, tar.gz format). There are several different versions of the data in this archive; we are concerned only with the data_stage_three directory. For those with limited disk space, a way to extract just the directory of interest is: gunzip < convote_v1.1.tar.gz | tar xvf - convote_v1.1/data_stage_three

Each file in the data_stage_three directory contains a speech by a legislator. The filenames contain information about each speech. In the filename template, ###_@@@@@@_%%%%$$$_PMV:

the first three characters ### identify the bill under discussion
the six digits @@@@@@ uniquely identify the speaker
the character in position P indicates the speaker's party: D (Democrat), R (Republican), I (independent), or X (unknown).
the character in position V indicates whether the speaker eventually voted yes (Y) or no (N).
for our purposes we can ignore the other parts of the template.

Please use all the data available under the 3 directories ("development_set", "test_set" and "training_set").

The task

The idea here is to use what we've learned in class (and what you've read about in M&S Chapter 5) to answer some interesting questions about this corpus.

What are the top 25 bigram collocations in this corpus, as measured automatically by (a) frequency and (b) pointwise mutual information (PMI)?
Same question, but limit your attention to speeches by Democrats.
Same question, but limit your attention to speeches by Republicans.
Discuss the advantages and limitations of frequency and PMI. For full credit make sure to illustrate advantages and/or disadvantages using the sorted lists of collocations you obtained.
Are there any interesting differences between what you found for Democrats and what you found for Republicans?
Here's an example. (In this case I actually used log-likelihood as the association score, rather than frequency or PMI, but the principle is exactly the same.) When I used the Democratic speeches as a corpus, and sorted all non-stopword bigrams by the association score, and did the same separately for Republicans, the phrase middle class ranked 16th for Democrats and 115th for Republicans; conversely, the phrase law enforcement ranked 28th for Republicans and 141st for Democrats. One could argue that this empirical observation is consistent with at least some characterizations of the priorities of the two political parties -- e.g. see discussions of the role that the phrase middle class had during the 2008 presidential debates, like this one.
Even if you're not particularly familiar with American politics, identify similar contrasts and offer your thoughts on why they might or might not be meaningful. (If this is an unfamiliar topic, feel free to discuss your data with classmates who know American politics better than you do.)

Ways to go about it

Conveniently, the corpus is already tokenized and lowercased for you, and the filename conventions make it very easy to identify particular subsets of the corpus.

In terms of implementation,

For those of you who like to implement everything from scratch, you should have all the information you need here in the assignment, in the book, and in standard references like Wikipedia.
For those who like Python and NLTK, you might want to look at this nice discussion by Nitin Madnani.
For those who like to take advantage of off-the-shelf tools, that's fine with me. If you do that, you might like Ted Pedersen's Ngram Statistics Package.

You are welcome to work together. If you do so, turn in separate assignments (since the discussion parts will be different), even if you're turning in the same code, but identify who you worked with.

What to turn in

Address for turning in assignments: See the course home page. Please turn in a file named firstname.lastname.tar.gz or firstname.lastname.zip. Include:

A PDF file (portrait 8.5x11 format) with your answers
A subdirectory code containing your code along with a README file that explains how to run it. You should not include a copy of the corpus -- ideally the README should identify filename arguments or configuration options, or, worst case, say where to change hardwired strings.

Note that you are not being graded on the quality of your code. Yes, all other things being equal, clean and easily runnable code would be nice. But if you have the choice between spending more time playing with and thinking about the data you're looking at, versus more time making your code pretty and easy to run, please focus on the former and not the latter!

Again, I strongly encourage you to talk to each other on the class discussion board.