Assignment 2
In this assignment, you will get some hands-on experience using
computational methods to identify interesting collocations in a
corpus. Although you are welcome to code up everything from scratch,
you are welcome to use code from existing NLP toolkits or code you
find on the Web, on condition that you clearly identify any
pre-existing code you've used.
The corpus
We'll work with the congressional
speech corpus, created by Matt Thomas, Bo Pang, and Lillian Lee.
It contains speeches made by politicians in the U.S. House of
Representatives during debates over legislation. Although the work in
this assignment is relatively straightforward from an NLP point of
view, it's pretty close to what some real political scientists have done in
text analysis; for example, see Monroe et al.,
Fightin’ Words: Lexical
Feature Selection and Evaluation for Identifying the Content of
Political Conflict, Political Analysis, 2008 16(4):372-403.
You should begin by downloading the convote
dataset v1.1 (9.8 Mb, tar.gz format). There are several different
versions of the data in this archive; we are concerned only with the
data_stage_three directory. For those with limited disk space, a way
to extract just the directory of interest is:
gunzip < convote_v1.1.tar.gz | tar xvf - convote_v1.1/data_stage_three
Each file in the data_stage_three directory contains a speech by a legislator.
The filenames contain information about each speech. In the filename template,
###_@@@@@@_%%%%$$$_PMV:
- the first three characters ### identify the bill under discussion
- the six digits @@@@@@ uniquely identify the speaker
- the character in position P indicates the speaker's party: D (Democrat), R (Republican), I (independent), or X (unknown).
- the character in position V indicates whether the speaker eventually voted yes (Y) or no (N).
- for our purposes we can ignore the other parts of the template.
Please use all the data available under the 3 directories ("development_set", "test_set" and "training_set").
The task
The idea here is to use what we've learned in class (and what you've
read about in M&S Chapter 5) to answer some interesting questions
about this corpus.
- What are the top 25 bigram collocations in this corpus, as measured automatically
by (a) frequency and (b) pointwise mutual information (PMI)?
- Same question, but limit your attention to speeches by Democrats.
- Same question, but limit your attention to speeches by Republicans.
- Discuss the advantages and limitations of frequency and PMI.
For full credit make sure to illustrate advantages and/or disadvantages
using the sorted lists of collocations you obtained.
- Are there any interesting differences between what you found for
Democrats and what you found for Republicans?
Here's an example. (In this case I actually used log-likelihood as the association score, rather than frequency or PMI, but the principle is exactly the same.)
When I used the Democratic speeches as a corpus,
and sorted all non-stopword bigrams by the association score,
and did the same separately for Republicans, the
phrase middle class ranked 16th for Democrats and 115th
for Republicans; conversely, the phrase law enforcement
ranked 28th for Republicans and 141st for Democrats.
One could argue that this empirical observation is consistent
with at least some characterizations of the priorities
of the two political parties -- e.g. see
discussions of the role that the phrase middle class had during the 2008 presidential debates,
like
this one.
Even if you're not particularly familiar with American politics,
identify similar contrasts and offer your thoughts on why
they might or might not be meaningful. (If this is an unfamiliar topic, feel free to discuss your data with classmates who
know American politics better than you do.)
Ways to go about it
Conveniently, the corpus is already tokenized and lowercased for you,
and the filename conventions make it very easy to identify particular
subsets of the corpus.
In terms of implementation,
- For those of you who like to implement everything from scratch, you
should have all the information you need here in the assignment,
in the book, and in standard references like Wikipedia.
- For those who like Python and NLTK,
you might want to look at
this nice discussion by Nitin Madnani.
- For those who like to take advantage of off-the-shelf tools,
that's fine with me. If you do that, you might like
Ted Pedersen's Ngram
Statistics Package.
You are welcome to work together. If you do so, turn
in separate assignments (since the discussion parts will be
different), even if you're turning in the same code, but identify who
you worked with.
What to turn in
Address for turning in assignments: See the course home page. Please turn in a file named firstname.lastname.tar.gz or firstname.lastname.zip. Include:
- A PDF file (portrait 8.5x11 format) with your answers
- A subdirectory code containing your code along with a README file that explains how to run it. You should not include a copy of the corpus -- ideally the README should identify filename arguments or configuration options, or, worst case, say where to change hardwired strings.
Note that you are not being graded on the quality of your code. Yes, all other things being equal, clean and easily runnable code would be nice. But if you have the choice between spending more time playing with and thinking about the data you're looking at, versus more time making your code pretty and easy to run, please focus on the former and not the latter!
Again, I strongly encourage you to talk to each other on the class
discussion board.