Exercise: Using a Hidden Markov Model

(This version, 21 February 2014, updates some earlier versions of this exercise.)

Description: In this exercise, we use a hidden Markov model (HMM) as a model of word generation from part-of-speech sequences. We will:

Train an HMM on a sample of English-like text
Inspect the resulting model
Generate sentences at random from the model
Create test sentences and find the most likely hidden state sequence
Use those steps to get a feel for what EM behaves like in practice

Note that this exercise was designed to be incredibly easy to walk through, even for people with no more than basic Unix for Poets experience with Unix/Linux. If you prefer, you are welcome to adapt this assignment and use a different HMM implementation. For example, there's tagging code, including HMM code, in NLTK (see class HiddenMarkovModelTrainer, specifically method train_unsupervised, for the NLTK implementation of forward-backward). And of course feel free to roll your own if you want the experience implementing an EM algorithm.

To turn in: Do everything suggested in the directions, and answer all questions that are asked. Make sure you read carefully and don't miss any questions!

Credits: The HMM package we are using in this exercise was written by Tapas Kanungo, and this exercise, the accompanying scripts, etc. were written by Philip Resnik.

Prerequisites: This exercise assumes basic familiarity with typical Unix commands, and the ability to create text files (e.g. using a text editor such as vi or emacs). No programming is required.

Notational Convention: The symbols <== will be used to identify a comment from the instructor, on lines where you're typing something in. So, for example, in

    
    %  cp file1.txt file2.txt   <== The "cp" is short for "copy"

what you're supposed to type at the prompt (identified by the percent sign, here) is

    cp file1.txt file2.txt

followed by a carriage return.

Getting Logged In and Getting the Code

Use anonymous ftp to get the code:

  % ftp umiacs.umd.edu        <== Invoke the "ftp" program

  Name (yourname): anonymous  <==   Type "anonymous" (without quotes)
  Password: name@address      <==   Type your e-mail address

  ftp> cd pub/resnik/723      <==   Go to directory pub/resnik/hmm
  ftp> binary                 <==   Tell ftp to use binary transfer mode
  ftp> get hmm.tar            <==   Download the file containing exercise code
  ftp> get hmm_src.tar        <==   Download the file containing the HMM source code
  ftp> bye                    <==   Exit from ftp

  % tar xvf hmm.tar           <== Extract code from the file
  % rm hmm.tar                <== Delete the file (to conserve space)
  % mv hmm_src.tar hmm        <== Move the tarfile into the subdirectory
  % cd hmm                    <== Go into hmm subdirectory
  % perl pickperl /usr/imports/bin/perl `which perl`  <== get correct perl (note backquotes!)
  % chmod u+x *.pl            <== Make perl programs executable

Execute uname -sr to see what operating system you're running. If it's SunOS or Solaris, you can use the executable programs that are part of the tarfile; skip to the step where it says "Now go into directory example0". If not, you'll need to do the extra step of downloading and compiling the HMM source code. Here's how:
First, execute the following lines:
```
  % tar xvf hmm_src.tar       <== Extract the HMM source code
  % rm hmm_src.tar            <== Delete the file (to conserve space)
  % cd src                    <== Go into source subdirectory
```
In an older version of this assignment, you needed to do one of these:

Execute 'make depend'
Edit Makefile to delete /usr/include/sys/feature_tests.h everywhere -- not sure what this was but we don't need it.
And then, at the top of file viterbi.c, create a variable MAXFLOAT by adding a line as follows: int MAXFLOAT = 1e100;
However, this version of the assignment makes life easier: just grab this "clean" Makefile. (In addition, to get the four perl scripts to work on some linux machines, you might need to manually edit them and replace the first THREE lines with a single line containing #!/usr/local/bin/perl.)
Now that the makefile is updated, do the following:
```
  % make                          <== Compile files.  You can ignore warnings.
  % mv esthmm genseq testvit ../  <== Move executables where we'll use them.
```
Say yes if you're asked whether it's ok to overwrite the existing files with the same names -- you're replacing the existing executables with the executables that are appropriate for the machine you're on.
(Note for Mac OSX users: you may need to copy malloc.h from /usr/include/malloc/malloc.h into /usr/include to get the code to compile.)
Now go into directory example0:
```
    
    %  cd ../example0
  
```
Run link to make the HMM programs available for running in this directory:
```
    %  ./link 
  
```

Creating the Training Data

Look at file example0.train, which contains a very small corpus generated using a very simple English-like grammar.
```
    
    %  more example0.train 
  
```
Alternatively, open the file in an editor.
Question 1: What parts of speech would you say are represented in this file? Write down a list of parts of speech and the words you think belong to each one. Notice that the same word can appear in multiple parts of speech.
Turn example0.train into training input for the HMM code, by running the create_key.pl program. This reads in the training file and converts each unique word into a unique number, since the HMM program uses numbers rather than symbols. For example, "the tall man saw the short man" might be translated into the sequence "1 2 3 4 1 5 3".
```
    %  perl create_key.pl example0.key < example0.train > example0.seq
  
```
Take a look at file example0.key, which now contains the mapping from words to their corresponding numbers, and at file example0.seq, which now contains the training input represented as numbers rather than words. The "T=" value at the top of example0.seq tells you how many symbols there are in the training sequence. (In this case the number should be 590.)

Training the HMM

To train the model we will use the esthmm program, which implements the forward-backward algorithm, also known as Baum-Welch. (It's really a very nice, clean implementation, and I encourage you to look at the code in baumwelch.c.) This program needs to know the number of symbols in the alphabet of the HMM (that is, symbols that can be emitted); you can obtain this by typing
```
    %  wc -l example0.key
  
```
In this case the number of symbols is 13. The number of states is something you can choose. Recall that for a HMM-based part-of-speech model like this one, each state corresponds to a part of speech (i.e. a syntactic category like noun, verb, etc.), so the number of states to choose corresponds to the number of parts of speech you believe are represented in the corpus. In this case, let's use 6 states. We run the esthmm program as follows:
```
    %  ./esthmm 6 13 example0.seq > example0.hmm
  
```
This creates file example0.hmm, which contains the trained model. Depending on the computer you're running this on, this might take a minute.
Edit file example0.hmm -- it's not the easiest thing in the world to read, but you can see how the model is represented there. At the top are specified the number of states and the number of symbols (M=13 symbols, N=6 states). Then you have the complete A matrix, i.e. the 6-by-6 transition probability matrix. Next you have the 6-by-13 emission probability matrix, B. Finally you have the pi vector, giving initial probabilities for the 6 states.
If the first line example0.hmm contains the words "num iterations", edit the file and delete that line. (If you don't do so, you'll get errors that look like "Bad format at A in example0.hmm" later on.)

Inspecting the Model

To view the model you've just created a bit more readably, keeping only the most useful information, run the viewhmm.pl program. This shows the state-to-state transition probabilities when they are above a certain threshold probability, since very low transition probabilities probably don't tell you much about the structure of the model. Similarly, for the emission probabilities, it shows only the symbols emitted with a sufficiently high probability. (And it shows them as words, not numbers, for easier readability.) Run viewhmm.pl as follows:
```
    %  perl viewhmm.pl example0.hmm example0.key 0.01 0.01 
  
```
Chances are it scrolls by too fast, you you may want to do this instead:
```
    %  perl viewhmm.pl example0.hmm example0.key 0.01 0.01 | more
  
```
Or, you might want to save the output to a file:
```
    %  perl viewhmm.pl example0.hmm example0.key 0.01 0.01 > example0.view
  
```
If you get an error about the format of example0.hmm, make sure that the first line is of the form M= number. If there is a first line talking about number of iterations (verbose output from esthmm) just delete it.
Question 2. Based on what you see when you run viewhmm.pl, associate each numbered state with a part of speech. (You might find it convenient to draw, on paper, a transition diagram for this HMM, i.e. each state as a node labeled by the state number, with probabilities on the arcs from state to state. This used to be required in the old version of the assignment, but students were spending too much time trying to make pretty pictures so I've made it optional.) The question we're interested in is: how good a match is there between your intuitions, earlier, and the way the model has automatically decided which are the high-probability symbols for each state? And if there are mismatches, are they linguistically interesting?

Generating Sentences at Random from the Model

Informed by what you did in Question 2, start at the the most likely start state, and write down the most likely symbol to be emitted there. (Break ties by choosing the first alternative.) Then follow the most likely arc to the next state, and write down the most likely symbol to be emitted there. Continue in this fashion until you emit a punctuation mark, or until you get bored.
Now we'll have the computer do this same process. Since it doesn't ever get bored, we'll have to tell it how many symbols to emit before it stops -- say, 100. To do this, run the genseq program:
```
    %  ./genseq example0.hmm 100
  
```
Notice that the output isn't very readable, since the program generates symbols as numbers. We can take that output, though, and run it through the ints2words.pl program to replace the numbers with the corresponding words:
```
    %  ./genseq example0.hmm 100 | perl ints2words.pl example0.key
  
```

Finding the Hidden State Sequence for a Test Sentence

Let's create a sentence to use as input to the model. First, create a file example0.test.words containing the word sequence that you got when you ran through the model state by state by hand, above. You can do this using an editor, or you can do it by typing
```
    %  cat > example0.test.words
  
```
then typing the sentence in, hitting return, and then typing control-D. Don't forget to make sure everything is lowercase, and make sure the punctuation mark is a separate word, not attached to the last word of the sentence.
Turn the file you've just created into the right format for the HMM programs, by running words2seq.pl program as follows:
```
    %  perl words2seq.pl example0.key < example0.test.words > example0.test
  
```
Examine file example0.test to see what the input sequence looks like.
Find the sequence of hidden states most likely to have generated the symbol sequence in example0.test, by using the testvit program:
```
    %  ./testvit example0.hmm example0.test
  
```
The program reports the Viterbi probability for the symbol sequence in the input file, and the Viterbi state sequence (backtrace) associated with that probability.
Take that Viterbi state sequence, and replace each state number with the part-of-speech label you assigned to that state. Congratulations! You've just done some HMM-based part-of-speech tagging.
Question 3. Show the sentence you tagged and the Viterbi part-of-speech sequence. Does the sequence of parts of speech correspond to what you expected? If not, explain.

Time for Fun

Now that you've gone through this exercise, we move on to further exploration:

Try getting the state sequence (part-of-speech tags) for some more sentences -- you can look at example0.key to see what words you're allowed to use. To speed things up, you can skip the step of creating example0.test.words by executing:
```
  %  cat | words2seq.pl example0.key > example0.testA
```
Typing your sentence in, hitting return, and then typing control-D to end and create file example0.testA. (You can use suffixes testA, testB, etc. for new sentences.) As before, don't forget to make sure everything is lowercase, and make sure punctuation are separated from other words by spaces. Once you've created example0.testA, run the testvit program on that file, as described above.
Question 4. Go back to "Training the HMM", but this time increase the number of states to 7 or 8 or 9. Go through the rest of the exercise of labeling the resulting states with part-of-speech tags. What does the model do when it has more states to play with, for this training set? What do you think are are some possible consequences of having more states, in terms of the model's ability to tag accurately, and in terms of the linguistic facts the model captures?
Question 5. Go back to "Training the HMM", but this time decrease the number of states to, say, 3 or 4. Same questions as in the previous paragraph: what are the consequences?
Let's look at some more interesting data, using a more interesting English-like language. Go into the example1 directory:
```
  %  cd ../example1
  
```
Now do the following:
1. Run the link program as described earlier. This time, though, you won't do the "Creating the Training Data" or "Training the HMM" steps, because the training would take too long. (As of when this exercise was originally written, it could be a few hours, depending on the machine. If you decide to do it, let me know how long it took!) Instead, you can go straight to "Inspecting the Model". Notice that this is a bigger HMM with a richer language: there are 36 symbols in the vocabulary, and the HMM has 12 states.
2. Do some of the same steps as we did for example0, particularly assigning a part-of-speech label by hand to each state, and generating some text at random using the genseq.pl program.
3. Question 6. Examine example1.train and construct (and turn in) a context-free grammar that generates this language, or something close to it. In what ways does the language described by this grammar diverge from English?
4. Question 7. Looking at the random text you generated using the genseq.pl program, do you see some sentences that could not have been generated by the context-free grammar in the previous question? Why do those sentences get generated by the HMM but not the CFG?