Natural Language Processing
Annotations and personal elaborations on the Coursera material for the Natural Language Processing course. No affiliation to the course is implied and my interpretation of the content can be flawed. This document is purely intended to assist me in the process of learning the subject and is a mix between transcript and personal notes.
This Markup is best read in markdownpad from markdownpad.com. Sadly the original implementation of markdown does not include support for LaTex, if it bugs me enough it might be worth writing a parser that does.
edit: It turns out this is also a showcase of a variety of TeX setting techniques.
Language modeling
Language modeling is about assigning probabilities to a sentence. Used in machine translation and spell correction
Goal of language model is (Also called 'the grammar' or 'LM'):
How to compute this? Use the chain rule of probability. Definition of conditional probabilities:
 P(AB) * P(B) = P(A, B)
 P(A, B) = P(AB) * P(B)
To generalize to multiple variables:
 P(A, B, C, D) = P(A) * P(BA) * P(CA, B) * P(DA, B, C)
 P(X1, X2...Xn) = P(X1) * P(X2X1) * P(X3X1, X2) * P(XnX1, X2..Xn1)
Joint probability is used for computing the prability of a string of words.
 sentence = "its water is so transparent"
 P(sentence) = P(its) * P(waterits) * P(isits water) * P(soits water is) * P(transparentits water is so)
 P(W1, W2....Wn) = Product over all i, P(WiW1, W2...Wi1)
Estimate these probabilities? Yes, the Markov Assumption.
 sentence = "its water is so transparent that"
 P("the"sentence) ≈ P(thetransparent that)
Formal definition: the markov assumption of the probability of a sequence of words is the product of the conditional probability of the word given some prefix of the last K words.
 P(W1, W2...Wn) ≈ Product(i) P(WiWik...Wi1)
 P(W1W1W2...Wi1) ≈ P(WiWik..Wi1)
The simplest case is Unigram model. They are no more than a concatenation of words picked randomly from a body of text. Unigrams tend to be unintelligable. K = 0
The Bigram model is conditioned on the previous word. K = 1
Ngram uses N=K. While sentences will start to look more like a natural language they will still be insufficient as a model of language because language has longdistance dependancies. In most cases the output will be intelligable in at a glance for short stretches of text.
This next line shows that computer and crashed can be separated by a large number of words. Statistically the likelyhood of the word crashed following the word floor is not high, but it does become high as the subject of the sentence is the computer.
"The computer which I had just put into the machine room on the fifth floor crashed"
Estimating Ngram Probabilities
The maximum Likelihood Estimate MLE:
 P(WiWi1) = count(Wi1, Wi) / count(Wi1)
Or stated in English:
Of all the times we saw Wi1, how many times was it followed by Wi
< s > means start symbol
< /s > means end symbol
mr Zeus used as a corpus:
< s > I am Sam < / s >
< s > Sam I am < / s >
< s > I do not like green eggs and ham < / s >
 P(I< s >) = 2/3
 P(< / s >Sam) = 1/2
 P(Sam< s >) = 1/3
 P(Samam) = 1/2
Sam occurs after am 1 time, but am occurs 2 times in total.  P(amI) = 2/3
am occurs after I 2 times , but I occurs 3 times in total.  P(doI) = 1/3
Bigram estimates of sentence probabilities
P(< s > I want english food < / s >)
= P(I< s >) * P(wantI) * P(englishwant) * P(foodenglish) * P(< / s >food)
= 0.000031
Zeroes in the probability matrix arise because a corpus doesn't contain certain word combinations. That doesn't mean the word could never follow that word in general english. A zero can indicate that certain words don't logically follow another word, this is especially more visible with much larger bodies of mixed content text.
log space With an abundance of near zero values, arithmetic is made faster by using additions and logs.
"Avoids arithmetic underflow"
 p1 * p2 * p3 * p4 = logP1 + logP2 + logP3 + logP4
Publicly available language modeling toolkits:
 srilm / speech.sri.com
 google Ngram release (pgoogle web corpus)
 google books ngrams (http://ngrams.googlelabs.com/
Evaluation: How good is our model?
We want to assign higher probability to 'real' or 'frequently observed' sentences and assign lower probability to 'ungrammatical' or 'rarely observed' sentences.
We train on a training set and test the performance of the resulting LM on a test set. Test sets are never before seen virgin data. An evaluation metric tells us how well our model does (responds) on the test set.
Extrinsic evaluation of Ngram models
The best way to evaluate how two models (A and B) compare is to apply each model to the same task. Tasks like spelling corrector/speech recognizer/Machine Translation System (MT). Run the tasks and analyze the accuracy of A and B, how many mispelled words corrected, how many words recognized correctly or how many words translated correctly.
There is a problem with this analysis method given the Extrinsic (in vivo) nature of Ngram models. It is time/processing expensive (days/weeks)
Sometimes another way to evaluate models is by intrinsic evaluation called perplexity, but it is a poor/bad approximation if the test and training data don't share a lot of similarity. The assertion is that perplexity is ok if two data sets are very similar and that can be OK for pilot experiments. Both are valuable methods.
intuition of Perplexity
Claud Shannon. The Shannon Game: how well can we predict the next word? A good language model will attempt to look at the context to narrow down the body of probable words. unigrams suck at this game.
If you can guess the next word right, then you are a good language model
perplexity, the best language model is one that best predicts an unseen test set. It assigns/gives the highest probability to the sentences that it sees. More formally:
 perplexity is the probabilty of the test set, normlized by the number of words.
 PP(W) = P(W1, W2... Wn) ^ (1/n)

= N root ( 1 / P(w1,W2...Wn) )
When chain ruled:
 PP(W) = N root ( product N over all I, 1 / P(WiW1...Wi1) )
When chain ruled (bigram only):
 PP(W) = N root ( product N over all I, 1 / P(WiWi1) )
"minimizing perplexity is the same as maximizing probability"
Second idea on perplexity comes from Josh Goodman, based on Shannon :
How hard is the task of recognizing digits '0,1,...9' ?
Perplexity is related to Average branching factor, on average how many things could come next at any point in sentence. (it's related to the entropy on the up comming things)
Example given: Speech recognition for automated operator to connect to an employee. 30,000 unique full names gives a perplexity of 30,000.
Perplexity is the weighted equivalent branching factor. Numbers example:
 Operator (1 in 4)
 Sales (1 in 4)
 Technical Support ( 1 in 4 )
 30,000 names ( 1 in 120,000 each = 1/4 * 1/4 * 1/4 * 30,000)
 Perplexity is: 53 (52.64...)
Lower perplexity indicates a better trained model.
My python interpretation of the formula is as follows, rewriting it this way helped me understand the whole idea better:
import math
def perplexity(chances):
"""pythonification of the formal definition of perplexity.
input: a sequence of chances (any iterable will do)
output: perplexity value.
"""
N = len(chances)
product = 1
for chance in chances:
product *= chance
# return product**(1/N)
return math.pow(product, 1/N)
This is the general gist of the formula, I would run tests on variations and optimize for speed instead of readability (especially when considering larger sequences). For instance the product of chances might better be calculated like:
# http://choorucode.wordpress.com/2010/02/05/pythonproductofelementsofalist/
import functools
import operator
def product(seq):
"""Product of a sequence."""
return functools.reduce(operator.mul, seq, 1)
My net searches on the subject of perplexity:
Gregor Heinrich: http://www.arbylon.net/publications/textest.pdf
Generalization and zeroes
What to do when we see a lot of zeroes? It helps to consider the Shannon visualization method.
 choose a random bigram
 now choose a bigram that starts with the previos word
 go on until you choose the end of sentence token.
looks like:
< s > I
I want
want to
to eat
eat Chinese
Chinese food
food < /s >
Shakespeare as corpus:
 N = 884,647 words (tokens)
 V = 29,066 vocabulary (unique roots)
 Produced 300K bigram types out 844 million possible = (V^2)
So 99.96% of the possible bigrams remain unseen and will have 0 in the table. A vast number of zeroes.
 Quadrigrams: in shakespearean text produce lines that are themselves mostly direct qoutes from shakespeare. Because the corpus(N) is so small. Try it.
The perils of overfitting
Ngrams only work well for word prediction if the test corpus looks like the training corpus. Language model trained on Shakespeare will have an abysmal result when looking at the Wallstreet Journal.
While A model trained on the Wallstreet Journal (corpus A) might do better on another financial publication (corpus B).
We need to train models that do a better job of generalizing, make them more robust. Combinations that don't occur in the training set will result in zeros, but the test set might validly contain such combinations.
Bigrams with zero probability will result in division by zero when calculating perplexity, so they must be counteracted.
Bigrams with zero probability
How do we deal with these beasts?
Simplest idea: Smoothing.
Smoothing: Addone also called Laplace smoothing: Addone estimation
If we have sparse statistics. We want to steal probability mass, for combinations that we might not see later, and place it on combinations that didn't occur in the training data.
in brief:
Laplace smoothing :
Pretend we saw each word one more time than we did. Add one to all counts for all bigrams.
The formal expression:
v = words in sentence.
 P Add1(WiWi1) = C(Wi1, Wi) + 1 / C(Wi1)+V
With (Maximum Likelyhood Estimate) MLE suppose a word occurs 400 times in a corpus of 1 million words. The MLE is 400/1, 000,000 = 0.0004. This may be a bad estimate for the likelyhood of that word occuring in another corpus
PP ML(C) <= PP smoothed(C)
Add1 estimation makes a massive difference if you compare the result on the reconstituted bigram table vs the raw bigram table, some differences are up to a factor of 10. So now we got rid of the zeros but gained a much greater level of uncertainty about valid syntax. There are better methods!
Add1 is used to smooth other NLP models
 For text classification
 In domains where the of zeros isn't so huge.
It helps for me to think of this as an analogy to the concept of antialiasing in bitmapgraphics.
Further reading on smoothing:
Microsoft Research / Stanley F. Chen & Joshua Goodman
http://research.microsoft.com/~joshuago/tr1098.pdf
Interpolation
Sometimes it helps to use less context, applying the requirement for less context with respect to context/situations you haven't learned much about. (rewrite)?
Backoff:
 Use a trigram if you have good evidence/data
 What if you haven't seen a trigram, you look at the bigrams, or if they don't exist for that combination, then you can look look at the unigram.
Interpolation:
 mixing unigram, bigram, trigram.
 Interpolation tends to work better than backoff. (clarify?)
There are (broadly speaking) two kinds of interpolation.
(lambda=λ, must sum to 1 to make them a probability)
Linear Interpolation:

Simple interpolation
Adding 1gram+2gram+3gram together depending on weights (λ)
P_hat(WnWn1Wn2) =
λ1 * P(trigram) + λ2 * P(bigram) + λ3 * P(unigram) 
Lambdas conditional on context: (slightly more complicated)
Now lambdas are dependant on what the previous two words were.
Where do the lambdas come from? How to set lambdas? We use a heldout corpus. Choose λ to maximize the probability of heldout data:
 Fix the Ngram probabilities (on the training data)
 Then search for λs that give the largest probability to heldout set.
(LaTeX needed) log P(Wi...WnM(λ1...λk)) = sigma over all i log Pm(λ1....λk) ( WiWi1)
The held out corpus can be used to set the lambdas, the idea is we take training data, train some Ngrams then choose which lambdas I would use to interpolate those ngams such that it gives me the highest probability of (predicting accurately) this held out data. Find the set of probabilities such that log probabilities of the words of the held out data are highest.
##Unknown words: Open versus closed vocabulary tasks##
If we know all the words in advanced then Vocabulary V is fixed and we are talking about a Closed vocabulary task (menus, predefined scenarios). Often we don't know this (if a vocabulary is fixed or not) in advance. This translates into Out of Vocabulary (OOV) words, and turns the exercise into an Open vocabulary Task.
Instead: utilize an unknown word token < UNK > and train on < UNK > probabilities.
 Create a fixed lexicon L of size V
 At text normalization phase, any training word not in L (all OOV words..or rare words) changed to < UNK >
 Now we train its probabilities like any normal word.
 At decoding time, if text input (doesn't match) use UNK probabilities for any such word not in the training data.
Huge webscale ngrams
How do we deal with computing probs in such large spaces, we prune. we only story non singleton counts. or compute the entropy/plexity and decide to remove ones that don't contribute (entropy based pruning)
Efficiency
 Efficient data structures like tries
 Bloom filters: apx language models (clarification needed)
 Store as indices, not strings.
 use huffman encoding to fit large numbers of words into 2 byes.
 Quantize probabilities (48 bits instead of 8byte float)
Smoothing for webscale Ngrams
Use "stupid backoff"? (Brants et al. 2007) (uses scores, rather than probabilities) or No Discounting, just use relative frequencies.

Insert LaTeX here
Add1 smoothing is ok for text classification but not so great for language modeling. The mot commonly used method of interpolation is the Extended KneserNey method (elaborate). However, for very large Ngrams like the web, stupid backoff is often adequate.
Advanced Language Modeling
Recent research:
 Discrimintive models: choose ngram weights to improve a task, not to fit the training set/data.
 Parsingbased models ( elaborate soon ).

Caching Models: train on recently used words, they have increased probability of reappearing.
Pcache(Whistory) =
lambdaP(WiWi2, Wi1) + (1lambda)(C(W function of history / history) Caching model has weak performance in speech recognition (Why?)
Good Turing Smoothing
 General Formulations (LaTeX)
 Unigram Prior Smoothing (LaTeX), works well, but not great for LM)
 Advanced smoothing algorithms, based on some form of intuition
 GoodTuring
 KneserNey
 WittenBell
The goal of smoothing algorithms is to replace those unseen zeros with a number that relates to a proposed likelyhood of being present in the body of text being looked at. Sometimes this means using the counts of words that you've seen once to help estimate the count of things we've never seen.
Notion: N sub c = Frequency of frequency c

Nc = count of things we've seen c times.

Sam I am I am Sam I do not eat
I 3 < N sub 3 = 1
sam 2 <
am 2 < N sub 2 = 2do 1 <
not 1 <
eat 1 < N sub 1 = 3
Imagine this scenario (by Josh Goodman). You are fishing and catch: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel. A total of 18 fish.
How likely is it that the next species is trout? 1/18
That's fine for species your encountered previously. But how to respond to a new species, what is the likelyhood of catfish or bass?
 We can use the estimate of thingswesawonce to estimate the new things.
 3/18 (because N sub 1 = 3)
Assuming this statement, how likely is it that the next species is trout? It must be less than 1/18 because are going to use probability mass from each fish we have seen and add it to the unseen fish probability. How do we adjust the new probability spread?
Good Turing calculations
P star subGT (things with zero frequency) = N1 / N
c star = ((c+1)*Nsubc+1) / Nsubc

Unseen fish ( bass, catfish)
 c = 0;
 MLEp = 0/18 = 0
adjusted for good turing:
 P star sub GT(unseen) = Nsub1/N = 3/18

Seen fish once (trout)
 c = 1
 MLE p = 1/18
adjusted for good turing:
 c star(trout) = (2Nsub2) / Nsub1 = (21)/3 = 2/3
 P star sub GT(trout) = (2/3) / 18 = 1/27
Complications of GoodTuring
we can't always use N sub k+1,

For small k, Nsubk > N subk+1

For large k, too jumpy, zeros wreck estimates.

Simple GoodTurning [Gale & Sampson] replace empircal Nsubk with bestfit power law once count counts get unreliable. (use a power log function to perform best fit after a certain count of Nsubk) (The graph them combines discreet histogram and at a cutoff point converts everything to a long tail curve)
Resulting goodTuring numbers:
Example: Numbes from Church and Gale (1991) / 22 million words of AP newswire.
c* = {{(c+1)N_{c+1}} \over {N_c}}
What is the general relationship betwen these counts?
Count c  Good Turing c*
0  0.0000270
1  0.446
2  1.26
3  2.24
4  3.24
5  4.22
6  4.19
7  6.21
8  7.24
9  8.25
Mostly here it's .75 on each count.
KneserNey
One of the most sophisticated, is KneserNey and it might be intuitive.
Absolute Discounting Interpolation
 save time by doing no brain subtraction of .75 (or some other discount value for different corpera)
P sub AbsoluteDiscounting(WiWi1) =
(c(Wi1,Wi)  d) / c(Wi1) + lambda * (Wi1) * P(W)
KneserNey Smoothing 1 offers a better estimate for probabilities of lowerorder unigram.
Example from shannon game: I can't see without my reading _________.
The unigram is useful exactly when we haven't seen a bigram.
 Instead of P(W): "How likely is w"
 Psub continuation(W): "How likely is w to appear as a novel continuation?"
How do we measure novel continuation? : For each word, count the number of bigram types it completes/creates Every bigram type was a novel continuation the first time it was seen.
Pcontinuation(W) is proportional to  { Wi1:c(Wi1,W)>0 } 
KneserNey Smoothing 2 How many times does w appear as a novel continuation:
 Pcontinuation(W) \propto  { Wi1:c(Wi1,W)>0 } 
Normalized by total number of word bigram types:
  { (Wj1,Wj):c(Wj1,Wj)>0 }  what is the cardinality of this set,
Pcontinuation(W) = {Wi1:c(Wi1,W)>0} / { (Wj1,Wj):c(Wj1,Wj)>0}
or LaTeX
P_{CONTINUATION}(w) = {\{w_{i1}:c(w_{i1},w)>0\}
\over
\{(w_{j1},w_j):c(w_{j1},w_j)>0\}}
KneserNey SMoothing 3
Alternative metaphor: The number # of word types seen to precede w. Normalized by the # of words preceding all words:
Latex.
KneserNey Smoothing 4
Latex.
KneserNey Recursion
harvesting semantic relation / espresso http://www.patrickpantel.com/download/papers/2006/acl0601.pdf