SLIDE 7 7
22/02/2002 13
- name-class bigram:
- first-word-bigram:
- non-first-word-bigram:
where c(event) = #occurrences of event in training corpus
Training: estimating probabilities
) , ( ) , , ( ) , | Pr(
1 1 1 1 1 1 − − − − − −
= w NC c w NC NC c w NC NC
) , ( ) , , , ( ) , | , Pr(
1 1 1 − − −
〉 〈 = 〉 〈 NC NC c NC NC f w c NC NC f w
first first
) , , ( ) , , , , ( ) , , | , Pr(
1 1 1
NC f w c NC f w f w c NC f w f w
− − −
〉 〈 〉 〈 〉 〈 = 〉 〈 〉 〈
22/02/2002 14
Handling of unknown words
- Vocabulary is built as it trains
- All unknown words are mapped to the token _UNK_
- _UNK_ can occur
As the current word, previous word, or both
- Train an unknown word model on held-out data
Gather statistics of unknown words in the midst of known words
50% hold out for unknown word model Do the same for the other 50% combine bigram counts for the first unknown training file