INF4820: Algorithms for Artificial Intelligence and Natural - PowerPoint PPT Presentation

Labelled Sequences ◮ We are interested in the probability of sequences like: flies like the wind flies like the wind or nns vb dt nn vbz p dt nn ◮ In normal text, we see the words, but not the tags. ◮ Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. ◮ A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model . 8

Hidden Markov Models The generative story: � S � 9

Hidden Markov Models The generative story: � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) 9

Hidden Markov Models The generative story: the P (the | DT ) � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) P (the | DT) 9

Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 10

Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. 10

Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. ◮ The emission probabilities model the probability that a state emits a particular observation. 10

Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11

Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. Our observations will be words ( w i ), and our states PoS tags ( t i ) 11

Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) 12

Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) Emission probabilities Computed from relative frequencies in the same way, with the words as observations: P ( w i | t i ) = C ( t i , w i ) C ( t i ) 12

Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . 13

Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow 13

Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . 13

Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . The issues related to MLE / smoothing that we discussed for n -gram models also applies here . . . 13

Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. 14

Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. A Hidden Markov Model! with: ◮ Hidden states: { H , C } (plus pseudo-states � S � and � / S � ) ◮ Observations: { 1 , 2 , 3 } 14

Ice Cream and Global Warming � S � 0.8 0.2 0.3 H C 0.6 0.2 0.5 P (1 | H ) = 0.2 P (1 | C ) = 0.5 0.2 0.2 � / S � P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 15

Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ P ( s x | O ) given O ◮ We can also learn the model parameters, given a set of observations. 16

Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) 17

Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) 17

Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S 17

Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S Since P ( O ) always is the same, we can drop the denominator: ˆ S = arg max P ( S , O ) S 17

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 18

Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 � S � C C C � / S � 0.0000500 18

Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . 19

Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again 19

Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again Dynamic Programming: ◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm 19

Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . 20

Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . 20

Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . At each step, we record backpointers showing which previous state led to the maximum probability. 20

An Example of the Viterbi Algorithm H H H � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � P ( C | S ) P (3 | C ) 0 . 2 ∗ 0 . 1 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) 0 . 8 ∗ 0 . 4 0 . 2 ∗ 0 . 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C 0 . 8 ∗ 0 . 4 | H 0 ) . P 2 ( 1 ∗ | C 0 ) . 5 � S � � / S � ) ) H H | 3 | 1 ( ( P ( C | S ) P (3 | C ) P P 2 4 ) ) C . C . 0 0 | | 0 . 2 ∗ 0 . 1 H ∗ H ∗ ( ( 3 3 P P . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 3 1 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( ∗ 1 3 ∗ | C | 0 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) ) . C . C 0 0 | | H H ∗ 0 . 2 ∗ 0 . 1 ∗ ( ( 3 P 3 P . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 3 ( C ) = v 2 ( C ) = max( . 0384 ∗ . 02 , . 032 ∗ . 05) max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 0016 = . 032 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) C ) . ) . C 0 C 0 | � | | S H H / 0 . 2 ∗ 0 . 1 ∗ ∗ � ( ( ( P P 3 P 3 2 . . 0 0 0 . P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | C C 0 0 ) ) . . 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H 1 | ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 ∗ ∗ 3 | | C C 0 0 ) . . ) 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H H 21

An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 � � H H H 21

Pseudocode for the Viterbi Algorithm Input : observations of length N , state set of size L Output : best-path create a path probability matrix viterbi [ N , L + 2] create a path backpointer matrix backpointer [ N , L + 2] for each state s from 1 to L do viterbi [1 , s ] ← trans ( � S � , s ) × emit ( o 1 , s ) backpointer [1 , s ] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi [ i , s ] ← max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) × emit ( o i , s ) backpointer [ i , s ] ← arg max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) end end viterbi [ N , L + 1] ← max L s = 1 viterbi [ s , N ] × trans ( s , � / S � ) backpointer [ N , L + 1] ← arg max L s = 1 viterbi [ N , s ] × trans ( s , � / S � ) return the path by following backpointers from backpointer [ N , L + 1] 22

Diversion: Complexity and O(N) Big-O notation describes the complexity of an algorithm. ◮ it describes the worst-case order of growth in terms of the size of the input ◮ only the largest order term is represented ◮ constant factors are ignored ◮ determined by looking at loops in the code 23

INF4820: Algorithms for Artificial Intelligence and Natural - PowerPoint PPT Presentation

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language Models Basic probability

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Wrap-Up and Exam

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Introduction and

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Introduction and

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing More Common Lisp

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp Core

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Data Structures

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

1 Artificial Intelligence - Example - An example of what we face What is involved (I)

CSE 573: Artificial Intelligence Hanna Hajishirzi Expectimax Complex Games slides adapted

Announcements Homework 3: Games Has been released, due Monday 9/17 at 11:59pm Electronic

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Foundations of Artificial Intelligence 14. Deep Learning Learning from Raw Data Joschka

Conceptual Dependency KR Chowdhary, Professor, Department of Computer Science & Engineering,

Masters Thesis Defense Matthew Jeremy Michelson University of Southern California June 15,

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

INF4820: Algorithms for Artificial Intelligence and Natural - PowerPoint PPT Presentation

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language Models Basic probability

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Wrap-Up and Exam

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Introduction and

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Introduction and

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing More Common Lisp

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp Core

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Data Structures

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Common Lisp

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

1 Artificial Intelligence - Example - An example of what we face What is involved (I)

CSE 573: Artificial Intelligence Hanna Hajishirzi Expectimax Complex Games slides adapted

Announcements Homework 3: Games Has been released, due Monday 9/17 at 11:59pm Electronic

CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science

Foundations of Artificial Intelligence 14. Deep Learning Learning from Raw Data Joschka

Conceptual Dependency KR Chowdhary, Professor, Department of Computer Science &amp; Engineering,

Masters Thesis Defense Matthew Jeremy Michelson University of Southern California June 15,

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Conceptual Dependency KR Chowdhary, Professor, Department of Computer Science & Engineering,