 
              High-level overview • Probability distributions and graphical models Grammars, graphs and automata • (Probabilistic) finite state machines and context-free grammars – computation (dynamic programming) Mark Johnson – estimation Brown University • Log-linear models ESSLLI 2005 – stochastic unification-based grammars – reranking parsing • Weighted CFGs and proper PCFGs slides available from http:/ /cog.brown.edu/˜mj 1 2 Topics What is computational linguistics? • Graphical models and Bayes networks Computational linguistics studies the computational processes involved in language production, comprehension and acquisition . • (Hidden) Markov models • assumption that language is inherently computational • (Probabilistic) context-free grammars and finite-state machines • scientific side: • Computation with and estimation of PCFGs – modeling human performance (computational psycholinguistics) • Lexicalized and bi-lexicalized PCFGs – understanding how it can be done at all • Non-local dependencies and log-linear models • technological applications: • Features in reranking parsing – speech recognition • Stochastic unification-based grammar – information extraction (who did what to whom) and question answering • Weighted CFGs and proper PCFGs – machine translation (translation by computer) 3 4
(Some of the) problems in modeling language Aspects of linguistic structure + Language is a product of the human mind • Phonetics: the (production and perception) of speech sounds ⇒ any structure we observe is a product of the mind • Phonology: the organization and regularities of speech sounds − Language involves a transduction between form and meaning , but we don’t • Morphology: the structure and organization of words know much about the way meanings are represented • Syntax: the way words combine to form phrases and sentences + / − We have (reasonable?) guesses about some of the computational processes • Semantics: the way meaning is associated with sentences involved in language • Pragmatics: how language can be used to do things − We don’t know very much about the cognitive processes that language interacts with In general the further we get from speech, the less well we understand what’s going on! − We know little about the anatomical layout of language in the brain − We know little about neural networks that might support linguistic computations 5 6 Aspects of syntactic and semantic structure A very brief history (Antiquity) Birth of linguistics, logic, rhetoric S S (1900s) Structuralist linguistics (phrase structure) NP VP CONJ S (1900s) Mathematical logic DT NN VB NP But NP VP (1900s) Probability and statistics Most people hate VBD NNS DT NNS VBD S (1940s) Behaviorism (discovery procedures, corpus linguistics) baked beans the students promised NP VP (1940s) Ciphers and codes PRO TO VP (1950s) Information theory to VB NP (1950s) Automata theory (1960s) Context-free grammars eat PRP • Anaphora: it refers to baked beans (1960s) Generative grammar dominates (US) linguistics (Chomsky) them • Predicate-argument structure: the students is agent of eat (1980s) “Neural networks” (learning as parameter estimation) • Discourse structure: second clause is contrasted with first (1980s) Graphical models (Bayes nets, Markov Random Fields) These all refer to phrase structure entities! Parsing is the process of recovering (1980s) Statistical models dominate speech recognition these entities. (1980s) Probabilistic grammars (1990s) Statistical methods dominate computational linguistics (1990s) Computational learning theory 7 8
Topics Probability distributions • Graphical models and Bayes networks • A probability distribution over a countable set Ω is a function P : Ω → [0 , 1] which satisfies 1 = � ω ∈ Ω P( ω ). • (Hidden) Markov models � • A random variable is a function X : Ω → X . P( X = x ) = P( ω ) • (Probabilistic) context-free grammars ω : X ( ω )= x • (Probabilistic) finite-state machines • If there are several random variables X 1 , . . . , X n , then: • Computation with PCFGs – P( X 1 , . . . , X n ) is the joint distribution – P( X i ) is the marginal distribution of X i • Estimation of PCFGs • X 1 , . . . , X n are independent iff P( X 1 , . . . , X n ) = P( X 1 ) . . . P( X n ), • Lexicalized and bi-lexicalized PCFGs i.e., the joint is the product of the marginals • Non-local dependencies and log-linear models • The conditional distribution of X given Y is P( X | Y ) = P( X, Y ) / P( Y ) • Stochastic unification-based grammars so P( X, Y ) = P( Y )P( X | Y ) = P( X )P( Y | X ) (Bayes rule) • X 1 , . . . , X n are conditionally independent given Y iff P( X 1 , . . . , X n | Y ) = P( X 1 | Y ) . . . P( X n | Y ) 9 10 Bayes inversion and the noisy channel model Why graphical models? Given an acoustic signal a , find words � w ( a ) most likely to correspond to a • Graphical models depict factorizations of probability distributions • Statistical and computational properties depend on the factorization w ⋆ ( a ) = arg max P( W = w | A = a ) w – complexity of dynamic programming is size of a certain cut in the Language model P( A )P( W | A ) = P( W, A ) = P( W )P( A | W ) graphical model P( W ) • Two different (but related) graphical representations P( W | A ) = P( W )P( A | W ) – Bayes nets (directed graphs; products of conditionals) P( A ) Acoustic model – Markov Random Fields (undirected graphs; products of arbitrary terms) P( A | W ) P( W = w )P( A = a | W = w ) w ⋆ ( a ) = arg max • Each random variable X i is represented by a node P( A = a ) w = arg max P( W = w )P( A = a | W = w ) Acoustic signal A w Advantages of noisy channel model: • P( W | A ) is hard to construct directly; P( A | W ) is easier • noisy channel also exploits language model P( W ) 11 12
Bayes nets (directed graph) Markov Random Field (undirected) • Factorize joint P( X 1 , . . . , X n ) into product of conditionals • Factorize P( X 1 , . . . , X n ) into product of potentials g c ( X c ), where c ⊆ (1 , . . . , n ) and c ∈ C (a set of tuples of indices) n � P( X 1 , . . . , X n ) = P( X i | X P a ( i ) ) � 1 P( X 1 , . . . , X n ) = g c ( X c ) i =1 Z c ∈C where Pa ( i ) ⊆ ( X 1 , . . . , X i − 1 ) • If i, j ∈ c ∈ C , then an edge connects i and j • The Bayes net contains an arc from each j ∈ Pa ( i ) to i C = { (1 , 2 , 3) , (3 , 4) } P( X 1 , X 2 , X 3 , X 4 ) = P( X 1 )P( X 2 )P( X 3 | X 1 , X 2 )P( X 4 | X 3 ) 1 P( X 1 , X 2 , X 3 , X 4 ) = Z g 123 ( X 1 , X 2 , X 3 ) g 34 ( X 3 , X 4 ) X 1 X 3 X 4 X 1 X 3 X 4 X 2 X 2 13 14 A rose by any other name ... Bayes nets and MRFs • MRFs have the same form as Maximum Entropy models , Exponential • MRFs are more general than Bayes nets models , Log-linear models , Harmony models , . . . • Its easy to find the MRF representation of a Bayes net � 1 P( X ) = g c ( X c ) P( X 1 , X 2 , X 3 , X 4 ) = P( X 1 )P( X 2 )P( X 3 | X 1 , X 2 ) P( X 4 | X 3 ) Z � �� � � �� � c ∈C � g 123 ( X 1 , X 2 , X 3 ) g 34 ( X 3 , X 4 ) 1 ( θ X c = x c ) [ [ X c = x c ] ] , where θ X c = x c = g c ( x c ) = Z • Moralization , i.e, “marry the parents” c ∈C ,x c ∈X c � 1 X 1 X 1 = Z exp [ [ X c = x c ] ] φ X c = x c , where φ X c = x c = log g c ( x c ) c ∈C ,X c ∈X c X 3 X 4 X 3 X 4 1 P( X ) = Z g 123 ( X 1 , X 2 , X 3 ) g 34 ( X 3 , X 4 )   1  [ [ X 123 = 000] ] φ 000 + [ [ X 123 = 001] ] φ 001 + . . . X 2 X 2  = Z exp [ [ X 34 = 00] ] φ 00 + [ [ X 34 = 01] ] φ 01 + . . . 15 16
Conditionalization in MRFs Marginalization in MRFs • Conditionalization is fixing the value of certain variables • Marginalization is summing over all possible values of certain variables • To get a MRF representation of the conditional distribution, delete nodes • To get a MRF representation of the marginal distribution, delete the whose values are fixed and arcs connected to them marginalized nodes and interconnect all of their neighbours � 1 P( X 1 , X 2 , X 4 ) = P( X 1 , X 2 , X 3 , X 4 ) P( X 1 , X 2 , X 4 | X 3 = v ) = Z P( X 3 = v ) g 123 ( X 1 , X 2 , v ) g 34 ( v, X 4 ) X 3 � 1 g ′ g ′ = g 123 ( X 1 , X 2 , X 3 ) g 34 ( X 3 , X 4 ) = 12 ( X 1 , X 2 ) 4 ( X 4 ) Z ′ ( v ) X 3 X 1 X 1 g ′ = 124 ( X 1 , X 2 , X 4 ) X 3 = v X 4 X 4 X 1 X 1 X 3 X 4 X 4 ✂✁✂ �✁� X 2 X 2 X 2 X 2 17 18 Classification ML and CML Estimation • Given value of X , predict value of Y • Maximum likelihood estimation (MLE) picks the θ that makes the data D = ( x, y ) as likely as possible • Given a probabilistic model P( Y | X ), predict: � θ = arg max P θ ( x, y ) y ⋆ ( x ) = arg max P( y | x ) θ y • Conditional maximum likelihood estimation (CMLE) picks the θ that • Learn P( Y | X ) from data D = (( x 1 , y 1 ) , . . . , ( x n , y n )) maximizes conditional likelihood of the data D = ( x, y ) • Restrict attention to a parametric model class P θ parameterized by θ ′ � = arg max P θ ( y | x ) parameter vector θ θ – learning is estimating θ from D • P( X, Y ) = P( X )P( Y | X ), so CMLE ignores P( X ) 19 20
Recommend
More recommend