A Bayesian Approach to Learning the Structure of Human Languages - PowerPoint PPT Presentation

A Bayesian Approach to Learning the Structure of Human Languages Phil Blunsom University of Oxford 1/35

Grammar Induction ? The proposal would undermine effectiveness managers contend Grammar Induction research pursues two main aims: • to produce testable models of human language acquisition, • to implement unsupervised parsers capable of reducing the reliance on annotated treebanks in Natural Language Processing. 2/35

Language Acquisition ? The proposal would undermine effectiveness managers contend • The empirical success or otherwise of weak bias models of grammar induction impact on the viability of the Argument from the Poverty of the Stimulus. • This contrasts with the strong bias hypothesis of Universal Grammar. 3/35

Machine Translation Grammar Induction for Machine Translation I wanted to read this book 4/35

Machine Translation Learn the syntactic part-of-speech categories of words PRP Verb TO Verb DT Noun I wanted to read this book 4/35

Machine Translation Learn the grammatical structure of the sentences PRP Verb TO Verb DT Noun I wanted to read this book 4/35

Machine Translation Learn syntactic reorderings from Subject-Verb-Object to Subject-Object-Verb PRP Verb DT Noun TO Verb I wanted this book to read 4/35

Machine Translation Learn to translate PRP Verb DT Noun TO Verb I wanted this book to read Ich wollte dieses Buch lesen 4/35

Dependency Grammar Induction will IN be VB health NN threatened VBN Of IN the DT of IN if IN economy NN course NN the DT market NN the DT continues VBZ to TO dive VB week NN Formalism Dependency Grammar induction has provided one this DT of the most promising avenues for this research. 5/35

Dependency Grammar Induction will IN be VB health NN threatened VBN Of IN the DT of IN if IN economy NN course NN the DT market NN the DT continues VBZ to TO dive VB week NN We induce two probabilistic models 1 A model of the syntactic part-of-speech this DT categories of the tokens (Noun, Verb, etc.), 2 A model of the dependency derivations of the text given these syntactic categories. 5/35

Weak Bias: Power Laws 6/35

Weak Bias: Pitman-Yor Process Priors In a Pitman-Yor Process (PYP) unigram language model words ( w 1 . . . w n ) are generated as follows: G | a , b , P 0 ∼ PYP ( a , b , P 0 ) w i | G ∼ G • G is a distribution over an infinite set of words, • P 0 is the probability that an word will be in the support of G , • a and b control the power-law behavior of the PYP . One way of understanding the predictions made by the PYP model is through the Chinese restaurant process (CRP) . . . 7/35

The Chinese Restaurant Process the the n0=0 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process cats the cats n0=1 n1=0 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process cats the cats n0=1 n1=1 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process the the cats n0=1 n1=2 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process the the cats the n0=2 n1=2 n2=0 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process meow the cats the meow n0=2 n1=2 n2=1 n3=0 Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: 1 w k ( w ) × ( n −  k − a ) , 0 ≤ k < K  P ( z i = k | w i = w , z − ) ∝ ( Ka + b ) P 0 ( w ) , k = K + 1  8/35

The Chinese Restaurant Process the the cats the meow n0=2 n1=2 n2=1 n3=1 The 7 th customer ‘ the ’ enters the restaurant and chooses a table from those already seating ‘ the ’, or opens a new table: P ( z 6 = 0 | w 6 = the , z − 6 ) ∝ 2 − a 8/35

The Chinese Restaurant Process the the cats the meow n0=2 n1=2 n2=1 n3=1 The 7 th customer ‘ the ’ enters the restaurant and chooses a table from those already seating ‘ the ’, or opens a new table: P ( z 6 = 2 | w 6 = the , z − 6 ) ∝ 1 − a 8/35

The Chinese Restaurant Process the the cats the meow the n0=2 n1=2 n2=1 n3=1 n4=0 The 7 th customer ‘ the ’ enters the restaurant and chooses a table from those already seating ‘ the ’, or opens a new table: P ( z 6 = 4 | w 6 = the , z − 6 ) ∝ ( 4 a + b ) P 0 ( the ) 8/35

Outline Inducing the syntactic categories of words 1 Inducing the syntactic structure of sentences 2 Inducing the syntactic categories of words 9/35

Unsupervised PoS Tagging 5 6 1 A simple example DT JJ NN A simple example Unsupervised part-of-speech tagging aims to learn a partitioning of tokens corresponding to syntactic equivalence classes. Inducing the syntactic categories of words 10/35

Unsupervised PoS Tagging 5 6 1 A simple example DT JJ NN A simple example Previous research has followed two paradigms: • word class induction, popular for language modelling and Machine Translation. All tokens of a type must have the same class. • syntactic models, generally based on HMMs, allow multiple tags per type and evaluate against an annotated treebank. For both paradigms most models optimise the likelihood of the training corpus, though more recently Bayesian approaches have become popular. Inducing the syntactic categories of words 10/35

A Hierarchical Pitman-Yor HMM Tri ij t 1 t 2 t 3 ... Em j w 1 w 2 w 3 t l | t l − 1 , t l − 2 , Tri ∼ Tri t l − 1 , t l − 2 w l | t l , Em ∼ Em t l . Inducing the syntactic categories of words 11/35

A Hierarchical Pitman-Yor HMM Tri ij Bi j t 1 t 2 t 3 ... Em j w 1 w 2 w 3 Tri ij | a Tri , b Tri , Bi j ∼ PYP ( a Tri , b Tri , Bi j ) Em j | a Em , b Em , C ∼ PYP ( a Em , b Em , Uniform ) . Inducing the syntactic categories of words 11/35

A Hierarchical Pitman-Yor HMM Tri ij Bi j Uni t 1 t 2 t 3 ... Em j w 1 w 2 w 3 Bi j | a Bi , b Bi , Uni ∼ PYP ( a Bi , b Bi , Uni ) Uni | a Uni , b Uni ∼ PYP ( a Uni , b Uni , Uniform ) Inducing the syntactic categories of words 11/35

Unsupervised PoS Tagging We perform inference in this model using Gibbs sampling, an MCMC technique: • the tagging of one token, conditioned on all others, is considered at each sampling step • we employ a hierarchical Chinese Restaurant analogy in which trigrams are considered as customers sitting at restaurant tables Inducing the syntactic categories of words 12/35

A Hierarchical Pitman-Yor HMM DT JJ ? A simple example ? . . . JJ NNP NNS JJ NN Tri (DT,JJ) : 1 2 3 4 5 Inducing the syntactic categories of words 13/35

A Hierarchical Pitman-Yor HMM P Tri ( t l = NN , z l ≤ tables | z − l , t − l ) ∝ DT JJ NN (DT,JJ,NN) − a Tri × tables − count − (DT,JJ,NN) (DT,JJ) + b Tri count − A simple example NN . . . JJ NNP NNS JJ NN Tri (DT,JJ) : 1 2 3 4 5 Inducing the syntactic categories of words 13/35

A Hierarchical Pitman-Yor HMM P Tri ( t l = NN , z l = tables + 1 | z − l , t − l ) ∝ DT JJ NN � a Tri × tables − (DT,JJ,NN) + b Tri � P Bi ( NN | z − l , t − l ) (DT,JJ) + b Tri count − A simple example NN . . . JJ NNP NNS JJ NN NN Tri (DT,JJ) : 1 2 3 4 5 6 Inducing the syntactic categories of words 13/35

A Hierarchical Pitman-Yor HMM P Bi ( t l = NN , z l ≤ tables | z − l , t − l ) ∝ DT JJ NN (JJ,NN) − a Bi × tables − count − (JJ,NN) (JJ) + b Bi count − A simple example NN . . . JJ NNP NNS JJ NN NN Tri (DT,JJ) : 1 2 3 4 5 6 . . . JJ NNP NNS NN Bi (JJ) : 1 2 3 4 Inducing the syntactic categories of words 13/35

A Hierarchical Pitman-Yor HMM P Bi ( t l = NN , z l = tables + 1 | z − l , t − l ) ∝ DT JJ NN � a Bi × tables − (JJ,NN) + b Bi � P Uni ( NN | z − l , t − l ) (JJ) + b Bi count − A simple example NN . . . JJ NNP NNS JJ NN NN Tri (DT,JJ) : 1 2 3 4 5 6 JJ NNP NNS NN NN Bi (JJ) : . . . 1 2 3 4 5 Inducing the syntactic categories of words 13/35

A Hierarchical Pitman-Yor HMM DT JJ NN P Uni ( t l = NN , z l ≤ tables | z − l , t − l ) ∝ NN − a Uni × tables − count − NN count − + b Uni A simple example NN . . . JJ NNP NNS JJ NN NN Tri (DT,JJ) : 1 2 3 4 5 6 JJ NNP NNS NN NN Bi (JJ) : . . . 1 2 3 4 5 JJ NNP NNS NN . . . Uni: 1 2 3 4 Inducing the syntactic categories of words 13/35

A Bayesian Approach to Learning the Structure of Human Languages - PowerPoint PPT Presentation

A Bayesian Approach to Learning the Structure of Human Languages Phil Blunsom University of Oxford 1/35 Grammar Induction ? The proposal would undermine effectiveness managers contend Grammar Induction research pursues two main aims:

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Statistical Learning: The Complex Cases Case 0: Bayesian Network structure known, all

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Information Warfare in Cyberspace: The Spread of Hoaxes in Social Media ( Case Study: Jakarta

Objects of Inspiration Why target Young People? Underrepresented audience Access For All

OpenCms Days 2011 Conference Opening Keynote: Presenting OpenCms 8 Alexander Kandzior, CEO

Views on Tipping Points, Resilience, and Long-Term Water Availability Tribes and Pueblos

Londonwide LMCs and Londonwide Enterprise Ltd Annual General Meeting 1 February 2018 Review of

Ba ed ball outcomes - contact rate Exploring Pitch Data in R Successful pitching

Evaluating Dialogue Act Tagging with Naive & Expert annotators Jeroen Geertzen & Volha

Annotation of Tense & Aspect Semantics for Sentential AMR Lucia Donatelli 1 , Michael Regan 2

A Bayesian Approach to Learning the Structure of Human Languages - PowerPoint PPT Presentation

A Bayesian Approach to Learning the Structure of Human Languages Phil Blunsom University of Oxford 1/35 Grammar Induction ? The proposal would undermine effectiveness managers contend Grammar Induction research pursues two main aims:

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Statistical Learning: The Complex Cases Case 0: Bayesian Network structure known, all

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Information Warfare in Cyberspace: The Spread of Hoaxes in Social Media ( Case Study: Jakarta

Objects of Inspiration Why target Young People? Underrepresented audience Access For All

OpenCms Days 2011 Conference Opening Keynote: Presenting OpenCms 8 Alexander Kandzior, CEO

Views on Tipping Points, Resilience, and Long-Term Water Availability Tribes and Pueblos

Londonwide LMCs and Londonwide Enterprise Ltd Annual General Meeting 1 February 2018 Review of

Ba ed ball outcomes - contact rate Exploring Pitch Data in R Successful pitching

Evaluating Dialogue Act Tagging with Naive &amp; Expert annotators Jeroen Geertzen &amp; Volha

Annotation of Tense &amp; Aspect Semantics for Sentential AMR Lucia Donatelli 1 , Michael Regan 2

Evaluating Dialogue Act Tagging with Naive & Expert annotators Jeroen Geertzen & Volha

Annotation of Tense & Aspect Semantics for Sentential AMR Lucia Donatelli 1 , Michael Regan 2