Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1

MaxEnt in NLP ● The maximum entropy principle has a long history. ● The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). ● Used in many NLP tasks: Tagging, Parsing, PP attachment, … 2

Readings & Comments ● Several readings: ● (Berger, 1996), (Ratnaparkhi, 1997) ● (Klein & Manning, 2003): Tutorial ● Note: Some of these are very ‘dense’ ● Don’t spend huge amount of time on every detail ● Take a first pass before class, review after lecture ● Going forward: ● Techniques more complex ● Goal: Understand basic model, concepts ● Training is complex; we’ll discuss, but not implement 3

Notation We use this one Input Output Pair Berger et al 1996 x y (x, y) Ratnaparkhi 1997 b a x Ratnaparkhi 1996 h t (h, t) Klein and Manning 2003 d c (d, c) 4

Outline ● Overview ● The Maximum Entropy Principle ● Modeling** ● Decoding ● Training** ● Case study: POS tagging 5

Overview 6

Joint vs. Conditional models ● Given training data {(x,y)}, we want to build a model to predict y for new x’s. For each model, we need to estimate the parameters µ. ● Joint (aka generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y|µ) ● Ex: n-gram models, HMM, Naïve Bayes, PCFG ● Choosing weights is trivial: just use relative frequencies. ● Conditional (aka discriminative) models estimate P(y | x) by maximizing the conditional likelihood: P(Y | X, µ) ● Ex: MaxEnt, SVM, CRF, etc. ● Computing weights is more complex. 7

Naïve Bayes Model C … f n f 2 f 1 Assumption: each f i is conditionally independent from f j given C. 8

The conditional independence assumption f m and f n are conditionally independent given c: P(f m | c, f n ) = P(f m | c) Counter-examples in the text classification task: - P(“Manchester” | entertainment) != P(“Manchester” | entertainment, “Oscar”) Q: How to deal with correlated features? A: Many models, including MaxEnt, do not assume that features are conditionally independent. 9

Naïve Bayes highlights ● Choose c* = arg max c P(c) ∏ k P(f k | c) ● Two types of model parameters: ● Class prior: P(c) ● Conditional probability: P(f k | c) ● The number of model parameters: |C|+|CV| 10

Weights in NB and MaxEnt ● In NB ● P(f | y) are probabilities (i.e., in [0,1]) ● P(f | y) are multiplied at test time ● In MaxEnt ● the weights are real numbers: they can be negative. ● the weighted features are added at test time 12

Highlights of MaxEnt f j (x,y) is a feature function, which normally corresponds to a (feature, class) pair. Training: to estimate Testing: to calculate P(y | x) 13

Main questions ● What is the maximum entropy principle? ● What is a feature function? ● Modeling: Why does P(y|x) have the form? ● Training: How do we estimate λ j ? 14

Outline ● Overview ● The Maximum Entropy Principle ● Modeling** ● Decoding ● Training* ● Case study 15

Maximum Entropy Principle 16

Maximum Entropy Principle ● Intuitively, model all that is known, and assume as little as possible about what is unknown. ● Related to Occam’s razor and other similar justifications for scientific inquiry ● Also: Laplace’s Principle of Insufficient Reason: when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely. 17

Maximum Entropy ● Why maximum entropy? ● Maximize entropy = Minimize commitment ● Model all that is known and assume nothing about what is unknown. ● Model all that is known: satisfy a set of constraints that must hold ● Assume nothing about what is unknown:   choose the most “uniform” distribution ➔ choose the one with maximum entropy 18

    Ex1: Coin-flip example   (Klein & Manning, 2003) ● Toss a coin: p(H)=p1, p(T)=p2. ● Constraint: p1 + p2 = 1 ● Question: what’s p(x)? That is, what is the value of p1? ● Answer: choose the p that maximizes   H ( p ) = − ∑ p ( x )log p ( x ) x 19

Ex2: An MT example   (Berger et. al., 1996) Possible translation for the word “in” is: {dans, en, à, au cours de, pendant} Constraint: Intuitive answer: 20

An MT example (cont) Constraints: Intuitive answer: 21

An MT example (cont) Constraints: Intuitive answer: ?? 22

Ex3: POS tagging   (Klein and Manning, 2003) 23

Ex3 (cont) 24

Ex4: Overlapping features   (Klein and Manning, 2003) p1 p2 p3 p4 25

Ex4 (cont) p1 p2 26

Ex4 (cont) p1 27

        The MaxEnt Principle summary ● Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p).   p * = arg max p ∈ P H ( p ) ● Q1: How to represent constraints? ● Q2: How to find such distributions? 28

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1 MaxEnt in NLP The maximum entropy principle has a long history. The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996).

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Entropy and The Second Law of Thermodynamics Entropy (S)

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Eigenvalue Estimates for Quantum Graphs James Kennedy University of Stuttgart, Germany Based on

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Some aspects of spectral graph theory u H July 2018

ME 5286 Robotics Labs Hello World TA: Mark Gilbertson 2017 Table of Contents Lab Overview

Squared distance matrix of a tree R.B.Bapat Indian Statistical Institute New Delhi, India

A Linear Kernel for the Differential of a Graph Can we beat combinatorial kernels? Sergio Bermudo

All Square Roots of a Knig-Egervry Graph Vadim E. Levit & Eugen Mandrescu Ariel

Connectivity in bridge-addable graph classes: the McDiarmid-Steger-Welsh conjecture Guillem

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for - PowerPoint PPT Presentation

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1 MaxEnt in NLP The maximum entropy principle has a long history. The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996).

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universitt des Saarlandes,

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

MAXIMUM CARDS MAXIMUM CARDS What is a Maximum Card ? The Maximum Card is the one which contains a

Maximum Entropy Inverse Reinforcement Learning Nomenclature Basis Feature Expectation Matching

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Road detection via entropy By Anna Zaidman 1 1 What is entropy? Entropy is a mathematically

Entropy Change in Entropy Reversible Isobaric Process Ideal Gas in a Reversible Process Free

Entropy and The Second Law of Thermodynamics Entropy (S)

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Eigenvalue Estimates for Quantum Graphs James Kennedy University of Stuttgart, Germany Based on

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Some aspects of spectral graph theory u H July 2018

ME 5286 Robotics Labs Hello World TA: Mark Gilbertson 2017 Table of Contents Lab Overview

Squared distance matrix of a tree R.B.Bapat Indian Statistical Institute New Delhi, India

A Linear Kernel for the Differential of a Graph Can we beat combinatorial kernels? Sergio Bermudo

All Square Roots of a Knig-Egervry Graph Vadim E. Levit &amp; Eugen Mandrescu Ariel

Connectivity in bridge-addable graph classes: the McDiarmid-Steger-Welsh conjecture Guillem

All Square Roots of a Knig-Egervry Graph Vadim E. Levit & Eugen Mandrescu Ariel