A Maximum Entropy Model for Part-of-Speech Introduction Tagging - - PowerPoint PPT Presentation

▶

Dec 12, 2023 331 likes •623 views

Mawulolo Ameko and Sonia Baee A Maximum Entropy Model for Part-of-Speech Introduction Tagging The probability model Adwait Ratnaparkhi, 1996 Features for POS tagging Testing the Model Mawulolo Ameko and Sonia Baee Error Analysis

SLIDE 1

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

A Maximum Entropy Model for Part-of-Speech Tagging

Adwait Ratnaparkhi, 1996 Mawulolo Ameko and Sonia Baee CS 6501-004 - Text Mining Paper Presentation April 12th, 2018

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 1

SLIDE 2

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 2

SLIDE 3

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 3

SLIDE 4

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Background

Many Natural Language tasks require the accurate assignment of Part-Of-Speech (POS) to previously unseen texts. Previous use cases for Maximum Entropy (MaxEnt) models include: . Language modeling (Lau et al., 1993) . Machine translation (Berger et al., 1996) . Prepositional phrase attachment (Ratnaparkhi et al., 1995) . Word morphology (Della Pietra et al., 1995)

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 4

SLIDE 5

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 5

SLIDE 6

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Model Formulation

Given a set of histories H and tag contexts T the probability model is defined

ver the Cartesian product space H × T as:

Probability Model

p(h, t) = πµ

k

αfj(h,t)

j

Where π is a normalization constant, {µ, α1, · · · , αk} positive model parameters and {f1, · · · , fk} features; where fj(h, t) ∈ {0, 1}

Likelihood Function

L(p) =

n

p(hi, ti) =

n

πµ

k

αfj(hi,ti)

j

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 6

SLIDE 7

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Equivalent Formulation

Maximum Entropy Formalism

H(p) = −

h∈H,t∈T

p(h, t)logp(h, t) s.t. Efj = ˜ Efj, 1 ≤ j ≤ k Where Efj and ˜ Efj represent the model’s feature expectation and the observed expectation from the training data, respectively. Generalized Iterative Scaling (Darroch and Ratcliff, 1972) used to determine the unique combination of parameters that maximizes the log-likelihood.

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 7

SLIDE 8

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 8

SLIDE 9

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Basic Definition

Feature Definition

For a given word and tag context available in a history: hi = {wi, wi+1, wi+2, wi−1, wi−2, ti−1, ti−2} fj(hi, ti) =

if suffix(wi) = ”ing” & ti = VBG

therwise

The joint distribution of a history h and tag t is determined by the activated parameters as enabled from the feature definition.

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 9

SLIDE 10

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specifically, Feature Definition

Condition Features wi is not rare wi = X & ti = T wi is rare X is prefix of wi, |X| ≤ 4 & ti = T X is suffix of wi, |X| ≤ 4 & ti = T wi contains number & ti = T wi contains uppercase character & ti = T wi contains hyphen & ti = T ∀wi ti−1 = X & ti = T ti−2ti−1 = X & ti = T wi−1 = X & ti = T wi−2 = X & ti = T wi+1 = X & ti = T wi+2 = X & ti = T

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 10

SLIDE 11

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 11

SLIDE 12

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Testing the Model

Specifically, uses ”Beam Search” as search algorithm with a with beam size

N = 5

Uses a Tag Dictionary for seen words within the training set
Assigns equal probability to all tags for Unseen words
Test corpus is tagged one sentence at a time

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 12

SLIDE 13

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Experiments

In order to conduct tagging experiments, the Wall St. Journal data has been split into three contiguous sections.

Table: WSJ Data Sizes Dataset Sentences Words Uknown words Training 40000 962687

Development

8000 192826 6107 Test 5485 133805 133805

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 13

SLIDE 14

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Experiments-Results

Table: Baseline Performance on Development Set Total Word Accu- racy Unknown Word Ac- curacy Sentence Accuracy Tag Dictionary 96.43% 86.32% 47.55% No Tag Dictionary 96.31% 86.28% 47.38%

Error analysis reveals some ”Difficult Words”.

Table: Top Tagging Mistakes on Training Set for Baseline Model Word Correct Tag Model Tag Frequency about RB IN 393 that DT IN 389 more RBR IN 389 up IN RB 187

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 14

SLIDE 15

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 15

SLIDE 16

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features and Consistency

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 16

SLIDE 17

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features

Thankfully, the Maximum Entropy model allows arbitrary binary-valued features

n the context.

Specialized Feature Definition

For a given word and tag context available in a history: hi = {wi, wi+1, wi+2, wi−1, wi−2, ti−1, ti−2} fj(hi, ti) =

if wi = ”about” & ti−2ti−1 = DT NNS &ti = IN

therwise

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 17

SLIDE 18

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features

Specialized features are constructed for ”difficult” words:

Table: Performance of Baseline Model with Specialized Features Number of ”Difficult” Words Development Set Performance 29 96.49% Table: Top Tagging Mistakes on Training Set for Baseline Model Word Baseline model error Specialized Model Errors that 246 207 up 186 169 about 110 120

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 18

SLIDE 19

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features and Consistency

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 19

SLIDE 20

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features and Consistency

Consistency test:

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 20

SLIDE 21

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features and Consistency

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 21

SLIDE 22

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Specialized Features and Consistency

Performance of Baseline and specialized Model when tested on Consistent subset

f development set:

Table: Result Training Size(words) Test Size(words) Baseline Specialized 571190 44478 97.04% 97.13%

The marginal improvement of +.1% might imply that the features remain improvrished or that there exist an intra-annotator inconsistency.

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 22

SLIDE 23

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 23

SLIDE 24

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Comparison with previous work

All previous models achieve an accuracy level of about 96.5% and 85% on unseen

words. However, the proposed model combines the merits of each model.

Table: Comparison Model Probabilistic Rich Represen- tation Independence Data Fragmen- tation SDT +1 +1 +1

Markov Model +1 +1

+1 TBL

+1 +1 +1 MaxEnt +1 +1 +1 +1

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 24

SLIDE 25

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 25

SLIDE 26

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Conclusion

The Maximum Entropy model is an extremely flexible technique for linguistic

modeling.

The implementation in this paper is a state-of-the-art POS tagger, as

evidenced by the 96.6% accuracy on the test set.

Note that this performance is close to current state-of-the-art performance of

97%.

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 26

SLIDE 27

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 27

SLIDE 28

Mawulolo Ameko and Sonia Baee Introduction The probability model Features for POS tagging Testing the Model Error Analysis Comparison with previous work Conclusion Question

Question

Questions?

CS6501-004: A Maximum Entropy Model for Part-of-Speech Tagging Mawulolo Ameko and Sonia Baee 28

A Maximum Entropy Model for Part-of-Speech Tagging

Adwait Ratnaparkhi, 1996 Mawulolo Ameko and Sonia Baee CS 6501-004 - Text Mining Paper Presentation April 12th, 2018

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Background

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Model Formulation

Given a set of histories H and tag contexts T the probability model is defined

Probability Model

p(h, t) = πµ

k

αfj(h,t)

j

Where π is a normalization constant, {µ, α1, · · · , αk} positive model parameters and {f1, · · · , fk} features; where fj(h, t) ∈ {0, 1}

Likelihood Function

L(p) =

n

p(hi, ti) =

n

πµ

k

αfj(hi,ti)

j

Equivalent Formulation

Maximum Entropy Formalism

H(p) = −

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Basic Definition

Feature Definition

For a given word and tag context available in a history: hi = {wi, wi+1, wi+2, wi−1, wi−2, ti−1, ti−2} fj(hi, ti) =

if suffix(wi) = ”ing” & ti = VBG

The joint distribution of a history h and tag t is determined by the activated parameters as enabled from the feature definition.

Specifically, Feature Definition

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Testing the Model

N = 5

Experiments

In order to conduct tagging experiments, the Wall St. Journal data has been split into three contiguous sections.

Table: WSJ Data Sizes Dataset Sentences Words Uknown words Training 40000 962687

8000 192826 6107 Test 5485 133805 133805

Experiments-Results

Table: Baseline Performance on Development Set Total Word Accu- racy Unknown Word Ac- curacy Sentence Accuracy Tag Dictionary 96.43% 86.32% 47.55% No Tag Dictionary 96.31% 86.28% 47.38%

Error analysis reveals some ”Difficult Words”.

Table: Top Tagging Mistakes on Training Set for Baseline Model Word Correct Tag Model Tag Frequency about RB IN 393 that DT IN 389 more RBR IN 389 up IN RB 187

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Specialized Features and Consistency

Specialized Features

Thankfully, the Maximum Entropy model allows arbitrary binary-valued features

Specialized Feature Definition

For a given word and tag context available in a history: hi = {wi, wi+1, wi+2, wi−1, wi−2, ti−1, ti−2} fj(hi, ti) =

if wi = ”about” & ti−2ti−1 = DT NNS &ti = IN

Specialized Features

Specialized features are constructed for ”difficult” words:

Table: Performance of Baseline Model with Specialized Features Number of ”Difficult” Words Development Set Performance 29 96.49% Table: Top Tagging Mistakes on Training Set for Baseline Model Word Baseline model error Specialized Model Errors that 246 207 up 186 169 about 110 120

Specialized Features and Consistency

Specialized Features and Consistency

Consistency test:

Specialized Features and Consistency

Specialized Features and Consistency

Performance of Baseline and specialized Model when tested on Consistent subset

Table: Result Training Size(words) Test Size(words) Baseline Specialized 571190 44478 97.04% 97.13%

The marginal improvement of +.1% might imply that the features remain improvrished or that there exist an intra-annotator inconsistency.

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Comparison with previous work

All previous models achieve an accuracy level of about 96.5% and 85% on unseen

Table: Comparison Model Probabilistic Rich Represen- tation Independence Data Fragmen- tation SDT +1 +1 +1

Markov Model +1 +1

+1 TBL

+1 +1 +1 MaxEnt +1 +1 +1 +1

Table of Contents

1 Introduction 2 The probability model 3 Features for POS tagging 4 Testing the Model 5 Error Analysis 6 Comparison with previous work 7 Conclusion 8 Question

Conclusion

modeling.