Natural Language Processing Part of Speech Tagging and Named Entity - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Part of Speech Tagging and Named Entity - - PowerPoint PPT Presentation

Natural Language Processing Part of Speech Tagging and Named Entity Recognition Alessandro Moschitti & Olga Uryupina Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it


slide-1
SLIDE 1

Natural Language Processing

Alessandro Moschitti & Olga Uryupina

Department of information and communication technology University of Trento

Email: moschitti@disi.unitn.it uryupina@gmail.com

Part of Speech Tagging and Named Entity Recognition

slide-2
SLIDE 2

NLP: why?

's (pd) . 2013 a abolish ally also an and as at became been berlusconi center-right chamber changing clear constitutional cumbersome democratic elected elections ended ensure for forza government had have he him important in institutional it italia italian its lawmaking leader less lost make matteo minister more

  • f
  • n

pact party prime priority reforms renzi rules ruling said senate silvio since stable the to voting wants wednesday when winner with

slide-3
SLIDE 3

NLP: why?

Italian Prime Minister Matteo Renzi lost an important ally on Wednesday when Silvio Berlusconi's center-right Forza Italia party said it had ended its pact with him on institutional and constitutional reforms. Changing voting rules to ensure a clear winner at elections and more stable government have been a priority for Renzi since he became leader of the ruling Democratic Party (PD) in 2013. He also wants to abolish the Senate as an elected chamber to make lawmaking less cumbersome.

slide-4
SLIDE 4

NLP: why?

Texts are objects with inherent complex structure. A simple BoW model is not good enough for text understanding. Natural Language Processing provides models that go deeper to uncover the meaning.

Part-of-speech tagging, NER Syntactic analysis Semantic analysis Discourse structure

slide-5
SLIDE 5

Upcoming lectures & labs

Part-of-speech tagging, NER Parsing Coreference Using Tree Kernels for Syntactic/Semantic modeling Question Answering with NLP Pipelines and complex architectures Neural Nets for NLP tasks

slide-6
SLIDE 6

Labs

New repository with all the upcoming labs material: https://github.com/mnicosia/anlpir-2016 Please download the current lab’s material before the lab!

slide-7
SLIDE 7

Parts of Speech

8 traditional parts of speech for IndoEuropean

languages

Noun, verb, adjective, preposition, adverb, article,

interjection, pronoun, conjunction, etc

Around for over 2000 years (Dionysius Thrax of

Alexandria, c. 100 B.C.)

Called: parts-of-speech, lexical category, word classes,

morphological classes, lexical tags, POS

slide-8
SLIDE 8

POS examples for English

N

noun chair, bandwidth, pacing

V

verb study, debate, munch

ADJ

adj purple, tall, ridiculous

ADV

adverb unfortunately, slowly

P

preposition of, by, to

PRO

pronoun I, me, mine

DET

determiner the, a, that, those

CONJ

conjunction and, or

slide-9
SLIDE 9

Open vs. Closed classes

Closed: determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, …

Open:

Nouns, Verbs, Adjectives, Adverbs.

slide-10
SLIDE 10

Open Class Words

Nouns

Proper nouns (Penn, Philadelphia, Davidson)

English capitalizes these.

Common nouns (the rest). Count nouns and mass nouns

Count: have plurals, get counted: goat/goats, one goat, two goats Mass: don’t get counted (snow, salt, communism) (*two snows)

Adjectives/Adverbs: tend to modify nouns/verbs

Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here,home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately)

Verbs

In English, have morphological affixes (eat/eats/eaten)

slide-11
SLIDE 11

Closed Class Words

Differ more from language to language than open

class words

Examples:

prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …

slide-12
SLIDE 12

Prepositions from CELEX

slide-13
SLIDE 13

Conjunctions

slide-14
SLIDE 14

Auxiliaries

slide-15
SLIDE 15

POS Tagging: Choosing a Tagset

There are so many parts of speech, potential distinctions we can

draw

To do POS tagging, we need to choose a standard set of tags to

work with

Could pick very coarse tagsets

N, V, Adj, Adv.

More commonly used set is finer grained, the

“Penn TreeBank tagset”, 45 tags

PRP$, WRB, WP$, VBG

Even more fine-grained tagsets exist “UNIVERSAL” tagset Task-specific tagsets (e.g. for Twitter)

slide-16
SLIDE 16

Penn TreeBank POS Tagset

slide-17
SLIDE 17

Using the Penn Tagset

The/DT grand/JJ jury/NN commmented/VBD on/

IN a/DT number/NN of/IN other/JJ topics/NNS ./.

Prepositions and subordinating conjunctions

marked IN (“although/IN I/PRP..”)

Except the preposition/complementizer “to” is just

marked “TO”.

slide-18
SLIDE 18

Deciding on the correct part of speech can be difficult even for people

Mrs/NNP Shaefer/NNP never/RB got/VBD

around/RP to/TO joining/VBG

All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB

around/IN the/DT corner/NN

Chateau/NNP Petrus/NNP costs/VBZ around/RB

250/CD

slide-19
SLIDE 19

POS Tagging: Definition

The process of assigning a part-of-speech or

lexical class marker to each word in a corpus:

the koala put the keys

  • n

the table

WORDS TAGS

N V P DET

slide-20
SLIDE 20

POS Tagging example

WORD tag the DET koala N put V the DET keys N

  • n

P the DET table N

slide-21
SLIDE 21

POS Tagging

Words often have more than one POS: back

The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the

POS tag for a particular instance of a word.

slide-22
SLIDE 22

How Hard is POS Tagging? Measuring Ambiguity

slide-23
SLIDE 23

How difficult is POS tagging?

About 11% of the word types in the Brown corpus

are ambiguous with regard to part of speech

But they tend to be very common words 40% of the word tokens are ambiguous

slide-24
SLIDE 24

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary

Write rules by hand to selectively remove tags Leaving the correct tag for each word.

slide-25
SLIDE 25

Start With a Dictionary

  • she:

PRP

  • promised:

VBN,VBD

  • to

TO

  • back:

VB, JJ, RB, NN

  • the:

DT

  • bill:

NN, VB

  • Etc… for the ~100,000 words of English with more than 1

tag

slide-26
SLIDE 26

Assign Every Possible Tag and apply rules

NN RB VBN JJ VB PRP VBD TO VB DT NN She promised to back the bill

slide-27
SLIDE 27

Assign Every Possible Tag and apply rules

NN RB VBN JJ VB PRP VBD TO VB DT NN She promised to back the bill

slide-28
SLIDE 28

Assign Every Possible Tag and apply rules

NN RB VBN JJ PRP VBD TO VB DT NN She promised to back the bill

slide-29
SLIDE 29

Simple Statistical Approaches: Idea 1

slide-30
SLIDE 30

Simple Statistical Approaches: Idea 2

For a string of words W = w1w2w3…wn find the string of POS tags T = t1 t2 t3 …tn which maximizes P(T|W)

i.e., the probability of tag string T given that the

word string was W

i.e., that W was tagged T

slide-31
SLIDE 31

The Sparse Data Problem

A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

slide-32
SLIDE 32

A Practical Statistical Tagger

slide-33
SLIDE 33

A Practical Statistical Tagger II

But we can't accurately estimate more than tag bigrams or so… Again, we change to a model that we CAN estimate:

slide-34
SLIDE 34

A Practical Statistical Tagger III

So, for a given string W = w1w2w3…wn, the tagger needs to find the string of tags T which maximizes

slide-35
SLIDE 35

Training and Performance

To estimate the parameters of this model, given an annotated

training corpus:

Because many of these counts are small, smoothing is

necessary for best results…

Such taggers typically achieve about 95-96% correct tagging,

for tag sets of 40-80 tags.

slide-36
SLIDE 36

Assigning tags to unseen words

Pretend that each unknown word is ambiguous

among all possible tags, with equal probability

Assume that the probability distribution of tags over

unknown words is like the distribution of tags over words seen only once

Morphological clues Combination

slide-37
SLIDE 37

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NNP

slide-38
SLIDE 38

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VBD

slide-39
SLIDE 39

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier DT

slide-40
SLIDE 40

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NN

slide-41
SLIDE 41

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier CC

slide-42
SLIDE 42

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VBD

slide-43
SLIDE 43

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier TO

slide-44
SLIDE 44

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VB

slide-45
SLIDE 45

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier PRP

slide-46
SLIDE 46

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier IN

slide-47
SLIDE 47

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier DT

slide-48
SLIDE 48

Sequence Labeling as Classification

Classify each token independently but use as input

features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NN

slide-49
SLIDE 49

Sequence Labeling as Classification Using Outputs as Inputs

Better input features are usually the categories of the

surrounding tokens, but these are not available yet.

Can use category of either the preceding or succeeding

tokens by going forward or back and using previous

  • utput.
slide-50
SLIDE 50

SVMs for tagging

h"p://www.lsi.upc.edu/~nlp/SVMTool/

SVMTool.v1.4.ps ¡

We ¡can ¡use ¡SVMs ¡in ¡a ¡similar ¡way ¡ We ¡can ¡use ¡a ¡window ¡around ¡ ¡the ¡word ¡ ¡ ¡97.16 ¡% ¡on ¡WSJ ¡

slide-51
SLIDE 51

SVMs for tagging

From Gimenez & Marquez

slide-52
SLIDE 52

No ¡sequence ¡modeling ¡

slide-53
SLIDE 53

Evaluation

So once you have you POS tagger running how

do you evaluate it?

Overall error rate with respect to a gold-standard test

set.

Error rates on particular tags Error rates on particular words Tag confusions...

slide-54
SLIDE 54

Evaluation

The result is compared with a manually coded

“Gold Standard”

Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger

(one that uses no context).

Important: 100% is impossible even for human

annotators.

slide-55
SLIDE 55

Error Analysis

Look at a confusion matrix See what errors are causing problems

Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Past tense verb form (VBD) vs Participle (VBN) vs Adjective (JJ)

slide-56
SLIDE 56

Named Entity Recognition

slide-57
SLIDE 57

Linguistically Difficult Problem

NE involves identification of proper names in

texts, and classification into a set of predefined categories of interest.

Three universally accepted categories: person,

location and organisation

Other common tasks: recognition of date/time

expressions, measures (percent, money, weight etc), email addresses etc.

Other domain-specific entities: names of drugs,

medical conditions, names of ships, bibliographic references etc.

slide-58
SLIDE 58

Applica2ons ¡of ¡NER ¡

Yellow ¡pages ¡with ¡local ¡search ¡capabiliHes ¡ Monitoring ¡trends ¡and ¡senHment ¡in ¡textual ¡social ¡

media ¡

InteracHons ¡between ¡genes ¡and ¡cells ¡in ¡biology ¡and ¡

geneHcs ¡

slide-59
SLIDE 59

Problems in NE Task Definition

Category definitions are intuitively quite clear,

but there are many grey areas.

Many of these grey area are caused by

metonymy.

Organisation vs. Location : “England won the

World Cup” vs. “The World Cup took place in England”.

Company vs. Artefact: “shares in MTV” vs.

“watching MTV”

Location vs. Organisation: “she met him at

Heathrow” vs. “the Heathrow authorities”

slide-60
SLIDE 60

NEs gazetteer tokeniser NE grammar documents

NE System Architecture

slide-61
SLIDE 61

Approach con’t

Again Text Categorization N-grams in a window centered on the NER Features similar to POS-tagging

Gazetteer Capitalize Beginning of the sentence Is it all capitalized

slide-62
SLIDE 62

Approach con’t

NE task in two parts:

Recognising the entity boundaries Classifying the entities in the NE categories

Tokens in text are often coded with the IOB scheme

O – outside, B-XXX – first word in NE, I-XXX – all other words

in NE

Easy to convert to/from inline MUC-style markup Argentina

B-LOCATION played O with O Del B-PERSON Bosque I-PERSON

slide-63
SLIDE 63

Feature ¡types ¡

Word-­‑level ¡features ¡ List ¡lookup ¡features ¡ Document ¡& ¡corpus ¡features ¡

slide-64
SLIDE 64

Word-­‑level ¡features ¡

slide-65
SLIDE 65

List ¡lookup ¡features ¡

Exact ¡match ¡vs. ¡flexible ¡match ¡ ¡Stems ¡(remove ¡inflecHonal ¡and ¡derivaHonal ¡suffixes) ¡ ¡ ¡Lemmas ¡(remove ¡inflecHonal ¡suffixes ¡only) ¡ ¡Small ¡lexical ¡variaHons ¡(small ¡edit ¡distance) ¡ ¡Normalize ¡words ¡to ¡their ¡Soundex ¡codes ¡

slide-66
SLIDE 66

Document ¡and ¡corpus ¡features ¡

slide-67
SLIDE 67

Examples of uses of document and corpus features

Meta-­‑informaHon ¡(e.g. ¡names ¡in ¡email ¡headers) ¡ MulHword ¡enHHes ¡that ¡do ¡not ¡contain ¡rare ¡lowercase ¡

words ¡of ¡a ¡relaHvely ¡long ¡size ¡are ¡candidate ¡NEs ¡

Frequency ¡of ¡a ¡word ¡(e.g. ¡Life) ¡divided ¡by ¡its ¡

frequency ¡in ¡case ¡insensiHve ¡form ¡ ¡

slide-68
SLIDE 68

Contributions on Italian Versions

Annotation of 220 documents from “La

Repubblica”

Modification of some features, e.g. “date” Accent treatments, e.g Cinecittà

slide-69
SLIDE 69

English Results

ACT| REC PRE

  • -------------------+---------

SUBTASK SCORES | enamex |

  • rganization 454| 85 84

person 381| 90 88 location 126| 94 82 timex | date 109| 95 97 time 0| 0 0 numex | money 87| 97 85 percent 26| 94 62

Precision = 91% Recall = 87% F1 = 88.61

slide-70
SLIDE 70

Italian Corpus from “La Repubblica”

Class Subtype N° Total ENAMEX Person 1825 3886 Organization 769 Location 1292 TIMEX Date 511 613 Time 102 NUMEX Money 105 223 Percent 118

Training data

slide-71
SLIDE 71

Italian Corpus from “La Repubblica”

Test data

Class Subtype N° Total ENAMEX Person 333 537 Organization 129 Location 75 TIMEX Date 45 48 Time 3 NUMEX Money 5 13 Percent 8

slide-72
SLIDE 72

Results of the Italian NER

11-fold cross validation (confidence at 99%)

Basic Model +Modified Features +Accent treatment Average F1 77.98±2.5 79.08±2.5 79.75±2.5

slide-73
SLIDE 73

Learning Curve

50 55 60 65 70 75 80

20 40 60 80 100 120 140 160 180 200 220

Number of Documents F1

slide-74
SLIDE 74

Neural Networks for NER

In the last decade Neural Networks have obtained state of the art results for NER.

English CoNLL 2003 dataset:

Bi-LSTM: 90.94 F1 (Lample et al. 2016)

Italian Evalita 2009 dataset (500+ documents):

Recurrent Context Window Network: 82.81 F1 (Bonadiman et al. 2015)

slide-75
SLIDE 75
  • 75

Chunking

Chunking useful for entity recognition Segment and label multi-token sequences Each of these larger boxes is called a chunk

slide-76
SLIDE 76
  • 76

Chunking

The CoNLL 2000 corpus contains 270k words of

Wall Street Journal text, annotated with part-of- speech tags and chunk tags.

Three chunk types in CoNLL 2000:

  • NP chunks
  • VP chunks
  • PP chunks