Named Entity Recognition & Sequence Labeling CSCI 699: ML for - - PowerPoint PPT Presentation

named entity recognition sequence labeling
SMART_READER_LITE
LIVE PREVIEW

Named Entity Recognition & Sequence Labeling CSCI 699: ML for - - PowerPoint PPT Presentation

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction & Reasoning Instructor: Xiang Ren USC Computer Science Recap: Information Extraction Information extraction (IE) systems Find and understand


slide-1
SLIDE 1

Named Entity Recognition & Sequence Labeling

CSCI 699: ML for Knowledge Extraction & Reasoning

Instructor: Xiang Ren USC Computer Science

slide-2
SLIDE 2

Recap: Information Extraction

  • Information extraction (IE) systems
  • Find and understand limited relevant parts of texts
  • Gather information from many pieces of text
  • Produce a structured representation of relevant

information:

  • a knowledge base

2

slide-3
SLIDE 3

Recap: Information Extraction

  • Information extraction (IE) systems
  • Goals:
  • 1. Organize information so that it is useful to

people

  • 2. Put information in a semantically precise form

that allows further inferences to be made by computer algorithms

3

slide-4
SLIDE 4

Example uses of IE technologies

  • in applications like Apple or Google mail, and

web indexing

4

slide-5
SLIDE 5

Example uses of IE technologies

  • Entity panel (from Google Knowledge Graph)

5

slide-6
SLIDE 6

Example uses of IE technologies

  • How-to instructions

6

slide-7
SLIDE 7

7

What is Entity Recognition and Typing

  • Identify token spans of entity mentions in text, and

classify them into types of interest

[Barack Obama] arrived this afternoon in [Washington, D.C.]. [President Obama]’s wife [Michelle] accompanied him [TNF alpha] is produced chiefly by activated [macrophages]

slide-8
SLIDE 8

8

What is Entity Recognition and Typing

  • Identify token spans of entity mentions in text, and

classify them into types of interest

[TNF alpha] is produced chiefly by activated [macrophages] [Barack Obama] arrived this afternoon in [Washington, D.C.]. [President Obama]’s wife [Michelle] accompanied him

PERSON LOCATION PROTEIN CELL

slide-9
SLIDE 9

Applications of NER

slide-10
SLIDE 10

Why NER?

  • Coreference Resolution / Entity Linking
  • Relation Extraction
  • Knowledge Base Construction
  • Web Query Understanding
  • Question Answering

10

slide-11
SLIDE 11

Natural Language Understanding

11

slide-12
SLIDE 12

Entity linking

12

slide-13
SLIDE 13

Why NER?

  • The uses:
  • Named entities can be indexed, linked off, etc.
  • Sentiment can be attributed to companies or products
  • A lot of IE relations are associations between named

entities

  • For question answering, answers are often named

entities.

13

slide-14
SLIDE 14

Why NER?

  • Concretely:
  • Many web pages tag various entities, with links to bio or

topic pages, etc.

  • Reuters’ OpenCalais, Evri, AlchemyAPI, Yahoo’s Term

Extraction, …

  • Apple/Google/Microsoft/… smart recognizers for

document content

14

slide-15
SLIDE 15

Evaluation of NER

slide-16
SLIDE 16

Contingency table

  • “correct” à entity mention’s boundary is

correct & predicted type label is correct

16

correct not correct selected tp fp not selected fn tn

slide-17
SLIDE 17

Precision/Recall/F1 for NER

  • When there are “boundary errors”:
  • First Bank of Chicago announced earnings …
  • This counts as both a fp (the system selects a

wrong one) and a fn (the system misses a true

  • ne)

17

slide-18
SLIDE 18

Precision/Recall/F1 for NER

  • When there are “boundary errors”:
  • First Bank of Chicago announced earnings …
  • This counts as both a fp (the system selects a

wrong one) and a fn (the system misses a true

  • ne)
  • Selecting nothing would have been better
  • Some other metrics (e.g., MUC scorer) give

partial credit (according to complex rules)

18

slide-19
SLIDE 19

Precision & Recall

  • Precision: % of selected items that are correct

Recall: % of correct items that are selected

19

correct not correct selected tp fp not selected fn tn

slide-20
SLIDE 20

F-score: A combined measure

  • A combined measure that assesses the P/R

tradeoff is F measure (weighted harmonic mean):

  • The harmonic mean is a very conservative

average

  • People usually use balanced F1 score
  • i.e., with b = 1 (that is, a = ½): F1 = 2PR/(P+R)

20

R P PR R P F + + = − + =

2 2

) 1 ( 1 ) 1 ( 1 1 β β α α

slide-21
SLIDE 21

F1 score

  • P = 40% R = 40% F1 = ?
  • P = 75% R = 25% F1 = ?

21

slide-22
SLIDE 22

F1 score

  • P = 40% R = 40% F1 = 40%
  • P = 75% R = 25% F1 = 37.5%

22

slide-23
SLIDE 23

Public Benchmarks for NER

  • CoNLL-2002 and CoNLL-2003 (British newswire)
  • Multiple languages: Spanish, Dutch, English, German
  • 4 entities: Person, Location, Organization, Misc
  • MUC-6 and MUC-7 (American newswire)
  • 7 entities: Person, Location, Organization, Time, Date,

Percent, Money

  • ACE
  • 5 entities: Location, Organization, Person, FAC, GPE
  • BBN (Penn Treebank)
  • 22 entities: Animal, Cardinal, Date, Disease, …

23

slide-24
SLIDE 24

Methods and Models for NER

slide-25
SLIDE 25

Three Approaches to NER

  • 1. Hand-crafted patterns/rules
  • 2. Standard classifiers
  • KNN, decision tree, naïve Bayes, SVM, …
  • 3. Sequence models
  • HMMs
  • CRFs
  • LSTM-CRF

25

slide-26
SLIDE 26

Knowledge Engineering vs. Machine Learning

26

Learning Systems

+ higher recall + no need to develop grammars + developers do not need to be experts + annotations are cheap

  • require lots of training data

Knowledge Engineering

+ very precise (hand-coded rules) + small amount of training data

  • expensive development & test cycle
  • domain dependent
  • changes over time are hard
slide-27
SLIDE 27

MUC: the NLP genesis of IE

  • DARPA funded significant efforts in IE in the early

to mid 1990s

  • Message Understanding Conference (MUC) was

an annual event/competition where results were presented.

  • Focused on extracting information from news

articles:

  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Starting off, all rule-based, gradually moved to ML

27

Slide from Chris Manning

slide-28
SLIDE 28

MUC: the NLP genesis of IE

28

Slide from Chris Manning

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.

JOINT-VENTURE-1

  • Relationship: TIE-UP
  • Entities: “Bridgestone Sport Co.” , “a

local concern”, “a Japanese trading house”

  • Joint Ent: “Bridgestone Sports Taiwan

Co.”

  • Activity: ACTIVITY-1
  • Amount: NT$20 000 000

ACTIVITY-1

  • Activity: PRODUCTION
  • Company: “Bridgestone Sports Taiwan

Co.”

  • Product: “iron and ‘metal wood’ clubs”
  • Start date: DURING: January 1990
slide-29
SLIDE 29

Rule Based NER

  • Create regular expressions to extract:

– Telephone number – E-mail – Capitalized names

29

slide-30
SLIDE 30

Regular Expressions

  • Regular expressions provide a flexible way to match strings of text,

such as particular characters, words, or patterns of characters

Suppose you are looking for a word that:

1. starts with a capital letter “P” 2. is the first word on a line 3. the second letter is a lower case letter 4. is exactly three letters long 5. the third letter is a vowel

the regular expression would be “^P[a-z][aeiou]” where ^ - indicates the beginning of the string [a-z] – any letter in range a to z [aeiou] – any vowel

30

slide-31
SLIDE 31

Regular Expressions

  • \w (word char) any alpha-numeric
  • \d (digit char) any digit
  • \s (space char) any whitespace
  • .

(wildcard) anything

  • \b word bounday
  • ^ beginning of string
  • $ end of string
  • ? For 0 or 1 occurrences
  • + for 1 or more occurrences
  • specific range of number of occurrences: {min,max}.
  • A{1,5} One to five A’s.
  • A{5,} Five or more A’s
  • A{5} Exactly five A’s

31

slide-32
SLIDE 32

Rule Based NER

  • Create regular expressions to extract:

– Te Telephone number – E-mail – Capitalized names

32

blocks of digits separated by hyphens RegEx = (\d+\-)+\d+

slide-33
SLIDE 33

Rule Based NER

  • Create regular expressions to extract:

– Te Telephone number – E-mail – Capitalized names

33

blocks of digits separated by hyphens RegEx = (\d+\-)+\d+

  • matches valid phone numbers like 900-865-1125 and 725-1234
  • incorrectly extracts social security numbers 123-45-6789
  • fails to identify numbers like 800.865.1125 and (800)865-CARE

Improved RegEx = (\d{3}[-.\ ()]){1,2}[\dA-Z]{4}

slide-34
SLIDE 34

Rule Based NER

  • Create rules to extract locations
  • Capitalized word + {city, center, river} indicates location
  • Ex. New York city

Hudson river

  • Capitalized word + {street, boulevard, avenue} indicates

location

  • Ex. Fifth avenue

34

slide-35
SLIDE 35

Rule Based NER

  • For unstructured human-written text, some

NLP may help

  • Part-of-speech (POS) tagging
  • Mark each word as a noun, verb, preposition, etc.
  • Syntactic parsing
  • Identify phrases: NP, VP, PP
  • Semantic word categories (e.g. from WordNet)
  • KILL: kill, murder, assassinate, strangle, suffocate

35

slide-36
SLIDE 36

Why simple patterns/rules would not work?

  • Capitalization is a strong indicator for capturing proper

names, but it can be tricky:

  • first word of a sentence is capitalized
  • sometimes titles in web pages are all capitalized
  • nested named entities contain non-capital words

University of Southern California is Organization

  • all nouns in German are capitalized!

36

slide-37
SLIDE 37

Why simple patterns/rules would not work?

  • We already have discussed that currently no gazetteer

contains all existing proper names.

  • New proper names constantly emerge

movie titles books singers restaurants etc.

37

slide-38
SLIDE 38

Why simple patterns/rules would not work?

  • The same entity can have multiple variants of the same

proper name

Xiang Ren

  • Prof. Ren
  • Dr. Ren

Xiang

  • Proper names are ambiguous

Jordan the person vs. Jordan the location JFK the person vs. JFK the airport May the person vs. May the month

38

slide-39
SLIDE 39

Standard Classifiers for NER

slide-40
SLIDE 40

Workflow of Token-wise Classifiers

Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities

40

slide-41
SLIDE 41

Workflow of Token-wise Classifiers

Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities

41

slide-42
SLIDE 42

Encoding labels for sequence

42

  • IO labeling schema:
  • The O
  • European ORG
  • Commission ORG
  • said O
  • on O
  • Thursday O
  • it O
  • disagreed O
  • with O
  • German MISC

Part of ORG entity Part of ORG entity Part of MISC entity

slide-43
SLIDE 43

Encoding labels for sequence

43

  • BIOES labeling schema:
  • The O
  • European B-ORG
  • Commission E-ORG
  • said O
  • on O
  • Thursday O
  • it O
  • disagreed O
  • with O
  • German S-MISC

Begin of Entity End of Entity Singleton Entity

slide-44
SLIDE 44

Encoding labels for sequence

44

  • BIO / BIOES labeling schema:
  • The O
  • European B-ORG
  • Commission E-ORG
  • said O
  • on O
  • Thursday O
  • it O
  • disagreed O
  • with O
  • German S-MISC

The European Commission said

  • n

Thursday it disagreed with German

slide-45
SLIDE 45

NER as a Classification Problem

  • NED: Identify named entities using BIO tags
  • B beginning of an entity
  • I continues the entity
  • O word outside the entity
  • NEC: Classify into a predefined set of categories
  • Person names
  • Organizations (companies, governmental organizations,

etc.)

  • Locations (cities, countries, etc.)
  • Miscellaneous (movie titles, sport events, etc.)

45

slide-46
SLIDE 46

Features for NER

  • Words
  • Current word (essentially like a learned dictionary)
  • Previous/next word (context)
  • Other kinds of inferred linguistic classification
  • Part-of-speech tags
  • Label context
  • Previous (and perhaps next) label

46

slide-47
SLIDE 47

Features: Word Substrings

47

drug company movie place person

Cotrimoxazole Wethersfield Alien Fury: Countdown to Invasion

18

  • xa

708 6

:

8 6 68 14

field

slide-48
SLIDE 48

Features for NER

  • Word Shapes
  • Map words to simplified representation that

encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc.

48

Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd

slide-49
SLIDE 49

Classical Learning Models on NER

  • KNN
  • Decision Tree
  • Naïve Bayes
  • SVM
  • Boosting, …

49

slide-50
SLIDE 50

K Nearest Neighbor

  • Learning is just storing the representations of

the training examples.

  • Testing instance xp:
  • compute similarity between xp and all training examples
  • take vote among xp k nearest neighbours
  • assign xp with the category of the most similar example in

T

50

slide-51
SLIDE 51

Distance measures in KNN

  • Nearest neighbor method uses similarity (or

distance) metric.

  • Given two objects x and y both with n values

calculate the Euclidean distance as

51

slide-52
SLIDE 52

An walk-through example

52

isPersonName isCapitalized isLiving teachesCS544 Jerry Hobbs 1 1 1 1 USC 1 eduard hovy 1 1 1 Kevin Knight 1 1 1 1

slide-53
SLIDE 53

3-NN

53

choose the category of the closer neighbor (can be erroneous due to noise) choose the category of the majority of the neighbors

slide-54
SLIDE 54

KNN: Pros & Cons

54

Pros

+ robust + simple + training is very fast (storing examples)

Cons

  • depends on similarity measure & k-NNs
  • easily fooled by irrelevant attributes
  • computationally expensive
slide-55
SLIDE 55

Decision Tree

55 isPersonName isCapitalized isLiving X is PersonName? profession NO Jerry Hobbs 1 1 1 YES USC 1 NO Jordan 1 1 NO

Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification isCapitalized 1 isPersonName isLiving 1 YES NO NO NO

slide-56
SLIDE 56

Decision Tree: Pros & Cons

56

Pros

+ generate understandable rules + provide a clear indication of which features are most important for classification

Cons

  • error prone in multi-class classification

and small number of training examples

  • expensive to train due to pruning
slide-57
SLIDE 57

AdaBoost with Decision Tree (Carreras et al. 2002)

  • Learning algorithm: AdaBoost
  • Binary classification
  • Binary features
  • (Schapire & Singer, 99)
  • Weak rules (ht): Decision Trees of fixed depth.

57

slide-58
SLIDE 58

Feature generation

Adam Smith Works for IBM in London .

58

slide-59
SLIDE 59

Feature generation

Adam Adam Smith Smith Works for IBM …………. in London . fp

59

  • Contextual
  • current word W0
slide-60
SLIDE 60

Feature generation

Adam Adam, null, null, null, Smith, works, for Smith Smith, Adam, null, null, works, for, IBM Works for IBM …………. in London . fp, London, in, IBM, null, null, null

60

  • Contextual
  • current word W0
  • words around W0 in [-3,…,+3] window
slide-61
SLIDE 61

Feature generation

Adam Adam, null, null, null, Smith, works, for,1,0 Smith Smith, Adam, null, null, works, for, IBM,1,0 Works for IBM …………. in London . fp, London, in, IBM, null, null, null,0,0

61

  • Ortographic
  • initial-caps
  • all-caps
slide-62
SLIDE 62

More features

62

slide-63
SLIDE 63

AdaBoost with Decision Tree (Carreras et al. 2002)

  • Learning algorithm: AdaBoost
  • Binary classification
  • Binary features
  • (Schapire & Singer, 99)
  • Weak rules (ht): Decision Trees of fixed depth.

63

slide-64
SLIDE 64

Results for Entity Boundary Detection

64

CoNLL-2002 Spanish Evaluation Data

Data sets #tokens #NEs Train 264,715 18,794 Development 52,923 4,351 Test 51,533 3,558

Carreras et al.,2002

Precision Recall F-score

BIO dev. 92.45 90.88 91.66

slide-65
SLIDE 65

Results for Entity Classification

65

Spanish Dev. Precision Recal l F-score LOC 79.04 80.00 79.52 MISC 55.48 54.61 55.04 ORG 79.57 76.06 77.77 PER 87.19 86.91 87.05

  • verall

79.15 77.80 78.47 Spanish Test. Precision Recall F-score LOC 85.76 79.43 82.47 MISC 60.19 57.35 58.73 ORG 81.21 82.43 81.81 PER 84.71 93.47 88.87

  • verall

81.38 81.40 81.39

slide-66
SLIDE 66

Sequence Labeling for NER

slide-67
SLIDE 67

Traditional Named Entity Recognition (NER) Systems

  • Heavy reliance on corpus-specific human labeling
  • Training sequence models is slow

67

A manual annotation interface

e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), …

The The be best BBQ BBQ I’ I’ve ta tasted in in Ph Phoenix ix O O Food O O O Location

Sequence ce mo model tr training

NER Systems: Stanford NER Illinois Name Tagger IBM Alchemy APIs …

slide-68
SLIDE 68

Encoding labels for sequence

68

  • BIOES labeling schema:
  • The O
  • European B-ORG
  • Commission E-ORG
  • said O
  • on O
  • Thursday O
  • it O
  • disagreed O
  • with O
  • German S-MISC

Begin of Entity End of Entity Singleton Entity

slide-69
SLIDE 69

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NNP

Slide from Ray Mooney

slide-70
SLIDE 70

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VBD

Slide from Ray Mooney

slide-71
SLIDE 71

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier DT

Slide from Ray Mooney

slide-72
SLIDE 72

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NN

Slide from Ray Mooney

slide-73
SLIDE 73

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier CC

Slide from Ray Mooney

slide-74
SLIDE 74

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VBD

Slide from Ray Mooney

slide-75
SLIDE 75

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier TO

Slide from Ray Mooney

slide-76
SLIDE 76

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier VB

Slide from Ray Mooney

slide-77
SLIDE 77

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier PRP

Slide from Ray Mooney

slide-78
SLIDE 78

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier IN

Slide from Ray Mooney

slide-79
SLIDE 79

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier DT

Slide from Ray Mooney

slide-80
SLIDE 80

Sequence Labeling as Classification

  • Classify each token independently but use as

input features, information about the surrounding tokens (sliding window).

John saw the saw and decided to take it to the table.

classifier NN

Slide from Ray Mooney

slide-81
SLIDE 81

Sequence Labeling vs. Classification

  • Sequence Models are statistical models of

whole token sequences that effectively label sub-sequences

81

slide-82
SLIDE 82

Workflow of Sequence Modeling

Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities

82

slide-83
SLIDE 83

Hidden Markov Models (HMMs)

  • Generative
  • Find parameters to maximize P(X,Y)
  • Assumes features are independent
  • When labeling Xi future observations are taken

into account (forward-backward)

Slides by Jenny Finkel

slide-84
SLIDE 84

84

  • Graphical Model Representation: Variables by time
  • Circles indicate states
  • Arrows indicate probabilistic dependencies between states

What is an HMM?

slide-85
SLIDE 85

85

  • Green circles are hidden states
  • Dependent only on the previous state: Markov process
  • “The past is independent of the future given the present.”

What is an HMM?

slide-86
SLIDE 86

86

  • Purple nodes are observed states
  • Dependent only on their corresponding hidden state

What is an HMM?

slide-87
SLIDE 87

87

  • {S, K, P, A, B}
  • S : {s1…sN } are the values for the hidden states
  • K : {k1…kM } are the values for the observations

S S S K K K S K S K

HMM Formalism

slide-88
SLIDE 88

88

HMM Formalism

  • {S, K, P, A, B}
  • P = {pi} are the initial state probabilities
  • A = {aij} are the state transition probabilities
  • B = {bik} are the observation state probabilities

A B A A A B B S S S K K K S K S K

slide-89
SLIDE 89

89

Inference for an HMM

  • Compute the probability of a given observation sequence
  • Given an observation sequence, compute the most likely

hidden state sequence

  • Given an observation sequence and set of possible models,

which model most closely fits the data?

slide-90
SLIDE 90

90

) | ( Compute ) , , ( , ) ,..., ( 1 µ µ O P B A

  • O

T

P = =

  • T
  • 1
  • t
  • t-1
  • t+1

Given an observation sequence and a model, compute the probability of the observation sequence

Sequence Probability

slide-91
SLIDE 91

91

) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =

T To

x

  • x
  • x

b b b X O P ... ) , | (

2 2 1 1

= µ

T T

x x x x x x x

a a a X P

1 3 2 2 1 1

... ) | (

  • = p

µ

å

=

X

X P X O P O P ) | ( ) , | ( ) | ( µ µ µ

  • T
  • 1
  • t
  • t-1
  • t+1

x1 xt+1 xT xt xt-1

Sequence Probability

slide-92
SLIDE 92

MaxEnt Markov Models (MEMMs)

  • Discriminative
  • Find parameters to maximize P(Y|X)
  • No longer assume that features are

independent

  • Do not take future observations into account

(no forward-backward)

Slides by Jenny Finkel

slide-93
SLIDE 93

MEMM inference in systems

  • For a MEMM, the classifier makes a single decision at a time,

conditioned on evidence from observations and previous decisions

  • A larger space of sequences is usually explored via search
  • 3
  • 2
  • 1

+1 DT NNP VBD ??? ??? The Dow fell 22.6 %

Local Context Features

W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

Decision Point

slide-94
SLIDE 94

Conditional Random Fields (CRFs)

  • Discriminative
  • Doesn’t assume that features are independent
  • When labeling Yi future observations are taken

into account è The best of both worlds!

Slides by Jenny Finkel

slide-95
SLIDE 95

CRFs [Lafferty, Pereira, and McCallum 2001]

  • A whole-sequence conditional model rather than a chaining of local

models.

  • The space of cs is now the space of sequences
  • But if the features fi remain local, the conditional sequence

likelihood can be calculated exactly using dynamic programming

  • Training is slower, but CRFs avoid causal-competition biases

∑ ∑

'

) , ' ( exp

c i i i

d c f λ = ) , | ( λ d c P

i i i

d c f ) , ( exp λ

slide-96
SLIDE 96

Model Trade-offs

Speed Discrim vs. Generative Normalization HMM very fast generative local MEMM mid-range discriminative local CRF kinda slow discriminative global

Slides by Jenny Finkel

slide-97
SLIDE 97

Greedy Inference

  • Greedy inference:
  • We just start at the left, and use our classifier at each position to assign a label
  • The classifier can depend on previous labeling decisions as well as observed data
  • Advantages:
  • Fast, no extra memory requirements
  • Very easy to implement
  • With rich features including observations to the right, it may perform quite well
  • Disadvantage:
  • Greedy. We make commit errors we cannot recover from

Sequence Model Inference Best Sequence

Slides by Chris Manning

slide-98
SLIDE 98

Viterbi Inference

  • Viterbi inference:
  • Dynamic programming or memorization.
  • Requires small window of state influence (e.g., past two states are relevant).
  • Advantage:
  • Exact: the global best sequence is returned.
  • Disadvantage:
  • Harder to implement long-distance state-state interactions (but beam inference

tends not to allow long-distance resurrection of sequences anyway).

Sequence Model Inference Best Sequence

Slides by Chris Manning

slide-99
SLIDE 99

Beam Inference

  • Beam inference:
  • At each position keep the top k complete sequences.
  • Extend each sequence in each local way.
  • The extensions compete for the k slots at the next position.
  • Advantages:
  • Fast; beam sizes of 3–5 are almost as good as exact inference in many cases.
  • Easy to implement (no dynamic programming required).
  • Disadvantage:
  • Inexact: the globally best sequence can fall off the beam.

Sequence Model Inference Best Sequence

Slides by Chris Manning

slide-100
SLIDE 100

Stanford CRF System: A Case Study

https://nlp.stanford.edu/software/CRF-NER.shtml

slide-101
SLIDE 101

Stanford NER

  • CRF
  • Features are more important than model
  • How to train a new model

101

slide-102
SLIDE 102

Lexical Features

  • Word features: current word, previous word,

next word, all words within a window

  • Orthographic features:
  • Jenny Xxxx
  • IL-2 XX-#

102

slide-103
SLIDE 103

Lexical Features

  • Word features: current word, previous word,

next word, all words within a window

  • Word shape features:
  • Jenny Xxxx
  • IL-2 XX-#
  • Prefixes and Suffixes:
  • Jenny <J, <Je, <Jen, …, nny>, ny>, y>
  • Label sequences
  • Lots of feature conjunctions

103

slide-104
SLIDE 104

Distributional Similarity Features

  • Large, unannotated corpus
  • Each word will appear in contexts - induce a

distribution over contexts

  • Cluster words based on how similar their

distributions are

  • Use cluster IDs as features

104

slide-105
SLIDE 105

Distributional Similarity Features

  • Large, unannotated corpus
  • Each word will appear in contexts - induce a

distribution over contexts

  • Cluster words based on how similar their

distributions are

  • Use cluster IDs as features
  • Great way to combat sparsity
  • They used Alexander Clark’s distributional

similarity code (easy to use, works great!)

  • 200 clusters, used 100 million words from English

gigaword corpus

105

slide-106
SLIDE 106

Distributed Models

  • Trained on CoNLL, MUC and ACE
  • Entities: Person, Location, Organization
  • Trained on both British and American

newswire, so robust across both domains

  • Models with and without the distributional

similarity features

106

slide-107
SLIDE 107

Incorporating NER into NLP systems

  • NER is a component technology
  • Common approach:
  • Labeled data
  • Pipe output to next stage
  • Better approach:
  • Sample output at each stage
  • Pipe sampled output to next stage
  • Repeat several times
  • Vote for final output
  • Sampling NER outputs is fast

107

slide-108
SLIDE 108

Neural Sequence Labeling

slide-109
SLIDE 109

Prior Approaches

  • Hand-crafted features
  • External dictionaries
  • Other resources that require human efforts to

generate

109

slide-110
SLIDE 110

Neural Sequence Models

  • No hand-engineered features
  • No specialized knowledge resources
  • Capture orthographic and distributional

evidence for being a named entity

110

slide-111
SLIDE 111

LSTM-CRF model

111

slide-112
SLIDE 112

LSTM-CRF

112

slide-113
SLIDE 113

LSTM-CRF

113

slide-114
SLIDE 114

Char-level Embedding

114

slide-115
SLIDE 115

Changes of F1 score for Neural NER models (CONLL’03)

115

Results by Yves Peirsman

slide-116
SLIDE 116

State-of-the-Art Models

slide-117
SLIDE 117

Challenge of NER on Low- resource Domains

117

Lack of abundant Supervision data There is no ImageNet for NLP

Feature Engineering

slide-118
SLIDE 118

Prior Neural NER Model

118

Word-level LSTM O y2 S-MISC S-ORG y3 O y4 O y5 champaigns

Italian

Juve announced his

Embedding

input <s> y0

<start>

y1 O arrival O with O a O picture O

  • f

O the S-MISC Argentine y12 y11 y10 y9 y8 y7 y6 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 CRF x2 x3 x4 x5 x0 x1 x6 x7 x8 x9 x10 x11 x12 xc xc

1

xc

2

xc

3

xc

4

xc

5

xc

6

xc

7

xc

8

xc

9

xc

10

xc

11

xc

12

xw

12

xw

11

xw

10

xw

9

xw

8

xw

7

xw

6

xw

5

xw

4

xw

3

xw

2

xw

1

xw

Words Semantic: Word Embedding Lexical Features: Char-level Representation

slide-119
SLIDE 119

Word-level Knowledge

  • Pre-trained word embedding has demonstrated its

potential to improve the performance of the neural model

119

slide-120
SLIDE 120

Word-level Knowledge

120

Word-level LSTM O y2 S-MISC S-ORG y3 O y4 O y5 champaigns

Italian

Juve announced his

Embedding

input <s> y0

<start>

y1 O arrival O with O a O picture O

  • f

O the S-MISC Argentine y12 y11 y10 y9 y8 y7 y6 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 CRF x2 x3 x4 x5 x0 x1 x6 x7 x8 x9 x10 x11 x12 xc xc

1

xc

2

xc

3

xc

4

xc

5

xc

6

xc

7

xc

8

xc

9

xc

10

xc

11

xc

12

xw

12

xw

11

xw

10

xw

9

xw

8

xw

7

xw

6

xw

5

xw

4

xw

3

xw

2

xw

1

xw

Performance on NER (F1 score) Embeddings 80.76 Random 90.28 Senna 84.91 Word2Vec 91.21 GloVe LSTM-CNN- CRF model

slide-121
SLIDE 121

Character-level Knowledge

  • The order of language contains valuable

information, and can be utilized to improve the performance

  • Specifically, we utilize language model to

guide sequence labeling model to the key knowledge.

121

slide-122
SLIDE 122

Character-level Knowledge

122

slide-123
SLIDE 123

Results of LM-LSTM-CRF

123

Extra Resource Model !" score gazetteers Collobert et al. 2011† [1] 89.59 Chiu et al. 2016† [2] 91.62±0.33 AIDA dataset Luo et al. 2015 [3] 91.20 CONLL 2000 / PTB-POS Yang et al. 2017† [4] 91.26 1B Word Dataset Peters et al. 2017†‡ [5] 91.93±0.19 None Collobert et al. 2011† [1] 88.67 Luo et al. 2015 [3] 89.90 Chiu et al. 2016† [2] 90.91±0.20 Lample et al. 2016† [6] 90.94 Ma et al. 2016† [7] 91.21 Yang et al. 2017† [4] 91.20 Rei 2017†‡ [8] 86.26 Peters et al. 2017† [5] 90.87±0.13 Peters et al. 2017†‡ [5] 90.79±0.15 LM-LSTM-CRF†‡ 91.24±0.12 (91.35)

slide-124
SLIDE 124

LSTM-CRF: Single-task Architecture

word BiLSTM CRF word emb concat char emb char BiLSTM LM CRF word emb concat char emb

w/o co-training w/ co-training

word BiLSTM char BiLSTM

slide-125
SLIDE 125

Learning NER on Multiple Tag Sets and Datasets

125

slide-126
SLIDE 126

Multi-task Architecture1: MTM-C

  • private word-level Bi-LSTM and shared char-

level Bi-LSTM

126

slide-127
SLIDE 127

Multi-task Architecture1: MTM-C

  • private char-level Bi-LSTM and shared word-

level Bi-LSTM

127

slide-128
SLIDE 128

Multi-task Architecture2: MTM- CW

  • shared char-level Bi-LSTM and word-level Bi-

LSTM

128

slide-129
SLIDE 129

F1 Performance Comparison

129

slide-130
SLIDE 130

Results compared with baselines

130

slide-131
SLIDE 131

Other Settings of NER

slide-132
SLIDE 132

Different Learning Settings

  • Supervised learning
  • labeled training examples
  • methods: Hidden Markov Models, k-Nearest Neighbors,

Decision Trees, AdaBoost, SVM, …

  • example: NE recognition, POS tagging, Parsing
  • Unsupervised learning
  • labels must be automatically discovered
  • method: clustering
  • example: NE disambiguation, text classification

132

slide-133
SLIDE 133

Different Learning Settings

  • Semi-supervised learning
  • small percentage of training examples are labeled, the

rest is unlabeled

  • methods: bootstrapping, active learning, co-training, self-

training

  • example: NE recognition, POS tagging, Parsing, …

133

slide-134
SLIDE 134

Weak-Supervision Systems: Pattern-Based Bootstrapping

  • Requires manual seed selection & mid-point checking

134

Annotate corpus using entities Generate candidate patterns Score candidate patterns Select Top patterns Apply patterns to find new entities

Se Seeds fo for Food

Pi Pizza Fr French Fr Fries Ho Hot Do Dog Panc ancak ake

.. ...

Seed entities and corpus Patterns for Food th the be best <X <X> I’ I’ve tr tried in in th their <X <X> ta tastes am amaz azing ing …

e.g., (Etzioni et al., 2005), (Talukdar et al., 2010), (Gupta et al., 2014), (Mitchell et al., 2015), …

Systems: CMU NELL UW KnowItAll Stanford DeepDive Max-Planck PROSPERA …

slide-135
SLIDE 135

Unsupervised Entity Extraction

  • Statistics based on massive text corpora
  • Popularity
  • Raw frequency
  • Frequency distribution based on Zipfian ranks [Deane’05]
  • Concordance
  • Significance score [Church et al.’91][El-Kishky et al.’14]
  • Completeness
  • Comparison to super/sub-sequences [Parameswaran et

al.’10]

135

slide-136
SLIDE 136

Frequency Distribution

  • Idea: ranks in a Zipfian frequency distribution is

more reliable than raw frequency

  • Heuristic: Actual Rank / Expected Rank
  • Example:
  • Given a phrase like “east end”
  • Ac

Actu tual Ra Rank: rank “east end” among all occurrences of “east” (e.g., “east end”, “east side”, “the east”, “towards the east”, etc.)

  • Ex

Expec pected ed Rank nk: rank “__ end” among all contexts of “east” (e.g., “__ end”, “__ side”, “the __ ”, “towards the __”, etc.)

136

slide-137
SLIDE 137

Significance Score

  • Significance score [Church et al.’91]
  • A.k.a. Z score
  • ToPMine [El-Kishky et al.’15]
  • If a phrase can be decomposed into two parts
  • P = P1 ! P2
  • α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)

137

Quality phrases

slide-138
SLIDE 138

TopMine Algorithm

138

[Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]…

Based on significance score [Church et al.’91]:

slide-139
SLIDE 139

Limitations of Unsupervised Methods

  • The thresholds should be carefully chosen.
  • Only consider a subset of quality phrase

requirements.

  • Combining different signals in an

unsupervised manner is difficult.

  • Introduce some supervision may help!

139

slide-140
SLIDE 140

SegPhrase: Weakly-supervised Phrase Mining

140

Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique.

Phrase Mining

Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.

Quality Phrases

Phrasal Segmentation

Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus

slide-141
SLIDE 141

SegPhrase: Segmentation of Phrases

  • Partition a sequence of words by maximizing the likelihood
  • Considering
  • Phrase quality score
  • ClassPhrase assigns a quality score for each

phrase

  • Probability in corpus
  • Length penalty
  • length penalty !: when ! > 1, it favors shorter

phrases

  • Filter out phrases with low rectified frequency
  • Bad phrases are expected to rarely occur in the segmenta6on

results

141

slide-142
SLIDE 142

Summary

  • Task definition, evaluation & applications
  • Methods: rule-based, classifier-based, (neural)

sequence models

  • Semi-supervised learning & multi-tasking on

NER

  • Other settings of NER

142