Named Entity Recognition & Sequence Labeling CSCI 699: ML for - - PowerPoint PPT Presentation
Named Entity Recognition & Sequence Labeling CSCI 699: ML for - - PowerPoint PPT Presentation
Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction & Reasoning Instructor: Xiang Ren USC Computer Science Recap: Information Extraction Information extraction (IE) systems Find and understand
Recap: Information Extraction
- Information extraction (IE) systems
- Find and understand limited relevant parts of texts
- Gather information from many pieces of text
- Produce a structured representation of relevant
information:
- a knowledge base
2
Recap: Information Extraction
- Information extraction (IE) systems
- Goals:
- 1. Organize information so that it is useful to
people
- 2. Put information in a semantically precise form
that allows further inferences to be made by computer algorithms
3
Example uses of IE technologies
- in applications like Apple or Google mail, and
web indexing
4
Example uses of IE technologies
- Entity panel (from Google Knowledge Graph)
5
Example uses of IE technologies
- How-to instructions
6
7
What is Entity Recognition and Typing
- Identify token spans of entity mentions in text, and
classify them into types of interest
[Barack Obama] arrived this afternoon in [Washington, D.C.]. [President Obama]’s wife [Michelle] accompanied him [TNF alpha] is produced chiefly by activated [macrophages]
8
What is Entity Recognition and Typing
- Identify token spans of entity mentions in text, and
classify them into types of interest
[TNF alpha] is produced chiefly by activated [macrophages] [Barack Obama] arrived this afternoon in [Washington, D.C.]. [President Obama]’s wife [Michelle] accompanied him
PERSON LOCATION PROTEIN CELL
Applications of NER
Why NER?
- Coreference Resolution / Entity Linking
- Relation Extraction
- Knowledge Base Construction
- Web Query Understanding
- Question Answering
- …
10
Natural Language Understanding
11
Entity linking
12
Why NER?
- The uses:
- Named entities can be indexed, linked off, etc.
- Sentiment can be attributed to companies or products
- A lot of IE relations are associations between named
entities
- For question answering, answers are often named
entities.
13
Why NER?
- Concretely:
- Many web pages tag various entities, with links to bio or
topic pages, etc.
- Reuters’ OpenCalais, Evri, AlchemyAPI, Yahoo’s Term
Extraction, …
- Apple/Google/Microsoft/… smart recognizers for
document content
14
Evaluation of NER
Contingency table
- “correct” à entity mention’s boundary is
correct & predicted type label is correct
16
correct not correct selected tp fp not selected fn tn
Precision/Recall/F1 for NER
- When there are “boundary errors”:
- First Bank of Chicago announced earnings …
- This counts as both a fp (the system selects a
wrong one) and a fn (the system misses a true
- ne)
17
Precision/Recall/F1 for NER
- When there are “boundary errors”:
- First Bank of Chicago announced earnings …
- This counts as both a fp (the system selects a
wrong one) and a fn (the system misses a true
- ne)
- Selecting nothing would have been better
- Some other metrics (e.g., MUC scorer) give
partial credit (according to complex rules)
18
Precision & Recall
- Precision: % of selected items that are correct
Recall: % of correct items that are selected
19
correct not correct selected tp fp not selected fn tn
F-score: A combined measure
- A combined measure that assesses the P/R
tradeoff is F measure (weighted harmonic mean):
- The harmonic mean is a very conservative
average
- People usually use balanced F1 score
- i.e., with b = 1 (that is, a = ½): F1 = 2PR/(P+R)
20
R P PR R P F + + = − + =
2 2
) 1 ( 1 ) 1 ( 1 1 β β α α
F1 score
- P = 40% R = 40% F1 = ?
- P = 75% R = 25% F1 = ?
21
F1 score
- P = 40% R = 40% F1 = 40%
- P = 75% R = 25% F1 = 37.5%
22
Public Benchmarks for NER
- CoNLL-2002 and CoNLL-2003 (British newswire)
- Multiple languages: Spanish, Dutch, English, German
- 4 entities: Person, Location, Organization, Misc
- MUC-6 and MUC-7 (American newswire)
- 7 entities: Person, Location, Organization, Time, Date,
Percent, Money
- ACE
- 5 entities: Location, Organization, Person, FAC, GPE
- BBN (Penn Treebank)
- 22 entities: Animal, Cardinal, Date, Disease, …
23
Methods and Models for NER
Three Approaches to NER
- 1. Hand-crafted patterns/rules
- 2. Standard classifiers
- KNN, decision tree, naïve Bayes, SVM, …
- 3. Sequence models
- HMMs
- CRFs
- LSTM-CRF
25
Knowledge Engineering vs. Machine Learning
26
Learning Systems
+ higher recall + no need to develop grammars + developers do not need to be experts + annotations are cheap
- require lots of training data
Knowledge Engineering
+ very precise (hand-coded rules) + small amount of training data
- expensive development & test cycle
- domain dependent
- changes over time are hard
MUC: the NLP genesis of IE
- DARPA funded significant efforts in IE in the early
to mid 1990s
- Message Understanding Conference (MUC) was
an annual event/competition where results were presented.
- Focused on extracting information from news
articles:
- Terrorist events
- Industrial joint ventures
- Company management changes
- Starting off, all rule-based, gradually moved to ML
27
Slide from Chris Manning
MUC: the NLP genesis of IE
28
Slide from Chris Manning
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.
JOINT-VENTURE-1
- Relationship: TIE-UP
- Entities: “Bridgestone Sport Co.” , “a
local concern”, “a Japanese trading house”
- Joint Ent: “Bridgestone Sports Taiwan
Co.”
- Activity: ACTIVITY-1
- Amount: NT$20 000 000
ACTIVITY-1
- Activity: PRODUCTION
- Company: “Bridgestone Sports Taiwan
Co.”
- Product: “iron and ‘metal wood’ clubs”
- Start date: DURING: January 1990
Rule Based NER
- Create regular expressions to extract:
– Telephone number – E-mail – Capitalized names
29
Regular Expressions
- Regular expressions provide a flexible way to match strings of text,
such as particular characters, words, or patterns of characters
Suppose you are looking for a word that:
1. starts with a capital letter “P” 2. is the first word on a line 3. the second letter is a lower case letter 4. is exactly three letters long 5. the third letter is a vowel
the regular expression would be “^P[a-z][aeiou]” where ^ - indicates the beginning of the string [a-z] – any letter in range a to z [aeiou] – any vowel
30
Regular Expressions
- \w (word char) any alpha-numeric
- \d (digit char) any digit
- \s (space char) any whitespace
- .
(wildcard) anything
- \b word bounday
- ^ beginning of string
- $ end of string
- ? For 0 or 1 occurrences
- + for 1 or more occurrences
- specific range of number of occurrences: {min,max}.
- A{1,5} One to five A’s.
- A{5,} Five or more A’s
- A{5} Exactly five A’s
31
Rule Based NER
- Create regular expressions to extract:
– Te Telephone number – E-mail – Capitalized names
32
blocks of digits separated by hyphens RegEx = (\d+\-)+\d+
Rule Based NER
- Create regular expressions to extract:
– Te Telephone number – E-mail – Capitalized names
33
blocks of digits separated by hyphens RegEx = (\d+\-)+\d+
- matches valid phone numbers like 900-865-1125 and 725-1234
- incorrectly extracts social security numbers 123-45-6789
- fails to identify numbers like 800.865.1125 and (800)865-CARE
Improved RegEx = (\d{3}[-.\ ()]){1,2}[\dA-Z]{4}
Rule Based NER
- Create rules to extract locations
- Capitalized word + {city, center, river} indicates location
- Ex. New York city
Hudson river
- Capitalized word + {street, boulevard, avenue} indicates
location
- Ex. Fifth avenue
34
Rule Based NER
- For unstructured human-written text, some
NLP may help
- Part-of-speech (POS) tagging
- Mark each word as a noun, verb, preposition, etc.
- Syntactic parsing
- Identify phrases: NP, VP, PP
- Semantic word categories (e.g. from WordNet)
- KILL: kill, murder, assassinate, strangle, suffocate
35
Why simple patterns/rules would not work?
- Capitalization is a strong indicator for capturing proper
names, but it can be tricky:
- first word of a sentence is capitalized
- sometimes titles in web pages are all capitalized
- nested named entities contain non-capital words
University of Southern California is Organization
- all nouns in German are capitalized!
36
Why simple patterns/rules would not work?
- We already have discussed that currently no gazetteer
contains all existing proper names.
- New proper names constantly emerge
movie titles books singers restaurants etc.
37
Why simple patterns/rules would not work?
- The same entity can have multiple variants of the same
proper name
Xiang Ren
- Prof. Ren
- Dr. Ren
Xiang
- Proper names are ambiguous
Jordan the person vs. Jordan the location JFK the person vs. JFK the airport May the person vs. May the month
38
Standard Classifiers for NER
Workflow of Token-wise Classifiers
Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities
40
Workflow of Token-wise Classifiers
Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a classifier to predict the labels of each token in the annotated training sentences Testing 1. Receive a set of testing documents 2. Run trained classifier to label each token 3. Appropriately output the recognized entities
41
Encoding labels for sequence
42
- IO labeling schema:
- The O
- European ORG
- Commission ORG
- said O
- on O
- Thursday O
- it O
- disagreed O
- with O
- German MISC
Part of ORG entity Part of ORG entity Part of MISC entity
Encoding labels for sequence
43
- BIOES labeling schema:
- The O
- European B-ORG
- Commission E-ORG
- said O
- on O
- Thursday O
- it O
- disagreed O
- with O
- German S-MISC
Begin of Entity End of Entity Singleton Entity
Encoding labels for sequence
44
- BIO / BIOES labeling schema:
- The O
- European B-ORG
- Commission E-ORG
- said O
- on O
- Thursday O
- it O
- disagreed O
- with O
- German S-MISC
The European Commission said
- n
Thursday it disagreed with German
NER as a Classification Problem
- NED: Identify named entities using BIO tags
- B beginning of an entity
- I continues the entity
- O word outside the entity
- NEC: Classify into a predefined set of categories
- Person names
- Organizations (companies, governmental organizations,
etc.)
- Locations (cities, countries, etc.)
- Miscellaneous (movie titles, sport events, etc.)
45
Features for NER
- Words
- Current word (essentially like a learned dictionary)
- Previous/next word (context)
- Other kinds of inferred linguistic classification
- Part-of-speech tags
- Label context
- Previous (and perhaps next) label
46
Features: Word Substrings
47
drug company movie place person
Cotrimoxazole Wethersfield Alien Fury: Countdown to Invasion
18
- xa
708 6
:
8 6 68 14
field
Features for NER
- Word Shapes
- Map words to simplified representation that
encodes attributes such as length, capitalization, numerals, Greek letters, internal punctuation, etc.
48
Varicella-zoster Xx-xxx mRNA xXXX CPA1 XXXd
Classical Learning Models on NER
- KNN
- Decision Tree
- Naïve Bayes
- SVM
- …
- Boosting, …
49
K Nearest Neighbor
- Learning is just storing the representations of
the training examples.
- Testing instance xp:
- compute similarity between xp and all training examples
- take vote among xp k nearest neighbours
- assign xp with the category of the most similar example in
T
50
Distance measures in KNN
- Nearest neighbor method uses similarity (or
distance) metric.
- Given two objects x and y both with n values
calculate the Euclidean distance as
51
An walk-through example
52
isPersonName isCapitalized isLiving teachesCS544 Jerry Hobbs 1 1 1 1 USC 1 eduard hovy 1 1 1 Kevin Knight 1 1 1 1
3-NN
53
choose the category of the closer neighbor (can be erroneous due to noise) choose the category of the majority of the neighbors
KNN: Pros & Cons
54
Pros
+ robust + simple + training is very fast (storing examples)
Cons
- depends on similarity measure & k-NNs
- easily fooled by irrelevant attributes
- computationally expensive
Decision Tree
55 isPersonName isCapitalized isLiving X is PersonName? profession NO Jerry Hobbs 1 1 1 YES USC 1 NO Jordan 1 1 NO
Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification isCapitalized 1 isPersonName isLiving 1 YES NO NO NO
Decision Tree: Pros & Cons
56
Pros
+ generate understandable rules + provide a clear indication of which features are most important for classification
Cons
- error prone in multi-class classification
and small number of training examples
- expensive to train due to pruning
AdaBoost with Decision Tree (Carreras et al. 2002)
- Learning algorithm: AdaBoost
- Binary classification
- Binary features
- (Schapire & Singer, 99)
- Weak rules (ht): Decision Trees of fixed depth.
57
Feature generation
Adam Smith Works for IBM in London .
58
Feature generation
Adam Adam Smith Smith Works for IBM …………. in London . fp
59
- Contextual
- current word W0
Feature generation
Adam Adam, null, null, null, Smith, works, for Smith Smith, Adam, null, null, works, for, IBM Works for IBM …………. in London . fp, London, in, IBM, null, null, null
60
- Contextual
- current word W0
- words around W0 in [-3,…,+3] window
Feature generation
Adam Adam, null, null, null, Smith, works, for,1,0 Smith Smith, Adam, null, null, works, for, IBM,1,0 Works for IBM …………. in London . fp, London, in, IBM, null, null, null,0,0
61
- Ortographic
- initial-caps
- all-caps
More features
62
AdaBoost with Decision Tree (Carreras et al. 2002)
- Learning algorithm: AdaBoost
- Binary classification
- Binary features
- (Schapire & Singer, 99)
- Weak rules (ht): Decision Trees of fixed depth.
63
Results for Entity Boundary Detection
64
CoNLL-2002 Spanish Evaluation Data
Data sets #tokens #NEs Train 264,715 18,794 Development 52,923 4,351 Test 51,533 3,558
Carreras et al.,2002
Precision Recall F-score
BIO dev. 92.45 90.88 91.66
Results for Entity Classification
65
Spanish Dev. Precision Recal l F-score LOC 79.04 80.00 79.52 MISC 55.48 54.61 55.04 ORG 79.57 76.06 77.77 PER 87.19 86.91 87.05
- verall
79.15 77.80 78.47 Spanish Test. Precision Recall F-score LOC 85.76 79.43 82.47 MISC 60.19 57.35 58.73 ORG 81.21 82.43 81.81 PER 84.71 93.47 88.87
- verall
81.38 81.40 81.39
Sequence Labeling for NER
Traditional Named Entity Recognition (NER) Systems
- Heavy reliance on corpus-specific human labeling
- Training sequence models is slow
67
A manual annotation interface
e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), …
The The be best BBQ BBQ I’ I’ve ta tasted in in Ph Phoenix ix O O Food O O O Location
Sequence ce mo model tr training
NER Systems: Stanford NER Illinois Name Tagger IBM Alchemy APIs …
Encoding labels for sequence
68
- BIOES labeling schema:
- The O
- European B-ORG
- Commission E-ORG
- said O
- on O
- Thursday O
- it O
- disagreed O
- with O
- German S-MISC
Begin of Entity End of Entity Singleton Entity
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier NNP
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier VBD
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier DT
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier NN
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier CC
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier VBD
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier TO
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier VB
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier PRP
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier IN
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier DT
Slide from Ray Mooney
Sequence Labeling as Classification
- Classify each token independently but use as
input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier NN
Slide from Ray Mooney
Sequence Labeling vs. Classification
- Sequence Models are statistical models of
whole token sequences that effectively label sub-sequences
81
Workflow of Sequence Modeling
Training 1. Collect a set of representative training documents 2. Label each token for its entity class or other (O) 3. Design feature extractors appropriate to the text and classes 4. Train a sequence classifier to predict the labels from the data Testing 1. Receive a set of testing documents 2. Run sequence model inference to label each token 3. Appropriately output the recognized entities
82
Hidden Markov Models (HMMs)
- Generative
- Find parameters to maximize P(X,Y)
- Assumes features are independent
- When labeling Xi future observations are taken
into account (forward-backward)
Slides by Jenny Finkel
84
- Graphical Model Representation: Variables by time
- Circles indicate states
- Arrows indicate probabilistic dependencies between states
What is an HMM?
85
- Green circles are hidden states
- Dependent only on the previous state: Markov process
- “The past is independent of the future given the present.”
What is an HMM?
86
- Purple nodes are observed states
- Dependent only on their corresponding hidden state
What is an HMM?
87
- {S, K, P, A, B}
- S : {s1…sN } are the values for the hidden states
- K : {k1…kM } are the values for the observations
S S S K K K S K S K
HMM Formalism
88
HMM Formalism
- {S, K, P, A, B}
- P = {pi} are the initial state probabilities
- A = {aij} are the state transition probabilities
- B = {bik} are the observation state probabilities
A B A A A B B S S S K K K S K S K
89
Inference for an HMM
- Compute the probability of a given observation sequence
- Given an observation sequence, compute the most likely
hidden state sequence
- Given an observation sequence and set of possible models,
which model most closely fits the data?
90
) | ( Compute ) , , ( , ) ,..., ( 1 µ µ O P B A
- O
T
P = =
- T
- 1
- t
- t-1
- t+1
Given an observation sequence and a model, compute the probability of the observation sequence
Sequence Probability
91
) | ( ) , | ( ) | , ( µ µ µ X P X O P X O P =
T To
x
- x
- x
b b b X O P ... ) , | (
2 2 1 1
= µ
T T
x x x x x x x
a a a X P
1 3 2 2 1 1
... ) | (
- = p
µ
å
=
X
X P X O P O P ) | ( ) , | ( ) | ( µ µ µ
- T
- 1
- t
- t-1
- t+1
x1 xt+1 xT xt xt-1
Sequence Probability
MaxEnt Markov Models (MEMMs)
- Discriminative
- Find parameters to maximize P(Y|X)
- No longer assume that features are
independent
- Do not take future observations into account
(no forward-backward)
Slides by Jenny Finkel
MEMM inference in systems
- For a MEMM, the classifier makes a single decision at a time,
conditioned on evidence from observations and previous decisions
- A larger space of sequences is usually explored via search
- 3
- 2
- 1
+1 DT NNP VBD ??? ??? The Dow fell 22.6 %
Local Context Features
W0 22.6 W+1 % W-1 fell T-1 VBD T-1-T-2 NNP-VBD hasDigit? true … …
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Decision Point
Conditional Random Fields (CRFs)
- Discriminative
- Doesn’t assume that features are independent
- When labeling Yi future observations are taken
into account è The best of both worlds!
Slides by Jenny Finkel
CRFs [Lafferty, Pereira, and McCallum 2001]
- A whole-sequence conditional model rather than a chaining of local
models.
- The space of cs is now the space of sequences
- But if the features fi remain local, the conditional sequence
likelihood can be calculated exactly using dynamic programming
- Training is slower, but CRFs avoid causal-competition biases
∑ ∑
'
) , ' ( exp
c i i i
d c f λ = ) , | ( λ d c P
∑
i i i
d c f ) , ( exp λ
Model Trade-offs
Speed Discrim vs. Generative Normalization HMM very fast generative local MEMM mid-range discriminative local CRF kinda slow discriminative global
Slides by Jenny Finkel
Greedy Inference
- Greedy inference:
- We just start at the left, and use our classifier at each position to assign a label
- The classifier can depend on previous labeling decisions as well as observed data
- Advantages:
- Fast, no extra memory requirements
- Very easy to implement
- With rich features including observations to the right, it may perform quite well
- Disadvantage:
- Greedy. We make commit errors we cannot recover from
Sequence Model Inference Best Sequence
Slides by Chris Manning
Viterbi Inference
- Viterbi inference:
- Dynamic programming or memorization.
- Requires small window of state influence (e.g., past two states are relevant).
- Advantage:
- Exact: the global best sequence is returned.
- Disadvantage:
- Harder to implement long-distance state-state interactions (but beam inference
tends not to allow long-distance resurrection of sequences anyway).
Sequence Model Inference Best Sequence
Slides by Chris Manning
Beam Inference
- Beam inference:
- At each position keep the top k complete sequences.
- Extend each sequence in each local way.
- The extensions compete for the k slots at the next position.
- Advantages:
- Fast; beam sizes of 3–5 are almost as good as exact inference in many cases.
- Easy to implement (no dynamic programming required).
- Disadvantage:
- Inexact: the globally best sequence can fall off the beam.
Sequence Model Inference Best Sequence
Slides by Chris Manning
Stanford CRF System: A Case Study
https://nlp.stanford.edu/software/CRF-NER.shtml
Stanford NER
- CRF
- Features are more important than model
- How to train a new model
101
Lexical Features
- Word features: current word, previous word,
next word, all words within a window
- Orthographic features:
- Jenny Xxxx
- IL-2 XX-#
102
Lexical Features
- Word features: current word, previous word,
next word, all words within a window
- Word shape features:
- Jenny Xxxx
- IL-2 XX-#
- Prefixes and Suffixes:
- Jenny <J, <Je, <Jen, …, nny>, ny>, y>
- Label sequences
- Lots of feature conjunctions
103
Distributional Similarity Features
- Large, unannotated corpus
- Each word will appear in contexts - induce a
distribution over contexts
- Cluster words based on how similar their
distributions are
- Use cluster IDs as features
104
Distributional Similarity Features
- Large, unannotated corpus
- Each word will appear in contexts - induce a
distribution over contexts
- Cluster words based on how similar their
distributions are
- Use cluster IDs as features
- Great way to combat sparsity
- They used Alexander Clark’s distributional
similarity code (easy to use, works great!)
- 200 clusters, used 100 million words from English
gigaword corpus
105
Distributed Models
- Trained on CoNLL, MUC and ACE
- Entities: Person, Location, Organization
- Trained on both British and American
newswire, so robust across both domains
- Models with and without the distributional
similarity features
106
Incorporating NER into NLP systems
- NER is a component technology
- Common approach:
- Labeled data
- Pipe output to next stage
- Better approach:
- Sample output at each stage
- Pipe sampled output to next stage
- Repeat several times
- Vote for final output
- Sampling NER outputs is fast
107
Neural Sequence Labeling
Prior Approaches
- Hand-crafted features
- External dictionaries
- Other resources that require human efforts to
generate
109
Neural Sequence Models
- No hand-engineered features
- No specialized knowledge resources
- Capture orthographic and distributional
evidence for being a named entity
110
LSTM-CRF model
111
LSTM-CRF
112
LSTM-CRF
113
Char-level Embedding
114
Changes of F1 score for Neural NER models (CONLL’03)
115
Results by Yves Peirsman
State-of-the-Art Models
Challenge of NER on Low- resource Domains
117
Lack of abundant Supervision data There is no ImageNet for NLP
Feature Engineering
Prior Neural NER Model
118
Word-level LSTM O y2 S-MISC S-ORG y3 O y4 O y5 champaigns
Italian
Juve announced his
Embedding
input <s> y0
<start>
y1 O arrival O with O a O picture O
- f
O the S-MISC Argentine y12 y11 y10 y9 y8 y7 y6 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 CRF x2 x3 x4 x5 x0 x1 x6 x7 x8 x9 x10 x11 x12 xc xc
1
xc
2
xc
3
xc
4
xc
5
xc
6
xc
7
xc
8
xc
9
xc
10
xc
11
xc
12
xw
12
xw
11
xw
10
xw
9
xw
8
xw
7
xw
6
xw
5
xw
4
xw
3
xw
2
xw
1
xw
Words Semantic: Word Embedding Lexical Features: Char-level Representation
Word-level Knowledge
- Pre-trained word embedding has demonstrated its
potential to improve the performance of the neural model
119
Word-level Knowledge
120
Word-level LSTM O y2 S-MISC S-ORG y3 O y4 O y5 champaigns
Italian
Juve announced his
Embedding
input <s> y0
<start>
y1 O arrival O with O a O picture O
- f
O the S-MISC Argentine y12 y11 y10 y9 y8 y7 y6 z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 CRF x2 x3 x4 x5 x0 x1 x6 x7 x8 x9 x10 x11 x12 xc xc
1
xc
2
xc
3
xc
4
xc
5
xc
6
xc
7
xc
8
xc
9
xc
10
xc
11
xc
12
xw
12
xw
11
xw
10
xw
9
xw
8
xw
7
xw
6
xw
5
xw
4
xw
3
xw
2
xw
1
xw
Performance on NER (F1 score) Embeddings 80.76 Random 90.28 Senna 84.91 Word2Vec 91.21 GloVe LSTM-CNN- CRF model
Character-level Knowledge
- The order of language contains valuable
information, and can be utilized to improve the performance
- Specifically, we utilize language model to
guide sequence labeling model to the key knowledge.
121
Character-level Knowledge
122
Results of LM-LSTM-CRF
123
Extra Resource Model !" score gazetteers Collobert et al. 2011† [1] 89.59 Chiu et al. 2016† [2] 91.62±0.33 AIDA dataset Luo et al. 2015 [3] 91.20 CONLL 2000 / PTB-POS Yang et al. 2017† [4] 91.26 1B Word Dataset Peters et al. 2017†‡ [5] 91.93±0.19 None Collobert et al. 2011† [1] 88.67 Luo et al. 2015 [3] 89.90 Chiu et al. 2016† [2] 90.91±0.20 Lample et al. 2016† [6] 90.94 Ma et al. 2016† [7] 91.21 Yang et al. 2017† [4] 91.20 Rei 2017†‡ [8] 86.26 Peters et al. 2017† [5] 90.87±0.13 Peters et al. 2017†‡ [5] 90.79±0.15 LM-LSTM-CRF†‡ 91.24±0.12 (91.35)
LSTM-CRF: Single-task Architecture
word BiLSTM CRF word emb concat char emb char BiLSTM LM CRF word emb concat char emb
w/o co-training w/ co-training
word BiLSTM char BiLSTM
Learning NER on Multiple Tag Sets and Datasets
125
Multi-task Architecture1: MTM-C
- private word-level Bi-LSTM and shared char-
level Bi-LSTM
126
Multi-task Architecture1: MTM-C
- private char-level Bi-LSTM and shared word-
level Bi-LSTM
127
Multi-task Architecture2: MTM- CW
- shared char-level Bi-LSTM and word-level Bi-
LSTM
128
F1 Performance Comparison
129
Results compared with baselines
130
Other Settings of NER
Different Learning Settings
- Supervised learning
- labeled training examples
- methods: Hidden Markov Models, k-Nearest Neighbors,
Decision Trees, AdaBoost, SVM, …
- example: NE recognition, POS tagging, Parsing
- Unsupervised learning
- labels must be automatically discovered
- method: clustering
- example: NE disambiguation, text classification
132
Different Learning Settings
- Semi-supervised learning
- small percentage of training examples are labeled, the
rest is unlabeled
- methods: bootstrapping, active learning, co-training, self-
training
- example: NE recognition, POS tagging, Parsing, …
133
Weak-Supervision Systems: Pattern-Based Bootstrapping
- Requires manual seed selection & mid-point checking
134
Annotate corpus using entities Generate candidate patterns Score candidate patterns Select Top patterns Apply patterns to find new entities
Se Seeds fo for Food
Pi Pizza Fr French Fr Fries Ho Hot Do Dog Panc ancak ake
.. ...
Seed entities and corpus Patterns for Food th the be best <X <X> I’ I’ve tr tried in in th their <X <X> ta tastes am amaz azing ing …
e.g., (Etzioni et al., 2005), (Talukdar et al., 2010), (Gupta et al., 2014), (Mitchell et al., 2015), …
Systems: CMU NELL UW KnowItAll Stanford DeepDive Max-Planck PROSPERA …
Unsupervised Entity Extraction
- Statistics based on massive text corpora
- Popularity
- Raw frequency
- Frequency distribution based on Zipfian ranks [Deane’05]
- Concordance
- Significance score [Church et al.’91][El-Kishky et al.’14]
- Completeness
- Comparison to super/sub-sequences [Parameswaran et
al.’10]
135
Frequency Distribution
- Idea: ranks in a Zipfian frequency distribution is
more reliable than raw frequency
- Heuristic: Actual Rank / Expected Rank
- Example:
- Given a phrase like “east end”
- Ac
Actu tual Ra Rank: rank “east end” among all occurrences of “east” (e.g., “east end”, “east side”, “the east”, “towards the east”, etc.)
- Ex
Expec pected ed Rank nk: rank “__ end” among all contexts of “east” (e.g., “__ end”, “__ side”, “the __ ”, “towards the __”, etc.)
136
Significance Score
- Significance score [Church et al.’91]
- A.k.a. Z score
- ToPMine [El-Kishky et al.’15]
- If a phrase can be decomposed into two parts
- P = P1 ! P2
- α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
137
Quality phrases
TopMine Algorithm
138
[Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]…
Based on significance score [Church et al.’91]:
Limitations of Unsupervised Methods
- The thresholds should be carefully chosen.
- Only consider a subset of quality phrase
requirements.
- Combining different signals in an
unsupervised manner is difficult.
- Introduce some supervision may help!
139
SegPhrase: Weakly-supervised Phrase Mining
140
Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique.
Phrase Mining
Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus
SegPhrase: Segmentation of Phrases
- Partition a sequence of words by maximizing the likelihood
- Considering
- Phrase quality score
- ClassPhrase assigns a quality score for each
phrase
- Probability in corpus
- Length penalty
- length penalty !: when ! > 1, it favors shorter
phrases
- Filter out phrases with low rectified frequency
- Bad phrases are expected to rarely occur in the segmenta6on
results
141
Summary
- Task definition, evaluation & applications
- Methods: rule-based, classifier-based, (neural)
sequence models
- Semi-supervised learning & multi-tasking on
NER
- Other settings of NER
142