Information Extraction Part II Kristina Lerman University of - - PowerPoint PPT Presentation

information extraction part ii
SMART_READER_LITE
LIVE PREVIEW

Information Extraction Part II Kristina Lerman University of - - PowerPoint PPT Presentation

Information Extraction Part II Kristina Lerman University of Southern California Thanks to Andrew McCallum and William Cohen for overview, sliding windows, and CRF slides. Thanks to Matt Michelson for sides on exploiting reference sets.


slide-1
SLIDE 1

Information Extraction Part II

Kristina Lerman University of Southern California

Thanks to Andrew McCallum and William Cohen for

  • verview, sliding windows, and CRF slides. Thanks

to Matt Michelson for sides on exploiting reference

  • sets. Thanks to Fabio Ciravegna for slides on LP2.
slide-2
SLIDE 2

What is “Information Extraction”

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy

  • f open-source software with Orwellian fervor,

denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept

  • f shared source," said Bill Veghte, a

Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying…

Information Extraction = segmentation + classification + association + clustering Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

slide-3
SLIDE 3

Outline

  • IE History
  • Landscape of problems and solutions
  • Models for segmenting/classifying:

– Lexicons/Reference Sets – Finite state machines – NLP Patterns

slide-4
SLIDE 4

Finite State Machines

slide-5
SLIDE 5

Information Ext: Graphical Models

  • Task: Given an input string, and set of states

(labels) with probabilities (we define later)

– What is set of states that produces input sequence?

start year make

Input: 2001 Ford Mustang GT V-8 Convertible - $12700

model

slide-6
SLIDE 6

Probabilistic Generative Models

  • Generative Models

– Probability of X and Y  P(X,Y)? – There is a water bowl, 1 dog, 1 cat and 1 chicken?

  • What is probability of seeing dog and cat drink together?
  • Look at the # times dog and cat drink together and divide by the

# times all animals drink together

  • Markov assumption: prob. in current state only

depends on previous and current state

– X, Y independent:

  • You just saw the cat drink, but you can’t use this information to

predict the next drinkers!

– Standard model

  • Hidden Markov Model (HMM)
slide-7
SLIDE 7

Markov Process: Definitions

Let’s say we’re independent of time, then we can define

  • aij = P(qt=Sj|qt-1=Si) as a STATE TRANSITION from Si to Sj

– What is the prob. of moving to new state from old state?

  • cat/dog drink from no-one drinking?
  • aij >= 0
  • – Conserve “mass” of probability  all outgoing probabilities sum to 1

Not drinking Drinking Not drinking 0.35 0.65

slide-8
SLIDE 8

Markov Process: More Defs

  • Two more terms to define:

– πi = P(q1=Si) = probability that we start in state Si – bj(k) = P(k|qt = Sj) = probability of observation symbol k in State j.

  • “Emission probability”

So, lets say symbols = {cat,dog}, then we could have something like b1(cat) = P(cat|not-drinking-state) i.e. what is the probability that we output cat in not- drinking-state?

slide-9
SLIDE 9

Hidden Markov Model

  • A Hidden Markov Model (HMM)

– Set of states, Set of ai,j , Set of πi ,Set of bj(k)

  • Training

– From sequences of observations, compute transition probs (aij), emission probs (bj(k)), starting probs (πi )

  • Decoding (after training)

– Input comes in, fits the model’s observations – Output best state transition sequence that produces input

  • Can observe the sequence of emissions, but you do

not know what state the model is in  “Hidden” If 2 states output “yes”, all I see is “yes,” I have no idea what state or set of states produced this!

slide-10
SLIDE 10

IE with Hidden Markov Models

Yesterday Lawrence Saul spoke this example sentence. Yesterday Lawrence Saul spoke this example sentence. Person name: Lawrence Saul

Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name:

slide-11
SLIDE 11

HMM Example: “Nymble”

Task: Named Entity Extraction Train on 450k words of news wire text. Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other

(Five other name classes) start-of- sentence end-of- sentence

Results:

slide-12
SLIDE 12

Regrets from Atomic View of Tokens

Would like richer representation of text: multiple overlapping features, whole chunks of text.

line, sentence, or paragraph features:

– length – is centered in page – percent of non-alphabetics – white-space aligns with next line – containing sentence has two verbs – grammatically contains a question – contains links to “authoritative” pages – emissions that are uncountable – features at multiple levels of granularity

Example word features:

– identity of word – is in all caps – ends in “-ski” – is part of a noun phrase – is in a list of city names – is under node X in WordNet or Cyc – is in bold font – is in hyperlink anchor – features of past & future – last person name was female – next two words are “and Associates”

slide-13
SLIDE 13

Problems with Richer Representation and a Generative Model

  • These arbitrary features are not independent:

– Overlapping and long-distance dependences – Multiple levels of granularity (words, characters) – Multiple modalities (words, formatting, layout) – Observations from past and future

  • HMMs are generative models of the text:
  • Generative models do not easily handle these non-

independent features. Two choices:

– Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! – Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

slide-14
SLIDE 14

Conditional Model

  • We prefer a model that is trained to maximize

a conditional probability rather than joint probability: P(s|o) instead of P(s,o):

– Allow arbitrary, non-independent features on the

  • bservation sequence

– Transition probabilities between states might depend on past (and future!) observations – Conditionally trained:

  • Given some observations (your input), what are the likely

labels (states) that the model traverses given this input?

slide-15
SLIDE 15

Conditional Random Fields (CRFs)

  • CRFs

– Conditional model – Based on “Random Fields” – Undirected acyclic graph – Allow some transitions “vote” more strongly than

  • thers depending on the corresponding
  • bservations

– Remember the point: Given some observations, what are the labels (states) that likely produced labels?

slide-16
SLIDE 16

Random Field: Definition

  • For graph G(V,E), let Vi be a random variable

If P(Vi|all other V) = P(Vi|its neighbors) Then G is a random field

B C A D E

P(A | B,C,D,E) = P(A | B,C)  Random Field

slide-17
SLIDE 17

Conditional Random Fields (CRFs)

St St+1 St+2 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 St+3 St+4 Markov on s, conditional dependency on o. Hammersley-Clifford-Besag theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.

slide-18
SLIDE 18

Conditional Random Field Example

States = {A, B, C, D, E} and some observation Blue

Blue B C A D E

A CRF is s.t. P(A | Blue, B, C, D, E) = P(A| Blue, B, C) Or generally: P(Yi|X, all other Y) = P(Yi | X, neighbors of Yi)

“first-order chain” Common in extraction

slide-19
SLIDE 19

CRF: Usefulness

  • CRF gives us P(label | obs, model)

– Extraction: Find most probable label sequence (y’s), given an observation sequence (x’s)

  • 2001 Ford Mustang GT V-8 Convertible - $12700

– No more independence assumption

  • Conditionally trained for whole label sequence (given input)

– “long range” features (future/past states) – Multi-features

  • Now we can use better features!

– What are good features for identifying a price?

slide-20
SLIDE 20

General CRFs vs. HMMs

  • More general and expressive modeling technique
  • Comparable computational efficiency
  • Features may be arbitrary functions of any or all
  • bservations
  • Parameters need not fully specify generation of
  • bservations; require less training data
  • Easy to incorporate domain knowledge
  • State means only “state of process”, vs

“state of process” and “observational history I’m keeping”

slide-21
SLIDE 21

Person name Extraction

[McCallum 2001, unpublished]

slide-22
SLIDE 22

Features in Experiment

Capitalized Xxxxx Mixed Caps XxXxxx All Caps XXXXX Initial Cap X…. Contains Digit xxx5 All lowercase xxxx Initial X Punctuation .,:;!(), etc Period . Comma , Apostrophe ‘ Dash

  • Preceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Hand-built FSM person-name extractor says yes, (prec/ recall ~ 30/95) Conjunctions of all previous feature pairs, evaluated at the current time step. Conjunctions of all previous feature pairs, evaluated at current step and one step ahead. All previous features, evaluated two steps ahead. All previous features, evaluated

  • ne step behind.

Total number of features = ~200k

slide-23
SLIDE 23

Training and Testing

  • Trained on 65469 words from 85 pages, 30

different companies’ web sites.

  • Training takes 4 hours on a 1 GHz Pentium.
  • Training precision/recall is 96% / 96%.
  • Tested on different set of web pages with

similar size characteristics.

  • Testing precision is 92 – 95%,

recall is 89 – 91%.

slide-24
SLIDE 24

Rule-based extraction from text

slide-25
SLIDE 25

KnowItAll Approach to IE

  • KnowItAll is domain-independent system that

extracts facts, concepts and relations from the Web

– Entities

‘Los Angeles’, ‘Albert Einstein’

– Classes and relations

  • Class instances:

‘Los Angeles’ is a CITY ‘Albert Einstein’ is a SCIENTIST

  • Unsupervised

– Does not require manually-labeled data

slide-26
SLIDE 26

KnowItAll Architecture

Extractor

  • Extraction patterns to

generate candidate facts

– Pattern “NP1 such as NP2” “…we offer tours in cities such as Paris and Berlin” – Extracts class CITY with instances Paris and Berlin

Assessor

  • Test candidate facts using

Pointwise Mutual Information (PMI)

– Statistics computed from all text on Web

  • Use existing Web search

technology to efficiently compute statistics

– Associates a probability with every fact it extracts

  • Automatically manage

tradeoff between precision and recall

Two-stage approach to information extraction

slide-27
SLIDE 27

Extractor

  • Natural language (NLP) patterns to extract

instances of classes

  • Automatically instantiates extraction patterns

NP1 "such as" NPList2 & head(NP1)= plural(Class1) & properNoun(head(each(NPList2))) => instanceOf(Class1, head(each(NPList2))) keywords: "plural(Class1) such as“

  • Uses part-of-speech tagger to identify Noun

Phrases (NP), proper names

slide-28
SLIDE 28

Extractor

  • Examples

– “…offer travel to cities such Paris and Berlin…” – “…famous physicists such as Einstein and Fermi…”

slide-29
SLIDE 29

Search Engine Interface

  • KnowItAll automatically formulates queries

based on its extraction rules

  • Query search engine with phrases

– “cities such as” to get Web pages with instances of CITY

  • Apply Extractor to all pages return by search
slide-30
SLIDE 30

Assessor

  • Uses statistics computed over all Web pages

to assess the likelihood that extracted fact is correct

– Pointwise Mutual Information (PMI) PMI(I, D) = |Hits(D+I)| |Hits(I)|

  • I is instance, e.g., “Liege”
  • D is discriminator phrase: e.g., “city of”
  • Hits(x) = number of Web pages that contain x

– high PMI(“Liege”, “city of”) indicates high likelihood that “Liege” is a city name

slide-31
SLIDE 31

Assessor

  • PMI is a feature to Naïve Bayes Classifier

– More likely classes get higher probabilities – Probability threshold tunable parameter to increase precision (at expense of recall)

slide-32
SLIDE 32

Precision and Recall – a recap

Facts about cities

E

Extracted facts about cities

R E^R

Precision = E^R/E Recall = E^R/R

Goal: Make the blue circle overlap more of the yellow circle!

slide-33
SLIDE 33

Enhancements to KnowItAll

  • Enhancements to increase precision & recall

– Rule Learning

  • Learns domain specific rules and validates accuracy of

instances they extract

– Subclass Extraction

  • Automatically identify subclasses

– Learn that physicists, geologists, etc. are subclasses of scientists – Rule “physicists such as …” will extract more scientists

– List Extraction

  • Locate lists of class instances
  • Learns a wrapper for the list to extract instances
slide-34
SLIDE 34

Enhancements: Rule Learning

  • Learn domain-specific rules to increase KnowItAll’s

precision and recall

E.g., “… headquartered in <CITY> …”

  • 1. Start with instances extracted by generic patterns
  • 2. Query search engine with instances  pages
  • 3. From each page, extract context string for instance
  • 4 words before, and after
  • 4. ‘Best’ substrings of the ‘best’ context strings are converted

to new Extraction Rules that extract new instances with high precision

  • Heuristic: Prefer substrings that appear in multiple pages
  • Heuristic: Penalize substrings that lead to many false positives
slide-35
SLIDE 35

Examples of Rule Learning

Most productive rules learn for each class, with number

  • f correct extractions and precision
  • 1. the cities of <city>

5215 0.80

  • 2. headquartered in <city>

4837 0.79

  • 3. for the city of <city>

3138 0.79

  • 4. in the movie <film>

1841 0.61

  • 5. <film> the movie starring

957 0.64

  • 6. movie review of <film>

860 0.64

  • 7. and physicist <scientist>

89 0.61

  • 8. physicist <scientist>,

87 0.59

  • 9. <scientist>, a British scientist 77

0.65

slide-36
SLIDE 36

Subclass Extraction

  • Identify subclasses and instantiate new generic patterns

– PHYSICIST is a subclass of SCIENTIST  new rule “physicists such as …” – Increases KnowItAll coverage

  • Subclasses of SCIENTIST found by KnowItAll

biologist zoologist astronomer meteorologist mathematicia n economist geologist sociologist chemist

  • ceanographer

anthropologist pharmacist psychologist climatologist paleontologist neuropsychologist engineer microbiologist

slide-37
SLIDE 37

Subclass Extraction

  • 1. Apply Subclass Extraction rules to extract

candidate subclasses

– “… such C1 as CN …”  CN is subclass of C1. – “ … CN and other C1 …”  CN is subclass C1.

  • 2. Assess validity of candidate

– Is subclass in a reference taxonomy (WordNet)? – Check word morphology  “microbiologist” is a subclass of “biologist”

slide-38
SLIDE 38

List Extraction

  • Extract information from formatted lists
  • Approach

– Query search engine with k random instances extracted by KnowItAll – In each Web page, search for a list containing these keywords using HLRT-like wrapper induction algorithm*

  • Convert Web page to DOM tree
  • Select subtrees corresponding to positive examples
  • Finds greatest common prefixes (and suffixes) for these

examples

  • Choose header and tail strings to limit extraction to good

subtrees

*Can learn wrapper from few positive examples – Assess the likelihood of each extracted instance

  • Rank instances by the number of lists they appear in
slide-39
SLIDE 39

Evaluation

  • For classes CITY and FILM

– Extracted >40k (compared to baseline 10k) at 90% precision

  • Most of the improvement due to List Extraction
  • For class SCIENTIST

– Extracted 40k instances (compared to baseline ~2k) at 90% precision

  • Most of the improvement due to Subclass Extraction
slide-40
SLIDE 40

Discussion

  • Automatic collection of large body of facts
  • Extract facts (e.g., instances of classes) from

text using generic NLP rules

  • Heuristics added to KnowItAll (rule learning,

subclass extraction, list extraction) greatly improve recall while maintaining high precision