Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview - PowerPoint PPT Presentation

Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview 8.2 Rule-based IE 8.3 Hidden Markov Models (HMMs) for IE 8.4 Linguistic IE 8.5 Entity Reconciliation 8.6 IE for Knowledge Acquisition 8-1 IRDM WS 2005

IE by text segmentation Source: concatenation of structured elements with limited reordering and some missing fields – Example: Addresses, bib records House Zip number State Building Road City 4089 Whispering Pines Nobel Drive San Diego CA 92122 Title Journal Year Author VolumePage P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Source: Sunita Sarawagi: Information Extraction Using HMMs, http://www.cs.cmu.edu/~wcohen/10-707/talks/sunita.ppt 8-2 IRDM WS 2005

8.3 Hidden Markov Models (HMMs) for IE Idea: text doc is assumed to be generated by a regular grammar (i.e. an FSA) with some probabilistic variation and uncertainty → stochastic FSA = Markov model HMM – intuitive explanation: • associate with each state a tag or symbol category (e.g. noun, verb, phone number, person name) that matches some words in the text; • the instances of the category are given by a probability distribution of possible outputs in this state; • the goal is to find a state sequence from an initial to a final state with maximum probability of generating the given text ; • the outputs are known, but the state sequence cannot be observed, hence the name hidden Markov model 8-3 IRDM WS 2005

Hidden Markov Models in a Nutshell A 0.6 • Doubly stochastic models A 0.9 C 0.4 0.5 C 0.1 S 1 S 2 0.9 0.5 0.1 0.8 • Efficient dynamic programming S 4 S 3 algorithms exist for 0.2 – Finding Pr(S) A 0.5 A 0.3 – The highest probability path P that C 0.5 C 0.7 maximizes Pr(S,P) (Viterbi) • Training the model – (Baum-Welch algorithm) Source: Sunita Sarawagi: Information Extraction Using HMMs, http://www.cs.cmu.edu/~wcohen/10-707/talks/sunita.ppt 8-4 IRDM WS 2005

Hidden Markov Model (HMM): Formal Definition An HMM is a discrete-time, finite-state Markov model with • state set S = (s 1 , ..., s n ) and the state in step t denoted X(t), • initial state probabilities p i (i=1, ..., n), • transition probabilities p ij : S × S → [0,1], denoted p(s i → s j ), • output alphabet Σ = {w 1 , ..., w m }, and • state-specific output probabilities q ik : S × × × × Σ Σ → Σ Σ → [0,1], denoted q(s i ↑ → → ↑ ↑ ↑ w k ) (or transition-specific output probabilities). Probability of emitting output o 1 ... o k ∈ Σ k is: k ∑ ∏ → ↑ p x x q x o → = p x x p x ( ) ( ) with ( ) : ( ) − i i i i 0 1 1 1 ∈ = x x S i ... 1 k 1 can be computed iteratively with clever caching and reuse of intermediate results („memoization“) α = = ( t ) : P [ o ... o , X ( t ) i ] − i 1 t 1 n α + = α → ↑ α = ( t 1 ) ( t ) p ( s s ) p ( s o ) ( 1 ) p ( i ) j i i j i t ∑ i = i 1 8-5 IRDM WS 2005

Example for Hidden Markov Model address ... start title author abstract section email p(start)=1 p[author → author]=0.5 q[author ↑ <firstname>]= 0.1 p[author → address]=0.2 q[author ↑ <initials>]= 0.2 p[author → email]=0.3 q[author ↑ <lastname>]= 0.5 ... ... q[email ↑ @]=0.2 q[email ↑ .edu]=0.4 q[email ↑ <lastname>]=0.3 ... 8-6 IRDM WS 2005

Example 0.4 0.2 A 0.4 A 0.2 0.8 C 0.1 C 0.3 G 0.2 G 0.3 0.6 0.5 T 0.3 T 0.2 1 3 begin end 0 5 A 0.4 A 0.1 0.5 0.9 0.2 C 0.1 C 0.4 G 0.1 G 0.4 T 0.4 T 0.1 2 4 0.1 0.8 π = × × × × × × a b a b a b a Pr( AAC , ) ( A ) ( A ) ( C ) 01 1 11 1 13 3 35 = × × × × × × 0 . 5 0 . 4 0 . 2 0 . 4 0 . 8 0 . 3 0 . 6 Source: Sunita Sarawagi: Information Extraction Using HMMs, http://www.cs.cmu.edu/~wcohen/10-707/talks/sunita.ppt 8-7 IRDM WS 2005

Training of HMM MLE for HMM parameters (based on fully tagged training sequences ) → # transition s s s i j → = p ( s s ) i j → # transition s s x i ∑ x ↑ # outputs s w → = i k q ( s w ) i k → # outputs s o i ∑ o or use special case of EM (Baum-Welch algorithm) to incorporate unlabeled data (training: output sequence only, state sequence unknown) learning of HMM structure (#states, connections): some work, but very difficult 8-8 IRDM WS 2005

Viterbi Algorithm for the Most Likely State Sequence Find arg max P [ state sequence x ... x | output o ... o ] x ... x 1 t 1 t 1 t Viterbi algorithm (uses dynamic programming): δ = = ( t ) : max P [ x ... x , o ... o , X ( t ) i ] − − i x ... x 1 t 1 1 t 1 − 1 t 1 δ = ( 1 ) p ( i ) i δ + = δ → ↑ ( t 1 ) max ( t ) p ( s s ) q ( s o ) = j i 1 ,..., n i i j i t store argmax in each step 8-9 IRDM WS 2005

HMMs for IE The following 6 slides are from: Sunita Sarawagi: Information Extraction Using HMMs, http://www.cs.cmu.edu/~wcohen/10-707/talks/sunita.ppt 8-10 IRDM WS 2005

Combining HMMs with Dictionaries • Augment dictionary – Example: list of Cities • Exploit functional dependencies – Example • Santa Barbara -> USA • Piskinov -> Georgia Example: 2001 University Avenue, Kendall Sq. Piskinov, Georgia House Area City State number Road Name 2001 University Avenue, Kendall Sq., Piskinov, Georgia House Area City Road Name Country number 2001 University Avenue, Kendall Sq., Piskinov, Georgia 8-11 IRDM WS 2005

Combining HMMs with Frequency Constraints • Including constraints of the form: the same tag cannot appear in two disconnected segments – Eg: Title in a citation cannot appear twice – Street name cannot appear twice • Not relevant for named-entity tagging kinds of problems → → → → extend Viterbi algorithm with constraint handling 8-12 IRDM WS 2005

Comparative Evaluation • Naïve model – One state per element in the HMM • Independent HMM – One HMM per element; • Rule Learning Method – Rapier • Nested Model – Each state in the Naïve model replaced by a HMM 8-13 IRDM WS 2005

Results: Comparative Evaluation Dataset insta Elem nces ents IITB student 2388 17 Addresses Company 769 6 Addresses US 740 6 Addresses The Nested model does best in all three cases (from Borkar 2001) 8-14 IRDM WS 2005

Results: Effect of Feature Hierarchy Feature Selection showed at least a 3% increase in accuracy 8-15 IRDM WS 2005

Results: Effect of training data size HMMs are fast Learners. We reach very close to the maximum accuracy with just 50 to 100 addresses 8-16 IRDM WS 2005

Semi-Markov Models for IE The following 4 slides are from: William W. Cohen A Century of Progress on Information Integration: a Mid-Term Report http://www.cs.cmu.edu/~wcohen/webdb-talk.ppt 8-17 IRDM WS 2005

Features for information extraction I met Prof. F. Douglas at the zoo t 1 2 3 4 5 6 7 8 x I met Prof F. Douglas at the zoo. y Other Other Person Person Person other other Location Question: how can we guide this using a dictionary D ? Simple answer: make membership in D a feature f d 8-18 IRDM WS 2005

Existing Markov models for IE • Feature vector for each position previous i-th label label Word i & neighbors • Examples • Parameters: weight W for each feature (vector) 8-19 IRDM WS 2005

Semi-markov models for IE t 1 2 3 4 5 6 7 8 x I met Prof. F. Douglas at the zoo. y Other Other Person Person Person other other Location l,u l 1 =u 1 =1 l 2 =u 2 =2 l 3 =3, u 3 =5 l 4 =6,u 4 =6 l 5 =u 5 =7 l 6 =u 6 =8 x I met Prof. F. Douglas at the zoo. y Other Other Person other other Location COST: Requires additional search in Viterbi Learning and inference slower by O(maxNameLength) 8-20 IRDM WS 2005

Features for Semi-Markov models previous Start of S j label j-th label end of S j 8-21 IRDM WS 2005

Problems and Extensions of HMMs • individual output letters/word may not show learnable patterns → output words can be entire lexical classes (e.g. numbers, zip codes) • geared for flat sequences, not for structured text docs → use nested HMM where each state can hold another HMM • cannot capture long-range dependencies (e.g. in addresses: with first word being „Mr.“ or „Mrs.“ the probability of later seeing a P.O. box rather than a street address decreases substantially) → use dictionary lookups in critial states and/or combine HMMs with other techniques for long-range effects → use semi-Markov models 8-22 IRDM WS 2005

8.4 Linguistic IE Preprocess input text using NLP methods: • Part-of-speech (PoS) tagging: each word (group) → grammatical role (NP, ADJ, VT, etc.) • Chunk parsing: sentence → labeled segments (temp. adverb phrase, etc.) • Link parsing: bridges between logically connected segments NLP-driven IE tasks: • Named Entity Recognition (NER) • Coreference resolution (anaphor resolution) • Template element construction • Template relation construction • Scenario template construction … • Logical representation of sentence semantics (e.g., FrameNet) 8-23 IRDM WS 2005

Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview - PowerPoint PPT Presentation

Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview 8.2 Rule-based IE 8.3 Hidden Markov Models (HMMs) for IE 8.4 Linguistic IE 8.5 Entity Reconciliation 8.6 IE for Knowledge Acquisition 8-1 IRDM WS 2005 IE by text

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

The Computing Community Consortium: Stimulating Bigger Thinking Ed Lazowska, UW and CCC Susan

Cybercasing the Joint: On the Privacy Implications of Geo-Tagging Gerald Friedland, Robin Sommer

P ERSON N AMES WITH U SER I NTERACTION 1 M OTIVATION Search an author in DBLP Do these papers

A Digital Fountain Approach to Reliable Distribution of Bulk Data John Byers, ICSI Michael Luby,

& Semantic Roles CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Q:

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

Optimal Index Codes with Near-Extreme Rates Vitaly Skachek (joint work with Son Hoang Dau and

W The SensEval workshop series are specifically dedicated ORD sense disambiguation (WSD) is

Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview - PowerPoint PPT Presentation

Chapter 8: Information Extraction (IE) 8.1 Motivation and Overview 8.2 Rule-based IE 8.3 Hidden Markov Models (HMMs) for IE 8.4 Linguistic IE 8.5 Entity Reconciliation 8.6 IE for Knowledge Acquisition 8-1 IRDM WS 2005 IE by text

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Chapter VI: Information Extraction Information Retrieval &amp; Data Mining Universitt des

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

The Computing Community Consortium: Stimulating Bigger Thinking Ed Lazowska, UW and CCC Susan

Cybercasing the Joint: On the Privacy Implications of Geo-Tagging Gerald Friedland, Robin Sommer

P ERSON N AMES WITH U SER I NTERACTION 1 M OTIVATION Search an author in DBLP Do these papers

A Digital Fountain Approach to Reliable Distribution of Bulk Data John Byers, ICSI Michael Luby,

&amp; Semantic Roles CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Q:

Evidence estimation for Markov random fields: a triply intractable problem Richard Everitt

Optimal Index Codes with Near-Extreme Rates Vitaly Skachek (joint work with Son Hoang Dau and

W The SensEval workshop series are specifically dedicated ORD sense disambiguation (WSD) is

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des

& Semantic Roles CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Q: