Information Extraction
- Prof. Sameer Singh
CS 295: STATISTICAL NLP WINTER 2017
February 21, 2017
Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from.
Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation
Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017 Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from. Outline What is Information Extraction Named
February 21, 2017
Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from.
CS 295: STATISTICAL NLP (WINTER 2017) 2
CS 295: STATISTICAL NLP (WINTER 2017) 3
4
Massive Corpus of Unstructured Text Structured Representation Query
Search (IR)
Search
DB Query
Query
Information Extraction
Database
Documents Documents Documents Documents Documents Documents Documents
5
Structured Representation Query
Which AI startups have been acquired by Tech companies?
Information Extraction
Massive Corpus of News Articles
Company Industry People acquired belongsTo founded employee expertIn
6
Query
Which two characters are not related by blood?
Collection of Books Structured Representation
Information Extraction
Query
What is the interaction pathway between YY1 and TIP60?
Massive Corpus of Scientific Papers Structured Representation Information Extraction
8
Information Extraction
Database or Graph
Question Answering
50 100 150 200 250 April June
Visualization & Statistics
Documents Documents Documents Documents Documents Documents
Documents
Downstream AI applications
CS 295: STATISTICAL NLP (WINTER 2017) 9
CS 295: STATISTICAL NLP (WINTER 2017) 10
CS 295: STATISTICAL NLP (WINTER 2017) 11
The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”)
CS 295: STATISTICAL NLP (WINTER 2017) 12
13
John was born in Liverpool, to Julia and Alfred Lennon.
Text
John Lennon Alfred Lennon Julia Lennon Liverpool
birthplace childOf childOf
Literal Facts
John was born in Liverpool, to Julia and Alfred Lennon.
Natural Language Processing
NNP VBD VBD IN NNP TO NNP CC NNP NNP
John was born in Liverpool, to Julia and Alfred Lennon.
Person Location Person Person Lennon.. John Lennon...
.. his mother .. his father Alfred he the Pool
14
John Lennon Alfred Lennon Julia Lennon Liverpool
birthplace childOf childOf spouse
Information Extraction
NNP VBD VBD IN NNP TO NNP CC NNP NNP
John was born in Liverpool, to Julia and Alfred Lennon.
Person Location Person Person Lennon.. John Lennon...
.. his mother .. his father Alfred he the Pool
15
John was born in Liverpool, to Julia and Alfred Lennon.
NNP VBD VBD IN NNP TO NNP CC NNP NNP
John was born in Liverpool, to Julia and Alfred Lennon.
Person Location Person Person Lennon.. John Lennon...
.. his mother .. his father Alfred he the Pool
Sentence Dependency Parsing, Part of speech tagging, Named entity recognition… Document Coreference Resolution...
John Lennon Alfred Lennon Julia Lennon Liverpool birthplace childOf childOf spouse
Information Extraction Entity resolution, Entity linking, Relation extraction…
16
CS 295: STATISTICAL NLP (WINTER 2017) 17
An important sub-task: find and classify names in text, for example:
support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
An important sub-task: find and classify names in text, for example:
support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
An important sub-task: find and classify names in text, for example:
support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Person Date Location Organi- zation
Uses in Knowledge Extraction:
How it is done:
John was born in Liverpool, to Julia and Alfred Lennon.
Person Location Person Person
21
3 class: Location, Person, Organization 4 class: Location, Person, Organization, Misc 7 class: Location, Person, Organization, Money, Percent, Date, Time
From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml)
PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.
Stanford CoreNLP spaCy.io
22
From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf)
Fine-grained Types
23
CS 295: STATISTICAL NLP (WINTER 2017) 24
CS 295: STATISTICAL NLP (WINTER 2017) 25
CS 295: STATISTICAL NLP (WINTER 2017) 26
Words Lexicons
CS 295: STATISTICAL NLP (WINTER 2017) 27
CS 295: STATISTICAL NLP (WINTER 2017) 28
drug company movie place person
18
708 6
:
8 6 68 14
field
CS 295: STATISTICAL NLP (WINTER 2017) 29
Short shapes John DC-100 CamelCase Shape(c)= if A-Z if a-z if 0-9
Word shapes X x d c Xxxx XX-ddd XxxxxXxxx Xx X-d XxXx
CS 295: STATISTICAL NLP (WINTER 2017) 30
BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … i i-1 i+1 BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … NEXT_ NEXT_ NEXT_ NEXT_ NEXT_ NEXT_ PREV_ PREV_ PREV_ PREV_ PREV_ PREV_ John Deere announced
CS 295: STATISTICAL NLP (WINTER 2017) 31
CS 295: STATISTICAL NLP (WINTER 2017) 32
Steedman, 2000
What a productive day . Not . PRON DET ADJ NOUN . ADV .
Parts of Speech
‘ Breaking Dawn ’ Returns to Vancouver on January 11th O B-MOVIE I-MOVIE O O O B-GEO-LOC O O O
Named Entity Recognition
CS 295: STATISTICAL NLP (WINTER 2017) 33
Steedman, 2000
Logistic Regression Conditional Random Fields
CS 295: STATISTICAL NLP (WINTER 2017) 34
Steedman, 2000
Feature Engineering Viterbi Algorithm
CS 295: STATISTICAL NLP (WINTER 2017) 35
Homework
Project
Summaries