Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation

information extraction
SMART_READER_LITE
LIVE PREVIEW

Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP - - PowerPoint PPT Presentation

Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017 Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from. Outline What is Information Extraction Named


slide-1
SLIDE 1

Information Extraction

  • Prof. Sameer Singh

CS 295: STATISTICAL NLP WINTER 2017

February 21, 2017

Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from.

slide-2
SLIDE 2

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 2

What is Information Extraction Named Entity Recognition Homework 3

slide-3
SLIDE 3

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 3

What is Information Extraction Named Entity Recognition Homework 3

slide-4
SLIDE 4

Making Sense of Text

4

Massive Corpus of Unstructured Text Structured Representation Query

?

Search (IR)

Search

DB Query

Query

Information Extraction

Database

  • r Graph

Documents Documents Documents Documents Documents Documents Documents

slide-5
SLIDE 5

News Articles

5

Structured Representation Query

Which AI startups have been acquired by Tech companies?

Information Extraction

Massive Corpus of News Articles

Company Industry People acquired belongsTo founded employee expertIn

slide-6
SLIDE 6

Fiction

6

Query

Which two characters are not related by blood?

Collection of Books Structured Representation

Information Extraction

slide-7
SLIDE 7

Academic Research

Query

What is the interaction pathway between YY1 and TIP60?

Massive Corpus of Scientific Papers Structured Representation Information Extraction

slide-8
SLIDE 8

Applications

8

Information Extraction

Database or Graph

?

Question Answering

50 100 150 200 250 April June

Visualization & Statistics

Documents Documents Documents Documents Documents Documents

Documents

Downstream AI applications

slide-9
SLIDE 9

Low-level Info. Extraction

CS 295: STATISTICAL NLP (WINTER 2017) 9

slide-10
SLIDE 10

Slightly better…

CS 295: STATISTICAL NLP (WINTER 2017) 10

slide-11
SLIDE 11

Slightly better?

CS 295: STATISTICAL NLP (WINTER 2017) 11

The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”)

slide-12
SLIDE 12

In the industry…

Google Knowledge Graph

  • Google Knowledge Vault

Amazon Product Graph Facebook Graph API IBM Watson Microsoft Satori

  • Project Hanover/Literome

LinkedIn Knowledge Graph Yandex Object Answer Diffbot, GraphIQ, Maana, ParseHub, Reactor Labs, SpazioDati

CS 295: STATISTICAL NLP (WINTER 2017) 12

slide-13
SLIDE 13

Knowledge Extraction

13

John was born in Liverpool, to Julia and Alfred Lennon.

Text

John Lennon Alfred Lennon Julia Lennon Liverpool

birthplace childOf childOf

Literal Facts

slide-14
SLIDE 14

Role of NLP?

John was born in Liverpool, to Julia and Alfred Lennon.

Natural Language Processing

NNP VBD VBD IN NNP TO NNP CC NNP NNP

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

14

slide-15
SLIDE 15

Information Extraction

John Lennon Alfred Lennon Julia Lennon Liverpool

birthplace childOf childOf spouse

Information Extraction

NNP VBD VBD IN NNP TO NNP CC NNP NNP

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

15

slide-16
SLIDE 16

Breaking it Down

John was born in Liverpool, to Julia and Alfred Lennon.

NNP VBD VBD IN NNP TO NNP CC NNP NNP

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

Sentence Dependency Parsing, Part of speech tagging, Named entity recognition… Document Coreference Resolution...

John Lennon Alfred Lennon Julia Lennon Liverpool birthplace childOf childOf spouse

Information Extraction Entity resolution, Entity linking, Relation extraction…

16

slide-17
SLIDE 17

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 17

What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction

slide-18
SLIDE 18

An important sub-task: find and classify names in text, for example:

  • The decision by the independent MP Andrew Wilkie to withdraw his

support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Named Entity Recognition

slide-19
SLIDE 19

An important sub-task: find and classify names in text, for example:

  • The decision by the independent MP Andrew Wilkie to withdraw his

support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Named Entity Recognition

slide-20
SLIDE 20

An important sub-task: find and classify names in text, for example:

  • The decision by the independent MP Andrew Wilkie to withdraw his

support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.

Named Entity Recognition

Person Date Location Organi- zation

slide-21
SLIDE 21

Detecting Named Entities

Uses in Knowledge Extraction:

  • Mentions describes the nodes
  • Types are incredibly important!
  • Often restrict relations
  • Fine-grained types are informative!
  • Brooklyn: city
  • Sanders: politician, senator

How it is done:

  • Context is important!
  • Georgia, Washington, …
  • John Deere, Thomas Cook, …
  • Princeton, Amazon, …
  • Label whole sentence together
  • Structured prediction again

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person

21

slide-22
SLIDE 22

NER: Entity Types

3 class: Location, Person, Organization 4 class: Location, Person, Organization, Misc 7 class: Location, Person, Organization, Money, Percent, Date, Time

From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml)

PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language.

Stanford CoreNLP spaCy.io

22

slide-23
SLIDE 23

NER: Entity Types

From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf)

Fine-grained Types

23

slide-24
SLIDE 24

CS 295: STATISTICAL NLP (WINTER 2017) 24

slide-25
SLIDE 25

Sequence Labeling for NER

CS 295: STATISTICAL NLP (WINTER 2017) 25

slide-26
SLIDE 26

Features: Words and Lexicons

CS 295: STATISTICAL NLP (WINTER 2017) 26

Words Lexicons

slide-27
SLIDE 27

Features: Prefixes/Suffixes

CS 295: STATISTICAL NLP (WINTER 2017) 27

slide-28
SLIDE 28

Features: Substrings of Words

CS 295: STATISTICAL NLP (WINTER 2017) 28

drug company movie place person

Cotrimoxazole Wethersfield Alien Fury: Countdown to Invasion

18

  • xa

708 6

:

8 6 68 14

field

Wethersfield Cotrimoxazole Alien Fury: Countdown to Invasion

slide-29
SLIDE 29

Features: Word Shapes

CS 295: STATISTICAL NLP (WINTER 2017) 29

Short shapes John DC-100 CamelCase Shape(c)= if A-Z if a-z if 0-9

  • .w.

Word shapes X x d c Xxxx XX-ddd XxxxxXxxx Xx X-d XxXx

slide-30
SLIDE 30

Features: Surrounding Context

CS 295: STATISTICAL NLP (WINTER 2017) 30

BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … i i-1 i+1 BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … BIAS WORD=Deere LWORD=deere FIRSTCAP=True SSHAPE=Xx LEXICON=company … NEXT_ NEXT_ NEXT_ NEXT_ NEXT_ NEXT_ PREV_ PREV_ PREV_ PREV_ PREV_ PREV_ John Deere announced

slide-31
SLIDE 31

Outline

CS 295: STATISTICAL NLP (WINTER 2017) 31

What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction

slide-32
SLIDE 32

Sequence Tagging on Twitter

CS 295: STATISTICAL NLP (WINTER 2017) 32

Steedman, 2000

What a productive day . Not . PRON DET ADJ NOUN . ADV .

Parts of Speech

‘ Breaking Dawn ’ Returns to Vancouver on January 11th O B-MOVIE I-MOVIE O O O B-GEO-LOC O O O

Named Entity Recognition

slide-33
SLIDE 33

Sequence Tagging Models

CS 295: STATISTICAL NLP (WINTER 2017) 33

Steedman, 2000

Logistic Regression Conditional Random Fields

slide-34
SLIDE 34

What do you have to do?

CS 295: STATISTICAL NLP (WINTER 2017) 34

Steedman, 2000

Feature Engineering Viterbi Algorithm

Test data will be released very close to the deadline!

slide-35
SLIDE 35

Upcoming…

CS 295: STATISTICAL NLP (WINTER 2017) 35

  • Homework 3 is due on February 27
  • Write-up and data has been released.

Homework

  • Status report due in 1.5 weeks: March 2, 2017
  • Instructions coming soon
  • Only 5 pages

Project

  • Paper summaries: February 28, March 14
  • Only 1 page each

Summaries