Mining Knowledge Graphs from Text WSDM 2018 J AY P UJARA , S AMEER - - PowerPoint PPT Presentation

mining knowledge graphs from text
SMART_READER_LITE
LIVE PREVIEW

Mining Knowledge Graphs from Text WSDM 2018 J AY P UJARA , S AMEER - - PowerPoint PPT Presentation

Mining Knowledge Graphs from Text WSDM 2018 J AY P UJARA , S AMEER S INGH Tutorial Overview Part 1: Knowledge Graphs Part 2: Part 3: Knowledge Graph Extraction Construction Part 4: Critical Analysis 2 Tutorial Outline 1. Knowledge


slide-1
SLIDE 1

Mining Knowledge Graphs from Text

WSDM 2018 JAY PUJARA, SAMEER SINGH

slide-2
SLIDE 2

Tutorial Overview

Part 3: Graph Construction Part 1: Knowledge Graphs Part 4: Critical Analysis Part 2: Knowledge Extraction

2

slide-3
SLIDE 3

Tutorial Outline

  • 1. Knowledge Graph Primer

[Jay]

  • 2. Knowledge Extraction Primer

[Jay]

  • 3. Knowledge Graph Construction

a. Probabilistic Models [Jay] Coffee Break b. Embedding Techniques [Sameer]

  • 4. Critical Overview and Conclusion [Sameer]

3

slide-4
SLIDE 4

What is NLP?

Information Extraction Unstructured Ambiguous Lots and lots of it! Humans can read them, but … very slowly … can’t remember all … can’t answer questions “Knowledge” Structured Precise, Actionable Specific to the task Can be used for downstream applications, such as creating Knowledge Graphs!

4

slide-5
SLIDE 5

Knowledge Extraction

5

John Lennon Alfred Lennon Julia Lennon Liverpool

birthplace childOf childOf

John was born in Liverpool, to Julia and Alfred Lennon.

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person

NNP VBD VBD IN NNP TO NNP CC NNP NNP

Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

NLP Information Extraction

Extraction graph Annotated text Text

slide-6
SLIDE 6

Breaking it Down

John was born in Liverpool, to Julia and Alfred Lennon.

NNP VBD VBD IN NNP TO NNP CC NNP NNP

Person Location Person Person

John was born in Liverpool, to Julia and Alfred Lennon.

Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

Sentence Dependency Parsing, Part of speech tagging, Named entity recognition… Document Coreference Resolution...

John Lennon Alfred Lennon Julia Lennon Liverpool birthplace childOf childOf spouse

Information Extraction Entity resolution, Entity linking, Relation extraction…

6

slide-7
SLIDE 7

Tagging the Parts of Speech

John was born in Liverpool, to Julia and Alfred Lennon.

NNP VBD VBD IN NNP TO NNP CC NNP NNP

7

Nouns are entities Verbs are relations

  • Common approaches include CRFs, CNNs, LSTMs
slide-8
SLIDE 8

Detecting Named Entities

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person

8

  • Structured prediction approaches
  • Capture entity mentions and entity types
slide-9
SLIDE 9

NLP annotations à features for IE

Combine tokens, dependency paths, and entity types to define rules. Argument 1 Argument 2

,

Person Organization

DT CEO

  • f

appos nmod case det

Bill Gates, the CEO of Microsoft, said …

  • Mr. Jobs, the brilliant and charming CEO of Apple Inc., said …

… announced by Steve Jobs, the CEO of Apple. … announced by Bill Gates, the director and CEO of Microsoft. … mused Bill, a former CEO of Microsoft. and many other possible instantiations…

9

slide-10
SLIDE 10

Within-document Coreference

John was born in Liverpool, to Julia and Alfred Lennon.

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool Lennon.. John Lennon... He…

10

  • Pairwise model for each noun/pronoun
  • Can consolidate information, provide context
slide-11
SLIDE 11

Entity Resolution & Linking

...during the late 60's and early 70's, Kevin Smith worked with several local... ...the term hip-hop is attributed to Lovebug Starski. What does it actually mean... The filmmaker Kevin Smith returns to the role of Silent Bob... Nothing could be more irrelevant to Kevin Smith's audacious ''Dogma'' than ticking off... Like Back in 2008, the Lions drafted Kevin Smith, even though Smith was badly... ... backfield in the wake of Kevin Smith's knee injury, and the addition of Haynesworth... ... The Physiological Basis of Politics,” by Kevin Smith, Douglas Oxley, Matthew Hibbing...

11

slide-12
SLIDE 12

Entity Names: Two Main Problems

Different Names for Entities Inconsistent References MSFT, APPL, GOOG… Typos/Misspellings Baarak, Barak, Barrack, … Nick Names Bam Bam, Drumpf, … Entities with Same Name Partial Reference Things named after each other First names of people, Location instead of team name, Nick names Clinton, Washington, Paris, Amazon, Princeton, Kingston, … Same type of entities share names Kevin Smith, John Smith, Springfield, …

12

slide-13
SLIDE 13

Entity Linking Approach

13

Washington drops 10 points after game with UCLA Bruins.

Candidate Generation

Washington DC, George Washington, Washington state, Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, …

Entity Types

Washington DC, George Washington, Washington state, Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, …

LOC/ORG Coreference

Washington DC, George Washington, Washington state, Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, … UWashington, Huskies

Coherence

UCLA Bruins, USC Trojans Washington DC, George Washington, Washington state, Lake Washington, Washington Huskies, Denzel Washington, University of Washington, Washington High School, …

Vinculum, Ling, Singh, Weld, TACL (2015)

slide-14
SLIDE 14

Information Extraction

John Lennon Alfred Lennon Julia Lennon Liverpool

birthplace childOf childOf spouse

Information Extraction

NNP VBD VBD IN NNP TO NNP CC NNP NNP

John was born in Liverpool, to Julia and Alfred Lennon.

Person Location Person Person Lennon.. John Lennon...

  • Mrs. Lennon..

.. his mother .. his father Alfred he the Pool

14

slide-15
SLIDE 15

Information Extraction

3 CONCRETE SUB-PROBLEMS

Defining domain Learning extractors Scoring the facts

3 LEVELS OF SUPERVISION

Supervised Semi-supervised Unsupervised

15

slide-16
SLIDE 16

Effect of supervision on extractions

16

Precision, Human efforts Recall, Speed

slide-17
SLIDE 17

Information Extraction

3 CONCRETE SUB-PROBLEMS

Defining domain

Learning extractors Scoring the facts

3 LEVELS OF SUPERVISION

Supervised Semi-supervised Unsupervised

17

slide-18
SLIDE 18

Defining Domain: Manual

18 Everything Animals Mammals Reptiles Food Fruits Vegetables

Subset Disjoint [Toward an Architecture for Never-Ending Language Learning, Carlson et al. AAAI 2010]

consumes

slide-19
SLIDE 19

Defining Domain: Semi-automatic

  • Subset of types are

manually defined

  • SSL methods discover

new types from unlabeled data

19 Everything Animals Mammals Reptiles Food Fruits Vegetables Beverages Location Country City [Exploratory Learning, Dalvi et al., ECML 2013] [Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies, Dalvi et al., WSDM 2016] Everything Animals Mammals Reptiles Food Fruits Vegetables

slide-20
SLIDE 20

Defining Domain: Automatic

  • Any noun phrase is a candidate entity
  • Dog, cat, cow, reptile, mammal, apple, greens,

mixed greens, lettuce, red leaf lettuce, romaine lettuce, iceberg lettuce…

  • Any verb phrase is a candidate relation
  • Eats, feasts on, grazes, consumes,

20

[Open Information Extraction from the Web, Banko et al., IJCAI 2007]

slide-21
SLIDE 21

Information Extraction

3 CONCRETE SUB-PROBLEMS

Defining domain

Learning extractors

Scoring candidate facts

3 LEVELS OF SUPERVISION

Supervised Semi-supervised Unsupervised

21

slide-22
SLIDE 22

Learning Extractors

  • Supervised: high precision patterns
  • <PERSON> plays in <BAND>
  • Semi-supervised: Bootstrapping to learn patterns
  • Create examples (John Lennon, Beatles), find patterns
  • Manually correct incorrect patterns
  • Unsupervised: cluster phrases with constraints
  • Identify candidate verb phrases, find candidate

arguments, cluster by NER types

slide-23
SLIDE 23

Information Extraction

3 CONCRETE SUB-PROBLEMS

Defining domain Learning extractors

Scoring candidate facts

3 LEVELS OF SUPERVISION

Supervised Semi-supervised Unsupervised

23

slide-24
SLIDE 24

Scoring the candidate facts

  • Human defined scoring function or

Scoring function learnt using supervised ML with large amount of training data {expensive, high precision}

  • Small amount of training data is available

scoring refined over multiple iterations using both labeled and unlabeled data

  • Completely automatic (Self-training)

Confidence(extraction pattern) ∝ (#unique instances it could extract) Score(candidate fact) ∝ (#distinct extraction patterns that support it) {cheap, leads to semantic drift}

slide-25
SLIDE 25

Impact of early supervision

Defining domain Extractors for each relation of interest Scoring the candidate facts

25

Puts constraints on the space of possibly true extractions Early removal of noisy extraction pattern can avoid semantic drift in later stages Enables inheritance and mutual exclusion at extractor level

Domain expertise needed

slide-26
SLIDE 26

Effect of supervision on extractions

26

Precision, Human efforts Recall, Speed

slide-27
SLIDE 27

27

Defining domain Learning extractors Scoring candidate facts Fusing extractors

ConceptNet NELL Knowledge Vault OpenIE

IE systems in practice

Heuristic rules Classifier

slide-28
SLIDE 28

Knowledge Extraction: Key Points

  • Built on the foundation of NLP techniques
  • Part-of-speech tagging, dependency parsing, named

entity recognition, coreference resolution…

  • Challenging problems with very useful outputs
  • Information extraction techniques use NLP to:
  • define the domain
  • extract entities and relations
  • score candidate outputs
  • Trade-off between manual & automatic methods

28