Knowledge Vault: a web-scale approach to probabilistic knowledge - - PowerPoint PPT Presentation

knowledge vault a web scale approach to probabilistic
SMART_READER_LITE
LIVE PREVIEW

Knowledge Vault: a web-scale approach to probabilistic knowledge - - PowerPoint PPT Presentation

Knowledge Vault: a web-scale approach to probabilistic knowledge fusion Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy , Thomas Strohmann, Shaohua Sun, Wei Zhang Google (Machine Intelligence group) KV @ KDD 2014


slide-1
SLIDE 1

KV @ KDD 2014

Knowledge Vault: a web-scale approach to probabilistic knowledge fusion

Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang Google (Machine Intelligence group)

slide-2
SLIDE 2

KV @ KDD 2014

Outline of the talk

1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion

2

slide-3
SLIDE 3

KV @ KDD 2014

A Knowledge Graph is a multi-graph where nodes = entities, edges = relations

3 Kobe Bryant NY Knicks LA Lakers Pau Gasol playFor teammate playInLeague teamInLeague

  • pponent

playFor

Kobe Bryant

slide-4
SLIDE 4

KV @ KDD 2014

Example Knowledge Graphs

4

Walmart’s Kosmix Google’s KG Facebook’s Entity Graph Microsoft’s Satori

slide-5
SLIDE 5

KV Talk at KDD, New York, August 25, 2014

FB

Freebase is created by fusing structured data sources and human contributions

people movies companies places products

Wikipedia Geo MusicBrainz TVDB

slide-6
SLIDE 6

KV @ KDD 2014

The long tail of knowledge

  • FB is very large (40M nodes, 637M edges)
  • But it still very incomplete:
  • We are missing many edges (facts)
  • We are also missing many nodes (entities)
  • We are also missing many edge types (schema)

Relation % unknown in Freebase Profession 68% Place of birth 71% Nationality 75% Education 91% Spouse 92% Parents 94%

This talk

slide-7
SLIDE 7

KV @ KDD 2014

Outline of the talk

1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion

7

slide-8
SLIDE 8

KV @ KDD 2014

From Knowledge Graph to Knowledge Vault

  • There are many groups at Google working on enlarging

KG while maintaining high precision .

  • KV is an exploratory research project to investigate
  • ther points along the precision-recall curve.
  • KV automatically extracts facts from public web sources.
  • KV embraces the inherent uncertainty associated with

this process (every fact has associated confidence and provenance info).

slide-9
SLIDE 9

KV @ KDD 2014

Previous projects on automatically building KBs (eg NELL, YAGO) predict facts based on text

9

Kobe Bryant LA Lakers

playFor

“Kobe Bryant, “Kobe “Kobe Bryant the franchise player of

  • nce again saved

man of the match for “Kobe Bryant, “Kobe “Kobe Bryant the Lakers” his team” Los Angeles”

?

Pr(<s, r, o>=1|D)

slide-10
SLIDE 10

KV @ KDD 2014

KV: Predict new facts based on text AND existing edges in FB

10

Kobe Bryant LA Lakers

playFor

“Kobe Bryant, “Kobe “Kobe Bryant the franchise player of

  • nce again saved

man of the match for “Kobe Bryant, “Kobe “Kobe Bryant the Lakers” his team” Los Angeles”

?

Kobe Bryant NY Knicks LA Lakers Pau Gasol playFor teammate playInLeague teamInLeague

  • pponent

Pr(<s, r, o>=1|D)

slide-11
SLIDE 11

KV @ KDD 2014

Extractors Priors

Fusion

11

Web

slide-12
SLIDE 12

KV @ KDD 2014

KV is 50x bigger than comparable KBs

12

Total # facts in KV > 2.5B

302M with Prob > 0.9 381M with Prob > 0.7

Open IE (e.g., Mausam et al., 2012) 5B assertions (Mausam, Michael Schmitz, personal communication, October 2013)

slide-13
SLIDE 13

KV Talk at KDD, New York, August 25, 2014

Uses for KV's uncertain triples

probably true triples uploaded to KG probably false triples removed from KG possibly true triples used as weak signals possibly false triples used for error analysis

slide-14
SLIDE 14

KV @ KDD 2014

Outline of the talk

1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion

14

slide-15
SLIDE 15

KV @ KDD 2014

Fact extraction from the web

NL text Page structure Tables Webmaster annotations

Extractors

Fusion

15

slide-16
SLIDE 16

KV @ KDD 2014

Fact extraction from text (TXT)

  • First identify named entities (entity linkage).
  • Then classify verb phrase as one of 2000 relations

Patrick Newport ,who has been working at IHS Global Insight, noted... The result is a probabilistic triple: Pr(<subject, reln, object>=1 | text) Classifier trained using distant supervision.* Details: see eg tutorial by Ralph Grishman (NYU): “Information Extraction: Capabilities and Challenges”, 2012

PER /m/101 /people/person/employment ORG /m/201

* Mintz et al, RANLP 2009

slide-17
SLIDE 17

KV @ KDD 2014

Fact extraction from DOM trees*

  • First identify named entities on page
  • Then classify X-path connecting each entity pair as one
  • f 2000 relations

* Cafarella et al, CACM’11

slide-18
SLIDE 18

KV @ KDD 2014

Fact extraction from tables (TBL)*

Squares are CVT nodes

* Cafarella et al, VLDB’08

slide-19
SLIDE 19

KV @ KDD 2014

Fact extraction from schema.org annotation (ANO)

<script type=“application/ld+json”> {“@context” : “http://www.schema.org”, “@type” : “Event”, “startDate” : “2014-07-26”, ...} </script>

  • About 20% of webpages have machine-readable

annotations of commercial events, products, etc.

  • Automatically map to KG schema.
  • We still need to do entity linking.
slide-20
SLIDE 20

KV @ KDD 2014

Combine outputs from all extractors

NL text Page structure Tables Webmaster annotations

Extractors

Fusion

20

  • Train binary classifier on

f(t) = [score-txt(t), #txt(t), … ] using distant supervision.

  • Platt scaling to get

calibrated probabilities.

slide-21
SLIDE 21

KV @ KDD 2014

ROC for each extraction system

21

slide-22
SLIDE 22

KV @ KDD 2014

Confidence of true facts rises given more evidence

22

slide-23
SLIDE 23

KV @ KDD 2014

Outline of the talk

1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion

23

slide-24
SLIDE 24

KV @ KDD 2014

Extractors Priors

Fusion

24

Web Mining facts from graphs

slide-25
SLIDE 25

KV @ KDD 2014

Link prediction using tensor factorization

  • Many methods have been used to fill in missing values in binary

matrices, eg tensor factorization associates a low-dimensional vector with every row and column.

25 Kobe Bryant NY Knicks LA Lakers Pau Gasol playFor teammate playInLeague teamInLeague

  • pponent

playFor

Kobe Bryant

= < , ,>

slide-26
SLIDE 26

KV @ KDD 2014

(Deep) neural network for link prediction

  • Represent each entity and relation by its own

low-dimensional (100D) embedding vector.

  • Stack together, feed into neural net.
  • Train model to maximize log-likelihood of observed positive and

negative triples.

  • Outperforms neural tensor model (Socher et al).

26 Kobe Bryant NY Knicks LA Lakers Pau Gasol playFor teammate playInLeague teamInLeague

  • pponent

playFor

Kobe Bryant Pau Gasol NBA NY Knicks LA Lakers teamInLeague playFor

2 Hidden layers

slide-27
SLIDE 27

KV @ KDD 2014

Feature = Typed Path CityInState, CityInstate-1, CityLocatedInCountry 0.8 0.32 AtLocation-1, AtLocation, CityLocatedInCountry 0.6 0.20 … … … Pittsburgh Pennsylvania Philadelphia Harisburg

…(14)

U.S. Feature Value Logistic Regresssion Weight

CityLocatedInCountry(Pittsburgh) = U.S. p=0.58

Delta PPG

AtLocation

Atlanta Dallas Tokyo Japan

CityLocatedInCountry(Pittsburgh) = ?

CityLocatedInCountry

Path Ranking Algorithm [Lao et al., EMNLP11]

Figure courtesy ofTom Mitchell and Partha Talukdar

slide-28
SLIDE 28

KV @ KDD 2014

Example of paths / rules learned by PRA

CityLocatedInCountry(city, country):

8.04 cityliesonriver, cityliesonriver-1, citylocatedincountry 5.42 hasofficeincity-1, hasofficeincity, citylocatedincountry 4.98 cityalsoknownas, cityalsoknownas, citylocatedincountry 2.85 citycapitalofcountry,citylocatedincountry-1,citylocatedincountry 2.29 agentactsinlocation-1, agentactsinlocation, citylocatedincountry 1.22 statehascapital-1, statelocatedincountry 0.66 citycapitalofcountry

. .

7 of the 2985 learned paths

Figure courtesy of Tom Mitchell and Partha Talukdar

slide-29
SLIDE 29

KV Talk at KDD, New York, August 25, 2014

PRA similar in performance to neural network

slide-30
SLIDE 30

KV @ KDD 2014

Outline of the talk

1. Knowledge Graph 2. Knowledge Vault 3. Fact mining from the web 4. Fact mining from graphs 5. Knowledge Fusion

30

slide-31
SLIDE 31

KV @ KDD 2014

Extractors Priors

Fusion

31

Web

slide-32
SLIDE 32

KV @ KDD 2014

Fusing web extractions with graph priors

32

slide-33
SLIDE 33

KV @ KDD 2014

Example: (Barry Richter, studiedAt, UW-Madison)

“In the fall of 1989, Richter accepted a scholarship to the University of Wisconsin, where he played for four years and earned numerous individual accolades ...” “The Polar Caps' cause has been helped by the impact of knowledgeable coaches such as Andringa, Byce and former UW teammates Chris Tancill and Barry Richter.” è Web extraction confidence: 0.14 <Barry Richter, born in, Madison> <Barry Richter, lived in, Madison> è Final belief (fused with prior): 0.61

33

slide-34
SLIDE 34

KV @ KDD 2014

Summary and future work

  • KV has 2.5B triples automatically extracted from the web.
  • Combining web mining and graph mining can improve precision.
  • Work in progress

§

Discovering new entities

  • Clustering open IE extractions, CIKM 2014
  • Robust wrapper induction for long-tail verticals (work in progress)

§

Discovering new relations

  • Clustering open IE extractions, CIKM 2014
  • “Biperpedia”, VLDB 2014

§

Assessing trust-worthiness of web sites: VLDB 2014

§

Common sense fact mining eg “apples” (work in progress)

34

slide-35
SLIDE 35

KV @ KDD 2014

EXTRA SLIDES

35

slide-36
SLIDE 36

KV @ KDD 2014

Application 1: Knowledge Panels

36

Augmenting the presentation with relevant facts

slide-37
SLIDE 37

KV @ KDD 2014

Application 2: Related Entities

37

slide-38
SLIDE 38

KV @ KDD 2014

Application 3: Structured Graph Search

38

Figure courtesy of Antoines Bordes (Facebook)

slide-39
SLIDE 39

KV @ KDD 2014

Application 4: Factoid Question Answering

39

EVI (Amazon) Google Siri (Apple)

slide-40
SLIDE 40

KV @ KDD 2014

The yield from different extraction systems

40

slide-41
SLIDE 41

KV Talk at KDD, New York, August 25, 2014

Overlap between extractors