LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , - PowerPoint PPT Presentation

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013

Overview Problem: Approach: Build a Knowledge Graph Knowledge Graph from millions of noisy Identification reasons extractions jointly over all facts in the knowledge graph Method: Results: Use probabilistic soft logic State-of-the-art performance to easily specify models and on real-world datasets efficiently optimize them producing knowledge graphs with millions of facts

CHALLENGES IN KNOWLEDGE GRAPH CONSTRUCTION

Motivating Problem: New Opportunities Extraction Internet Knowledge Graph (KG) Cutting-edge IE Structured methods representation of Massive source of entities, their labels and publicly available the relationships information between them

Motivating Problem: Real Challenges Extraction Internet Knowledge Graph Difficult! Noisy! Contains many errors and inconsistencies

NELL: The Never-Ending Language Learner • Large-scale IE project (Carlson et al., 2010) • Lifelong learning: aims to “read the web” • Ontology of known labels and relations • Knowledge base contains millions of facts

Examples of NELL errors

Entity co-reference errors Kyrgyzstan has many variants: • Kyrgystan • Kyrgistan • Kyrghyzstan • Kyrgzstan • Kyrgyz Republic

Missing and spurious labels Kyrgyzstan is labeled a bird and a country

Missing and spurious relations Kyrgyzstan’s location is ambiguous – Kazakhstan, Russia and US are included in possible locations

Violations of ontological knowledge • Equivalence of co-referent entities (sameAs) • SameEntity(Kyrgyzstan, Kyrgyz Republic) • Mutual exclusion (disjointWith) of labels • MUT(bird, country) • Selectional preferences (domain/range) of relations • RNG(countryLocation, continent) Enforcing these constraints require jointly considering multiple extractions

KNOWLEDGE GRAPH IDENTIFICATION

Motivating Problem (revised) Knowledge Graph (noisy) Extraction Graph Internet = Large-scale IE Joint Reasoning

Knowledge Graph Identification Problem: Knowledge Graph Knowledge Graph = Identification Extraction Graph Solution: Knowledge Graph Identification (KGI) • Performs graph identification : • entity resolution • collective classification • link prediction • Enforces ontological constraints • Incorporates multiple uncertain sources

Illustration of KGI: Extractions Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) .8: Rel(Kyrgyz Republic, Bishkek, hasCapital)

Illustration of KGI: Extraction Graph Extraction Graph Uncertain Extractions: .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl hasCapital) country bird Bishkek

Illustration of KGI: Ontology + ER (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan)

Illustration of KGI (Annotated) Extraction Graph Uncertain Extractions: SameEnt .5: Lbl(Kyrgyzstan, bird) Kyrgyzstan Kyrgyz Republic .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country) Rel(hasCapital) .8: Rel(Kyrgyz Republic, Bishkek, Lbl Dom hasCapital) country Ontology: Dom(hasCapital, country) Mut(country, bird) bird Entity Resolution: Bishkek SameEnt(Kyrgyz Republic, Kyrgyzstan) After Knowledge Graph Identification Kyrgyzstan Rel(hasCapital) Lbl Bishkek country Kyrgyz Republic

MODELING KNOWLEDGE GRAPH IDENTIFICATION

Viewing KGI as a probabilistic graphical model Rel(hasCapital, Lbl(Kyrgyzstan, bird) Kyrgyzstan, Bishkek) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Rel(hasCapital, Lbl(Kyrgyz Republic, Kyrgyz Republic, bird) Bishkek)

Background: Probabilistic Soft Logic (PSL) • Templating language for hinge-loss MRFs, very scalable! • Model specified as a collection of logical formulas SameEnt ( E 1 , E 2 ) ˜ ∧ Lbl ( E 1 , L ) ⇒ Lbl ( E 2 , L ) • Uses soft-logic formulation • Truth values of atoms relaxed to [0,1] interval • Truth values of formulas derived from Lukasiewicz t-norm

Background: PSL Rules to Distributions • Rules are grounded by substituting literals into formulas w EL : SameEnt (Kyrgyzstan , Kyrygyz Republic) ˜ ∧ Lbl (Kyrgyzstan , country) ⇒ Lbl (Kyrygyz Republic , country) • Each ground rule has a weighted distance to satisfaction derived from the formula’s truth value P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R • The PSL program can be interpreted as a joint probability distribution over all variables in knowledge graph, conditioned on the extractions

Background: Finding the best knowledge graph • MPE inference solves max G P(G) to find the best KG • In PSL, inference solved by convex optimization • Efficient: running time scales with O(|R|)

PSL Rules for the KGI Model

PSL Rules: Uncertain Extractions Predicate representing uncertain Relation in relation extraction from extractor T Weight for source T Knowledge Graph (relations) w CR − T : CandRel T ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , R ) w CL − T : CandLbl T ( E, L ) ⇒ Lbl ( E, L ) Label in Weight for source T Predicate representing uncertain Knowledge Graph (labels) label extraction from extractor T

PSL Rules: Entity Resolution ER predicate captures • Rules require co-referent confidence that entities entities to have the same are co-referent labels and relations • Creates an equivalence class of co-referent entities

PSL Rules: Ontology Inverse: ˜ w O : Inv ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 2 , E 1 , S ) Selectional Preference: ˜ w O : Dom ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 1 , L ) ˜ w O : Rng ( R, L ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Lbl ( E 2 , L ) Subsumption: ˜ w O : Sub ( L, P ) ∧ Lbl ( E, L ) ⇒ Lbl ( E, P ) ˜ w O : RSub ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ Rel ( E 1 , E 2 , S ) Mutual Exclusion: ˜ w O : Mut ( L 1 , L 2 ) ∧ Lbl ( E, L 1 ) ⇒ ˜ ¬ Lbl ( E, L 2 ) ˜ w O : RMut ( R, S ) ∧ Rel ( E 1 , E 2 , R ) ⇒ ˜ ¬ Rel ( E 1 , E 2 , S ) Adapted from Jiang et al., ICDM 2012

Probability Distribution over KGs P ( G | E ) = 1 $ & ∑ Z exp − w r ϕ r ( G ) % ' r ∈ R CandLbl T ( kyrgyzstan , bird ) ⇒ Lbl ( kyrgyzstan , bird ) ˜ Mut ( bird , country ) ∧ Lbl ( kyrgyzstan , bird ) ⇒ ˜ ¬ Lbl ( kyrgyzstan , country ) ˜ SameEnt ( kyrgz republic , kyrgyzstan ) ∧ Lbl ( kyrgz republic , country ) ⇒ Lbl ( kyrgyzstan , country )

EVALUATION

T wo Evaluation Datasets LinkedBrainz NELL Description Community-supplied data about Real-world IE system extracting musical artists, labels, and general facts from the WWW creative works Noise Realistic synthetic noise Imperfect extractors and ambiguous web pages Candidate Facts 810K 1.3M Unique Labels 27 456 and Relations Ontological 49 67.9K Constraints

LinkedBrainz dataset for KGI Mapping to FRBR/FOAF ontology mo:label mo:Release mo:Label DOM rdfs:domain mo:record foaf:maker RNG rdfs:range mo:Record mo:MusicalArtist inverseOf mo:track INV owl:inverseOf subClassOf subClassOf SUB rdfs:subClassOf mo:Track foaf:made mo:SoloMusicArtist mo:MusicGroup RSUB rdfs:subPropertyOf mo:published_as MUT owl:disjointWith mo:Signal

Adding noise to LinkedBrainz Add realistic noise to LinkedBrainz data: Error Type Erroneous Data Co-reference User misspells artist Label User swaps artist and album fields Relation User omits or adds spurious albums for artist Reliability Gaussian noise on truth value of information

LinkedBrainz experiments Comparisons: Baseline Use noisy truth values as fact scores PSL-EROnly Only apply rules for E ntity R esolution PSL-OntOnly Only apply rules for Ont ological reasoning PSL-KGI Apply K nowledge G raph I dentification model AUC Precision Recall F1 at .5 Max F1 Baseline 0.672 0.946 0.477 0.634 0.788 PSL-EROnly 0.797 0.953 0.558 0.703 0.831 PSL-OntOnly 0.753 0.964 0.605 0.743 0.832 PSL-KGI 0.901 0.970 0.714 0.823 0.919

NELL Evaluation: two settings Target Set: restrict to a subset of KG Complete: Infer full knowledge graph (Jiang, ICDM12) ? ? • Closed-world model • Open-world model • Uses a target set: subset of KG • All possible entities, relations, labels • Derived from 2-hop neighborhood • Inference assigns truth value to • Excludes trivially satisfied variables each variable

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , - PowerPoint PPT Presentation

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

PSL Assemblies Limited, Eastern Avenue, Dunstable, United Kingdom. LU5 4JY 00 44 (0) 1582 676800

PSL Update Prague June, 2018 How does the PSL Merger impact you as a Distributor? A Vision A

PSL and Flow Models Conrad Bock Michael Gruninger 8/2004 1 Overview Approaches to system

More on PSL some examples, some pitfalls pulsed signal The PSL was right assert always (req

Parametrization of PSL(n,C)-representations of surface group I, II Yuichi Kabaya (Osaka

Parametrization of PSL(n,C)-representations of surface group II Yuichi Kabaya (Osaka

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Large Scale Knowledge Representation of Large Scale Knowledge Representation of Distributed

Random Walk Inference and Learning in A Large Scale Knowledge Base in A Large Scale Knowledge Base

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

AUTOMATING KNOWLEDGE WORK WITH LARGE-SCALE KNOWLEDGE GRAPHS 2018 Strata Data Conference, New

A Short Introduction to Probabilistic Soft Logic Angelika Kimmig, Stephen H. Bach, Matthias

The R Package fechner Ali nl, Thomas Kiefer 1 Ehtibar N. Dzhafarov 2 1 University of Dortmund 2

Youre once, twice, three times stepping Music Therapy and Physiotherapy Working in

Recap from last time 1. You can use the Normal approximation for the difference of two

Visualizing Fuchsian Groups David Dumas Dec 4, 2019 ICERM Special Interest Seminar Fuchsian

Specifying circuit properties in PSL Formal methods Mathematical and logical methods used in

protoDUNE APA Commissioning Lessons Learned (so far) Andrzej Szelc (Manchester) & Serhan

Asymptotics of certain families of Higgs bundles Qiongling Li (QGM-Caltech) (joint with Brian

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , - PowerPoint PPT Presentation

LARGE-SCALE KNOWLEDGE GRAPH IDENTIFICATION USING PSL Jay Pujara 1 , Hui Miao 1 , Lise Getoor 1 , William Cohen 2 1 University of Maryland, College Park, US 2 Carnegie Mellon University AAAI Symposium on Semantics for Big Data 11/16/2013

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

PSL Assemblies Limited, Eastern Avenue, Dunstable, United Kingdom. LU5 4JY 00 44 (0) 1582 676800

PSL Update Prague June, 2018 How does the PSL Merger impact you as a Distributor? A Vision A

PSL and Flow Models Conrad Bock Michael Gruninger 8/2004 1 Overview Approaches to system

More on PSL some examples, some pitfalls pulsed signal The PSL was right assert always (req

Parametrization of PSL(n,C)-representations of surface group I, II Yuichi Kabaya (Osaka

Parametrization of PSL(n,C)-representations of surface group II Yuichi Kabaya (Osaka

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Large Scale Knowledge Representation of Large Scale Knowledge Representation of Distributed

Random Walk Inference and Learning in A Large Scale Knowledge Base in A Large Scale Knowledge Base

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Pregel Large-Scale Graph Processing William Jones Analysing large graphs is hard. We are

AUTOMATING KNOWLEDGE WORK WITH LARGE-SCALE KNOWLEDGE GRAPHS 2018 Strata Data Conference, New

A Short Introduction to Probabilistic Soft Logic Angelika Kimmig, Stephen H. Bach, Matthias

The R Package fechner Ali nl, Thomas Kiefer 1 Ehtibar N. Dzhafarov 2 1 University of Dortmund 2

Youre once, twice, three times stepping Music Therapy and Physiotherapy Working in

Recap from last time 1. You can use the Normal approximation for the difference of two

Visualizing Fuchsian Groups David Dumas Dec 4, 2019 ICERM Special Interest Seminar Fuchsian

Specifying circuit properties in PSL Formal methods Mathematical and logical methods used in

protoDUNE APA Commissioning Lessons Learned (so far) Andrzej Szelc (Manchester) &amp; Serhan

Asymptotics of certain families of Higgs bundles Qiongling Li (QGM-Caltech) (joint with Brian

protoDUNE APA Commissioning Lessons Learned (so far) Andrzej Szelc (Manchester) & Serhan