Di Discovering Graph Patterns for Fact ct Check cking in - - PowerPoint PPT Presentation

di discovering graph patterns for fact ct check cking in
SMART_READER_LITE
LIVE PREVIEW

Di Discovering Graph Patterns for Fact ct Check cking in - - PowerPoint PPT Presentation

Di Discovering Graph Patterns for Fact ct Check cking in Knowledge Graphs Peng Lin Qi Song Jialiang Shen Yinghui Wu Washington State University Beijing University of Washington State University Posts and


slide-1
SLIDE 1

Di Discovering Graph Patterns for Fact ct Check cking in Knowledge Graphs

Peng Lin Qi Song Jialiang Shen Yinghui Wu

Washington State University Beijing University of Posts and Telecommunications Washington State University Pacific Northwest National Laboratory

slide-2
SLIDE 2

Wha What is fact che hecki king? ng?

Fact checking answers if a fact belongs to the missing part of KG.

Plato

(philosopher)

Cicero

(philosopher) Dialogues Ancient Philosophy

belongsTo p u b l i s h e d cited

Against Verres

cited

Against Piso

b e l

  • n

g s T

  • gaveBy

Triple < "#, %, "& >

  • "# and "& are two nodes;
  • # and & are node labels;
  • % is a relationship;

e.g., <Cicero, influencedBy, Plato>

  • "# = “Cicero”, "& = “Plato”
  • #, & = “philosopher”
  • % = “influencedBy”

Knowledge Graph (KG): G=

G=(V, E , E, L , L)

Fact: a triple predicate

slide-3
SLIDE 3

Fa Fact Checking in Graphs

Graph structure can be evidence for fact checking. “If a philosopher X gave one or more speeches, which cited a book of another philosopher Y with the same topic, then the philosopher X is likely to be lnfluencedBy Y.”

Plato

(philosopher)

Cicero

(philosopher) Dialogues Ancient Philosophy

belongsTo p u b l i s h e d cited

Against Verres

cited

Against Piso

b e l

  • n

g s T

  • gaveBy

A fact can be supported by its surrounded substructures!

slide-4
SLIDE 4

Fa Fact Checking via Graph Patterns

(philosopher)

!"

(philosopher)

!#

(book) (speech) (topic) Graph structure can be evidence for fact checking.

Plato

(philosopher)

Cicero

(philosopher) Dialogues Ancient Philosophy

belongsTo p u b l i s h e d cited

Against Verres

cited

Against Piso

b e l

  • n

g s T

  • gaveBy

$" $# We say % covers a fact if &' and &( matches )' and )( with *.

Pattern: regularity in KG

slide-5
SLIDE 5

Ru Rule Model: Graph Fact Checking Ru Rules (GFC)

GFC ! ∶ # $, & → (($, &)

(philosopher)

+,

(philosopher)

+- . RHS

(philosopher)

+,

(philosopher)

+-

(book) (speech) (topic)

LHS

Rule Semantics:

  • GFC / states that if pattern 0(,, -) covers a

fact < 2$, (, 2& >, then it is true.

Rule matching:

  • Subgraph isomorphism
  • verkill: redundant, too strict, too many
  • Approximate matching

(S. Ma, VLDB 2011)

A GFC rule contains two patterns connected by two anchored nodes.

slide-6
SLIDE 6

! "#

Ru Rule St Statistics

§ Given: G = V, E, L § GFC * ∶ , -, . → 0(-, .) § True facts Γ#:

  • sampled from the edges F in I.

§ False facts ΓL:

  • sampled from node pairs (M-, M.) that have no 0 between them.
  • following partial closed world assumption (PC

PCA) Statistical measures are defined in terms of graph and a set of training facts.

T "# "# ! "L T "L "L

slide-7
SLIDE 7

Su Support a and C Confidence ce GFC: ! ∶ # $, & → (($, &)

§ supp ! =

|0 12 ∩4(12)| |4(12)|

Ratio of facts can be covered

  • ut of r(x, y) triples.

§ conf ! =

|0 12 ∩4(12)| |0 12 9|

Ratio of facts can be covered

  • ut of (x, y) pairs, under PCA.

# # supp = 2/3 # #

(:$, :&) (:$, :&) (:$, :&)

conf = 1/2

((:$, :&) ((:$, :&) ((:$, :&)

Support and confidence are for pattern mining.

slide-8
SLIDE 8

Si Significance ce

GFC: ! ∶ # $, & → (($, &) G-Test score sig !, ., / = 2|Γ4|(. ln . / + 1 − . ln 1 − . 1 − / )

: and ; are the supports of <(=, >) for positive and negative facts, respectively. A “rounded up” score max{sig !, ., C , sig(!, C, /)} is used in practice. where C is a small positive to prevent infinities. In our work, we also normalize it between 0 and 1 by a sigmoid function.

Significance is the ability to distinguish true and false facts.

slide-9
SLIDE 9

Di Diversity ty

! is a set of GFCs. div ! = 1 |Γ)| *

+∈-.

*

/ ∈ 01(!)

supp(7)

89(:) is the GFCs in : that cover a true fact ;. E.g. !< = =<, =?, =@ , !? = {=B, =C, =D} F(GH, GI)< F(GH, GI)? F(GH, GI)@ =< ✓ ✓ =? ✓ ✓ =@ ✓ ✓ F(GH, GI)< F(GH, GI)? F(GH, GI)@ =B ✓ ✓ =C ✓ ✓ =D ✓ ✓

Diversity is to measure the redundancy of a set of GFCs

div !J = 2 div !K = 1.6

>

slide-10
SLIDE 10

To Top-! GF GFC Discovery Problem

Problem formulation: Given graph ", support threshold # and confidence threshold $, and a set of true facts Γ& and a set of false facts Γ', and integer (, identify a size-( set of GFCs ), such that: (a) For each GFC * in ), supp * ≥ #, conf * ≥ $. (b) cov ) is maximized.

To cope with diversity, the total significance sig ) = ∑9 ∈) sig(*).

Coverage function: cov ) = sig ) + div())

More significance, less redundancy.

slide-11
SLIDE 11

Pr Properties of cov(%)

§ cov % is a set function.

marginal gain: mg % = cov % ∪ {,} − cov %

§ cov % is monotone.

Adding elements to % does not decrease cov(%).

§ cov % is submodular.

If %/ ⊆ %1 and , ∉ %1, then mg %1 ≤ mg(%/).

Submodularity is a good property for set optimization problem.

slide-12
SLIDE 12

Di Discovery Algorith thms

§ OPT = max cov $

  • Cannot afford to enumerate every size-% set of GFCs.
  • cov $ is a monotone submodular function.
  • A greedy algorithm can have (1 − )

*) approximation of OPT.

§ GFC_batch:

1. Mine all the patterns satisfying support and confidence. 2. , = ∅ 3. While , < %, do 4. Select the pattern 0 with the largest marginal gain. GFC_batch: mining in batch and selecting greedily

slide-13
SLIDE 13

Di Discovery Algorith thms

§ GFC_batch is infeasible and slow. § Still, it requires mine all patterns first. § Can we do better?

GFC_stream: mining and selecting on-the-fly!

§ GFC_stream:

§ Interleave pattern generation and rule selection. § Find the top-! GFCs on-the-fly. § One pass of pattern mining. § (#

$ − &) approximation of OPT

slide-14
SLIDE 14

Di Discovery Algorith thms

Ø PGen: pattern generation § Generates patterns in a stream way. § Pass the patterns for selection § Can be in any order, e.g., Apriori, DFS, or random.

PGen PSel

Ø PSel: pattern selection § Selects and constructs GFCs on-the-fly. § Based on a “sieve” strategy, !

" − $ OPT

pattern stream decision

  • 1. Estimate the range of OPT by max{cov(,)}
  • 2. Each one is a size-. sieve with an estimation / for OPT.
  • 3. While the sieves are not full

4. if mg(,, 3) ≥ (

5 " − cov(3))/(. – |3|), add , to sieve 3.

  • 5. Signal PGen to stop and output the sieve with largest cov.

GFC_stream: mining and selecting on-the-fly!

Fast compute!

slide-15
SLIDE 15

GF GFC-ba based d fact che hecki king ng

ØGFactR: Using GFCs as rules:

§ Invokes GFC_stream to find top-! GFCs. § “Hit and miss” § True if a fact is covered by one GFC. § False If no GFC can cover the fact. § A typical rule model to compare with: AMIE+

ØGFact: Using GFCs in supervised link prediction:

§ A feature vector of size !. § Each entry encodes the presence of one GFC. § Build a classifier, by default, Logistic Regression. § A typical rule models to compare with: PRA

slide-16
SLIDE 16

Ex Exper erimen ent se settings

Dataset category |V| |E| # node labels # edge labels # < ", $, % > Yago Knowledge base 2.1 M 4.0 M 2273 33 15.5 K DBpedia Knowledge base 2.2 M 7.4 M 73 584 8240 Wikidata Knowledge base 10.8 M 41.4 M 18383 693 209 K MAG Academic network 0.6 M 1.71 M 8665 6 11742 Offshore Social network 1.0 M 3.3 M 356 274 633

Tasks Rule Mining Fact Checking Our methods GFC_batch, GFC_stream GFact, GFactR Baselines AMIE+, PRA AMIE+, PRA, KGMiner Evaluation Metrics running time vs. ' , Γ) prediction rate, precision, recall, F1

slide-17
SLIDE 17

Ex Exper erimen ent: effi ficien ency

1 10 102 103 104 0.6M 0.9M 1.2M 1.5M 1.8M Time (seconds)

GFC_stream GFC_batch AMIE+ PRA

Varying ! (DBPedia)

0.5K 1K 1.5K 2K 3K 6K 9K 12K 15K Time (seconds)

GFC_stream GFC_batch AMIE+ PRA

Varying |Γ$| (DBpedia) Overview

§ GFC_stream takes 25.7 seconds to discover 200 GFCs over Wikidata with 41.4 million edges and 6000 training facts. § On average, GFC_stream is 3.2 times faster than AMIE+ over DBpedia.

slide-18
SLIDE 18

Ex Exper erimen ent: effec ectiven eness ess

0.6 0.8 1 75K 90K 105K 120K 135K Prediction Rate GFact GFactR

Varying |Γ#| (Wikidata)

0.6 0.8 1 50 100 150 200 250 Prediction Rate GFact GFactR

Varying $ (Wikidata) Compared with AMIE+, PRA and KGMiner, respectively, on average:

§ GFact achieves additional 30%, 20%, and 5% gains of precision over DBpedia. § GFact achieves additional 20%, 15%, and 16% gains of F1-score over Wikidata.

slide-19
SLIDE 19

Case stu tudy: are tw two anonymous companies same? ? (O (Offsh shor

  • re)

!"

(A. Company)

!#

shareholder isActiveIn

(officer)

GFC AMIE+

  • If two anonymous companies are

registered in the same place, then they are same.

  • Low accuracy.

registerIn( , )⋀ registerIn( , ) isSameAs( , )

beneficiary

(address) (place) (A. Company)

registeredIn

(jurisdiction)

isIn isIn

  • If an officer is both a shareholder of company &'

and a beneficiary of company &(, and &' has an address and is registered through a jurisdiction in a place, and &( is active in the same place, then they are likely to be the same anonymous company.

slide-20
SLIDE 20

Conclusions and futu ture work

Ø Our future work: scalable GFC-based methods

§ Parallel mining, Distributed learning

Sponsored by: Ø Graph Fact Checking Rules (GFCs) Ø A stream-based rule discovery algorithm

§ One pass, !

" − $ OPT

Ø Evaluation of GFCs-based techniques

§ Rule models, fact checking (2 methods), efficiency, and case studies.

Ø Top-% GFCs discovery problem

Maximize a submodular cov function.

slide-21
SLIDE 21

Discovering Graph Patterns for Fact Checking in Knowledge Graphs

Thank you!

Related work: Gstream (IEEE BigData 2017) Event Pattern Discovery by Keywords in Graph Streams Mohammad Hossein Namaki, Peng Lin, Yinghui Wu https://ieeexplore.ieee.org/abstract/document/8258019/