Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu - - PowerPoint PPT Presentation

mayank kejriwal
SMART_READER_LITE
LIVE PREVIEW

Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu - - PowerPoint PPT Presentation

Mayank Kejriwal Information Sciences Institute/USC kejriwal@isi.edu http://usc-isi-i2.github.io/kejriwal/ Given one or more attribute-rich graphs, a training set of linked node pairs, how do we avoid evaluating all node pairs (O|V| 2 ) ?


slide-1
SLIDE 1

Mayank Kejriwal

Information Sciences Institute/USC kejriwal@isi.edu http://usc-isi-i2.github.io/kejriwal/

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Given one or more attribute-rich graphs, a training set of linked node pairs, how do we avoid evaluating all node pairs (O|V|2)?

slide-5
SLIDE 5

Blocks

1 2 3 4 5

Apply blocking key e.g. Tokens(LastName) Generate candidate set (7 pairs), apply similarity function

  • n each pair

? ? ? ? ? ? ? Dataset 1 Dataset 2 ‘Exhaustive’ set: 4 X 6=24 pairs

Idea: Candidate Generation via blocking

slide-6
SLIDE 6

Even better…learn candidate generation function

  • Doing it efficiently without losing (much) expressive power:

Disjunctive Normal Form (DNF) blocking keys

  • Example:
  • CharTriGrams(Last_Name) U (Numbers(Address) X Last4Chars(SSN))
  • Use functional elements like CharTriGrams to construct complex

blocking keys

  • Optimal search is NP-Complete, use greedy approximation with

guarantees

slide-7
SLIDE 7

Some results

DNF blocking for RDF Attribute Clustering (AC) Name Recall Reduction FMeasure Recall Reduction FMeasure Persons 1 100 99.75 99.88 100 98.86 99.43 Persons 2 99.00 99.79 99.39 99.75 99.02 99.38 Restaurants 100 99.73 99.87 100 95.57 99.79 Eprints-Rexa 98.16 99.28 98.72 99.60 99.37 99.48 IM-Similarity 100 98.14 99.06 100 62.79 77.14 IIMB-059 99.76 93.35 96.45 97.33 73.09 83.49 IIMB-062 47.73 98.11 64.22 77.27 90.80 83.49 Libraries 97.96 99.99 98.96 99.99 99.87 99.93 Parks 95.96 94.41 95.18 99.07 88.27 93.36 Video Game 98.73 99.96 99.34 99.72 99.85 99.79 Average 93.73 98.25 95.11 97.27 91.15 93.53