Entity Linking to Knowledge Graphs to Infer Column Types and - - PowerPoint PPT Presentation

entity linking to knowledge graphs to infer column types
SMART_READER_LITE
LIVE PREVIEW

Entity Linking to Knowledge Graphs to Infer Column Types and - - PowerPoint PPT Presentation

Entity Linking to Knowledge Graphs to Infer Column Types and Properties Avijit Thawani , Minda Hu, Erdong Hu, Husain Zafar, Naren Teja Divvala, Amandeep Singh, Ehsan Qasemi, Pedro Szekely, and Jay Pujara About Us Team ISI: Information


slide-1
SLIDE 1

Entity Linking to Knowledge Graphs to Infer Column Types and Properties

Avijit Thawani, Minda Hu, Erdong Hu, Husain Zafar, Naren Teja Divvala, Amandeep Singh, Ehsan Qasemi, Pedro Szekely, and Jay Pujara

slide-2
SLIDE 2

About Us

Team ISI:

  • Information Sciences Institute
  • University of Southern California

Me:

  • PhD student, USC
slide-3
SLIDE 3

Outline

1. CEA 2. tf-idf 3. CTA and CPA 4. Shortcomings 5. Analysis 6. Appendix: PSL

slide-4
SLIDE 4
  • 1. CEA
slide-5
SLIDE 5

Objective: CEA

Mark Knopfler Super Furry Animals The Killers Brian Wilson AlunaGeorge dbp.org/resource/Mark_Knopfler dbp.org/resource/Super_Furry_Animals dbp.org/resource/The_Killers dbp.org/resource/Brian_Wilson dbp.org/resource/AlunaGeorge

slide-6
SLIDE 6

Approach: CEA

slide-7
SLIDE 7
slide-8
SLIDE 8

Lots of Cues

slide-9
SLIDE 9

Lots of Cues

  • Class
slide-10
SLIDE 10

Lots of Cues

  • Class
  • Properties
slide-11
SLIDE 11

Lots of Cues

  • Class
  • Properties
  • Values
slide-12
SLIDE 12

Lots of Cues

  • Class
  • Properties
  • Values
slide-13
SLIDE 13

Lots of Cues

  • Class
  • Properties
  • Values

instanceOf: Human

slide-14
SLIDE 14

Lots of Cues

  • Class
  • Properties
  • Values
slide-15
SLIDE 15

Lots of Cues

  • Class
  • Properties
  • Values
  • ccupation: Singer
slide-16
SLIDE 16

Lots of Cues

  • Class
  • Properties
  • Values
slide-17
SLIDE 17

Lots of Cues

  • Class
  • Properties
  • Values

Record Label: ...

slide-18
SLIDE 18

Lots of Cues

  • Class
  • Properties
  • Values
slide-19
SLIDE 19

Lots of Cues Features

  • Class
  • Properties
  • Values
slide-20
SLIDE 20

What to do with all those Features?

slide-21
SLIDE 21

What to do with all those Features?

If labelled data -> Machine Learning

slide-22
SLIDE 22

What to do with all those Features?

If labelled data -> Machine Learning

Human?

  • cc:Singer?

Record Label? ... Chef? 1 1 1 ... Confidence = 60 Weights 20 30 10 ... 0.5

slide-23
SLIDE 23

What to do with all those Features?

If labelled data -> Machine Learning

slide-24
SLIDE 24

What to do with all those Features?

If labelled data -> Machine Learning If not ->

Image Source: icon-library.net

slide-25
SLIDE 25

What to do with all those Features?

If labelled data -> Machine Learning If not -> Heuristics!

slide-26
SLIDE 26
  • 2. tf-idf
slide-27
SLIDE 27

Image Source: becominghuman.ai blog

slide-28
SLIDE 28

properties entities genre family name record label disco- graphy Dbo: MusicalArtist TF/IDF Levenshtein Q313013 (Brian Wilson, musician) 1 1 1 1 1 0.98 1.0 Q913269 (Brian Wilson, baseball player) 1 0.64 1.0 Q1135582 (Super Flurry Animals, band) 1 1 1 1 0.23 1.0 Q7642367 (Super Flurry Animals Discography) 0.0 0.61 Q185343 (Mark Knopfler, musician) 1 1 1 1 1 0.99 1.0 DF = document frequency 52 31 36 15 49 IDF = log 3.20 1.85 1.65 3.46 2.11

slide-29
SLIDE 29
slide-30
SLIDE 30
  • 3. CTA and CPA
slide-31
SLIDE 31

Objective: CTA

Auckland Los Angeles California ... Waikato District dbp.org/ontology/Settlement

slide-32
SLIDE 32

Approach: CTA

slide-33
SLIDE 33

CPA

slide-34
SLIDE 34

Results: CEA

Round 1 Round 2 Round 3 Round 4

f1 precision f1 precision f1 precision f1 precision

0.884 0.908 0.826 0.852 0.857 0.866 0.804 0.814

slide-35
SLIDE 35
  • 4. Shortcomings
slide-36
SLIDE 36

Shortcomings

slide-37
SLIDE 37

Shortcomings

Another pass needed

slide-38
SLIDE 38

Shortcomings

Another pass needed Custom handling of data types

slide-39
SLIDE 39

Shortcomings

Another pass needed Custom handling of data types Intra-row information

slide-40
SLIDE 40
  • 5. Analysis
slide-41
SLIDE 41

Analysis: # Rows

slide-42
SLIDE 42

Analysis: # Rows

slide-43
SLIDE 43

Analysis: # Rows

slide-44
SLIDE 44

Analysis: Custom Handling

slide-45
SLIDE 45

Analysis: Embeddings

Levenshtein Similarity

tf-idf

slide-46
SLIDE 46

Analysis: Embeddings

Levenshtein Similarity tf-idf on Property feature tf-idf on Class feature

slide-47
SLIDE 47

Takeaways

slide-48
SLIDE 48

Takeaways

  • Lots of Semantic Cues (not just classes)
slide-49
SLIDE 49

Takeaways

  • Lots of Semantic Cues (not just classes)
  • When no data -> TF-IDF
slide-50
SLIDE 50

Takeaways

  • Lots of Semantic Cues (not just classes)
  • When no data -> TF-IDF
  • Revising always good
slide-51
SLIDE 51

Takeaways

  • Lots of Semantic Cues (not just classes)
  • When no data -> TF-IDF
  • Revising always good
  • Over-revising is an overkill (PSL)
slide-52
SLIDE 52

Takeaways

  • Lots of Semantic Cues (not just classes)
  • When no data -> TF-IDF
  • Revising always good
  • Over-revising is an overkill (PSL)
  • String Similarity ⊥ Semantic Similarity
slide-53
SLIDE 53

Fin.

Avijit Thawani

PhD student with Pedro Szekely and Jay Pujara thawani@isi.edu

Thank You

kia mihi

slide-54
SLIDE 54

Appendix

slide-55
SLIDE 55

PSL

Graphical Model = Several passes!

slide-56
SLIDE 56

Probabilistic Soft Logic

PSL is a

  • Probabilistic Programming Language

for easily defining

  • Hinge Loss Markov Random Fields
  • using a syntax like First Order Logic.
slide-57
SLIDE 57

PSL in one slide

slide-58
SLIDE 58

PSL in one slide

Define closed predicates:

  • instance(madonna, Singer)

instance(st_madonna, Saint) …

  • candidate(R3C1, madonna)

candidate(R3C1, st_madonna) …

slide-59
SLIDE 59

PSL in one slide

Define closed predicates:

  • instance(madonna, Singer)

instance(st_madonna, Saint) …

  • candidate(R3C1, madonna)

candidate(R3C1, st_madonna) … Define open predicates:

  • type(C1, Singer)?

type(C1, Saint)?

  • entity(R3C1, madonna)?

entity(R3C1, st_madonna)?

slide-60
SLIDE 60

PSL in one slide

Define closed predicates:

  • instance(madonna, Singer)

instance(st_madonna, Saint) …

  • candidate(R3C1, madonna)

candidate(R3C1, st_madonna) … Define open predicates:

  • type(C1, Singer)?

type(C1, Saint)?

  • entity(R3C1, madonna)?

entity(R3C1, st_madonna)? Restrict with PSL rules:

  • 10: candidate(RxCy, Qz) -> entity(RxCy, Qz)
  • 20: candidate(RxCy, Qz) & type(Cy, Tw) & instance(Qz, Tw) -> entity(RxCy, Qz)
  • entity(RxCy, Q1) & Q1!=Q2 -> ! entity(RxCy, Q2) .
slide-61
SLIDE 61

PSL output

class(C1, Singer): 0.12 class(C1, Saint): 0.89 entity(R3C1, madonna): 0.23 entity(R3C1, st_madonna): 0.68

slide-62
SLIDE 62

1st result baseline

F1: 0.865 Precision: 0.871 Recall: 0.858 (7 datasets annotated by us)

slide-63
SLIDE 63

PSL results

F1: 0.903 Precision: 0.910 Recall: 0.896 (7 datasets annotated by us)

slide-64
SLIDE 64

PSL without ranked priors

F1: 0.777 Precision: 0.783 Recall: 0.771 (7 datasets annotated by us)