Type inference through the analysis of Wikipedia links Andrea - - PowerPoint PPT Presentation

type inference through the analysis of wikipedia links
SMART_READER_LITE
LIVE PREVIEW

Type inference through the analysis of Wikipedia links Andrea - - PowerPoint PPT Presentation

Type inference through the analysis of Wikipedia links Andrea Giovanni Nuzzolese nuzzoles@cs.unibo.it Aldo Gangemi aldo.gangemi@cnr.it Valentina Presutti valentina.presutti@cnr.it Paolo Ciancarini ciancarini@cs.unibo.it stlab.istc.cnr.it


slide-1
SLIDE 1

stlab.istc.cnr.it

Type inference through the analysis of Wikipedia links

Andrea Giovanni Nuzzolese

nuzzoles@cs.unibo.it

Aldo Gangemi

aldo.gangemi@cnr.it

Valentina Presutti

valentina.presutti@cnr.it

Paolo Ciancarini

ciancarini@cs.unibo.it 16 April 2012 - Lyon, France - LDOW 2012

slide-2
SLIDE 2

stlab.istc.cnr.it

  • Motivations
  • Materials
  • Applied methods
  • Results
  • Conclusions

Outline

2

slide-3
SLIDE 3

stlab.istc.cnr.it

3

Motivations

✦ Only a subset of the DBpedia

resources is typed with the DBpedia ontology (DBPO)

✦ The typing procedure is top-

down.

✦ Is the DBPO complete with

respect to the DBpedia domain?

✦ How good and homogeneous

is the granularity of DBPO types?

Resources used in wikilinks relations: 15,944,381

Resources having a DBPO type: 1,518,697

slide-4
SLIDE 4

stlab.istc.cnr.it

4

Dataset # of triples wikilink triples 107,892,317 infobox mapping-based “data” triples 9,357,273 rdfs:label triples 7,972,225 rdf:type triples 6,173,940 infobox mapping-based “object” triples 4,251,239

Wikilinks triples: 107,892,317 Wikilink triples with typed subject/object: 16,745,830

Materials

DBpedia ontology: 272 classes DBpedia 3.6

slide-5
SLIDE 5

stlab.istc.cnr.it

5

What we did

  • Wikilinks of a DBpedia resource convey knowledge that

can be used for classifying it.

  • Classification methods

✦ Inductive learning: k-Nearest Neighbor algorithm ✦ Abductive classification based on EKPs [1] and homotypes used as

background knowledge

  • The methods were performed on

Sample of untyped resources: 1,000

Resources used in wikilinks relations: 15,944,381

Resources having a DBPO type: 1,518,697

[1] A. G. Nuzzolese, A. Gangemi,

  • V. Presutti, and P

. Ciancarini. Encyclopedic Knowledge Patterns from Wikipedia Links. In L. Aroyo, N. Noy, and C. Welty, editors, Proceedings of the 10th International Semantic Web Conference (ISWC2011), pages 520-536. Springer, 2011.

slide-6
SLIDE 6

stlab.istc.cnr.it

6 5

Inductive classification

  • We designed two inductive classification

experiments based on the k-NN algorithm

✦ on 272 features, i.e., all the classes in the DBPO ✦ on 27 features, i.e., the top-level classes in the DBPO

hierarchy

  • For each experiment we built a labeled feature

space model as training set by using a randomly sampled 20% of typed resources

✦ the algorithms were tested on the remaining 80% of typed

resources

slide-7
SLIDE 7

stlab.istc.cnr.it

7

dbpedia:Steve_Jobs dbpedia:Apple_Inc. dbpedia:NeXT dbpedia:Cupertino,_California dbpedia:Forbes

dbpo:wikiPageWikiLink

Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs ...

Building the training set for K-Nearest Neighbor algorithm

slide-8
SLIDE 8

stlab.istc.cnr.it

7

dbpedia:Steve_Jobs dbpedia:Apple_Inc. dbpedia:NeXT dbpedia:Cupertino,_California dbpedia:Forbes dbpo:Organisation dbpo:Magazine dbpo:City

dbpo:wikiPageWikiLink rdf:type

Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs

dbpo:Person

dbpo:Person ...

Building the training set for K-Nearest Neighbor algorithm

slide-9
SLIDE 9

stlab.istc.cnr.it

7

dbpedia:Steve_Jobs dbpo:Organisation dbpo:Magazine dbpo:City

dbpo:wikiPageWikiLink rdf:type kp:linksTo

Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs dbpo:Person 1 1 1 ...

Building the training set for K-Nearest Neighbor algorithm

slide-10
SLIDE 10

stlab.istc.cnr.it

7

dbpedia:Steve_Jobs dbpo:Organisation dbpo:Magazine dbpo:City

dbpo:wikiPageWikiLink rdf:type kp:linksTo

Mammal Scientist Company Drug City Magazine Class dbpedia:Steve_Jobs dbpo:Person 1 1 1 ... ... ... ... ... ... ... ...

Building the training set for K-Nearest Neighbor algorithm

slide-11
SLIDE 11

stlab.istc.cnr.it

7

dbpedia:Steve_Jobs dbpo:Organisation dbpo:Magazine dbpo:City

dbpo:wikiPageWikiLink rdf:type kp:linksTo

Building the training set for K-Nearest Neighbor algorithm

✦ Precision using all DBPO types as features: 31.65% ✦ Precision using the top-level of DBPO as features: 40.27%

slide-12
SLIDE 12

stlab.istc.cnr.it

8

Abductive classification with EKPs

  • EKPs

✦ A EKP of a certain entity

type is a small vocabulary that captures the core types used for describing such entity type as it emerges from the Wikipedia crowds

visit aemoo.org for an exploratory tool based on EKPs

slide-13
SLIDE 13

stlab.istc.cnr.it

9

How can we infer the type of “Galileo Galilei”?

http://www.aemoo.org

slide-14
SLIDE 14

stlab.istc.cnr.it

9

How can we infer the type of “Galileo Galilei”? We know its path types

http://www.aemoo.org

slide-15
SLIDE 15

stlab.istc.cnr.it

9

We have 231 EKPs We compare the path types involving “Galileo Galilei” as subject with EKPs in

  • rder to identify the most similar, which

is the "Scientist" EKP .

http://www.aemoo.org

slide-16
SLIDE 16

stlab.istc.cnr.it

9

The inferred type for the resource “Galileo Galiei” is the class “Scientist”

http://www.aemoo.org

slide-17
SLIDE 17

stlab.istc.cnr.it

10

Distinctive weakness of some EKPs

✦ The distinctive weakness

seems due to wide

  • verlaps among some

EKPs

✦ Systematic ambiguity of

the 4 largest classes

✦ Precision and recall on all DBPO types both 44.4% ✦ Precision and recall on the top-level of DBPO hierarchy: 36.5% and 79.5%

slide-18
SLIDE 18

stlab.istc.cnr.it

11

Homotype-based abductive classification

  • Homotypes are wikilinks that have the same

type on both the subject and the object of the triple

  • We have observed how the homotype is usually

the most frequent (or in the top 3) wikilink type

  • Given an untyped entity, we hypothesize that

the most frequent type involved in its ingoing/

  • utgoing wikilinks detects its homotype, hence

it indicates its type

dbpedia:Immanuel_Kant

dbpedia:Plato

dbpo:wikiPageWikiLink

dbpo:Philosopher dbpo:Philosopher

rdf:type rdf:type

slide-19
SLIDE 19

stlab.istc.cnr.it

12

Homotype-based abductive classification

s

slide-20
SLIDE 20

stlab.istc.cnr.it

12

Homotype-based abductive classification

s

slide-21
SLIDE 21

stlab.istc.cnr.it

13

Results on classifying already typed resources

slide-22
SLIDE 22

stlab.istc.cnr.it

14

Results on untyped resources

  • Results on a sample of 1,000 untyped resources

are much less satisfactory With EKPs With Homotypes

slide-23
SLIDE 23

stlab.istc.cnr.it

15

Why? [1]

  • Typed entities: 2:3 typed wikilinks ratio
  • Untyped entities: 1:3 typed wikilinks ratio
  • Link structure for untyped entities is not rich

enough

slide-24
SLIDE 24

stlab.istc.cnr.it

16

Why? [2]

  • DBPO does not provide a complete set of

classes for correctly typing DBpedia resources dbpedia:Counterattack Plan dbpedia:Computer_Science ScientificDiscipline dbpedia:List_of_FIFA_World_Cup_finals Collection dbpedia:Eros(concept) Concept dbpedia:Gentlemen’s_agreement Agreement

slide-25
SLIDE 25

stlab.istc.cnr.it

17

Conclusions

  • We have investigated different approaches for

typing DBpedia resources based on the data set

  • f wikilinks
  • Results are acceptable in the test set, but

extensive untypedness in output links, and poor DBPO coverage severely compromise automatic typing for untyped resources

  • We have analyzed possible causes deriving from

some bias in DBpedia

slide-26
SLIDE 26

stlab.istc.cnr.it

18

Future work

  • Yago could be helpful but

✦ there is a lack of mapping between

YAGO and DBPO

✦ it has larger coverage and only an overlap with DBPO ✦ the granularity of its categories is finer, and not easily

reusable, because the top level is very large

slide-27
SLIDE 27

stlab.istc.cnr.it

19

Thank you

Andrea Nuzzolese

  • STLab, ISTC-CNR

& Dipartimento di Scienze dell’Informazione University of Bologna Italy