Supporting Data Interlinking in Semantic Libraries with Microtask - - PowerPoint PPT Presentation

supporting data interlinking in semantic libraries with
SMART_READER_LITE
LIVE PREVIEW

Supporting Data Interlinking in Semantic Libraries with Microtask - - PowerPoint PPT Presentation

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing Cristina Sarasua SWIB 2014, Bonn Institute for Web Science and Technologies University of Koblenz-Landau, Germany Cristina Sarasua Supporting Data


slide-1
SLIDE 1

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing

Cristina Sarasua SWIB 2014, Bonn

slide-2
SLIDE 2

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 2 Cristina Sarasua

slide-3
SLIDE 3

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 3 Cristina Sarasua

a b relation

MARC 21 FRBR EDM

slide-4
SLIDE 4

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 4 Cristina Sarasua

a b relation

MARC 21 FRBR EDM

slide-5
SLIDE 5

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 5 Cristina Sarasua

Please share your thoughts on interlinking! https://etherpad.mozilla.org/4IfZDaTBIe

slide-6
SLIDE 6

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 6 Cristina Sarasua

Interlinking on the Web of Data

Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/

https://etherpad.mozilla.org/4IfZDaTBIe

slide-7
SLIDE 7

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 7 Cristina Sarasua

Cross-dataset links

D1 d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; D2 (a,r,b) | a in D1, b in D2

  • 1:Conference owl:equivalentClass o2:Congress;
  • 1:Democracy skos:related o2:Government;
  • 1:Publication skos:broader o2:JournalArticle;
  • 1:ImpressionistPainting rdfs:subClassOf o2:Painting;
  • 1:Conference owl:equivalentClass o2:Congress;
  • 1:Democracy skos:related o2:Government;
  • 1:Publication skos:broader o2:JournalArticle;
  • 1:ImpressionistPainting rdfs:subClassOf o2:Painting;

https://etherpad.mozilla.org/4IfZDaTBIe

slide-8
SLIDE 8

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 8 Cristina Sarasua

Why is interlinking important?

 Enhance the

description of local entities

 Richer queries over

aggregated data

 Cross-data set

browsing

What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)} SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)}

https://etherpad.mozilla.org/4IfZDaTBIe

http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/

slide-9
SLIDE 9

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 9 Cristina Sarasua

Generating links

Comparison criteria

https://etherpad.mozilla.org/4IfZDaTBIe

D1 D2 Identify the resources to be connected with relation R

Picture: https://www.assembla.com/spaces/silk/wiki/Managin g_Reference_Links

Decision boundary between link and non-link

slide-10
SLIDE 10

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 10 Cristina Sarasua

He is already busy

Attribution: Thomas Leu

slide-11
SLIDE 11

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 11 Cristina Sarasua

Attribution: Thomas Leu

He is already busy … but still would like correct and useful links

slide-12
SLIDE 12

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 12 Cristina Sarasua

Crowdsourced Interlinking

slide-13
SLIDE 13

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 13 Cristina Sarasua

Crowdsourcing

“Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call” Jeff Howe, 2006

Fast Scalable Microtask crowdsourcing Microtask crowdsourcing Macrotask crowdsourcing Macrotask crowdsourcing Contest-based crowdsourcing Contest-based crowdsourcing Citizen Science Citizen Science

  • E.g.

tweet sentiment analysis

  • Seconds, reward cents
  • Crowd

workers register with simple profile, limited filtering

  • E.g. writing an E-Book
  • Months, $30per hour /

hundreds or thousands of dollars

  • Freelancers

recruitment, interviews

  • E.g. NLP algorithm for a

particular challenging scenario

  • Months, up to thousands
  • f dollards
  • Final

evaluation and winner selection

  • E.g. classify galaxies in

pictures

  • seconds/minutes,

no money

  • Open to everyone
slide-14
SLIDE 14

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 14 Cristina Sarasua

An interlinking microtask

slide-15
SLIDE 15

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 15 Cristina Sarasua

An interlinking microtask

slide-16
SLIDE 16

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 16 Cristina Sarasua

An interlinking microtask

slide-17
SLIDE 17

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 17 Cristina Sarasua

Approach

D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers

Aggregated response

Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4

Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2

slide-18
SLIDE 18

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 18 Cristina Sarasua

Approach (II)

 Analyse crowd workers to filter out people

– With bad intentions (i.e. scammers) – Who do not have enough knowledge

 Select representative links from which the answer is known

(ground truth) and assess people → domain expert useful

x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based

  • n data

heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883;

slide-19
SLIDE 19

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 19 Cristina Sarasua

Approach (II)

 Analyse crowd workers to filter out people

– With bad intentions (i.e. scammers) – Who do not have enough knowledge

 Select representative links from which the answer is known

(ground truth) and assess people → domain expert useful

x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based

  • n data

heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883; Two-way feedback

slide-20
SLIDE 20

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 20 Cristina Sarasua

Approach

D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers

Aggregated response

Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4

Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2

agreement #workers per link Context information

slide-21
SLIDE 21

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 21 Cristina Sarasua

Approach (II)

D1 D2 Manual interlinking D1 D2 HCOMP interlinking Guide Review Algorithm

slide-22
SLIDE 22

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 22 Cristina Sarasua

Use cases

slide-23
SLIDE 23

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 23 Cristina Sarasua

Mapping vocabularies

Run an automatic ontology alignment tool and post-process the results with the crowd See also: [Sarasua et al., 2012] Context information pre-configured

slide-24
SLIDE 24

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 24 Cristina Sarasua

      

a) To extract the patterns of the linkage rules (i.e. labelling) b) To post-process irregular multilingual values, different name versions c) To automatically identify patterns of errors in a resulting set of links, which may be afterwards reviewed by the experts

Discovering links between instances

slide-25
SLIDE 25

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 25 Cristina Sarasua

There are different possible targets for the interlinking of a dataset: which possibility to select for the Web portal?

Embed Web site in a microtask and ask for specific information or

  • bserve next Web site opened

Curating mapping extensions to authority files

Quality control can be done by giving these answers to

  • ther crowd workers

Checking usefulness of links with library users

slide-26
SLIDE 26

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 26 Cristina Sarasua

3 Challenges

slide-27
SLIDE 27

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 27 Cristina Sarasua

# Deciding whether to crowdsource or not

 Depends to a large extent on the data

– Specific domains require more crowd management effort – Benefit compared to automatically generated links may vary – Availability of workers may change in time

 What should be processed by the crowd

– Criteria for selecting subsets of the data (e.g. confidence of machine)

Libraries and the cultural heritage domain have high potential (multilinguality, different naming conventions, knowledge exploration) > Trial, error and assessment

slide-28
SLIDE 28

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 28 Cristina Sarasua

# Building a loyal workforce

Attracting good crowd workers – Microtasks are constantly being published – Higher reward may also attract more malicious workers

Working with people repeatedly is not supported by majority of crowdsourcing platforms

How to make crowd workers keep on working in these microtasks without them getting demotivated?

> Be fair (see also Guidelines on Crowd Work for Academic Researchers, 2014) > Listen to crowd workers (e.g. direct comments, twitter, ratings, monitor online discussions) > Recognize their work > Be aware that gamification is not always the best solution

It's really easy to change people's motivations, [at Zooniverse] we find people are motivated by wanting to contribute, they want a sense that this is something

  • real. And in adding game-like elements you can

destroy that quite quickly” Chris Lintott, Zoouniverse http://www.wired.co.uk/news/archive/2013- 09/12/fraxinus-gamifying-science/viewgallery/307960

slide-29
SLIDE 29

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 29 Cristina Sarasua

# Working with unknown humans

 Open call can be a problem and an opportinty at the same

time: people have diverse

– Motivation and dedication – Context and profile – Background knowledge

 Crowdsourcing platforms have limited support for

personalisation

 Working with suitable crowd

– Identify what they can do best ▪ Type of task / data level ▪ Competences vs experience cross platform analysis – Assign work accordingly ▪ Weight vs reject

>Towards a Crowd Work CV See also: [Sarasua et al., 2014]

slide-30
SLIDE 30

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 30 Cristina Sarasua

Plea to this community

 Interlinking is much more than deduplication, consider using

also other relations

 Consider connecting library datasets to different

complementary domains

 Interlinking to non editorial data can also be enriching  The more datasets you connect the better  Document your interlinking on the VoiD description of your

dataset

 Query and make use of available links

slide-31
SLIDE 31

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 31 Cristina Sarasua

If you need humans to process data while interlinking datasets, consider crowd intervention because it can be very valuable for enhancing your results.

slide-32
SLIDE 32

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Thank you for your attention!

Cristina Sarasua Institute for Web Science and Technologies Universität Koblenz-Landau csarasua@uni-koblenz.de http://de.slideshare.net/cristinasarasua https://github.com/criscod

slide-33
SLIDE 33

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 33 Cristina Sarasua

References

Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology alignment with microtasks. In: Proceedings

  • f

the 11th International Semantic Web Conference (ISWC). (2012)

Sarasua, C., Thimm, M. Crowd Work CV: Recognition for Micro Work. In: SoHuman workshop, co-located with Social Informatics (SocInfo). (2014)

Guidelines on Crowd Work for Academic Researchers (2014).

http://wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters