A Collaborative Named Entity Focused URI Collection to Explore Web - - PowerPoint PPT Presentation

a collaborative named entity focused uri collection to
SMART_READER_LITE
LIVE PREVIEW

A Collaborative Named Entity Focused URI Collection to Explore Web - - PowerPoint PPT Presentation

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web Archiving and Digital Libraries Sergej Wildemann & Helge Holzmann June 6, 2019 L3S Research Center - University of Hannover - Germany Introduction


slide-1
SLIDE 1

A Collaborative Named Entity Focused URI Collection to Explore Web Archives

Workshop on Web Archiving and Digital Libraries

Sergej Wildemann & Helge Holzmann June 6, 2019

L3S Research Center - University of Hannover - Germany

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Motivation

  • Named entities are evolving
  • Cities grow
  • People change positions in their careers
  • Information is spread on the Web
  • Content is changed or deleted on web pages
  • Web search engines forget or rerank resources
  • Limited search in web archives by:
  • Exact URI
  • "Full-text"

Challenge Accessing online resources that describe an entity over time

1

slide-4
SLIDE 4

Example: Site Search in the Wayback Machine

  • Indexed anchor texts
  • Problem: Domains only, no specific paths or dates

2

slide-5
SLIDE 5

Dataset Generation

slide-6
SLIDE 6

Processing Pipeline

  • 1. Entity collection
  • 2. URI collection and assignment
  • 3. URI unification
  • 4. Ranking
  • 5. Temporal enrichment

3

slide-7
SLIDE 7

Incorporated Datasets

Dataset Entities Classes URIs Tags Dates Wikipedia

  • DBpedia

⋊ ⋉

  • Wikidata

⋊ ⋉

  • Delicious

⋊ ⋉

  • GWA

⋊ ⋉

  • GWW

⋊ ⋉

  • Wayback CDX

⋊ ⋉

  • (•) extracted information – (⋊

⋉) used for joining

4

slide-8
SLIDE 8

Entity Collection

  • Potential entities: List of all Wikipedia article titles
  • Filtering with DBpedia’s ontology:

Type Count % Person 1,500,000 22.7 Place 840,000 12.7 Creative Work 496,000 7.5 Organization 286,000 4.3 Other 2,378,000 36.0 N/A 1,100,000 16.7 6,600,000 100

5

slide-9
SLIDE 9

URI Collection

  • Wikidata provides URIs directly or indirectly via identifiers
  • Wikipedia articles contain "External Links" section
  • Official websites (usually without a path)
  • Databases like IMDb
  • GWA and GWW were generated from the German Web Archive (1996-2013)
  • Searching for entities in anchor texts
  • GWW restricted the search to pages referenced from Wikipedia
  • Up to 10 URIs per entity
  • With most prominent years in dataset

6

slide-10
SLIDE 10

URI Collection from Delicious

Delicious contains tagged bookmarks with timestamps DATE USERID URI TAG...

  • General idea: Matching of tags against entities
  • Problem: Tags are single words
  • Solution: Normalizing tags and entity titles
  • Disambiguation terms must be found in tags
  • Additional tags as metadata

Normalization Entity: New_York_(State) ⇒ newyork Tag: new-york ⇒ newyork

7

slide-11
SLIDE 11

URI Unification

URI := protocol://domain[/path][?query][#fragment]

  • Protocol removal
  • Subdomain unification (www. prefix)
  • Path stripping (index pages)
  • Query parameter cleanup (empty or tracking keys)
  • Fragment removal

Unification http://www.example.org/index.html?foo=&ref=123#content https://example.org/?ref=456#about ⇒ www.example.org

8

slide-12
SLIDE 12

Ranking of Entity-URI Matches

  • Provide initial votings in range [1, 10] per dataset
  • Wikipedia with links to homepages or databases
  • Voting: 10 (domains), 5 (URIs with path)
  • Wikidata contains hand-picked URIs, but many indirect ones seem not useful
  • Voting: 10 (direct), 5 (indirect)
  • GWA and GWW seem to have weak results
  • Voting: 3
  • Delicious URI matches are ranked by relative number of users of a tag
  • Voting: 10 (most used tag) down to 1

9

slide-13
SLIDE 13

Dataset by the Numbers

  • 22.8 M URIs and 1.6 M described entities
  • 91.3 % of entities have matching URIs
  • 13.4 % of entities with at least one multi-source URI
  • 13.7 URIs per entity on average
  • URIs for 70 % of entities in both Wiki and GWW
  • 35 % in GWW and only 2.3 % in Delicious

Average URIs per type and dataset Dataset Person Org. Place C.W. All Wikipedia 3.52 2.56 2.47 1.81 2.83 Wikidata 5.52 2.79 2.96 5.95 4.60 Delicious 22.15 51.43 63.53 61.43 55.86 GWA 6.54 7.16 6.52 8.49 7.14 GWW 4.43 5.32 4.63 6.10 5.13

10

slide-14
SLIDE 14

Dataset Cross-Evaluation

slide-15
SLIDE 15

URI Assignments to Entities

Delicious Wiki GWA GWW 2,326,440 8,329,706 9,046,987 3,365,652 8,913 16,056 4,963 17,801 114,048 171,895 3,810 5,634 1,968 13,688 1,901

11

slide-16
SLIDE 16

Overlapping URIs in Delicious

CreativeWork Organization Person Place

Entity Type

1 2 3 4 5 6 7 8 9 10

Vote

1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

Overlapping URIs

5 10 15 20 25 30 35 40 45 50

Overlap (%)

12

slide-17
SLIDE 17

Quality of Individual Datasets

Mean Reciprocal Rank MRR = 1 |Q|

|Q|

  • i=1

1 ranki

  • All URIs of a dataset as Q
  • Avg. inverse rank in results
  • Penalty for missing URIs

Results

  • Delicious performs well
  • verall
  • URI selection of GWW

better than GWA

0.0 0.2 0.4 0.6 0.8 1.0 MRR Delicious GWA GWW Wiki Dataset

Person

0.0 0.2 0.4 0.6 0.8 1.0 MRR Delicious GWA GWW Wiki Dataset

Organization

0.0 0.2 0.4 0.6 0.8 1.0 MRR Delicious GWA GWW Wiki Dataset

Place

0.0 0.2 0.4 0.6 0.8 1.0 MRR Delicious GWA GWW Wiki Dataset

CreativeWork

13

slide-18
SLIDE 18

Quality of Overlapping Results

nDCG – simplified in this environment:

  • Only overlapping URIs
  • Ideal rank: all results rank 1

Results

  • Best URIs in Wiki
  • GWA contains more of the

popular URIs than GWW

0.0 0.2 0.4 0.6 0.8 1.0 nDCG Delicious GWA GWW Wiki Dataset

Person

0.0 0.2 0.4 0.6 0.8 1.0 nDCG Delicious GWA GWW Wiki Dataset

Organization

0.0 0.2 0.4 0.6 0.8 1.0 nDCG Delicious GWA GWW Wiki Dataset

Place

0.0 0.2 0.4 0.6 0.8 1.0 nDCG Delicious GWA GWW Wiki Dataset

CreativeWork

14

slide-19
SLIDE 19

Comparison with Web Search Engines

Bing results for 50,000 queries of our most promising entities

  • 10 URIs from the first result page
  • Precision:
  • Our top URI is in the result set of 83 % of

entities

  • Average Bing position: 2.26
  • Recall:
  • 23.29 % of all Bing URIs in our dataset

15

slide-20
SLIDE 20

Conclusion

slide-21
SLIDE 21

Conclusion

Ordered collection of annotated resources describing 1.6 M named entities over time.

  • Combination of multiple diverse datasets
  • Evaluation of dataset quality
  • Promising results with respect to regular search engines

Future Work

  • Integration of more data sources
  • Expansion covered time-frames and entities
  • Date tag evaluation
  • Language awareness
  • Improved entity matching

16

slide-22
SLIDE 22

Explore Dataset Online

https://tempurion.l3s.uni-hannover.de

17