So Far Away and Yet so Close: Augmenting Toponym Disambiguation and - - PowerPoint PPT Presentation

so far away and yet so close augmenting toponym
SMART_READER_LITE
LIVE PREVIEW

So Far Away and Yet so Close: Augmenting Toponym Disambiguation and - - PowerPoint PPT Presentation

So Far Away and Yet so Close: Augmenting Toponym Disambiguation and Similarity with Text-Based Networks Andreas Spitz, Johanna Gei and Michael Gertz Heidelberg University, Institute of Computer Science Database Systems Research Group,


slide-1
SLIDE 1

So Far Away and Yet so Close: Augmenting Toponym Disambiguation and Similarity with Text-Based Networks

Andreas Spitz, Johanna Geiß and Michael Gertz

Heidelberg University, Institute of Computer Science Database Systems Research Group, Heidelberg {spitz, geiss, gertz}@informatik.uni-heidelberg.de

3rd GeoRich Workshop San Francisco, June 26, 2016

slide-2
SLIDE 2

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Implicit Networks

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 1 of 18

slide-3
SLIDE 3

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Implicit Text-Based Networks

“Most of the circuits currently in use are specially constructed for competition. The current street circuits are Monaco, Mel- bourne, Montreal, Singapore and Sochi, although races in

  • ther urban locations come and go (Las Vegas and Detroit,

for example) and proposals for such races are often discussed – most recently New Jersey.”

en.wikipedia.org/wiki/Formula One

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 2 of 18

slide-4
SLIDE 4

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Graph Extraction from Text

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 3 of 18

slide-5
SLIDE 5

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Graph Extraction from Text

s(v, w) := distance in sentences between toponyms v and w d(v, w) := exp

  • −s(v, w)

2

  • Augmenting Toponym Disambiguation with Text-Based Networks

Andreas Spitz 3 of 18

slide-6
SLIDE 6

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Graph Extraction from Text

s(v, w) := distance in sentences between toponyms v and w d(v, w) := exp

  • −s(v, w)

2

  • Augmenting Toponym Disambiguation with Text-Based Networks

Andreas Spitz 3 of 18

slide-7
SLIDE 7

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Edge Aggregation

Distance-based cosine for nodes v and w: dicos(v, w) :=

  • i di(v) di(w)
  • i di(v)2

i di(w)2

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 4 of 18

slide-8
SLIDE 8

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Nonreciprocal Relationships

Dirk Beyer, Wikimedia Commons Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 5 of 18

slide-9
SLIDE 9

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Inducing Edge Directions

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 6 of 18

slide-10
SLIDE 10

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Inducing Edge Directions

Normalize weights of outgoing edges: ω(v → w) := dicos(v, w)

  • x∈V dicos(v, x)

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 6 of 18

slide-11
SLIDE 11

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Adding Knowledge Base Support: Wikidata

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 7 of 18

slide-12
SLIDE 12

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Toponym Extraction in Wikipedia & Wikidata

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 8 of 18

slide-13
SLIDE 13

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network Overview

Network statistics: |V | |E| density clustering coefficient 723, 779 178, 890, 238 6.8 · 10−4 0.56 Node types:

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 9 of 18

slide-14
SLIDE 14

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network Overview

Network statistics: |V | |E| density clustering coefficient 723, 779 178, 890, 238 6.8 · 10−4 0.56 Node types: Wikidata location hierarchy:

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 9 of 18

slide-15
SLIDE 15

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network Properties

  • % of remaining edges

clustering coefficient number of components assortativity

25 50 75 100 0.5 0.6 0.7 0.8 0.9 20000 40000 60000 0.0 0.2 0.4 0.6 0.8 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

dicos threshold network metric

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 10 of 18

slide-16
SLIDE 16

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network Centrality

city cdeg cindeg cH

deg

cH

indeg

Paris 63,150 89.87 8,064 7.56 New York City 79,398 71.74 9,294 12.12 Chicago 54,217 51.84 8,074 7.70 Los Angeles 49,961 51.47 7,276 7.76 Washington, D.C. 62,858 51.05 8,138 8.65 Boston 45,895 50.43 6,121 6.08 Philadelphia 51,237 45.19 6,372 5.03 Vienna 35,724 44.55 4,827 7.44 Moscow 29,026 43.77 4,644 19.47 San Francisco 43,759 40.87 6,029 4.76

Network between the top 10 European cities by in-degree centrality.

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 11 of 18

slide-17
SLIDE 17

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Geographically Embedded Network

city connection strength 0.007 - 0.015 0.015 - 0.030 0.030 - 0.045 0.045 - 1.000

Legend

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 12 of 18

slide-18
SLIDE 18

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Centrality-Based Hierarchy Classification

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

recall precision centrality cdeg cdeg

H

cindeg cindeg

H

Classification into classes country and city based on centrality.

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 13 of 18

slide-19
SLIDE 19

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Disambiguation Problem

Locations of towns and cities with the name Heidelberg.

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 14 of 18

slide-20
SLIDE 20

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network-based Toponym Disambiguation

Given a document with toponyms, the following information is available:

  • a set of locations L in the network
  • a set of seeds S ⊆ L in the

document (unambiguous toponyms)

  • an ambiguous toponym t in the

document with candidates l ∈ L

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 15 of 18

slide-21
SLIDE 21

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Network-based Toponym Disambiguation

Given a document with toponyms, the following information is available:

  • a set of locations L in the network
  • a set of seeds S ⊆ L in the

document (unambiguous toponyms)

  • an ambiguous toponym t in the

document with candidates l ∈ L Resolve toponyms by their neighbourhood in the network: resolve(t) := arg max

l∈L

  • s∈S

ω(l, s)

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 15 of 18

slide-22
SLIDE 22

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Evaluation on AIDA CoNLL-YAGO data set

Precision in % mean distance in km all seeds ambig. all seeds ambig. WLND 85.7 86.0 85.6 327.5 522.9 179.1 AIDA 84.9 86.0 83.2 120.4 87.7 142.3 BDIST 81.6 86.0 78.5 683.1 522.9 800.8 BMIN 81.4 86.0 78.8 650.9 522.9 745.0 WLDN Wikipedia Location Network disambiguation AIDA AIDA named entity disambiguation BDIST Baseline using minimum geographic distance BMIN Baseline using lowest Wikidata ID

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 16 of 18

slide-23
SLIDE 23

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Summary

New method for implicit network extraction that

  • is based on text distances of toponyms,
  • works across documents,
  • can be applied to any geo-tagged corpus.

Application to Wikipedia & Wikidata

  • creates an accurate and reliable network,
  • supports disambiguation and entity linking,
  • provides a language-agnostic tool for NLP tasks

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 17 of 18

slide-24
SLIDE 24

Motivation Network Construction Network Properties Toponym Disambiguation Summary

The Wikipedia Location Network is available for download. http://dbs.ifi.uni-heidelberg.de/index.php?id=data

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 18 of 18

slide-25
SLIDE 25

Motivation Network Construction Network Properties Toponym Disambiguation Summary

The Wikipedia Location Network is available for download. http://dbs.ifi.uni-heidelberg.de/index.php?id=data Thank you! Questions?

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 18 of 18

slide-26
SLIDE 26

Motivation Network Construction Network Properties Toponym Disambiguation Summary

Bibliography

Johanna Geiß and Michael Gertz. With a Little Help from my Neighbors: Person Name Linking Using the Wikipedia Social Network. In WWW Companion, 2016. Johanna Geiß, Andreas Spitz, Jannik Str¨

  • tgen, and Michael Gertz.

The Wikipedia Location Network - Overcoming Borders and Oceans. In GIR, 2015. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F¨ urstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust Disambiguation of Named Entities in Text. In EMNLP, 2011. Michael Speriosu and Jason Baldridge. Text-Driven Toponym Resolution using Indirect Supervision. In ACL, 2013.

Augmenting Toponym Disambiguation with Text-Based Networks Andreas Spitz 18 of 18