Terms over LOAD: Leveraging Named Entities for Cross-Document - - PowerPoint PPT Presentation

terms over load leveraging named entities for cross
SMART_READER_LITE
LIVE PREVIEW

Terms over LOAD: Leveraging Named Entities for Cross-Document - - PowerPoint PPT Presentation

Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events Andreas Spitz and Michael Gertz Heidelberg University, Institute of Computer Science Database Systems Research Group, Heidelberg { spitz, gertz


slide-1
SLIDE 1

Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events

Andreas Spitz and Michael Gertz

Heidelberg University, Institute of Computer Science Database Systems Research Group, Heidelberg {spitz, gertz}@informatik.uni-heidelberg.de

SIGIR ’16 Pisa, July 20, 2016

slide-2
SLIDE 2

Motivation Network Construction Applications Evaluation Summary & Outlook Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 1 of 20

slide-3
SLIDE 3

Motivation Network Construction Applications Evaluation Summary & Outlook Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 2 of 20

slide-4
SLIDE 4

Motivation Network Construction Applications Evaluation Summary & Outlook Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 2 of 20

slide-5
SLIDE 5

Motivation Network Construction Applications Evaluation Summary & Outlook Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 3 of 20

slide-6
SLIDE 6

Motivation Network Construction Applications Evaluation Summary & Outlook Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 3 of 20

slide-7
SLIDE 7

Motivation Network Construction Applications Evaluation Summary & Outlook

Motivation

Definition: Event

“Something that happens at a given place and time between a group of actors.”

[CSG+02]

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 4 of 20

slide-8
SLIDE 8

Motivation Network Construction Applications Evaluation Summary & Outlook

Motivation

Definition: Event

“Something that happens at a given place and time between a group of actors.”

[CSG+02] For large document collections, how can we...

  • obtain events from unstructured text?
  • identify connections across documents?
  • support ad-hoc event search?

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 4 of 20

slide-9
SLIDE 9

Motivation Network Construction Applications Evaluation Summary & Outlook

Graph Extraction from Unstructured Text

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 5 of 20

slide-10
SLIDE 10

Motivation Network Construction Applications Evaluation Summary & Outlook

Graph Extraction from Unstructured Text

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 5 of 20

slide-11
SLIDE 11

Motivation Network Construction Applications Evaluation Summary & Outlook

Graph Extraction from Unstructured Text

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 5 of 20

slide-12
SLIDE 12

Motivation Network Construction Applications Evaluation Summary & Outlook

Graph Extraction from Unstructured Text

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 5 of 20

slide-13
SLIDE 13

Motivation Network Construction Applications Evaluation Summary & Outlook

Graph Extraction from Unstructured Text

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 5 of 20

slide-14
SLIDE 14

Motivation Network Construction Applications Evaluation Summary & Outlook

Edge Weight Generation

For edges (x, y) for which y is a page or sentence, count only (co-) occurrences: ω(x, y) =

  • 1

if y contains x

  • therwise

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 6 of 20

slide-15
SLIDE 15

Motivation Network Construction Applications Evaluation Summary & Outlook

Edge Weight Generation

For edges (x, y) for which y is a page or sentence, count only (co-) occurrences: ω(x, y) =

  • 1

if y contains x

  • therwise

For edges (x, y) between entity types and terms, aggregate co-occurrence instances I: sum over similarities derived from sentence distances s. ω(x, y) :=

  • i∈I

exp(−s(x, y, i))

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 6 of 20

slide-16
SLIDE 16

Motivation Network Construction Applications Evaluation Summary & Outlook

LOADing Wikipedia

For the entire English Wikipedia (∼ 4.5M articles with annotations):

  • use only unstructured text.
  • exclude pages of lists.
  • exclude info boxes.
  • exclude references.

Extract named entities with:

  • Stanford NER for locations,
  • rganizations and actors [FGM05]
  • Heideltime for dates [SG13]

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 7 of 20

slide-17
SLIDE 17

Motivation Network Construction Applications Evaluation Summary & Outlook

Wikipedia LOAD Graph

edges LOC ORG ACT DAT TER SEN PAG LOC ORG 91 ACT 276 106 DAT 83 46 128 TER 183 94 317 57 SEN 71 21 84 38 412 PAG 54 nodes 2.7 3.4 7.1 0.2 4.9 53.5 4.5

Number of edges and nodes (in millions) of the LOAD graph of the English Wikipedia. ∼ 2B edges and ∼ 76M nodes in total.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 8 of 20

slide-18
SLIDE 18

Motivation Network Construction Applications Evaluation Summary & Outlook

Single Entity Queries

How can we rank nodes in one set Y by their neighbours in set X? Adapt tf-idf scores to the graph [RV13]!

  • Term frequency:

edge weights tf(x, y) ≈ ω(x, y)

  • Inverse document frequency:

number of neighbours id f(x) ≈

|Y | degY (x)

r(x, y) ≈ ω(x, y) log |Y | degY (x)

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 9 of 20

slide-19
SLIDE 19

Motivation Network Construction Applications Evaluation Summary & Outlook

Single Entity Queries

How can we rank nodes in one set Y by their neighbours in set X? Adapt tf-idf scores to the graph [RV13]!

  • Term frequency:

edge weights tf(x, y) ≈ ω(x, y)

  • Inverse document frequency:

number of neighbours id f(x) ≈

|Y | degY (x)

r(x, y) ≈ ω(x, y) log |Y | degY (x) LOC : (ACT, Mark Spitz)

location score munich 1.00000 us 0.70651 states 0.49010 united states 0.46918

Query: Y : (X, value)

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 9 of 20

slide-20
SLIDE 20

Motivation Network Construction Applications Evaluation Summary & Outlook

Multi-Entity Queries

How can we rank nodes in Y by neighbours in multiple sets Xn? Combine individual set scores: r( x, y) := 1 nη( x, y)

n

  • i=1

r(xi, y)

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 10 of 20

slide-21
SLIDE 21

Motivation Network Construction Applications Evaluation Summary & Outlook

Multi-Entity Queries

How can we rank nodes in Y by neighbours in multiple sets Xn? Combine individual set scores: r( x, y) := 1 nη( x, y)

n

  • i=1

r(xi, y) Ensure triangular cohesion when combining results: η( x, y) :=

  • 1

if n

i=1

n

j>i MyxiMyxj > 1

  • therwise

Where M is the adjacency matrix of the graph.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 10 of 20

slide-22
SLIDE 22

Motivation Network Construction Applications Evaluation Summary & Outlook

Multi-Entity Query Examples

DAT : (ACT, Mark Spitz), (LOC, Munich)

date score 1972-08-29 0.50851 1972-08-31 0.48217 1972-09-05 0.22738 1947-03-10 0.10511 2006-09-07 0.09226

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 11 of 20

slide-23
SLIDE 23

Motivation Network Construction Applications Evaluation Summary & Outlook

Multi-Entity Query Examples

DAT : (ACT, Mark Spitz), (LOC, Munich)

date score 1972-08-29 0.50851 1972-08-31 0.48217 1972-09-05 0.22738 1947-03-10 0.10511 2006-09-07 0.09226

TER : (ACT, Mark Spitz), (LOC, Munich), (DAT, 1972)

term score

  • lymp

0.89630 medal 0.54205 gold 0.43211 won 0.38904 record 0.34548

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 11 of 20

slide-24
SLIDE 24

Motivation Network Construction Applications Evaluation Summary & Outlook

Summarization: Sentence Queries

How can sentences in S be used to describe combinations of entities in Xn? Find a sentence that contains them: r( x, s) :=

n

  • i=1

Msxi

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 12 of 20

slide-25
SLIDE 25

Motivation Network Construction Applications Evaluation Summary & Outlook

Summarization: Sentence Queries

How can sentences in S be used to describe combinations of entities in Xn? Find a sentence that contains them: r( x, s) :=

n

  • i=1

Msxi SEN : (ACT, Mark Spitz) Mark Spitz of the United States had a spectacular run, lining up for seven events, winning seven Olympic titles and setting seven world records.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 12 of 20

slide-26
SLIDE 26

Motivation Network Construction Applications Evaluation Summary & Outlook

Entity Linking: Document Queries

Since we created the LOAD graph from Wikipedia, can we link entities in Xn to pages P? Use sentences to find the page that contains them most frequently: r( x, p) :=

  • s∈S

n

  • i=1

MsxiMsp

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 13 of 20

slide-27
SLIDE 27

Motivation Network Construction Applications Evaluation Summary & Outlook

Entity Linking: Document Queries

Since we created the LOAD graph from Wikipedia, can we link entities in Xn to pages P? Use sentences to find the page that contains them most frequently: r( x, p) :=

  • s∈S

n

  • i=1

MsxiMsp PAG : (ACT, Mark Spitz) Wiki page ID 66265: Mark Spitz

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 13 of 20

slide-28
SLIDE 28

Motivation Network Construction Applications Evaluation Summary & Outlook

Event Extraction and Completion

Intuition:

  • Events correspond to triangular

structures in the network

  • Participating entities can be used to

complete events

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 14 of 20

slide-29
SLIDE 29

Motivation Network Construction Applications Evaluation Summary & Outlook

Query Answering Speed

  • Query Execution Time

0.0 0.2 0.4 0.6 0.8 1 5 10 15 20

number of query entities avg processing time in ms

query type

  • entities

sentences pages

Asymptotic complexity of entity queries: O(degX(y) degY (x))

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 15 of 20

slide-30
SLIDE 30

Motivation Network Construction Applications Evaluation Summary & Outlook

Historic Event Evaluation Data

Evaluation data set from a “This Day in History” website [Gui95]

  • old enough to not contain

Wikipedia data

  • exactly one date per sentence
  • 500 hand-annotated

historic events

  • example: Ernest Hemingway,

Red Cross volunteer, wounded in Italy on 1918-07-08.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 16 of 20

slide-31
SLIDE 31

Motivation Network Construction Applications Evaluation Summary & Outlook

Evaluation on Historic Event Data

Retrieving Dates of Historic Events

0.0 0.1 0.2 0.3 10 20 30 40 50 60 70 80 90 100

maximum rank fraction of included dates

method LOADTsq LOADsq LOAD BASEw Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 17 of 20

slide-32
SLIDE 32

Motivation Network Construction Applications Evaluation Summary & Outlook

NER based on Wikipedia & Wikidata

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 18 of 20

slide-33
SLIDE 33

Motivation Network Construction Applications Evaluation Summary & Outlook

Summary

Ongoing work:

  • online search and query interface for Wikipedia
  • streaming model for online news
  • inclusion of parts-of-speech

LOAD summary:

  • fast entity and event exploration
  • can support most entity-related IE tasks
  • can be extended to any kind of entity
  • scalable and fast
  • language-agnostic with entity linking

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 19 of 20

slide-34
SLIDE 34

Motivation Network Construction Applications Evaluation Summary & Outlook

Summary

Ongoing work:

  • online search and query interface for Wikipedia
  • streaming model for online news
  • inclusion of parts-of-speech

LOAD summary:

  • fast entity and event exploration
  • can support most entity-related IE tasks
  • can be extended to any kind of entity
  • scalable and fast
  • language-agnostic with entity linking

LOAD your data before you do entity-based analyses.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 19 of 20

slide-35
SLIDE 35

Motivation Network Construction Applications Evaluation Summary & Outlook

Available for download:

  • Wikipedia LOAD network (Stanford NER)
  • Wikipedia LOAD network (Wikidata)
  • Code for generating LOAD networks
  • Code for LOAD query interface

http://dbs.ifi.uni-heidelberg.de/index.php?id=load

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 20 of 20

slide-36
SLIDE 36

Motivation Network Construction Applications Evaluation Summary & Outlook

Available for download:

  • Wikipedia LOAD network (Stanford NER)
  • Wikipedia LOAD network (Wikidata)
  • Code for generating LOAD networks
  • Code for LOAD query interface

http://dbs.ifi.uni-heidelberg.de/index.php?id=load

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 20 of 20

slide-37
SLIDE 37

Motivation Network Construction Applications Evaluation Summary & Outlook

Bibliography

Christopher Cieri, Stephanie Strassel, David Graff, Nii Martey, Kara Rennert, and Mark Liberman. Corpora for topic detection and tracking. In Topic Detection and Tracking. Springer, 2002. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In ACL, 2005. Robert A Guisepi. History world: On this day in history, 1995. http://history-world.org/ontd.htm (2015-10-02). Fran¸ cois Rousseau and Michalis Vazirgiannis. Graph-of-word and TW-IDF: new approach to ad hoc IR. In CIKM, 2013. Jannik Str¨

  • tgen and Michael Gertz.

Multilingual and cross-domain temporal tagging. Language Resources and Evaluation, 47(2):269–298, 2013.

Terms over LOAD: Named Entities for Cross-Document Event Extraction Andreas Spitz 20 of 20