Network-centric Approaches to the Exploration of News Streams - - PowerPoint PPT Presentation

network centric approaches to the exploration of news
SMART_READER_LITE
LIVE PREVIEW

Network-centric Approaches to the Exploration of News Streams - - PowerPoint PPT Presentation

Network-centric Approaches to the Exploration of News Streams Andreas Spitz November 12, 2018 EPFL, Lausanne Heidelberg University, Germany Database Systems Research Group Collaborators Satya Almasian Gloria Feher Michael Gertz Jannik


slide-1
SLIDE 1

Network-centric Approaches to the Exploration of News Streams

Andreas Spitz November 12, 2018 — EPFL, Lausanne

Heidelberg University, Germany Database Systems Research Group

slide-2
SLIDE 2

Collaborators

Satya Almasian Gloria Feher Michael Gertz Jannik Strötgen

slide-3
SLIDE 3

Catching up on the News

www.deviantart.com/clearkid

1

slide-4
SLIDE 4

Part I Implicit Entity Networks

slide-5
SLIDE 5

The Importance of Entities in News

The Five Ws of journalism:

◮ Who was involved? ◮ Where did it take place? ◮ When did it take place? ◮ What happened? ◮ Why did that happen? 2

slide-6
SLIDE 6

The Importance of Entities in News

The Five Ws of journalism:

◮ Who was involved? ◮ Where did it take place? ◮ When did it take place? ◮ What happened? ◮ Why did that happen?

A common definition of event in IR:

◮ An event is something that happens

at a given place and time between a group of actors.

2

slide-7
SLIDE 7

What Are Implicit Entity Networks?

3

slide-8
SLIDE 8

What Are Implicit Entity Networks?

3

slide-9
SLIDE 9

What Are Implicit Entity Networks?

3

slide-10
SLIDE 10

Implicit Network Construction

slide-11
SLIDE 11

Implicit Network Extraction

4

slide-12
SLIDE 12

Implicit Network Aggregation

  • A. Spitz and M. Gertz. “Terms over LOAD: Leveraging Named Entities for Cross-Document Extrac-

tion and Summarization of Events”. In: SIGIR. 2016

5

slide-13
SLIDE 13

Implicit Network Aggregation

  • A. Spitz and M. Gertz. “Terms over LOAD: Leveraging Named Entities for Cross-Document Extrac-

tion and Summarization of Events”. In: SIGIR. 2016

5

slide-14
SLIDE 14

Applications of Implicit Networks

NLP and IR applications:

◮ Entity disambiguation ◮ Entity linking ◮ Extractive summarization ◮ Relationship extraction ◮ ... 6

slide-15
SLIDE 15

Applications of Implicit Networks

NLP and IR applications:

◮ Entity disambiguation ◮ Entity linking ◮ Extractive summarization ◮ Relationship extraction ◮ ...

Interactive text stream exploration:

◮ Entity participation in events ◮ Evolving topic detection ◮ Visual summarization ◮ ... 6

slide-16
SLIDE 16

Entity-centric News Exploration

slide-17
SLIDE 17

News Article Data Set

English news articles from RSS feeds:

◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences 7

slide-18
SLIDE 18

News Article Data Set

English news articles from RSS feeds:

◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences

NLP processing pipeline:

◮ Part-of-speech and sentence tagging:

Stanford POS tagger

◮ Temporal tagging: HeidelTime ◮ Entity classification:

YAGO classes (LOC, ORG, PER)

◮ Named entity recognition and linking: 7

slide-19
SLIDE 19

News Article Data Set

English news articles from RSS feeds:

◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences

The resulting implicit network has

◮ 125k entities ◮ 351k terms ◮ 83.4M edges

NLP processing pipeline:

◮ Part-of-speech and sentence tagging:

Stanford POS tagger

◮ Temporal tagging: HeidelTime ◮ Entity classification:

YAGO classes (LOC, ORG, PER)

◮ Named entity recognition and linking: 7

slide-20
SLIDE 20

Implicit Network Exploration Pipeline

8

slide-21
SLIDE 21

Interactive Entity-centric Search

Try it yourself:

  • A. Spitz, S. Almasian, and M. Gertz. “EVELIN: Exploration of Event and Entity Links in Implicit

Networks”. In: WWW. 2017. url: http://evelin.ifi.uni-heidelberg.de:7777

9

slide-22
SLIDE 22

Interactive Entity-centric Search: An Example

10

slide-23
SLIDE 23

Evaluation Data: Entity Participation in Events

11

slide-24
SLIDE 24

Evaluation Results: Entity Participation

w2v skip−gram w2v CBOW GloVe

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 0.2 0.4 0.6 0.8

rank k recall@k neighbourhood mode

implicit netw. SUM AVG MINMAX 12

slide-25
SLIDE 25

Evaluation Results: Performance vs. Entity Frequency

implicit network w2v skip−gram w2v CBOW GloVe

1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 250 500 750

entity frequency entity rank

13

slide-26
SLIDE 26

Entity-centric Network Topics

slide-27
SLIDE 27

What Are Network Topics?

term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17

14

slide-28
SLIDE 28

What Are Network Topics?

term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17

14

slide-29
SLIDE 29

Implicit Network Extraction for Topic Detection

Andreas Spitz and Michael Gertz. “Entity-Centric Topic Extraction and Exploration: A Network- Based Approach”. In: ECIR. 2018

15

slide-30
SLIDE 30

Edge Aggregation and Weighting

ω(e) = 3 ·

  • |D(v1) ∪ D(v2)|

|D(e)|

  • coverage

+ max{T(e)} − min{T(e)} |T(e)|

  • temporal coverage

+ c(e)

  • δ∈∆(e) exp(−δ)
  • distance

−1

16

slide-31
SLIDE 31

Topic Extraction and Triangular Growth

Intuition:

◮ edges between entities correspond to seeds of topics 17

slide-32
SLIDE 32

Topic Extraction and Triangular Growth

Intuition:

◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17

slide-33
SLIDE 33

Topic Extraction and Triangular Growth

Intuition:

◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17

slide-34
SLIDE 34

Topic Overlap and Merging Topics

18

slide-35
SLIDE 35

Topic Overlap and Merging Topics

18

slide-36
SLIDE 36

Topic Overlap and Merging Topics

18

slide-37
SLIDE 37

Topic Subgraph Exploration: An Example

19

slide-38
SLIDE 38

Term Ranking in Network Topics

20

slide-39
SLIDE 39

Term Ranking in Network Topics

term score t1 min{ω(e1, t1), ω(e2, t1)} t2 min{ω(e1, t2), ω(e2, t2)} . . . . . . tn min{ω(e1, tn), ω(e2, tn)}

20

slide-40
SLIDE 40

Deriving Classic Topics From Network Topics

Beirut - Lebanon Russia - Moscow Russia - Putin Trump - Obama Q3820 - Q822 Q159 - Q649 Q159 - Q7747 Q22686 - Q76 term score term score term score term score syrian 0.14 russian 0.28 russian 0.29 presid 0.40 rebel-held 0.12 soviet 0.06 presid 0.18 american 0.21 rebel 0.06 nato 0.06 annex 0.09 republican 0.19 cease-fir 0.05 diplomat 0.06 nato 0.08 democrat 0.19 bombard 0.05 syrian 0.06 hack 0.08 campaign 0.18 bomb 0.04 rebel 0.05 west 0.08 administr 0.17 Network news topics from the New York Times (Jun - Nov 2016)

21

slide-41
SLIDE 41

Benefits of Entity-centric Network Topics

Benefits vs. traditional topics:

◮ faster extraction than LDA topics ◮ number of topics is flexible ◮ runtime contained in data preparation 22

slide-42
SLIDE 42

Benefits of Entity-centric Network Topics

Benefits vs. traditional topics:

◮ faster extraction than LDA topics ◮ number of topics is flexible ◮ runtime contained in data preparation

Stream compatibility:

◮ document updates require only

(sub-) graph updates

22

slide-43
SLIDE 43

Interactive Topic Exploration

Try it yourself:

  • A. Spitz, S. Almasian, and M. Gertz. “TopExNet: Entity-Centric Network Topic Exploration in News

Streams”. In: WSDM. 2019. url: http://topexnet.ifi.uni-heidelberg.de

23

slide-44
SLIDE 44

Linking Topics to Source Articles

24

slide-45
SLIDE 45

Contexts of Entity Mentions

slide-46
SLIDE 46

Why the Context Maters

25

slide-47
SLIDE 47

Edge Context Extraction

Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018

26

slide-48
SLIDE 48

Edge Context Extraction

Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018

26

slide-49
SLIDE 49

Context-based Aggregation of Edges

Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018

27

slide-50
SLIDE 50

Edge Aggregation Approaches

Streaming aggregation: Static aggregation / clustering:

28

slide-51
SLIDE 51

Edge Aggregation Approaches

Streaming aggregation:

◮ Compare similarity of new edge

(v, w, ·) to existing edges (v, w, ·)

◮ If similarity threshold is exceeded:

merge with existing edge

◮ Otherwise, insert as new parallel edge

Static aggregation / clustering:

28

slide-52
SLIDE 52

Edge Aggregation Approaches

Streaming aggregation:

◮ Compare similarity of new edge

(v, w, ·) to existing edges (v, w, ·)

◮ If similarity threshold is exceeded:

merge with existing edge

◮ Otherwise, insert as new parallel edge

Static aggregation / clustering:

◮ Collect all parallel edges ◮ Cluster parallel edges

(density-based)

◮ Discard “noisy” edges ◮ aggregate edges within clusters 28

slide-53
SLIDE 53

Evaluation Results: Entity Participation (with Context)

Comparison of context aggregation methods

10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

rank k recall@k aggregation method

streaming static no context 29

slide-54
SLIDE 54

Edge Deflation Potential

Edge deflation in streaming aggregation

2500 5000 7500 50 100 150

number of unaggregated edges aggregated edges aggregation threshold

t = 0.6 t = 0.5 t = 0.4 t = 0.3 30

slide-55
SLIDE 55

Evolving Network Topics

Topics for David Cameron (Q192) − UK (Q145)

0.00 0.25 0.50 0.75 1.00 Jun Jul Aug Sep Oct

relative frequency of mentions

brexit nation favour demand govern referendum ukip vote westminst campaign prime minist leader resign pro−brexit

31

slide-56
SLIDE 56

Summary and Overview (Part I)

slide-57
SLIDE 57

Entity-centric Implicit Networks: Outlook

The implicit network model

◮ is applicable to arbitrary types of

entities in any domain

◮ accounts for the sparseness and

distance of entity mentions

◮ works well for relatedness tasks

(in contrast to similarity)

◮ can extend word embeddings 32

slide-58
SLIDE 58

Entity-centric Implicit Networks: Outlook

The implicit network model

◮ is applicable to arbitrary types of

entities in any domain

◮ accounts for the sparseness and

distance of entity mentions

◮ works well for relatedness tasks

(in contrast to similarity)

◮ can extend word embeddings

A focus on entities in news analysis

◮ creates a dependence on entity

annotations in texts

◮ accounts for the central role of entities

in news

◮ bridges structured and unstructured data ◮ provides a fine grained view on news 32

slide-59
SLIDE 59

Part II News Citation Networks

slide-60
SLIDE 60

Why News Citations?

Citation networks exist for

◮ Academic papers ◮ Patents ◮ Case law ◮ Movies ◮ Tweets ◮ ...

www.vosviewer.com

33

slide-61
SLIDE 61

Why News Citations?

Citation networks exist for

◮ Academic papers ◮ Patents ◮ Case law ◮ Movies ◮ Tweets ◮ ... ◮ ... why not for news?!

www.vosviewer.com

33

slide-62
SLIDE 62

Citation Network Construction

slide-63
SLIDE 63

News Citation Network Extraction

34

slide-64
SLIDE 64

News Citation Network Overview

News articles from RSS feeds:

◮ Politics and business feeds ◮ 34 English news outlets

(USA, UK, AUS, CAN, GER, CHN, QAT)

◮ 2 years (Nov 2015 - Oct 2017) ◮ 245k articles ◮ 367k edges

Andreas Spitz and Michael Gertz. “Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles”. In: ASONAM. 2015

35

slide-65
SLIDE 65

Evolution of Network Metrics

clustering coefficient average path length average degree undirected diameter

2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07 20 40 60 5 10 15 1 2 3 0.0 0.2 0.4 0.6

days measure value

network aggregated politics business

36

slide-66
SLIDE 66

Predicting Publication Dates

slide-67
SLIDE 67

Task Definition: Publication Date Prediction

Andreas Spitz, Jannik Strötgen, and Michael Gertz. “Predicting Document Creation Times in News Citation Networks”. In: WWW Companion. 2018

37

slide-68
SLIDE 68

Task Definition: Publication Date Prediction

Predict article publication dates from:

◮ Citation network topology ◮ Publication dates of adjacent articles ◮ Temporal expressions in adjacent articles

Andreas Spitz, Jannik Strötgen, and Michael Gertz. “Predicting Document Creation Times in News Citation Networks”. In: WWW Companion. 2018

37

slide-69
SLIDE 69

Feature Extraction for Publication Date Prediction

Network topology features:

◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features 38

slide-70
SLIDE 70

Feature Extraction for Publication Date Prediction

Network topology features:

◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features

Temporal expression features:

38

slide-71
SLIDE 71

Feature Extraction for Publication Date Prediction

Network topology features:

◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features

Temporal expression features: Temporal network features:

38

slide-72
SLIDE 72

Imputation of Missing Feature Values

Missing features

◮ 30.8% of feature values are missing ◮ 89.6% of news articles are missing at least one feature 39

slide-73
SLIDE 73

Imputation of Missing Feature Values

Missing features

◮ 30.8% of feature values are missing ◮ 89.6% of news articles are missing at least one feature

Imputation of missing values

◮ Column mean of the feature 39

slide-74
SLIDE 74

Evaluation Results: Absolute Prediction Error (in Days)

  • ut

in+out all in

BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM 50 100 150 200 250 50 100 150 200 250

regression method absolute error (days)

method

BASE LR BAY NN RF GB SVM

40

slide-75
SLIDE 75

Feature Importance: Random Forest

  • Feature importance: random forest

m a x

(

T

  • ut)

m i n

(

T

in)

µ

(

T

  • ut)

µ

(

T

in)

m i n

(

T

  • ut)

m a x

(

T

in)

m a x

(

X

in)

µ

(

X

in)

c

pr

σ

(

T

  • ut)

σ

(

X

in)

c

cl,out

s p a n

(

T

  • ut)

σ

(

T

in)

s p a n

(

T

in)

m i n

(

X

in)

s p a n

(

X

in)

c

cl,in

m i n

(

D i s t

)

d e g

  • ut

µ

(

D i s t

)

d e g

in

d e g

all

m a x

(

D i s t

)

c

btw

c c σ

(

D i s t

)

10−3 10−2 10−1 100

relative importance

Feature type:

  • network topology

temporal expression temporal network 41

slide-76
SLIDE 76

Topology of News Citations

slide-77
SLIDE 77

Exploring Citation Chains in News

42

slide-78
SLIDE 78

Citation Practices of News Outlets

short news outlet days articles %otherin %otherout AT The Atlantic 334 7.2 16.7 50.6 BBC British Bc. Corp. 730 8.1 19.1 8.0 DW Deutsche Welle 334 1.2 48.1 5.9 FOX Fox News 548 2.7 0.0 0.0 NPR National Public Radio 334 0.4 63.6 58.5 NY The New Yorker 548 3.0 33.5 30.6 NYT New York Times 669 23.8 26.8 4.7 SMH Sydney Morn. Herald 548 2.3 3.0 51.9 WP Washington Post 548 62.7 13.7 5.1

43

slide-79
SLIDE 79

Summary and Overview (Part II)

slide-80
SLIDE 80

News Citation Networks: Outlook

News citation networks

◮ encode an implicit flow of information

(in contrast to explicit mentions of phrases)

◮ dependent on thorough data cleaning for their extraction 44

slide-81
SLIDE 81

News Citation Networks: Outlook

News citation networks

◮ encode an implicit flow of information

(in contrast to explicit mentions of phrases)

◮ dependent on thorough data cleaning for their extraction

Possible extension

◮ to blogs and alternative media as additional data sources ◮ by utilizing more metadata beyond dates (e.g., authors) ◮ to citation paths as features (e.g., for recurrent NNs) 44

slide-82
SLIDE 82

Resources

More available online:

◮ These slides ◮ Data and evaluation ground truth ◮ Implementations

https://dbs.ifi.uni-heidelberg.de/team/spitz/

45

slide-83
SLIDE 83

Resources

More available online:

◮ These slides ◮ Data and evaluation ground truth ◮ Implementations

https://dbs.ifi.uni-heidelberg.de/team/spitz/

45