Network-centric Approaches to the Exploration of News Streams
Andreas Spitz November 12, 2018 — EPFL, Lausanne
Heidelberg University, Germany Database Systems Research Group
Network-centric Approaches to the Exploration of News Streams - - PowerPoint PPT Presentation
Network-centric Approaches to the Exploration of News Streams Andreas Spitz November 12, 2018 EPFL, Lausanne Heidelberg University, Germany Database Systems Research Group Collaborators Satya Almasian Gloria Feher Michael Gertz Jannik
Andreas Spitz November 12, 2018 — EPFL, Lausanne
Heidelberg University, Germany Database Systems Research Group
Satya Almasian Gloria Feher Michael Gertz Jannik Strötgen
www.deviantart.com/clearkid
1
The Five Ws of journalism:
◮ Who was involved? ◮ Where did it take place? ◮ When did it take place? ◮ What happened? ◮ Why did that happen? 2
The Five Ws of journalism:
◮ Who was involved? ◮ Where did it take place? ◮ When did it take place? ◮ What happened? ◮ Why did that happen?
A common definition of event in IR:
◮ An event is something that happens
at a given place and time between a group of actors.
2
3
3
3
4
tion and Summarization of Events”. In: SIGIR. 2016
5
tion and Summarization of Events”. In: SIGIR. 2016
5
NLP and IR applications:
◮ Entity disambiguation ◮ Entity linking ◮ Extractive summarization ◮ Relationship extraction ◮ ... 6
NLP and IR applications:
◮ Entity disambiguation ◮ Entity linking ◮ Extractive summarization ◮ Relationship extraction ◮ ...
Interactive text stream exploration:
◮ Entity participation in events ◮ Evolving topic detection ◮ Visual summarization ◮ ... 6
English news articles from RSS feeds:
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences 7
English news articles from RSS feeds:
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences
NLP processing pipeline:
◮ Part-of-speech and sentence tagging:
Stanford POS tagger
◮ Temporal tagging: HeidelTime ◮ Entity classification:
YAGO classes (LOC, ORG, PER)
◮ Named entity recognition and linking: 7
English news articles from RSS feeds:
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127k articles ◮ 5.4M sentences
The resulting implicit network has
◮ 125k entities ◮ 351k terms ◮ 83.4M edges
NLP processing pipeline:
◮ Part-of-speech and sentence tagging:
Stanford POS tagger
◮ Temporal tagging: HeidelTime ◮ Entity classification:
YAGO classes (LOC, ORG, PER)
◮ Named entity recognition and linking: 7
8
Try it yourself:
Networks”. In: WWW. 2017. url: http://evelin.ifi.uni-heidelberg.de:7777
9
10
11
w2v skip−gram w2v CBOW GloVe
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 0.2 0.4 0.6 0.8
rank k recall@k neighbourhood mode
implicit netw. SUM AVG MINMAX 12
implicit network w2v skip−gram w2v CBOW GloVe
1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 1 ⋅ 105 2 ⋅ 105 250 500 750
entity frequency entity rank
13
term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17
14
term score skripal 0.83 nerve 0.77 agent 0.76 u.k. 0.61 russia 0.58 diplomat 0.45 intelligence 0.43 poison 0.33 daughter 0.19 yulia 0.17
14
Andreas Spitz and Michael Gertz. “Entity-Centric Topic Extraction and Exploration: A Network- Based Approach”. In: ECIR. 2018
15
ω(e) = 3 ·
|D(e)|
+ max{T(e)} − min{T(e)} |T(e)|
+ c(e)
−1
16
Intuition:
◮ edges between entities correspond to seeds of topics 17
Intuition:
◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17
Intuition:
◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 17
18
18
18
19
20
term score t1 min{ω(e1, t1), ω(e2, t1)} t2 min{ω(e1, t2), ω(e2, t2)} . . . . . . tn min{ω(e1, tn), ω(e2, tn)}
20
Beirut - Lebanon Russia - Moscow Russia - Putin Trump - Obama Q3820 - Q822 Q159 - Q649 Q159 - Q7747 Q22686 - Q76 term score term score term score term score syrian 0.14 russian 0.28 russian 0.29 presid 0.40 rebel-held 0.12 soviet 0.06 presid 0.18 american 0.21 rebel 0.06 nato 0.06 annex 0.09 republican 0.19 cease-fir 0.05 diplomat 0.06 nato 0.08 democrat 0.19 bombard 0.05 syrian 0.06 hack 0.08 campaign 0.18 bomb 0.04 rebel 0.05 west 0.08 administr 0.17 Network news topics from the New York Times (Jun - Nov 2016)
21
Benefits vs. traditional topics:
◮ faster extraction than LDA topics ◮ number of topics is flexible ◮ runtime contained in data preparation 22
Benefits vs. traditional topics:
◮ faster extraction than LDA topics ◮ number of topics is flexible ◮ runtime contained in data preparation
Stream compatibility:
◮ document updates require only
(sub-) graph updates
22
Try it yourself:
Streams”. In: WSDM. 2019. url: http://topexnet.ifi.uni-heidelberg.de
23
24
25
Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018
26
Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018
26
Andreas Spitz and Michael Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: WWW Companion. 2018
27
Streaming aggregation: Static aggregation / clustering:
28
Streaming aggregation:
◮ Compare similarity of new edge
(v, w, ·) to existing edges (v, w, ·)
◮ If similarity threshold is exceeded:
merge with existing edge
◮ Otherwise, insert as new parallel edge
Static aggregation / clustering:
28
Streaming aggregation:
◮ Compare similarity of new edge
(v, w, ·) to existing edges (v, w, ·)
◮ If similarity threshold is exceeded:
merge with existing edge
◮ Otherwise, insert as new parallel edge
Static aggregation / clustering:
◮ Collect all parallel edges ◮ Cluster parallel edges
(density-based)
◮ Discard “noisy” edges ◮ aggregate edges within clusters 28
Comparison of context aggregation methods
10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
rank k recall@k aggregation method
streaming static no context 29
Edge deflation in streaming aggregation
2500 5000 7500 50 100 150
number of unaggregated edges aggregated edges aggregation threshold
t = 0.6 t = 0.5 t = 0.4 t = 0.3 30
Topics for David Cameron (Q192) − UK (Q145)
0.00 0.25 0.50 0.75 1.00 Jun Jul Aug Sep Oct
relative frequency of mentions
brexit nation favour demand govern referendum ukip vote westminst campaign prime minist leader resign pro−brexit
31
The implicit network model
◮ is applicable to arbitrary types of
entities in any domain
◮ accounts for the sparseness and
distance of entity mentions
◮ works well for relatedness tasks
(in contrast to similarity)
◮ can extend word embeddings 32
The implicit network model
◮ is applicable to arbitrary types of
entities in any domain
◮ accounts for the sparseness and
distance of entity mentions
◮ works well for relatedness tasks
(in contrast to similarity)
◮ can extend word embeddings
A focus on entities in news analysis
◮ creates a dependence on entity
annotations in texts
◮ accounts for the central role of entities
in news
◮ bridges structured and unstructured data ◮ provides a fine grained view on news 32
Citation networks exist for
◮ Academic papers ◮ Patents ◮ Case law ◮ Movies ◮ Tweets ◮ ...
www.vosviewer.com
33
Citation networks exist for
◮ Academic papers ◮ Patents ◮ Case law ◮ Movies ◮ Tweets ◮ ... ◮ ... why not for news?!
www.vosviewer.com
33
34
News articles from RSS feeds:
◮ Politics and business feeds ◮ 34 English news outlets
(USA, UK, AUS, CAN, GER, CHN, QAT)
◮ 2 years (Nov 2015 - Oct 2017) ◮ 245k articles ◮ 367k edges
Andreas Spitz and Michael Gertz. “Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles”. In: ASONAM. 2015
35
clustering coefficient average path length average degree undirected diameter
2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07 20 40 60 5 10 15 1 2 3 0.0 0.2 0.4 0.6
days measure value
network aggregated politics business
36
Andreas Spitz, Jannik Strötgen, and Michael Gertz. “Predicting Document Creation Times in News Citation Networks”. In: WWW Companion. 2018
37
Predict article publication dates from:
◮ Citation network topology ◮ Publication dates of adjacent articles ◮ Temporal expressions in adjacent articles
Andreas Spitz, Jannik Strötgen, and Michael Gertz. “Predicting Document Creation Times in News Citation Networks”. In: WWW Companion. 2018
37
Network topology features:
◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features 38
Network topology features:
◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features
Temporal expression features:
38
Network topology features:
◮ Node degree-based features ◮ Density-based features ◮ Centrality-based features
Temporal expression features: Temporal network features:
38
Missing features
◮ 30.8% of feature values are missing ◮ 89.6% of news articles are missing at least one feature 39
Missing features
◮ 30.8% of feature values are missing ◮ 89.6% of news articles are missing at least one feature
Imputation of missing values
◮ Column mean of the feature 39
in+out all in
BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM 50 100 150 200 250 50 100 150 200 250
regression method absolute error (days)
method
BASE LR BAY NN RF GB SVM
40
m a x
(
T
m i n
(
T
in)
µ
(
T
µ
(
T
in)
m i n
(
T
m a x
(
T
in)
m a x
(
X
in)
µ
(
X
in)
c
pr
σ
(
T
σ
(
X
in)
c
cl,out
s p a n
(
T
σ
(
T
in)
s p a n
(
T
in)
m i n
(
X
in)
s p a n
(
X
in)
c
cl,in
m i n
(
D i s t
)
d e g
µ
(
D i s t
)
d e g
in
d e g
all
m a x
(
D i s t
)
c
btw
c c σ
(
D i s t
)
10−3 10−2 10−1 100
relative importance
Feature type:
temporal expression temporal network 41
42
short news outlet days articles %otherin %otherout AT The Atlantic 334 7.2 16.7 50.6 BBC British Bc. Corp. 730 8.1 19.1 8.0 DW Deutsche Welle 334 1.2 48.1 5.9 FOX Fox News 548 2.7 0.0 0.0 NPR National Public Radio 334 0.4 63.6 58.5 NY The New Yorker 548 3.0 33.5 30.6 NYT New York Times 669 23.8 26.8 4.7 SMH Sydney Morn. Herald 548 2.3 3.0 51.9 WP Washington Post 548 62.7 13.7 5.1
43
News citation networks
◮ encode an implicit flow of information
(in contrast to explicit mentions of phrases)
◮ dependent on thorough data cleaning for their extraction 44
News citation networks
◮ encode an implicit flow of information
(in contrast to explicit mentions of phrases)
◮ dependent on thorough data cleaning for their extraction
Possible extension
◮ to blogs and alternative media as additional data sources ◮ by utilizing more metadata beyond dates (e.g., authors) ◮ to citation paths as features (e.g., for recurrent NNs) 44
More available online:
◮ These slides ◮ Data and evaluation ground truth ◮ Implementations
https://dbs.ifi.uni-heidelberg.de/team/spitz/
45
More available online:
◮ These slides ◮ Data and evaluation ground truth ◮ Implementations
https://dbs.ifi.uni-heidelberg.de/team/spitz/
45