Entity-centric Topic Extraction and Exploration: A Network-based Approach
Andreas Spitz and Michael Gertz March 27, 2018 — ECIR 2018, Grenoble
Heidelberg University, Germany Database Systems Research Group
Entity-centric Topic Extraction and Exploration: A Network-based - - PowerPoint PPT Presentation
Entity-centric Topic Extraction and Exploration: A Network-based Approach Andreas Spitz and Michael Gertz March 27, 2018 ECIR 2018, Grenoble Heidelberg University, Germany Database Systems Research Group A Topic From Recent News term
Heidelberg University, Germany Database Systems Research Group
1
◮ with the number of documents ◮ with the number of topics 2
◮ with the number of documents ◮ with the number of topics
◮ changing the number of topics ◮ updating the underlying data / processing data streams 2
◮ with the number of documents ◮ with the number of topics
◮ changing the number of topics ◮ updating the underlying data / processing data streams
◮ topic labels / topic descriptions ◮ relations between topics 2
3
3
4
4
4
5
6
◮ edges between entities correspond to seeds of topics 7
◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 7
◮ edges between entities correspond to seeds of topics ◮ topics can be grown around seeds by adding relevant terms 7
8
9
9
9
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127.5 thousand articles ◮ 5.4 million sentences 10
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127.5 thousand articles ◮ 5.4 million sentences
◮ Part-of-speech and sentence tagging:
◮ Entity classification:
◮ Named entity recognition and linking: 10
◮ 14 news outlets (from US, UK, and AU) ◮ 6 months (Jun 1 - Nov 30, 2016) ◮ 127.5 thousand articles ◮ 5.4 million sentences
◮ 119.3 thousand entities ◮ 329.0 thousand terms ◮ 10.6 million edges
◮ Part-of-speech and sentence tagging:
◮ Entity classification:
◮ Named entity recognition and linking: 10
11
12
13
14
14
15
topic size 5 topic size 10 topic size 50 LDA network 5 10 15 20 5 10 15 20 5 10 15 20 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3
number of topics average topic overlap news outlet
BBC CBS CNN Guardian IBTimes Independent LATimes NYTimes Reuters Skynews SMH Telegraph USAtoday WPost
16
◮ faster extraction than LDA topics ◮ runtime contained in data preparation ◮ number of topics is flexible 17
◮ faster extraction than LDA topics ◮ runtime contained in data preparation ◮ number of topics is flexible
◮ document updates require only
17
◮ network visualizations instead of term lists ◮ entities act as labels for topics 18
◮ network visualizations instead of term lists ◮ entities act as labels for topics
◮ Adding more topic seeds (edges):
◮ Adding more descriptive terms:
18
◮ [data] Implicit news network ◮ [code] Implicit network extraction ◮ [code] Topic exploration and extraction
19
◮ [data] Implicit news network ◮ [code] Implicit network extraction ◮ [code] Topic exploration and extraction
19