SigniTrend: Scalable Detection of Emerging Topics in Textual - - PowerPoint PPT Presentation

signitrend scalable detection of emerging topics in
SMART_READER_LITE
LIVE PREVIEW

SigniTrend: Scalable Detection of Emerging Topics in Textual - - PowerPoint PPT Presentation

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel Institute of Informatics Database Systems Group Ludwig-Maximilians-Universitt


slide-1
SLIDE 1

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Erich Schubert, Michael Weiler, Hans-Peter Kriegel

  • Institute of Informatics

Database Systems Group 
 Ludwig-Maximilians-Universität München

slide-2
SLIDE 2

Trend detection on streams should be early and accurate

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

Term frequency

14 28 42 56 70

10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13

Term “Facebook” Term “WhatsApp”

Twitter Streaming API on Feb. 19th 2014

slide-3
SLIDE 3

Term frequency

14 28 42 56 70

10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13

Term “Facebook” Term “WhatsApp”

Trend detection on streams should be early and accurate

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

B A Twitter Streaming API on Feb. 19th 2014

slide-4
SLIDE 4

Term frequency

14 28 42 56 70

10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13

Term “Facebook” Term “WhatsApp”

Trend detection on streams should be early and accurate

B A

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

? Twitter Streaming API on Feb. 19th 2014

slide-5
SLIDE 5

Term frequency

14 28 42 56 70

10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13

Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”}

Trend detection on streams should be early and accurate

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

B A Twitter Streaming API on Feb. 19th 2014

slide-6
SLIDE 6

Facebook bought WhatsApp

Trend detection on streams should be early and accurate

Term frequency

14 28 42 56 70

10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13

Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”}

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10

B A

slide-7
SLIDE 7

Problem description

  • 1. Statistical significance score


Popular topics ≠ trending topics (e.g. Obama)

  • 2. Track interacting terms
  • Facebook bought WhatsApp
  • Edward Snowden traveled to Moscow
  • 3. Scalability


Efficient calculation for all terms and pairs

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 2/10

slide-8
SLIDE 8

SigniTrend on textual streams


tracking both: single terms and pairs

  • A. Preprocessing (stopwords, stemming, duplicates)
  • B. Trend detection cycle
  • Temporal slicing for statistical aggregation
  • Score all terms and pairs based on expectations


from past slices

  • C. Refinement with clustering

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 3/10

slide-9
SLIDE 9

Trend detection cycle

Count frequency Update statistics exceeds 
 threshold?

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 4/10

Trend candidates Terms 
 and pairs at the end of
 each time slice new alerting 
 thresholds

slide-10
SLIDE 10
  • How many standard deviations is the current

frequency x higher than its mean

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10

z (xt,e) := xt,e−µt−1,e

σt-1,e

Update statistics

for time slice t and term or pair e

slide-11
SLIDE 11
  • How many standard deviations is the current

frequency x higher than its mean

  • Exponential weighted moving average/variance for

continuous estimation on a stream [Finch09]

Update statistics

for time slice t and term or pair e

— z (xt,e) := xt,e−EWMAt−1,e p

EWMVart−1,e

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10 [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009

— 4t,e xt,e EWMAt−1,e EWMAt,e EWMAt−1,e + α · 4t,e EWMVart,e (1 α) ·

  • EWMVart−1,e + α · 42

t,e

slide-12
SLIDE 12

Significance and frequency for term “Facebook”

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 6/10

slide-13
SLIDE 13

Problem: Too many terms and pairs to track everything

  • Therefore, we designed an efficient hashing scheme

(based on Bloom Filters and Heavy Hitters) 
 for probabilistic upper-bound statistics

How to track statistics of all pairs efficiently?

2013 News Dataset

STEMMED TERMS OBSERVED PAIRS TOTAL 56,661,782 660,430,059 UNIQUES 300,141 71,289,359

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 7/10

slide-14
SLIDE 14

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

#1

#2 #3 #4 #5 #6 #7

{WhatsApp}: 60

slide-15
SLIDE 15

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 45 ± 30

h1 h2

slide-16
SLIDE 16

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 2 ± 1 45 ± 30 2 ± 1

{Facebook, WhatsApp}: 2

h1 h2

slide-17
SLIDE 17

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 2 ± 1

45 ± 30 20 ± 10

2 ± 1 20 ± 10

{Facebook, WhatsApp}: 2 {Obama, US}: 25

h1 h2

?

slide-18
SLIDE 18

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60 write MAX (upper bound)

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10

{Facebook, WhatsApp}: 2 {Obama, US}: 25

h1 h2

slide-19
SLIDE 19

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60

Upper-bound estimate for mean and its variance


read MIN (lowest collision) write MAX (upper bound) read {Obama, US}:
 min(45±30, 20±10) = 20±10

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10

{Facebook, WhatsApp}: 2 {Obama, US}: 25

h1 h2

slide-20
SLIDE 20

Hashing scheme for efficient tracking

L=7 buckets, K=2 hash functions

{WhatsApp}: 60

Upper-bound estimate for mean and its variance
 Performance on news dataset: 104s/day with a Raspberry-Pi

read MIN (lowest collision) write MAX (upper bound) read {Obama, US}:
 min(45±30, 20±10) = 20±10

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10

45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10

{Facebook, WhatsApp}: 2 {Obama, US}: 25

slide-21
SLIDE 21

Inject artificial words with frequency α
 e.g. “Obama meets <X123> Netanyahu”

Artificial trends evaluation

0% 20% 40% 60% 80% 100% 8 10 12 14 16 18 20 22 24 26 Recall of Injected Trends Hash Table Bits ℓ α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.15

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 9/10

Hash table size large enough → recall saturation

slide-22
SLIDE 22
  • Inverted index (Apache Lucene) to verify trend

candidates and measure exactly (without hashing) for precise reporting (false-positives can be eliminated)

  • Single Link clustering with Ward of remaining

trends (similarity matrix is built with the exact significance of all pairs)

  • Future work: include topic modeling techniques 


(e.g. pLSI, LDA)

Refinement & clustering

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 10/10

slide-23
SLIDE 23

Thank You!

Questions?

Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds