SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
- Institute of Informatics
Database Systems Group Ludwig-Maximilians-Universität München
SigniTrend: Scalable Detection of Emerging Topics in Textual - - PowerPoint PPT Presentation
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel Institute of Informatics Database Systems Group Ludwig-Maximilians-Universitt
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
Database Systems Group Ludwig-Maximilians-Universität München
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10
Term frequency
14 28 42 56 70
10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13
Term “Facebook” Term “WhatsApp”
Twitter Streaming API on Feb. 19th 2014
Term frequency
14 28 42 56 70
10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13
Term “Facebook” Term “WhatsApp”
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10
B A Twitter Streaming API on Feb. 19th 2014
Term frequency
14 28 42 56 70
10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13
Term “Facebook” Term “WhatsApp”
B A
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10
? Twitter Streaming API on Feb. 19th 2014
Term frequency
14 28 42 56 70
10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13
Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”}
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10
B A Twitter Streaming API on Feb. 19th 2014
Facebook bought WhatsApp
Term frequency
14 28 42 56 70
10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13
Term “Facebook” Term “WhatsApp” Pair {“Facebook”, “WhatsApp”}
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 1/10
B A
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 2/10
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 3/10
Count frequency Update statistics exceeds threshold?
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 4/10
Trend candidates Terms and pairs at the end of each time slice new alerting thresholds
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10
σt-1,e
EWMVart−1,e
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 5/10 [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009
— 4t,e xt,e EWMAt−1,e EWMAt,e EWMAt−1,e + α · 4t,e EWMVart,e (1 α) ·
t,e
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 6/10
Problem: Too many terms and pairs to track everything
(based on Bloom Filters and Heavy Hitters) for probabilistic upper-bound statistics
2013 News Dataset
STEMMED TERMS OBSERVED PAIRS TOTAL 56,661,782 660,430,059 UNIQUES 300,141 71,289,359
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 7/10
L=7 buckets, K=2 hash functions
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
#1
#2 #3 #4 #5 #6 #7
{WhatsApp}: 60
L=7 buckets, K=2 hash functions
{WhatsApp}: 60
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 45 ± 30
h1 h2
L=7 buckets, K=2 hash functions
{WhatsApp}: 60
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 2 ± 1 45 ± 30 2 ± 1
{Facebook, WhatsApp}: 2
h1 h2
L=7 buckets, K=2 hash functions
{WhatsApp}: 60
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 2 ± 1
45 ± 30 20 ± 10
2 ± 1 20 ± 10
{Facebook, WhatsApp}: 2 {Obama, US}: 25
h1 h2
L=7 buckets, K=2 hash functions
{WhatsApp}: 60 write MAX (upper bound)
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10
{Facebook, WhatsApp}: 2 {Obama, US}: 25
h1 h2
L=7 buckets, K=2 hash functions
{WhatsApp}: 60
Upper-bound estimate for mean and its variance
read MIN (lowest collision) write MAX (upper bound) read {Obama, US}: min(45±30, 20±10) = 20±10
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10
{Facebook, WhatsApp}: 2 {Obama, US}: 25
h1 h2
L=7 buckets, K=2 hash functions
{WhatsApp}: 60
Upper-bound estimate for mean and its variance Performance on news dataset: 104s/day with a Raspberry-Pi
read MIN (lowest collision) write MAX (upper bound) read {Obama, US}: min(45±30, 20±10) = 20±10
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 8/10
45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10
{Facebook, WhatsApp}: 2 {Obama, US}: 25
Inject artificial words with frequency α e.g. “Obama meets <X123> Netanyahu”
0% 20% 40% 60% 80% 100% 8 10 12 14 16 18 20 22 24 26 Recall of Injected Trends Hash Table Bits ℓ α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.15
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 9/10
Hash table size large enough → recall saturation
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds | Page 10/10
Michael Weiler | SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds