Topic Detection and Trend Sensing Via Joint Complexity Dimitris - - PowerPoint PPT Presentation

topic detection and trend sensing via joint complexity
SMART_READER_LITE
LIVE PREVIEW

Topic Detection and Trend Sensing Via Joint Complexity Dimitris - - PowerPoint PPT Presentation

Topic Detection and Trend Sensing Via Joint Complexity Dimitris Milioris 1,2 1 Bell Labs, Alcatel-Lucent, France 2 cole Polytechnique ParisTech Advisor: Philippe Jacquet 1,2 ResCom 2014 Overview n Motivation & Challenges n


slide-1
SLIDE 1

Topic Detection and Trend Sensing Via Joint Complexity

Dimitris Milioris1,2

1Bell Labs, Alcatel-Lucent, France 2École Polytechnique ParisTech

Advisor: Philippe Jacquet1,2

ResCom 2014

slide-2
SLIDE 2

Overview

n Motivation & Challenges n I-Complexity n Joint Complexity n Benefits n Experiments n Conclusions n Future work

2

slide-3
SLIDE 3

Motivation

n Online social media services have seen a huge expansion:

q The value of information has increased dramatically q Interactions and communication between users help predict the

evolution of information

q The ability to study Social Networks can provide relevant info in real

time

3

slide-4
SLIDE 4

Challenges

The study of Soc. Networks has several research challenges

n Searching in social media is still an open problem

à short size of posts, tremendous quantity in real time

n Information of the correlation between groups of users à predict media consumption, network resources, traffic

à improve QoS

n Analyze the relationship between members of a group/community

à reveal important teams

n Spam and adv. detection

à continuously growing amount of irrelevant info

4

slide-5
SLIDE 5

I – Complexity

n X is a sequence and I(X) is a set of factors (distinct substr.) n Example: X = apple, then:

I(X) = {a, p, l, e, ap, pp, pl, le, app, ppl, ple, appl, pple, apple, v}

n |I(X)| is the complexity of a sequence

q |I(X)| = 15 (v denotes the empty string) 5

slide-6
SLIDE 6

Joint Complexity [1]

n The information contained in a string may be revealed by

comparing with a reference string

n The Joint Complexity is the number of common distinct

factors in two sequences

n J(X, Y) = |I(X) ∩ I(Y)| n Efficient way to estimate similarity degree of two sequences n The analysis of a sequence in subcomponents is done by

Suffix Trees

q

Simple, fast and low complexity method to store and recall from memory

[1] P. Jacquet, D. Milioris and W. Szpankowski, “Classification of Markov Sources Through Joint String Complexity: Theory and Experiments”, in IEEE International Symposium on Information Theory (ISIT’13), Istanbul, Turkey, July 2013.

6

slide-7
SLIDE 7

Suffix Trees Superposition [2]

n Suffix Tree superposition of X = apple and Y = maple n It reveals the common factors of X and Y, and gives a similarity metric n Time to build a S.T. = O(n logn) n Space in memory = O(n), n is the length of the tweet

7

JC(apple, maple) = 9

[2] D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis: Application to Social Networks Information Flow”, in Bell Laboratories Technical Journal, Issue on Data Analytics, Vol. 18, No. 4, 2014

slide-8
SLIDE 8

Topic Detection

n Timeslot representation via connected weighted graphs n Each tweet is a node in the graph and an adjacency matrix

(triangular) holds the weight (JC) of every edge

8

[3] G. Burnside, D. Milioris and P. Jacquet, “One Day in Twitter: Topic Detection via Joint Complexity”, in Snow Data Challenge, Seoul, Korea, April 2014.

slide-9
SLIDE 9

Topic Detection

9

slide-10
SLIDE 10

Algorithms

10

slide-11
SLIDE 11

Experiments – Trend Sensing

Joint Complexity of four sets of tweets from the 2012 United States presidential elections and the 2012 Olympic games at London.

11 2000 4000 6000 8000 10000 0.5 1 1.5 2 2.5 3 3.5 4 x 10

4

Text length (n) Joint Complexity

OlympicsSet1 Over OlympicsSet2 OlympicsSet1 Over OlympicsSet1Sim OlympicsSet2 Over OlympicsSet2Sim USelectionsSet1 Over USelectionsSet2 USelectionsSet1 Over USelectionsSet1Sim USelectionsSet2 Over USelectionsSet2Sim USelectionsSet1 Over OlympicsSet1 USelectionsSet2 Over OlympicsSet2

slide-12
SLIDE 12

Experiments

12

Joint Complexity of tweet sets from the 2012 United States presidential elections and the 2012 Olympic games at London, in comparison with theoretic curves, using the third Markov order.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.5 1 1.5 2 2.5 3 x 10

4

Text length (n) Joint Complexity

USelectionsSet11Sim Over USelectionsSet12Sim USelectionsSet21Sim Over USelectionsSet22Sim USelectionsSet1Sim Over OlympicsSet1Sim USelectionsSet1Sim Over OlympicsSet2Sim USelectionsSet1Sim Theor. USelectionsSet2Sim Theor. USelectionsSet1Sim Over OlympicsSet1Sim Theor.

slide-13
SLIDE 13

Benefits

n Both message classification and identification of the growing

trends in real time (trend sensing)

n Track the information and timeline within a social network n Deal with languages other than English without specific pre-

processing or dictionaries, because the method is:

q simple, context-free, with no grammar and does not use semantics 13

slide-14
SLIDE 14

Conclusions

n Implementation of a topic detection method applied to a

dataset of tweets emitted during a 24 hour period

n 3rd Prize in Snow Data Challenge in WWW conf. 2014 n It relies heavily on the concept of Joint String Complexity

which has the benefit

q of being language agnostic and does not require humans to deal

with list of keywords

q has high algorithmic efficiency 14

slide-15
SLIDE 15

Future Work, Improvements

n Use the theoretical background in order to automatically fix

the threshold values, than empirical ones (topic detection)

n Extend the JC metric to make topological classification of

tweets and perform clustering based on this distance

15

slide-16
SLIDE 16

Publications related to JC

n

  • G. Burnside, D. Milioris and P. Jacquet, “One Day in Twitter: Topic Detection

via Joint Complexity”, Snow Data Challenge, WWW 2014

n

  • D. Milioris and P. Jacquet, “Joint Sequence Complexity Analysis: Application

to Social Networks Information Flow”, in Bell Laboratories Technical Journal, Issue on Data Analytics, Vol. 18, No. 4, 2014

n

  • P. Jacquet, D. Milioris, and W. Szpankowski, “Classification of Markov

Sources Through Joint String Complexity: Theory and Experiments,” Proc. IEEE Internat. Symp. Inform. Theory (ISIT ’13)

n

  • P. Jacquet and W. Szpankowski, “Joint String Complexity for Markov

Sources,” Proc. 23rd Internat. Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Anal. of Algorithms (AofA ’12)

n

  • P. Jacquet, “Common Words Between Two Random Strings,” Proc. IEEE
  • Internat. Symp. on Inform. Theory (ISIT ’07)

16

slide-17
SLIDE 17

Publications related to JC

n

  • P. Jacquet and W. Szpankowski, “Analytical Depoissonization and Its

Applications,” Theoret. Comput. Sci., 201:1-2 (1998), 1–62.

n

  • P. Jacquet and W. Szpankowski, “Autocorrelation on Words and Its

Applications: Analysis of Suffix Trees by String-Ruler Approach,” J. Combin. Theory Ser. A, 66:2 (1994), 237–269.

17

slide-18
SLIDE 18

Questions ?

18

dimitrios.milioris@polytechnique.edu