mining temporal networks
play

Mining temporal networks Aristides Gionis Department of Computer - PowerPoint PPT Presentation

Mining temporal networks Aristides Gionis Department of Computer Science, Aalto University users.ics.aalto.fi/gionis Nov 14, 2016 networks a simple abstraction used to model many different real-world datasets social networks


  1. Mining temporal networks Aristides Gionis Department of Computer Science, Aalto University users.ics.aalto.fi/gionis Nov 14, 2016

  2. networks • a simple abstraction used to model many different real-world datasets – social networks – information networks – technology networks – biological networks

  3. traditional view • networks represented as pure graph-theory objects – no additional vertex / edge information • emphasis on static networks • dynamic settings model structural changes – vertex / edge additions / deletions

  4. temporal networks • ability to collect and store large volumes of network data • available data have fine granularity • lots of additional information associated to vertices/edges • network topology is relatively stable, while lots of activity and interaction is taking place • giving rise to new concepts, new problems, and new computational challenges

  5. modeling activity in networks 1. network nodes perform actions (e.g., posting messages) z c e b w d a b y b c a x a c u c a d time 2. network nodes interact with each other (e.g., a “like”, a repost, or sending a message to each other) u w z y x time

  6. many novel and interesting concepts z a b u w b w y a z x a b y u x temporal information paths new pattern types z a u w a w y a z x a y u x network evolution new types of events

  7. temporal networks — objectives • identify new concepts and new problems • develop algorithmic solutions • demonstrate revelance to real-world applications

  8. agenda tracking important nodes • maintaining neighborhood profiles • temporal PageRank reconstructing an epidemic over time

  9. tracking important nodes maintaining sliding-window neighborhood profiles R. Kumar, T. Calders, A. Gionis, and N. Tatti, ECML PKDD 2015

  10. distance distributions in graphs • given graph G , a node u , and distance r : how many nodes of G are in distance r from u? • fundamental graph-mining primitive – median distance, diameter, effective diameter • related to small-world phenomena • a measure of centrality for nodes of G

  11. distance distributions in graphs • exact solution requires all-pairs shortest path computation – Floyd-Warshall algorithm: O ( n 3 ) – or, BFS for unweighted graphs: O ( nm ) • clearly non scalable • resort to approximations based on diffusion methods

  12. diffusion-based computation [Palmer et al., 2002] • let B t ( x ) be the ball of radius t around x (the set of nodes at distance ≤ t from x ) • clearly B 0 ( x ) = { x } • moreover B t + 1 ( x ) = � ( x , y ) B t ( y ) � { x } • so computing B t + 1 from B t just takes a single (sequential) scan of the graph

  13. diffusion-based computation • every set requires O ( n ) bits, hence O ( n 2 ) bits overall • amount of space is prohibitively large • instead use sketching for counting distinct elements • probabilistic counters require very small space (log log) • HyperANF algorithm [Boldi et al., 2011] – uses HyperLogLog counters [Flajolet et al., 2007] – with 40 bits you can count up to 4 billion with – standard deviation 6%

  14. extension to temporal networks • limitations of existing solutions – consider static network – multi-pass algorithm • in this work – extension to temporal networks – streaming algorithm for sliding-window model : – consider only the most recent interactions (edges)

  15. setting • temporal network G = ( V , E ) • stream of edges E = � ( u 1 , v 1 , t 1 ) , ( u 2 , v 2 , t 2 ) , . . . � with t 1 ≤ t 2 ≤ . . . • sliding window length w • snapshot network G ( t , w ) at time t contains all edges with time-stamps in ( t − w , t ] problem : given node u , window length w , and distance r , how many nodes in G ( t , w ) are within distance r from u at time t ?

  16. example 1,8 1 a b a b a b a b 5,10 2 2 2 5 6 G 3 G 4 G 5 c d c d c d c d 7 3 3 3 4 3 4 4,9 e e e e a toy example, 3 snapshot graphs with a window size of 3

  17. proposed online algorithms 1. an exact but memory-inefficient streaming algorithm 2. an approximate memory-efficient streaming algorithm – approximate algorithm uses logic of exact algorithm, combined with hyperloglog sketches

  18. horizons • path horizon : time-stamp of the oldest edge on the path • h ( u , v , i ) : the horizon for length i between nodes u and v : the maximum horizon of any path of length at most i

  19. example ∞ , ∞ , ∞ , ∞ , ∞ −∞ , −∞ , 3, 3, 3 a b 2 4 3 c 1 d −∞ ,2, 2, 3, 3 −∞ ,3, 3, 3, 3 5 6 e −∞ , −∞ , 3, 3, 3 ∞ , ∞ , ∞ , ∞ , ∞ −∞ ,7, 7, 7, 7 7 a b 2 4 3 c 1 d −∞ ,2, 2, 3, 4 −∞ ,3, 4, 4, 4 5 6 e −∞ , −∞ , 3, 4, 4 two snapshot graphs along with h ( u , b , i ) for i = 0 , . . . , 4

  20. neighborhood summaries • observation : if for a node u we know all horizons h ( u , v , i ) , for all distances i and all nodes v , we can give complete neighborhood profile for u for any window length • neighborhood summary : S u t = ( S u t [ 0 ] , . . . , S u t [ r ]) where S u t [ i ] = { ( v , h t ( u , v , i )) | h t ( u , v , i ) > −∞}

  21. updating neighborhood summaries • edge deletion : simply delete entries from summaries • edge addition : a change in summary at distance i for a node u will introduce a change in the summary of its neighbors at distance i + 1 – updates propagate in a BFS fashion

  22. exact algorithm • update time : O ( rmn log n ) • space complexity : O ( rn 2 ) – where r an upper bound on max distance • quadratic dependence not acceptable for large graphs – hence approximation algorithm

  23. approximate algorithm • sliding HyperLogLog sketch : extension of HyperLogLog to maintain a distinct set counter over sliding window • if number of buckets in the HLL counter is k then the worst case complexity changes to – update time : – O ( rm 2 k log log n ) from O ( rmn log n ) – space complexity : – O ( rn 2 k log log n ) O ( rn 2 ) from

  24. empirical evaluation — quality nodes dist total clus diam eff avg rel dataset edges edges coef diam error (k=7) 4 039 88 234 88 234 0.60 8 4.7 0.08 Facebook 27 771 352 801 352 801 0.31 13 5.3 0.10 Cit-HepTh 166 840 249 030 500 000 0.19 10 4.7 0.14 Higgs 192 357 400 000 800 000 0.63 21 8.0 0.09 DBLP

  25. empirical evaluation — running time 60 7 k = 4 k = 4 k = 5 6 k = 5 50 k = 6 k = 6 k = 7 5 k = 7 40 time (sec) time (sec) 4 30 3 20 2 10 1 0 0 100 200 300 400 500 100 200 300 400 500 600 700 800 edges (in thousands) edges (in thousands) (c) Higgs (d) DBLP contrast ( DBLP ) – offline HyperANF : 3.6 sec / sliding window – proposed approach : 0.003 sec / sliding window

  26. tracking important nodes temporal PageRank P . Rozenshtein and A. Gionis, ECML PKDD 2016

  27. PageRank • classic approach for measuring node importance • listed in the top-10 most important data-mining algorithms [Wu et al., 2008] • numerous applications – ranking web pages – trust and distrust computation – finding experts in social networks – . . .

  28. PageRank • PageRank defined as the stationary distribution of a random walk in the graph • inherently a static process • however, many modern networks can be viewed as a sequence (stream) of edges – temporal network : G = ( V , E ) , with E = { ( u , v , t ) } – examples : twitter, instagram, IMs, email, . . . • what is an appropriate PageRank definition for temporal networks?

  29. temporal networks network nodes interact with each other (e.g., a “like”, a repost, or sending a message to each other) u w z y x time

  30. motivating example 11 7 c c c g 3 g 1 g 9 5 2 4 a 7 a 6 a 5 3 b b b 1 2 f f f 8 11 10 e 12 12 e 10 e d d d 4 9 h h h 6 8 (a) (b) (c) static network temporal network temporal network

  31. research questions and objectives • extend PageRank to incorporate temporal information and network dynamics • adapt PageRank to reflect changes in network dynamics and node importance • estimate importance of a node u at any given time t

  32. dynamic PageRank vs. temporal PageRank • extensive work on dynamic PageRank • dynamic PageRank computation : – maintain correct PageRank during network updates – e.g., edge additions / deletions • computation should return the static PageRank at a given network snapshot • for edges present in a snapshot, order does not matter

  33. static PageRank • graph G = ( V , E ) • corresponding row-stochastic matrix P ∈ R n × n • personalization vector h ∈ R n • PageRank is the stationary distribution of a random walk, with restart probability ( 1 − α ) ∞ � � ( 1 − α ) α k � π ( u ) = h ( v ) Pr [ z | v ] v ∈ V k = 0 z ∈Z ( v , u ) | z | = k where, Z ( v , u ) is the set of all paths from v to u and Pr [ z | v ] = � ( i , j ) ∈ z P ( i , j )

  34. temporal PageRank • make a random walk only on temporal paths – e.g., time-respecting paths – time-stamps increase along the path 11 c g 3 9 2 a 7 5 c → b → a → c : time respecting b 1 f 8 a → c → b → a : not time respecting 12 10 e d 4 h 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend