most of the slides are borrowed from the authors original
play

Most of the slides are borrowed from the authors original - PowerPoint PPT Presentation

SEG5010 presentation G RAPH C OMPRESSION AND S UMMARIZATION Wei Zhang Dept. of Information Engineering The Chinese University of Hong Kong Most of the slides are borrowed from the authors original presentation. original presentation.


  1. SEG5010 presentation G RAPH C OMPRESSION AND S UMMARIZATION Wei Zhang Dept. of Information Engineering The Chinese University of Hong Kong

  2. � Most of the slides are borrowed from the authors’ original presentation. original presentation. � http://www.cs.umd.edu/~saket/pubs/sigmod2008.ppt � http://videolectures net/kdd09 kumar ocsn/ � http://videolectures.net/kdd09_kumar_ocsn/

  3. G RAPH S UMMARIZATION WITH B OUNDED G RAPH S UMMARIZATION WITH B OUNDED E RROR � Saket Navlakha (UMCP) � Rajeev Rastogi (Yahoo! Labs India) � Rajeev Rastogi (Yahoo! Labs, India) � Nisheeth Shrivastava (Bell Labs India)

  4. E F L ARGE G RAPHS yahoo.com cnn.com 20.20.2.2 D D � Many interactions can be represented as B A C graphs 10.1.1.1 G Webgraphs: search engine etc Webgraphs: search engine, etc. � � Netflow jokes.com Netflow graphs (which IPs talk to each other): � traffic patterns, security, worm attacks Social (friendship) networks: ( p) � mine user communities, viral marketing Email exchanges: security. virus spread, � spam detection Market basket data: customer profiles targeted Market basket data: customer profiles, targeted � � advertizing Social Networks email � Need to compress understand � Need to compress, understand � Webgraph ~ 50 billion edges; social networks ~ few million, growing quickly quickly � Compression reduces size to one-tenth (webgraphs)

  5. O UR A PPROACH � Graph Compression (reference encoding) Not applicable to all graphs: use urls node labels for compression Not applicable to all graphs: use urls, node labels for compression � � Resulting structure is hard to visualize/interpret � � Graph Clustering Nice summary works for generic graphs Nice summary, works for generic graphs � � No compression: needs the same memory to store the graph itself � � Our MDL based representation R = (S C) � Our MDL-based representation R = (S,C) S is a high-level summary graph: compact, highlights dominant trends, easy � to visualize C is a set of edge corrections: help in reconstructing the graph C is a set of edge corrections: help in reconstructing the graph � � Compression based on MDL principle: minimize cost of S+C � information-theoretic approach; parameter less; applicable to any graph Novel Approximate Representation: reconstructs graph with bounded error pp p g p � ( є ); results in better compression

  6. d e f g H OW DO WE COMPRESS ? a b c � Compression possible (S) � Many nodes with similar � Many nodes with similar neighborhoods Summary X = {d,e,f,g} � Communities in social networks; link- copying in webpages i i b � Collapse such nodes into Y = {a,b,c} supernodes (clusters) supernodes (clusters) and the edges into superedges � Bipartite subgraph to two supernodes and a superedge d d � Clique to supernode with a “self-edge”

  7. Cost = 14 edges d e f g H OW DO WE COMPRESS ? i h j j a b c � Compression possible (S) Many nodes with similar neighborhoods � � Communities in social networks; link-copying in C webpages Summary Collapse such nodes into supernodes (clusters) and the X = {d,e,f,g} � edges into superedges g p g i � Bipartite subgraph to two supernodes and a h superedge i Y = {a,b,c} � Clique to supernode with a “self-edge” Need to correct mistakes (C) � Most superedges are not complete � � Nodes don’t have exact same neighbors: friends N d d ’t h t i hb f i d Correction in social networks s Cost = 5 +(a,h) Remember edge-corrections � (1 superedge + (1 superedge � Edges not present in superedges ( ve corrections) � Edges not present in superedges (-ve corrections) +(c,i) ( i) 4 corrections) � Extra edges not counted in superedges (+ve +(c,j) corrections) -(a,d) ( , ) Minimize overall storage cost = S+C �

  8. R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Y = {a,b,c} j Each supernode v represents a set of nodes A v � Each superedge (u,v) represents E h d ( ) t � all pair of edges π uv = A u x A v C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Corrections C: {(a,b); a and b are nodes of G} � Supernodes are key, superedges/corrections easy easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π uv – E uv | � d e f f g g C Cost without (u,v) = |E uv | t ith t ( ) |E | � h Choose the minimum, decides whether edge (u,v) i � is in S j j a a b b c c

  9. R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v � Y = {a,b,c} j Each superedge (u,v) represents p g ( , ) p � all pair of edges π uv = A u x A v � Corrections C: {(a,b); a and b are nodes of G} C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π Cost with (u,v) 1 + | π uv – E | E uv | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is in � S d e f f g g h i � Reconstructing the graph from R For all superedges (u,v) in S, insert all pair of edges j j � a a b b c c π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �

  10. R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v Y = {a,b,c} j � Each superedge (u v) represents Each superedge (u,v) represents � � all pair of edges π uv = A u x A v C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Corrections C: {(a,b); a and b are nodes of G} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u v) = 1 + | π Cost with (u,v) = 1 + | π uv – E uv | E | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is � d e f f g g in S in S h i � Reconstructing the graph from R j j a a b b c c For all superedges (u,v) in S, insert all pair of edges For all superedges (u v) in S insert all pair of edges � � π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �

  11. R EPRESENTATION S TRUCTURE R=(S C) R EPRESENTATION S TRUCTURE R=(S,C) X = {d,e,f,g} h i � Summary S(V S , E S ) Each supernode v represents a set of nodes A v � Y = {a,b,c} j Each superedge (u,v) represents p g ( , ) p � all pair of edges π uv = A u x A v � Corrections C: {(a,b); a and b are nodes of G} C = {+(a,h), +(c,i), +(c,j), -(a,d)} � Supernodes are key superedges/corrections � Supernodes are key, superedges/corrections easy A uv actual edges of G between A u and A v � Cost with (u,v) = 1 + | π Cost with (u,v) 1 + | π uv – E | E uv | � Cost without (u,v) = |E uv | � Choose the minimum, decides whether edge (u,v) is in � S d e f f g g h i � Reconstructing the graph from R For all superedges (u,v) in S, insert all pair of edges j j � a a b b c c π uv For all +ve corrections +(a,b), insert edge (a,b) � For all -ve corrections -(a,b), delete edge (a,b) �

  12. X = {d,e,f,g} A PPROXIMATE R EPRESENTATION R Є Y = {a,b} { b} � Approximate representation Recreating the input graph exactly is not always � necessary necessary C = {-(a,d), -(a,f)} { ( , ), ( ,f)} Reasonable approximation enough: to compute � communities, anomalous traffic patterns, etc. Use approximation leeway to get further cost reduction d d e e f f g g � � Generic Neighbor Query G Given node v, find its neighbors N v in G � Apx-nbr set N’ v estimates N v with є -accuracy p y � a a b b v v Bounded error: error(v) = |N’ v - N v | + |N v - N’ v | < є � |N v | For є =.5, we can remove Number of neighbors added or deleted is at most є - � one correction of a one correction of a fraction of the true neighbors fraction of the true neighbors � Intuition for computing R є If correction (a,d) is deleted, it adds error for both a � d d e e f f g g and d and d From exact representation R for G, remove (maximum) � corrections s.t. є -error guarantees still hold a a b b

  13. C OMPARISON WITH EXISTING TECHNIQUES d e f g � Webgraph compression [Adler-DCC-01] Use nodes sorted by urls: not applicable to other graphs Use nodes sorted by urls: not applicable to other graphs � � More focus on bitwise compression: represent sequence of a b c � neighbors (ids) using smallest bits � Clique stripping [Feder-pods-99] Cli t i i Collapses edges of complete bi-partite subgraph into single � cluster d d e e f f g g Only compresses very large, complete bi-cliques � � Representing webgraphs [Raghavan-icde-03] Represent webgraphs as SNodes, Sedges Represent webgraphs as SNodes, Sedges � a b c Use urls of nodes for compression (not applicable for other � graphs) No concept of approximate representation No concept of approximate representation �

  14. O UTLINE � Compressed graph � MDL representation R=(S C); є -representation � MDL representation R (S,C); є -representation � Computing R � GREEDY RANDOMIZED � GREEDY, RANDOMIZED � Computing R є � APX-MDL, APX-GREEDY APX MDL APX GREEDY � Experimental results � Conclusions and future work

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend