Why re-compression of a compressed graph? large graphs long time to - - PowerPoint PPT Presentation

why re compression of a compressed graph
SMART_READER_LITE
LIVE PREVIEW

Why re-compression of a compressed graph? large graphs long time to - - PowerPoint PPT Presentation

Towards Graph (Re-)Compression Design decisions and first results Stefan Bttcher University of Paderborn Towards Graph (Re-)Compression - Stefan Bttcher - University of Paderborn 1 Why re-compression of a compressed graph? large


slide-1
SLIDE 1

1

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Towards Graph (Re-)Compression

Design decisions and first results

Stefan Böttcher University of Paderborn

slide-2
SLIDE 2

2

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Why re-compression of a compressed graph?

large graphs è “long time“ to find a “good“ compression idea: instead: do any compression “fast“ and in parallel on small sub-graphs è get compressed sub-graphs “fast“ re-compress compressed sub-graphs è re-compression time depends on size of compressed sub-graph

slide-3
SLIDE 3

3

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Overwiew of steps towards re-compressed graphs

string compression

  • rdered tree compression

unordered tree compression graph compression string re-compression

  • rdered tree re-compression

unordered tree re-compression graph re-compression re-compression

  • rdered trees

unordered trees graphs compression strings

  • rdered trees

unordered trees

slide-4
SLIDE 4

4

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Why digram-based compression?

S à b c d e c d e c d S à b N e N e N N à c d S à b N M M M à e N replacing digram occurrences uses a “look for smallest repeated pattern first“ – approach substitute larger frequently occurring patterns in multiple steps

slide-5
SLIDE 5

5

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

(Re-)Compression by replacing a most frequent digram

S à b c d b c d S à b N b N N à c d S à M M M à b N N à c d S à M M M à b c d (Re-)Compression Algorithm for strings / trees / graphs : while at least one digram occurs more than once choose a most frequent digram D ( e.g. c d ) (if re-compression: isolate all occurrences of D by smart inlining) replace each occurrence of digram D by a new nonterminal N, which is thereafter treated as a terminal, i.e. not cut-off again introduce a grammar rule ( e.g. N à c d ) inline rules called only once ( e.g. N à c d )

slide-6
SLIDE 6

6

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for strings and for trees

A digram is a pair of typed items (c,d) in a given relationship r String: b c d e c d e c d digram (c,d) with r is “d follows c“ Tree: c c N à c N N b d e d y1 d b e digram (c,d) with r is “d is the second child of c“ Unordered Tree: c c b d d e digram (c,d) with r is “d is a child of c“

edge order does not matter - like in graphs

slide-7
SLIDE 7

7

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c d e

slide-8
SLIDE 8

8

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ d e

slide-9
SLIDE 9

9

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ d e

slide-10
SLIDE 10

10

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ d e

slide-11
SLIDE 11

11

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“ d e

slide-12
SLIDE 12

12

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Re-compression of a compressed string / tree / graph

A string / tree / graph S à d c d c d c that has been compressed to S à d N N c N à c d can be recompressed to S à M M M M à d c to get a better compression

slide-13
SLIDE 13

13

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Re-compress a compressed string: 1. Count digrams

S à d N N c N à c d digram generator generated digram d N d c N c d (occurs twice) N N d c N c d c è (d,c) with r = “d follows c“ is the most frequent digram in decompressed graph

slide-14
SLIDE 14

14

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

  • 2. Isolate a most frequent digram by smart inlining

Task: isolate most frequent digram (d,c) with r = “d follows c“ S à d c N c N c N à c e f g d needed: partial decompression of N to isolate d from N new rules that isolate d from the end of N: N à N-d d N-d à c e f g S à d c N-d d c N-d d c trick: inline rewritten rule N à N-d d instead of N à c e f g d finally, substitute digrams (d,c) with new nonterminal M: S à M N-d M N-d M M à d c

slide-15
SLIDE 15

15

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Re-compress a compressed ordered tree: 1. Count digrams

How to count all digrams generated by tree grammars? A à C (A, C, D may be called several times) b D parent node (C) does not determine a digram, but child (D) does:

C à r D à h A à r e s i j e s f y2

f h

y1 g b g i j

each non-root non-parameter node (e.g. D) in the RHS of each rule

  • f an SLT grammar represents (a child of) a digram

è count calls of rule A for the digram represented by child node D è O ( size(G) )

[ICDE2016]

slide-16
SLIDE 16

16

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

  • 2. Smarter inlining needed for ordered tree grammars

to isolate a digram: A à C

  • isolate root terminal of tree generated by D
  • isolate parent of 2nd parameter of tree generated by C

b D C à r D à h A à r e s i j e s f y2

f h

y1 g b g i j

slide-17
SLIDE 17

17

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

  • 2. Smarter inlining needed for ordered tree grammars

to isolate a digram: A à C

  • isolate root terminal of tree generated by D
  • isolate parent of 2nd parameter of tree generated by C

b D C à r D à h A à r e s i j e s f y2

f h

y1 g b g i j needs smarter inlining: C à C-r A à C-r C-e à f C-r à r e e y1 g y1 s C-e y2 C-e h y1 b i j

slide-18
SLIDE 18

18

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Tree grammar re-compression: compression ratio

NCBI 39 0 % 0 % EXI− Weblog 39 0.05 % 0.09 % EXI− Telecomp 71 0.06 % 0.11 % Medline 13096 4.71 % 4.89 % XMark 34649 7.94 % 11.38 % Treebank 52266 20.67 % 21.26 % 0% 100% 200% max | intermediate grammar | | final grammar |

document generated from seed by 5000 updates

  • re-compression after every 100 updates:

blow-ups of a factor of 5 at most

  • without re-compression blow-up up to a factor of 400

smarter inlining yields intermediate blow-ups

  • f factor 2 at most

#edges compression ratio compression ratio with max blow-up

slide-19
SLIDE 19

19

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Tree grammar re-compression: runtime

Here: recompress after renaming 300 nodes to fresh labels

EXI− Weblog 93434 2.8s XMark 167864 6.2s EXI− Telecomp 177633 5s Treebank 2437665 71.5s Medline 2866079 241.4s NCBI 3642224 293s 0.00 0.25 0.50 0.75 1.00 runtime of GrammarRePair runtime of decom. + comp. GrammarRePair applied to Grammar Decompress GrammarRePair applied to tree

  • for large files:

Re-compression applied to Grammar (in Java) even faster than pure Decompression of TreeRePair (written in C)

  • Re-compression faster than decompression+compression(d.+c.)

even than (d.+c.) of TreeRePair (in C) except for smallest file

#edges d+c runtime

slide-20
SLIDE 20

20

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Why trees with commutative labels?

because order does not always matter, e.g. * ( a,b,c ) = * ( b,c,a ) name( first: peter , last: smith ) = name( last:smith , first: peter)

  • rder( item1, item2, item1 ) = order( item2, item1, item1 )

using the property that some node labels are commutative, i.e. the order of their children does not matter, may lead to better (re)compression results (up to 1.42% à 0.15%) evaluation: next slide

slide-21
SLIDE 21

21

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

(Re-)Compression ratio of trees with commutative labels

slide-22
SLIDE 22

22

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Compress and re-compress unordered trees: 1. Count digrams

  • rder*

i1 i2 i2 i1 i1 use the property that the order* node labels are commutative, i.e. the order of their children does not matter Problem: avoid considering exponentially many sibling orders

  • f children of order* node

Solution: count number of repetitions 3x i1 and 2x i2 è min(3,2) i1-i2 -digram occurrences

slide-23
SLIDE 23

23

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Re-Compress unordered trees: 2. Digram Isolation

  • 1. reorder siblings
  • 2. use digram isolation technique for ordered trees

We have implemeted re-compression of partially ordered trees, i.e., some node labels (e.g. ‘+‘) can be declared to be commutative, while others (e.g. ‘-‘) are not è combined techniques to count digrams è combined techniques for digram isolation

slide-24
SLIDE 24

24

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Digrams for a graph with labeled nodes and labeled edges

A digram is a pair of typed items (c,d) in a given relationship r Graph: f b c digram (f,b) with r is “nodes f and b are connected by a hyperedge from f to b“ digram (d,e) with r is “there is a node shared by an incoming hyperedge d and an outgoing hyperedge e“ digram (b,e) with r is “node b has an outgoing hyperedge e“ digram (d,b) with r is “node b has an incoming hyperedge d“

  • r non-connected nodes [BICOD 2015] and/or edges

d e

slide-25
SLIDE 25

25

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Graph transformation

Graph G Transformed Graph T a h b c a b c h edges e and hyperedges h of G are transformed into nodes of T è - more nodes + easier graph structure (no hyperedges, no edge-labels)

slide-26
SLIDE 26

26

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Graph (re-)compression algorithm

given: graph G T = transform graph ( G ) while at least one digram occurs more than once in T choose a most frequent digram D , ( e.g. c d ) (if re-compression: isolate all occurrences of D by smart inlining) replace each occurrence of digram D by a new nonterminal N, which is thereafter treated as a terminal introduce a grammar rule, e.g., N à c d inline rules called only once (optional step:) transform back resulting graph

slide-27
SLIDE 27

27

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

  • 1. Count digrams in transformed graph

digram type 1: two nodes of T connected by an edge in T digram type 2: two nodes of T connected by two edges in T simulate multiples digram types in G, e.g. digram consisting of two edges sharing a common node e.g. digram consisting of two nodes sharing a common hyperedge (in principle more digram types possible [BICOD 2015]) counting non-overlapping digrams is more difficult for graphs

slide-28
SLIDE 28

28

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

  • 2. Smart inlining and digram replacement

smart inlining: similar to isolation of the parent of a parameter within trees digram replacement: like for commutative nodes: allow digram replacement independent of of child order e.g., for N à c d + ( c , d , e ) è + ( N , e ) + ( d , e, c ) è + ( N , e )

slide-29
SLIDE 29

29

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

First results

compresses some graphs to compressed grammars

  • not achievable by node-sharing-digrams only

and

  • not achievable by edge-sharing-digrams only

and

  • (#nodes,#edges) < (#nodes,#edges) of grammars

generated by node-sharing only and

  • (#nodes,#edges) < (#nodes,#edges) of grammars

generated by of edge-sharing only

slide-30
SLIDE 30

30

Towards Graph (Re-)Compression - Stefan Böttcher - University of Paderborn

Overwiew of steps towards re-compressed graphs

string compression

  • rdered tree compression

unordered tree compression graph compression string re-compression

  • rdered tree re-compression

unordered tree re-compression graph re-compression re-compression digram counting, smart decompression

  • rdered trees

SLT grammars unordered trees add commutativity graphs node-node, edge-edge & node-edge digrams