indexcompressionand efgicientqueryprocessing
play

Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, - PowerPoint PPT Presentation

Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, THE UNIVERSITY OF MELBOURNE by Matthias Petri Tue 12/3/2019 Index compression 1/37 Indexcompression Inverted Index - Recap in 4 where 3 sleep 5 house 4 night 52


  1. Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, THE UNIVERSITY OF MELBOURNE by Matthias Petri Tue 12/3/2019

  2. Index compression 1/37 Indexcompression

  3. Inverted Index - Recap in 4 where 3 sleep 5 house 4 night 52 the Index compression 7 term t old 3 big 1 6 and 2/37 Postings list for t f t � 1 , 6 , 7 , 8 , 9 , 12 � , � 1 , 2 , 1 , 3 , 1 , 2 � � 2 , 5 , 42 � , � 1 , 1 , 1 � � 32 � , � 4 � � 2 , 3 , 5 , 6 , 8 , 14 , 25 � , � 1 , 1 , 4 , 1 , 5 , 3 , 1 � � 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , . . . � , � 10 , 21 , 10 , 42 , 12 , 14 , 12 , 4 , . . . � � 1 , 12 , 13 , 14 � , � 2 , 2 , 1 , 3 � � 6 , 21 , 32 , 33 , 43 � , � 2 , 3 , 4 , 2 , 1 � � 1 , 51 , 53 � , � 1 , 2 , 3 � � 1 , 3 , 4 , 6 � , � 1 , 1 , 2 , 1 �

  4. Index Compression - Motivation Index compression Space reduction can lead to substantial cost reductions queries Inverted Index mostly stored in RAM (query performance) Uncompressed Storage Cost Postings Terms Documents 3/37 Saving 5% means shutting down 50/1000 machines! Inverted Index size for 420 GB of web data (tiny) 25 Million 35 Million 6 Billion ≈ 32 GB Companies run 1000 s of machines to answer search

  5. Index compression Index compression 4/37 Benefits of index compression: Reduce storage requirements Keep larger parts of the index in memory Faster query processing Example A state-of-the-art inverted index of 25 million websites (420GB) milliseconds. requires only 5GB ( 1 . 2 % ) and can answer queries in ≈ 10 32 GB → 5 GB corresponds to a 700 % space reduction!

  6. Compression Principles Index compression length of T . n f s f s Intuition: Spend less bits on items that occur ofuen. Entropy H : Information content of a text T is a characterized by its a data set Compressibility is bounded by the information content of 5/37 � H ( T ) = − n log 2 s ∈ Σ where f s is the frequency of symbol s in T and n is the For example, H ( abracadabra ) = 2 . 040373 bits with n = 11 , f a = 5 , f b = 2 , f c = 1 , f d = 1 , f r = 2 .

  7. Posting list Compression Index compression 6/37 Minimize storage costs Fast sequential access Support GEQ( x ) operation: Return the smallest item in the list that is greater or equal to x

  8. Posting list Compression - Concepts … … aeronaut ids: gaps: ids: house Index compression … gaps: gaps: … 7/37 Postings list corresponds to an increasing sequence of integers ids: the smaller Idea: Gaps between two adjacent integers can be much Each integer can be in [1 , N ] requiring log 2 ( N ) bits 25 26 29 12345 12347 25 1 3 1 2 5123 5234 5454 5591 5123 1 220 137 251235 251239 251239 251235 4 34

  9. Variable Byte Compression 00111000 Number of Bytes Number Range Storage Cost 10000101 Index compression 10000110 Encoding Number Examples Use variable number of bytes to represent integers. Each byte Idea 8/37 contains 7 bits “payload” and one continuation bit. 824 5 0 − 127 1 128 − 16383 2 16384 − 2097151 3

  10. Variable Byte Compression - Example 3. Extract the lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod bits. ) Index compression (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? bytes! . How many bytes? 9/37 which is Compress number 512312 or 1111101000100111000 in binary.

  11. Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes!

  12. Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes!

  13. Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits.

  14. Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000

  15. Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest ) (or How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits.

  16. Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 )

  17. Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. mod How do we compress the number? Index compression 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits.

  18. Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. which is How do we compress the number? Index compression 9/37 Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010

  19. Variable Byte Compression - Example Index compression So we write top bit to . bits and set . Write in lowest 5. Number smaller than which is 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 )

  20. Variable Byte Compression - Example Index compression 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 ) 5. Number smaller than 128 . Write in lowest 7 bits and set top bit to 1 . 31 = 11111 So we write 10011111 which is 31 + 128 = 159

  21. Variable Byte Compression - Example Index compression 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 ) 5. Number smaller than 128 . Write in lowest 7 bits and set top bit to 1 . 31 = 11111 So we write 10011111 which is 31 + 128 = 159

  22. Variable Byte - Algorithm Decoding return x 10: 9: end while 8: 7: 6: 5: 4: 3: Index compression 1: function DECODE(bytes) 2: 7: end function 4: 10/37 Encoding 1: function ENCODE( x ) 2: 3: 11: end function 5: end while 6: while x > = 128 do x = 0 , s = 0 WRITE( x mod 128 ) y = READBYTE(bytes) x = x ÷ 128 while y < 128 do x = x ^ ( y << s ) WRITE( x + 128 ) s = s + 7 y = READBYTE(bytes) x = x ^ (( y − 128) << s )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend