Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, - PowerPoint PPT Presentation

Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, THE UNIVERSITY OF MELBOURNE by Matthias Petri Tue 12/3/2019

Index compression 1/37 Indexcompression

Inverted Index - Recap in 4 where 3 sleep 5 house 4 night 52 the Index compression 7 term t old 3 big 1 6 and 2/37 Postings list for t f t � 1 , 6 , 7 , 8 , 9 , 12 � , � 1 , 2 , 1 , 3 , 1 , 2 � � 2 , 5 , 42 � , � 1 , 1 , 1 � � 32 � , � 4 � � 2 , 3 , 5 , 6 , 8 , 14 , 25 � , � 1 , 1 , 4 , 1 , 5 , 3 , 1 � � 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , . . . � , � 10 , 21 , 10 , 42 , 12 , 14 , 12 , 4 , . . . � � 1 , 12 , 13 , 14 � , � 2 , 2 , 1 , 3 � � 6 , 21 , 32 , 33 , 43 � , � 2 , 3 , 4 , 2 , 1 � � 1 , 51 , 53 � , � 1 , 2 , 3 � � 1 , 3 , 4 , 6 � , � 1 , 1 , 2 , 1 �

Index Compression - Motivation Index compression Space reduction can lead to substantial cost reductions queries Inverted Index mostly stored in RAM (query performance) Uncompressed Storage Cost Postings Terms Documents 3/37 Saving 5% means shutting down 50/1000 machines! Inverted Index size for 420 GB of web data (tiny) 25 Million 35 Million 6 Billion ≈ 32 GB Companies run 1000 s of machines to answer search

Index compression Index compression 4/37 Benefits of index compression: Reduce storage requirements Keep larger parts of the index in memory Faster query processing Example A state-of-the-art inverted index of 25 million websites (420GB) milliseconds. requires only 5GB ( 1 . 2 % ) and can answer queries in ≈ 10 32 GB → 5 GB corresponds to a 700 % space reduction!

Compression Principles Index compression length of T . n f s f s Intuition: Spend less bits on items that occur ofuen. Entropy H : Information content of a text T is a characterized by its a data set Compressibility is bounded by the information content of 5/37 � H ( T ) = − n log 2 s ∈ Σ where f s is the frequency of symbol s in T and n is the For example, H ( abracadabra ) = 2 . 040373 bits with n = 11 , f a = 5 , f b = 2 , f c = 1 , f d = 1 , f r = 2 .

Posting list Compression Index compression 6/37 Minimize storage costs Fast sequential access Support GEQ( x ) operation: Return the smallest item in the list that is greater or equal to x

Posting list Compression - Concepts … … aeronaut ids: gaps: ids: house Index compression … gaps: gaps: … 7/37 Postings list corresponds to an increasing sequence of integers ids: the smaller Idea: Gaps between two adjacent integers can be much Each integer can be in [1 , N ] requiring log 2 ( N ) bits 25 26 29 12345 12347 25 1 3 1 2 5123 5234 5454 5591 5123 1 220 137 251235 251239 251239 251235 4 34

Variable Byte Compression 00111000 Number of Bytes Number Range Storage Cost 10000101 Index compression 10000110 Encoding Number Examples Use variable number of bytes to represent integers. Each byte Idea 8/37 contains 7 bits “payload” and one continuation bit. 824 5 0 − 127 1 128 − 16383 2 16384 − 2097151 3

Variable Byte Compression - Example 3. Extract the lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod bits. ) Index compression (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? bytes! . How many bytes? 9/37 which is Compress number 512312 or 1111101000100111000 in binary.

Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes!

Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits.

Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000

Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest ) (or How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits.

Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 )

Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. mod How do we compress the number? Index compression 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits.

Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. which is How do we compress the number? Index compression 9/37 Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010

Variable Byte Compression - Example Index compression So we write top bit to . bits and set . Write in lowest 5. Number smaller than which is 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 )

Variable Byte Compression - Example Index compression 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 ) 5. Number smaller than 128 . Write in lowest 7 bits and set top bit to 1 . 31 = 11111 So we write 10011111 which is 31 + 128 = 159

Variable Byte - Algorithm Decoding return x 10: 9: end while 8: 7: 6: 5: 4: 3: Index compression 1: function DECODE(bytes) 2: 7: end function 4: 10/37 Encoding 1: function ENCODE( x ) 2: 3: 11: end function 5: end while 6: while x > = 128 do x = 0 , s = 0 WRITE( x mod 128 ) y = READBYTE(bytes) x = x ÷ 128 while y < 128 do x = x ^ ( y << s ) WRITE( x + 128 ) s = s + 7 y = READBYTE(bytes) x = x ^ (( y − 128) << s )

Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, - PowerPoint PPT Presentation

Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, THE UNIVERSITY OF MELBOURNE by Matthias Petri Tue 12/3/2019 Index compression 1/37 Indexcompression Inverted Index - Recap in 4 where 3 sleep 5 house 4 night 52

Profiling a warehouse-scale computer Svilen Kanev Harvard University Juan Pablo Darago

When Ensembling Smaller Models is More Effjcient than Single Large Models WebVision 2020 Dan

15-11-2019 Department of Veterinary and Animal Sciences Linear programming Anders Ringgaard

hagiography (noun) CMU SCS ChristosTheGreekGodofDatabases.com Pinterest meets Causal

Predicate Logic: Peano Arithmetic Alice Gao Lecture 20 CS 245 Logic and Computation Fall 2019

Purpose-Driven Performance 2017 Results and 2018 Guidance Feb. 16, 2018 Cautionary Statements

Foundation Support for Lobbying and Other Advocacy WEBINAR | PART 2 December 8, 2016 501(c)(4)

Economy, Transport and Environment Select Committee 15 January 2019 Budget Briefing 2019/20

CEE 370 Environmental Engineering Principles Lecture #24 Water Quality Management II: Rivers

Soil Moisture Deficit Rainfall Totals Properties Flooded (That we know of)

IMPLICATIONS OF THE 2012 ELECTION FOR HEALTH CARE: THE VOTERS PERSPECTIVE Robert J. Blendon,

Computational Results on the -Deficit of Trees Gunnar Brinkmann, Hadrien M elot and Eckhard

On calculation of Blakers binomial confidence limits Jan Klaschka klaschka@cs.cas.cz Inst.

Working Longer Solves (almo most) Everything The correlation between employment, social

Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arXiv

Lattice QCD Precision Science for Muon g-2 and Running Coupling Kohtaroh Miura (GSI

Strategies for Spectrum Slicing Based on Restarted Lanczos Methods Carmen Campos and Jose E.

Real Smooth Points Agnes Szanto Joint with Katherine Harris (NC State) and Jonathan Hauenstein

Third Quarter 2013 Earnings Call November 20, 2013 Q3 2013 Earnings Call Forward Looking

Learning and Technology Growth Regimes Andrew Foerster 1 Christian Matthes 2 1 FRB Kansas City 2

Creating the conditions for change the role of NHS Improvement Jim Mackey Chief Executive,

Ch. 7. One sample hypothesis tests for and Prof. Tesler Math 186 Winter 2019 Prof.

CS325 Artificial Intelligence Robotics II Navigation (Ch. 25) Dr. Cengiz Gnay, Emory Univ.

Gov 2000: 11. Interactions, F-tests, and Nonlinearities Matthew Blackwell November 15, 2016 1 /

Sambuz

Useful Links

Newsletter

Mail Us