Compressing Indexes Indexing, session 4 CS6200: Information - PowerPoint PPT Presentation

Compressing Indexes Indexing, session 4 CS6200: Information Retrieval Slides by: Jesse Anderton

Index Size Inverted lists often consume a large amount of space. • e.g., 25-50% of the size of the raw documents for TREC collections with the Indri search engine • much more than the raw documents if n-grams are indexed Compressing indexes is important to conserve disk and/or RAM space. Inverted lists have to be decompressed to read them, but there are fast, lossless compression algorithms with good compression ratios.

Entropy and Compressibility The entropy of a probability distribution is a measure of its randomness. � H ( p ) = − p i log p i i The more random a sequence of data is, the less predictable and less compressible it is. The entropy of the probability distribution of a data sequence provides a bound on the best possible compression ratio. Entropy of a Binomial Distribution

Hu ff man Codes In an ideal encoding scheme, a symbol 𝔽 [ length ] p i Symbol Code with probability p i of occurring will be a 1/2 0 0.5 assigned a code which takes log( p i ) bits. b 1/4 10 0.5 The more probable a symbol is to occur, the smaller its code should be. By this c 1/8 110 0.375 view, UTF-32 assumes a uniform distribution over all unicode symbols; d 1/16 1110 0.25 UTF-8 assumes ASCII characters are more e 1/16 1111 0.25 common. Huffman Codes achieve the best possible Plaintext: aedbbaae (64 bits in UTF-8) compression ratio when the distribution is Ciphertext: 0111111101010001111 known and when no code can stand for multiple symbols.

Building Hu ff man Codes Huffman Codes are built using a binary tree 1 which always joins the least probable 1 remaining nodes. 1/2 1. Create a leaf node for each symbol, weighted by its probability. 1 0 2. Iteratively join the two least probable 1/4 nodes without a parent by creating a 0 1 parent whose weight is the sum of the childrens’ weights. 0 1/8 3. Assign 0 and 1 to the edges from each 0 1 parent. The code for a leaf is the sequence of edges on the path from the a: 1/2 b: 1/4 d: 1/16 c: 1/8 e: 1/16 root. 0 10 110 1110 1111

Can We Do Better? Huffman codes achieve the theoretical limit for compressibility, assuming that the size of the code table is negligible and that each input symbol must correspond to exactly one output symbol. Other codes, such as Lempel-Ziv encoding, allow variable-length sequences of input symbols to correspond to particular output symbols and do not require transferring an explicit code table. Compression schemes such as gzip are based on Lempel-Ziv encoding. However, for encoding inverted lists it can be beneficial to have a 1:1 correspondence between code words and plaintext characters.

Wrapping Up The best any compression scheme can do depends on the entropy of the probability distribution over the data. More random data is less compressible. Huffman Codes meet the entropy limit and can be built in linear time, so are a common choice. Other schemes can do better, generally by interpreting the input sequence differently (e.g. encoding sequences of characters as if they were a single input symbol – different distribution, different entropy limit). Next, we’ll take a look at how to efficiently represent integers of arbitrary size using bit-aligned codes.

Compressing Indexes Indexing, session 4 CS6200: Information - PowerPoint PPT Presentation

Compressing Indexes Indexing, session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Index Size Inverted lists often consume a large amount of space. e.g., 25-50% of the size of the raw documents for TREC collections with the

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study Joel

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Indexes 1 Demo 2 Indexes Index = data structure

PERSONALITY INDEXES For Hiring, Team Building, and the Bottom Line Presentation by Deb Harris /

Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal With Arjun Gopalan, Ashish

Responsibility Lvia apukov Ida Wikstrm Leonard Guik Victoria Knabl Structure

PhUSE 2016 Paper CC08 Perish the Sort: Using Indexes and Hash Objects for Efficient Programming

Constructing High Frequency Price Indexes Data Daniel Melser Using Scanner Data Daniel Melser

indexing 1 many slides courtesy James Allan@umass File organizations or indexes are used to

Sequence to Sequence Models for Machine Translation (2) CMSC 723 / LING 723 / INST 725 Marine

Advanced Vitreous State The Physical Properties of Glass Dielectric Properties of Glass

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013

Revisi2ng Wavelet Compression for Large-Scale Climate Data using

Reconstruction of Smooth 3D Color Functions from Keypoints: Application to Lossy Compression and

Image Compression Based on Spatial Redundancy Removal and Image Inpainting Alexander Cullmann

Announcements Chapter 9 for today Chapter 9 for today No quiz this week Instructor