Bit-aligned Codes Indexing, session 5 CS6200: Information Retrieval - PowerPoint PPT Presentation

Bit-aligned Codes Indexing, session 5 CS6200: Information Retrieval Slides by: Jesse Anderton

Compressing Inverted Lists An inverted list is generally represented as multiple sequences of integers. • Term and document IDs are used instead of the literal term or document URL/path/name. • TF, DF, term position lists and other data in the inverted lists are often integers. We’d like to efficiently encode this integer data to help minimize disk and Postings with DF, TF, and Positions memory usage. But how?

Unary The encodings used by processors for integers (e.g., two’s complement) use a fixed-width encoding with fixed upper bounds. Any number takes 32 (say) bits, decimal binary unary with no ability to encode larger numbers. 0 00000000 0 Both properties are bad for inverted lists. Smaller numbers tend to be much more 1 00000001 10 common, and should take less space. But very large numbers can happen – 7 00000111 11111110 consider term positions in very large files, 13 00001101 11111111111110 or document IDs in a large web collection. What if we used a unary encoding? This encodes k by k 1 s, followed by a 0 .

Elias- ɣ Codes Unary is efficient for small numbers, k d k r Decimal Code but very inefficient for large numbers. 1 0 0 0 There are better ways to get a variable bit length. 2 1 0 10 0 With Elias- ɣ codes, we use unary to 3 1 1 10 1 encode the bit length and then store 6 2 2 110 10 the number in binary. 15 3 7 1110 111 To encode a number k , compute: 16 4 0 11110 0000 k d = � log 2 k � 255 7 127 11111110 1111111 k r = k − 2 � log 2 k � 1111111110 1023 9 511 111111111

Elias- δ Codes Elias- ɣ codes take bits. k d k dd k dr k r 2 � log 2 k � + 1 Decimal Code We can do better, especially for large 1 0 0 0 0 0 numbers. 2 1 1 0 0 10 0 0 Elias- δ codes encode k d using an 3 1 1 0 1 10 0 1 Elias- ɣ code, and take approximately bits. 2 log 2 log 2 k + log 2 k 6 2 1 1 2 10 1 10 15 3 2 0 7 110 00 111 We split k d into: k dd = � log 2 k d � 16 4 2 1 0 110 01 0000 k dr = k d − 2 � log 2 k d � 1110 000 255 7 3 0 127 1111111 1110 010 1023 9 3 2 511 111111111

Python Implementation

Delta Encoding We now have an efficient variable bit length integer encoding scheme which uses just a few bits for small numbers, Raw positions: 1, 5, 9, 18, 23, 24, 30, 44, 45, 48 and can handle arbitrarily large numbers 1, 4, 4, 9, 5, 1, 6, 14, 1, 3 Deltas: with ease. To further reduce the index size, we want High-frequency words compress more easily: to ensure that docids, positions, etc. in 1, 1, 2, 1, 5, 1, 4, 1, 1, 3, ... our lists are small (for smaller encodings) and repetitive (for better compression). Low-frequency words have larger deltas: 109, 3766, 453, 1867, 992, ... We can do this by sorting the lists and encoding the difference, or delta, between the current number and the last.

Wrapping Up Bit-aligned codes allow us to minimize the storage used to encode integers. We can use just a few bits for small integers, and still represent arbitrarily large numbers. Inverted lists can also be made more compressible by delta-encoding their contents. Next, we’ll see how to encode integers using a variable byte code, which is more convenient for processing.

Bit-aligned Codes Indexing, session 5 CS6200: Information Retrieval - PowerPoint PPT Presentation

Bit-aligned Codes Indexing, session 5 CS6200: Information Retrieval Slides by: Jesse Anderton Compressing Inverted Lists An inverted list is generally represented as multiple sequences of integers. Term and document IDs are used instead

Byte-Aligned Codes Indexing, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

Simulation of Field-Aligned Simulation of Field-Aligned Ideal MHD Flows Around deal MHD Flows

The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit

Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit is normally group with

PE Refresher Course Digital Systems and Computers Joanne Degroat degroat.1@osu.edu

Compiler Development (CMPSC 401) Code Optimization Janyl Jumadinova April 15, 2019 Janyl

Control with binary code William Sandqvist william@kth.se Dec Bin Hex Oct 218 10 =

CSE 351: Week 3 Tom Bergan, TA 1 Today Questions on Lab 1 or Hw 1? Floating point

Fixed point lecture 2 encourage you to participate in studies such as these. Fixed point means

Outline Background 8271 discussion of: Transparent LBR-based approach ROP Exploit Mitigation

Course Introduction What this course is about Hardware/Software interface: Compilers,

Semaphores and Monitors: High-level Synchronization Constructs 1 Synchronization Constructs