Introduction to Data Indexing: Classifications and Properties Walid - - PowerPoint PPT Presentation

introduction to data indexing classifications and
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Indexing: Classifications and Properties Walid - - PowerPoint PPT Presentation

Introduction to Data Indexing: Classifications and Properties Walid G. Aref Walid G. Aref Introduction The target of an index is to provide speedy retrieval of data from an underlying table Input: One or more values (also, termed search


slide-1
SLIDE 1

Introduction to Data Indexing: Classifications and Properties

Walid G. Aref

Walid G. Aref

slide-2
SLIDE 2

Introduction

  • The target of an index is to provide speedy retrieval of data from an

underlying table

  • Input: One or more values (also, termed search key) or range of

values

  • Output: Tuples from the index with these matching values (or within

the range of values)

  • Search Key: One or more attributes from the schema of the

underlying table

Walid G. Aref

slide-3
SLIDE 3

Primary vs. Secondary Index

  • Primary-key index:
  • Search key of the index is the primary key for the underlying table
  • Secondary-key index:
  • This is not the case, i.e., Search key of the index is not the primary key for the underlying

table

  • Can have multiple tuples in the table with the same search key value
  • Example of a secondary-key index:
  • Index on GPA over the students table
  • Can have multiple students with the same GPA (for a given GPA value, multiple tids of

students with the same GPA)

  • This is not the case for a primary-key index
  • Example of a primary-key index:
  • Index on Student-id over the students table
  • Has only one tuple for a given sid (for a given sid value, only one tid has that value)

Walid G. Aref

slide-4
SLIDE 4

How are the Index and the Underlying Table Related?

  • What to store in the index?

Walid G. Aref

?

Table

slide-5
SLIDE 5

What Gets Stored in the Index?

  • Three choices on what to store in the index

1. The entire table (Index-based Tables) 2. (Key value, Tuple-identifier) pairs 3. Key value, Set of Tuple-identifiers

Walid G. Aref

?

Table

slide-6
SLIDE 6

Storing the Entire Table in the Index: Index-based Tables

  • The entire table is stored inside an index
  • Common: Hash and B-tree tables
  • Is used typically when the Search Key for the index is the

same as the primary key for table

  • Or at least is a unique key
  • Why? So that tuples are stored only once inside the index
  • B-tree-based table:
  • Search key of the b-tree is a unique key for the table.
  • Need log time to access a tuple
  • Hash-based table:
  • Buckets collectively contain the entire table
  • Hash function takes the key of the tuple as input and produces the

bucket (disk page) that contains the entire tuple

Walid G. Aref

Table Tree-based Index Search Path Hash-based Index Hash Table Table Stored Inside the Hash Buckets

slide-7
SLIDE 7

Index-based Tables

  • While table can conceptually have multiple

indexes, the table can only be stored in one index.

  • Thus, can have only one index-based table and

multiple other indexes from the other choices

Walid G. Aref

Table Tree-based Index Search Path Hash-based Index Hash Table Table Stored Inside the Hash Buckets

slide-8
SLIDE 8

Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs

  • Tuple identifiers point to the tuples
  • Search key does not have to be the key for the

table

  • Can be a secondary index
  • e.g., index on salary

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Tid) pairs

slide-9
SLIDE 9

Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs

  • What if the table is sorted by the same attribute

as the Search Key?

  • Comb-like connections between the index and

the table

  • Why is the comb-like shape important?
  • Because it is an indication of a clustered

index

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Tid) pairs Table is sorted based on the same attribute as that of the index

slide-10
SLIDE 10

Clustered vs. Un-clustered Index

  • Clustered index: Table is sorted based on the

same attribute as that of the index

  • Un-clustered index: Table is sorted based on a

different attribute than that of the index (or not sorted at all)

  • In a Range Search operation, tuples in the

range will end up being contiguous in the case of a clustered index

  • Reduces the I/Os significantly
  • Example:
  • Assume Disk page capacity DC, Number of tuples

in the range is Nr

  • Number of I/Os
  • Clustered index: ⌈!"

#$⌉, Un-clustered index: ~Nr

Walid G. Aref

Table Tree-based Index Search Range Table Tree-based Index Search Range

slide-11
SLIDE 11

Choice 3: The Index Contains a Key Value and Set of Tuple-identifiers

  • Leaf level contains
  • (key value, tid1, tid2, ..)
  • Suitable for a secondary key index (e.g.,

index on GPA) where values are repeated.

  • Can save in storage
  • However, need support for variable-

length records

  • In contrast to (key-value, tid) fixed-

length records (in most cases, e.g., an exception in when the key-value is a string)

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Set of tids) For tuples with that value

slide-12
SLIDE 12

Dense vs. Sparse Indexes

  • When table is sorted on the same key as the index, we have

this comb-like connection between index and table

  • Can habe the index point to only the first tuple in the page.

(the rest are sorted.

  • Gives rise to sparse vs. dense indexes
  • Dense index: Entry in the index per tuple in the table
  • Sparse index: Entry in the index per page in the table
  • Result in a smaller-size index.

Walid G. Aref

Table Tree-based Dense Index Table Tree-based Sparse Index assuming Page stores 3 tuples Table Tree-based Sparse Index assuming Page stores 3 tuples

slide-13
SLIDE 13

Un-clustered Indexes

  • Un-clustered index: Table is sorted based on a

different attribute than that of the index (or not sorted at all)

  • In a Range Search operation, tuples in the

range will end up being contiguous in the case of a clustered index

  • Reduces the I/Os significantly
  • In-evitable problem: Why?
  • Because for a given table, we can have only
  • ne clustered index. All the other indexes will

have to be un-clustered

  • Question:
  • How to deal with the cost of query processing for

un-clustered index?

Walid G. Aref

Table An Unclustered Tree-based Index Search Range Table A Clustered Tree-based Index Search Range

slide-14
SLIDE 14

Efficient Range Search using Un-clustered Indexes

  • In a Range Search operation, tuples in the range will end

up being in separate pages

  • Cost formula:
  • Clustered index: ⌈!"

#$⌉ I/O

  • Un-clustered index: ~Nr
  • We can have only one clustered index per table. All the
  • ther indexes will have to be un-clustered

Walid G. Aref

Table An Unclustered Tree-based Index Search Range

  • How to deal with the cost of query processing for un-clustered index?
  • If the number of tuples qualifying a range is small , e.g., 1, do nothing
  • For a one-tuple result, Clustered and un-clustered costs are the same (1 I/O)
  • Otherwise, collect the set of tids that qualify the range query without retrieving the tuples
  • Tid = (Page-id, slot-id)
  • Sort the tids based on Page-id
  • Retrieve a page only once and retrieve all the qualifying tuples from their corresponding slot-ids
  • Can result in some saving
slide-15
SLIDE 15

Multi-Dimensional Indexes (or Indexes with Composite Attributes)

  • Question: Can we realize an index when the search key is

more than one attribute?

  • Give rise to Composite index or Multi-dimensional Index
  • Example: For Table Enrolled (sid, cid, grade)
  • Index on (sid,cid):
  • Can answer queries on sid,cid:

Find grade of sid=0111 in cid=CS580

  • May also answer queries on sid only: Find courses taken by sid=0111
  • Can it answer queries on cid? E.g., Find students registered in cid=CS580?
  • Depends on how the composite index is constructed
  • How to realize a composite index?

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

slide-16
SLIDE 16

How to Realize Composite or Multi- Dimensional Indexes?

  • Want to build an index on Attributes A and B
  • Many ways to do that. Each with different costs and query

capabilities

  • Simple solution (for now - More on this subject later):
  • Concatenate the two attribute values together and build a hash or tree-

based index

  • For example, concatenate sid,cid
  • Predicates that can be answered “efficiently” when

concatenating AB

  • Tree-based index on AB: Equality on AB, Equality on A only, Range

predicate on A and Range predicate on B, Range on only A

  • Hash-based index on AB: Only equality on AB unless hash supports
  • rder-preserving property (can support ranges on A and B or on A only

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

slide-17
SLIDE 17

Composite Indexes vs. One-attribute Indexes

  • Question: What are the advantages/disadvantages of

Composite vs. One-attribute indexes

  • Assume we have a table with two attributes A and B
  • Scenario 1: Build a composite index on AB (the concatenation
  • f A and B
  • Scenario 2: Build two single-attribute indexes; one on A and
  • ne on B
  • Consider the query:
  • Find grade of sid=0111 in cid=CS580
  • For Scenario 1, use the composite index on sid,cid
  • For Scenario 2, we have multiple strategies to answer the query

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

slide-18
SLIDE 18

Composite Indexes vs. One-attribute Indexes (2)

  • Scenario 2: Build two single-attribute indexes; one on A

and one on B

  • Query: Find grade of sid=0111 in cid=CS580

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

  • Strategy 1: Use the index
  • f sid
  • Find the qualifying tids
  • Given a tid, retrieve the

corresponding tuple and check the cid value on the fly

  • If there is a match, report

this tuple as output

  • If not, discard the tuple
  • Strategy 2: Use the index
  • f cid
  • Find the qualifying tids
  • Given a tid, retrieve the

corresponding tuple and check the sid value on the fly

  • If there is a match, report

this tuple as output

  • If not, discard the tuple
  • Which strategy to

use?

  • The one that ends

up retrieving the least number of tuples (i.e., the one with high selectivity

slide-19
SLIDE 19

Composite Indexes vs. One-attribute Indexes (3)

  • Scenario 2: Build two single-attribute indexes; one on A

and one on B

  • Query: Find grade of sid=0111 in cid=CS580

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

  • Strategy 3: Use the index of sid to Find the qualifying tids
  • Store these tids in Set Ssid
  • Use the index of cid to Find the qualifying tids
  • Store these tids in Set Scid
  • Intersect both sets Ssidand Scid to find the tids of the tuples

that qualify both predicates

slide-20
SLIDE 20

How to Choose Which Type of Index to Construct?

  • Already talked about this subject
  • More to come on this subject when we discuss query processing

Walid G. Aref