[PPT] - Introduction to Data Indexing: Classifications and Properties Walid PowerPoint Presentation

SLIDE 1

Introduction to Data Indexing: Classifications and Properties

Walid G. Aref

SLIDE 2

Introduction

The target of an index is to provide speedy retrieval of data from an

underlying table

Input: One or more values (also, termed search key) or range of

values

Output: Tuples from the index with these matching values (or within

the range of values)

Search Key: One or more attributes from the schema of the

underlying table

Walid G. Aref

SLIDE 3

Primary vs. Secondary Index

Primary-key index:
Search key of the index is the primary key for the underlying table
Secondary-key index:
This is not the case, i.e., Search key of the index is not the primary key for the underlying

table

Can have multiple tuples in the table with the same search key value
Example of a secondary-key index:
Index on GPA over the students table
Can have multiple students with the same GPA (for a given GPA value, multiple tids of

students with the same GPA)

This is not the case for a primary-key index
Example of a primary-key index:
Index on Student-id over the students table
Has only one tuple for a given sid (for a given sid value, only one tid has that value)

Walid G. Aref

SLIDE 4

How are the Index and the Underlying Table Related?

What to store in the index?

Walid G. Aref

?

Table

SLIDE 5

What Gets Stored in the Index?

Three choices on what to store in the index

1. The entire table (Index-based Tables) 2. (Key value, Tuple-identifier) pairs 3. Key value, Set of Tuple-identifiers

Walid G. Aref

?

Table

SLIDE 6

Storing the Entire Table in the Index: Index-based Tables

The entire table is stored inside an index
Common: Hash and B-tree tables
Is used typically when the Search Key for the index is the

same as the primary key for table

Or at least is a unique key
Why? So that tuples are stored only once inside the index
B-tree-based table:
Search key of the b-tree is a unique key for the table.
Need log time to access a tuple
Hash-based table:
Buckets collectively contain the entire table
Hash function takes the key of the tuple as input and produces the

bucket (disk page) that contains the entire tuple

Walid G. Aref

Table Tree-based Index Search Path Hash-based Index Hash Table Table Stored Inside the Hash Buckets

SLIDE 7

Index-based Tables

While table can conceptually have multiple

indexes, the table can only be stored in one index.

Thus, can have only one index-based table and

multiple other indexes from the other choices

Walid G. Aref

Table Tree-based Index Search Path Hash-based Index Hash Table Table Stored Inside the Hash Buckets

SLIDE 8

Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs

Tuple identifiers point to the tuples
Search key does not have to be the key for the

table

Can be a secondary index
e.g., index on salary

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Tid) pairs

SLIDE 9

Choice 2: The Index Contains (Key Value, Tuple-Identifier) Pairs

What if the table is sorted by the same attribute

as the Search Key?

Comb-like connections between the index and

the table

Why is the comb-like shape important?
Because it is an indication of a clustered

index

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Tid) pairs Table is sorted based on the same attribute as that of the index

SLIDE 10

Clustered vs. Un-clustered Index

Clustered index: Table is sorted based on the

same attribute as that of the index

Un-clustered index: Table is sorted based on a

different attribute than that of the index (or not sorted at all)

In a Range Search operation, tuples in the

range will end up being contiguous in the case of a clustered index

Reduces the I/Os significantly
Example:
Assume Disk page capacity DC, Number of tuples

in the range is Nr

Number of I/Os
Clustered index: ⌈!"

#$⌉, Un-clustered index: ~Nr

Walid G. Aref

Table Tree-based Index Search Range Table Tree-based Index Search Range

SLIDE 11

Choice 3: The Index Contains a Key Value and Set of Tuple-identifiers

Leaf level contains
(key value, tid1, tid2, ..)
Suitable for a secondary key index (e.g.,

index on GPA) where values are repeated.

Can save in storage
However, need support for variable-

length records

In contrast to (key-value, tid) fixed-

length records (in most cases, e.g., an exception in when the key-value is a string)

Walid G. Aref

Table Tree-based Index Search Path Leaf Level contains (value, Set of tids) For tuples with that value

SLIDE 12

Dense vs. Sparse Indexes

When table is sorted on the same key as the index, we have

this comb-like connection between index and table

Can habe the index point to only the first tuple in the page.

(the rest are sorted.

Gives rise to sparse vs. dense indexes
Dense index: Entry in the index per tuple in the table
Sparse index: Entry in the index per page in the table
Result in a smaller-size index.

Walid G. Aref

Table Tree-based Dense Index Table Tree-based Sparse Index assuming Page stores 3 tuples Table Tree-based Sparse Index assuming Page stores 3 tuples

SLIDE 13

Un-clustered Indexes

Un-clustered index: Table is sorted based on a

different attribute than that of the index (or not sorted at all)

In a Range Search operation, tuples in the

range will end up being contiguous in the case of a clustered index

Reduces the I/Os significantly
In-evitable problem: Why?
Because for a given table, we can have only
ne clustered index. All the other indexes will

have to be un-clustered

Question:
How to deal with the cost of query processing for

un-clustered index?

Walid G. Aref

Table An Unclustered Tree-based Index Search Range Table A Clustered Tree-based Index Search Range

SLIDE 14

Efficient Range Search using Un-clustered Indexes

In a Range Search operation, tuples in the range will end

up being in separate pages

Cost formula:
Clustered index: ⌈!"

#$⌉ I/O

Un-clustered index: ~Nr
We can have only one clustered index per table. All the
ther indexes will have to be un-clustered

Walid G. Aref

Table An Unclustered Tree-based Index Search Range

How to deal with the cost of query processing for un-clustered index?
If the number of tuples qualifying a range is small , e.g., 1, do nothing
For a one-tuple result, Clustered and un-clustered costs are the same (1 I/O)
Otherwise, collect the set of tids that qualify the range query without retrieving the tuples
Tid = (Page-id, slot-id)
Sort the tids based on Page-id
Retrieve a page only once and retrieve all the qualifying tuples from their corresponding slot-ids
Can result in some saving

SLIDE 15

Multi-Dimensional Indexes (or Indexes with Composite Attributes)

Question: Can we realize an index when the search key is

more than one attribute?

Give rise to Composite index or Multi-dimensional Index
Example: For Table Enrolled (sid, cid, grade)
Index on (sid,cid):
Can answer queries on sid,cid:

Find grade of sid=0111 in cid=CS580

May also answer queries on sid only: Find courses taken by sid=0111
Can it answer queries on cid? E.g., Find students registered in cid=CS580?
Depends on how the composite index is constructed
How to realize a composite index?

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

SLIDE 16

How to Realize Composite or Multi- Dimensional Indexes?

Want to build an index on Attributes A and B
Many ways to do that. Each with different costs and query

capabilities

Simple solution (for now - More on this subject later):
Concatenate the two attribute values together and build a hash or tree-

based index

For example, concatenate sid,cid
Predicates that can be answered “efficiently” when

concatenating AB

Tree-based index on AB: Equality on AB, Equality on A only, Range

predicate on A and Range predicate on B, Range on only A

Hash-based index on AB: Only equality on AB unless hash supports
rder-preserving property (can support ranges on A and B or on A only

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

SLIDE 17

Composite Indexes vs. One-attribute Indexes

Question: What are the advantages/disadvantages of

Composite vs. One-attribute indexes

Assume we have a table with two attributes A and B
Scenario 1: Build a composite index on AB (the concatenation
f A and B
Scenario 2: Build two single-attribute indexes; one on A and
ne on B
Consider the query:
Find grade of sid=0111 in cid=CS580
For Scenario 1, use the composite index on sid,cid
For Scenario 2, we have multiple strategies to answer the query

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

SLIDE 18

Composite Indexes vs. One-attribute Indexes (2)

Scenario 2: Build two single-attribute indexes; one on A

and one on B

Query: Find grade of sid=0111 in cid=CS580

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

Strategy 1: Use the index
f sid
Find the qualifying tids
Given a tid, retrieve the

corresponding tuple and check the cid value on the fly

If there is a match, report

this tuple as output

If not, discard the tuple
Strategy 2: Use the index
f cid
Find the qualifying tids
Given a tid, retrieve the

corresponding tuple and check the sid value on the fly

If there is a match, report

this tuple as output

If not, discard the tuple
Which strategy to

use?

The one that ends

up retrieving the least number of tuples (i.e., the one with high selectivity

SLIDE 19

Composite Indexes vs. One-attribute Indexes (3)

Scenario 2: Build two single-attribute indexes; one on A

and one on B

Query: Find grade of sid=0111 in cid=CS580

Walid G. Aref

sid cid grade 0111 CS541 A 0111 CS580 B 0333 CS448 A- 0444 CS348 B 0333 CS580 A Enrolled Table

Strategy 3: Use the index of sid to Find the qualifying tids
Store these tids in Set Ssid
Use the index of cid to Find the qualifying tids
Store these tids in Set Scid
Intersect both sets Ssidand Scid to find the tids of the tuples

that qualify both predicates

SLIDE 20

How to Choose Which Type of Index to Construct?

Already talked about this subject
More to come on this subject when we discuss query processing

Walid G. Aref