[PPT] - Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no PowerPoint Presentation

SLIDE 1

1 340151 Big Databases & Cloud Services (P. Baumann)

Indexing

Ramakrishnan/Gehrke Ch. 8

“How index-learning turns no student pale Yet holds the eel of science by the tail.”

- Alexander Pope (1688-1744)

SLIDE 2

2 340151 Big Databases & Cloud Services (P. Baumann)

Range Searches

``Find all students with gpa > 3.0’’
sorted file (by gpa!), fixed-length records:

binary search to find first student, then scan to find rest

Cost of binary search can be quite high
Simple idea:

Create an `index’ file containing only key values + search values

Can do binary search on (smaller) index file!

tuple 1 tuple 2 tuple N tuple 3

Data File

k2 kN k1

Index File

SLIDE 3

3 340151 Big Databases & Cloud Services (P. Baumann)

Indexes

speeds up selections on predefined search key field(s)
one relation (~file)
Any attribute (except BLOB) can be search key for an index on the relation
collection of data entries
For efficient retrieval of all data entries k* for given key value k
Index vs sorted files
Both: search faster than just heap
Updates: index much faster

SLIDE 4

4 340151 Big Databases & Cloud Services (P. Baumann)

B+ Tree Indexes

Ordered, balanced tree of degree m
Non-leaf pages: index entries = keys & pointers
Leaf pages: keys + data pointers; prev/next page chain

Index pages Leaf pages

fill factor

P0 K1 P1 K2 P2 Km Pm

SLIDE 5

5 340151 Big Databases & Cloud Services (P. Baumann)

B+-Tree Definition

B+-Tree of Order m has the following properties...
Property #1 - All leaf nodes must be at same level.
Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of

m-1 keys.

Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least

m/2 children.

Property #5 - A non leaf node with n-1 keys must have n number of children.
key values in a node sorted in ascending order

[http://btechsmartclass.com/data_structures/b-trees.html]

SLIDE 6

6 340151 Big Databases & Cloud Services (P. Baumann)

B+ Tree: Operations

Find 28*? 29*? All > 15* and < 30*?
Insert/delete: Find data entry in leaf, change it; adjust parent if needed
change sometimes bubbles up the tree
O( logF N ) where F = fan-out, N = # leaf pages

2* 3*

Root

17

30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29*

Entries < 17 Entries => 17

Note how data entries in leaf level are sorted

https://www.cs.usfca.edu/~galles/visualization/BTree.html

SLIDE 7

7 340151 Big Databases & Cloud Services (P. Baumann)

More Exercises

Consider
# node reads from disk – determines speed
# comparisons – not performance relevant, but for understanding mechanics
Find 15, 20, 0
Find all 11 – 15; 20 – 32

[https://condor.depaul.edu/ntomuro/courses/417/notes/lecture3.html]

SLIDE 8

8 340151 Big Databases & Cloud Services (P. Baumann)

B+ Trees in Practice

Typical fill-factor: 67% (outdated; today ~90%)
Average fan-out: 133
Typical capacities:
Height 3: 1333 = 2,352,637 records
Height 4: 1334 = 312,900,700 records
Can often hold top levels in buffer pool:
Level 1 = 1 page = 8 Kbytes
Level 2 = 133 pages = 1 Mbyte
Level 3 = 17,689 pages = 133 MBytes

SLIDE 9

9 340151 Big Databases & Cloud Services (P. Baumann)

Hash-Based Indexes

Goal: compute address without disk access
get data in O(1)
Idea: distribute data evenly into fixed number of “buckets”
Compute location from key via Hashing function
Ex: h(int r) = r*a mod b, b prime relative to a
overflow pages
Hash index = bucket set + hashing function
Bucket = primary page + 0..n overflow pages
only equality, no range queries

[Shankai Yan]

SLIDE 10

10 340151 Big Databases & Cloud Services (P. Baumann)

Index Classification

Primary index: index contains primary key, otherwise: secondary index
Unique index
Clustered vs. unclustered:
rder of data records same as / `close to’ order of data entries

 clustered index

Cost varies greatly

SLIDE 11

11 340151 Big Databases & Cloud Services (P. Baumann)

Clustered vs. Unclustered Index

To build clustered index, first sort Heap file
Overflow pages may be needed for inserts
order of data recs `close to’, but not identical to, sort order

Data Records Index entries direct search for data entries Data entries

CLUSTERED UNCLUSTERED

Data file Data Records Index File

SLIDE 12

12 340151 Big Databases & Cloud Services (P. Baumann)

Composite Search Keys =

Search on combination of fields

Equality query:

Every field value equal to a constant value Ex: for <sal,age> index: age=20 and sal=75

Range query:

Some field value within interval Ex: age>20; or age=20 and sal>10

Data entries in index sorted by search

key to support range queries

Composite Search Keys

<age, sal> 12,20 12,10 11,80 20,75 <age> 11 12 12 20

Data records sorted by name

20 75 12 10 20 80 11 12 name age sal sue bob cal joe <sal, age> 20,12 10,12 75,20 80,11

Data entries in index sorted by <sal,age>

<sal> 10 20 75 80

Data entries sorted by <sal>

Examples of composite key indexes using lexicographic order:

SLIDE 13

13 340151 Big Databases & Cloud Services (P. Baumann)

Composite Search Keys: How To Use

To retrieve Emp records with age=30 AND sal=4000:
index on <age,sal> better than index on age, or index on sal
Choice of index key orthogonal to clustering etc.
If condition is: 20<age<30 AND 3000<sal<5000:
Clustered tree index on <age,sal> or <sal,age> is best
If condition is: age=30 AND 3000<sal<5000:
Clustered <age,sal> index much better

than <sal,age> index!

Composite indexes are larger, updated more often

SLIDE 14

14 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans

A number of queries can be answered without retrieving any tuples

from one or more of the relations involved …if a suitable index is available

SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno

<E.dno>

SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno

<E.dno,E.sal> Tree index!

SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND

E.sal BETWEEN 3000 AND 5000 <E. age,E.sal>

r

<E.sal, E.age> Tree index!

SLIDE 15

15 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (contd.)

Index-only plans possible

if key is <dno,age> or <age,dno>

Which is better?

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno

SLIDE 16

16 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (contd.)

Index-only plans possible

if key is <dno,age> or <age,dno>

Which is better?
What if we consider the second query?

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno

SLIDE 17

17 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (Contd.)

Index-only plans also for queries involving >1 table

SELECT D.dno FROM Dept D, Emp E WHERE D.dno=E.dno

<E.dno>

SELECT D.dno, E.eid FROM Dept D, Emp E WHERE D.dno=E.dno

<E.dno,E.eid>

SLIDE 18

18 340151 Big Databases & Cloud Services (P. Baumann)

Understanding the Workload

For each query in workload:
Which relations accessed?
Which attributes retrieved?
Which attributes involved in selection/join conditions? How selective likely?
For each update in workload:
Which attributes involved in selection/join conditions? How selective likely?
Type of update (INSERT/DELETE/UPDATE) + attributes affected
Trade-off: Indexes can make queries faster, updates slower
…and require disk space

SLIDE 19

19 340151 Big Databases & Cloud Services (P. Baumann)

Attributes in WHERE clause candidates for index keys
Exact match condition

hash index

Range query

tree index

Clustering especially useful for range queries
can also help on equality queries if many duplicates
Multi-attribute search keys should be considered when a WHERE clause

contains several conditions

Order of attributes is important for range queries
Such indexes can sometimes enable index-only strategies for important queries
For index-only strategies, clustering is not important!

Index Selection Guidelines

SLIDE 20

20 340151 Big Databases & Cloud Services (P. Baumann)

Choose indexes that benefit as many queries as possible
impact on updates: Indexes make queries faster, updates slower
require disk space!
only one index can be clustered per relation

choose based on important queries that benefit most from clustering

understand how DBMS evaluates queries & creates query evaluation plans
…a practitioner's approach:
Consider most important queries in turn:
Consider best plan using current indexes
See if a better plan is possible with an additional index
If so, create it

Index Selection Guidelines (contd.)

SLIDE 21

21 340151 Big Databases & Cloud Services (P. Baumann)

Summary

Many alternative file organizations exist
each appropriate in some situation
If selection queries are frequent: sort file or build index
Hash-based indexes only good for equality search
Sorted files and tree-based indexes best for range search; also good for equality

search

Files rarely kept sorted in practice; B+ tree index is better
Index = collection of data entries + way to quickly find entries with given

key values

SLIDE 22

22 340151 Big Databases & Cloud Services (P. Baumann)

Summary (Contd.)

Index data entries can be actual data records
<key, rid> pairs or <key, rid-list> pairs
Choice orthogonal to indexing technique used
Can have several indexes on a given file of data records
each with different search key
important influence factors for utility/performance:
clustered vs. unclustered
primary vs. secondary
dense vs. sparse

SLIDE 23

23 340151 Big Databases & Cloud Services (P. Baumann)

Summary (Contd.)

Understanding application workload & performance goals essential
important queries & updates?
What attributes/relations involved?
speed up important queries (and perhaps some updates!)
Index maintenance overhead on updates to key fields
Choose indexes that can help many queries, if possible
Build indexes to support index-only strategies
Clustering is an important decision; only one index per relation!
Order of fields in composite index key can be important