Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no - - PowerPoint PPT Presentation

indexing
SMART_READER_LITE
LIVE PREVIEW

Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no - - PowerPoint PPT Presentation

Indexing Ramakrishnan/Gehrke Ch. 8 How index -learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) 340151 Big Databases & Cloud Services (P. Baumann) 1 Range Searches `` Find


slide-1
SLIDE 1

1 340151 Big Databases & Cloud Services (P. Baumann)

Indexing

Ramakrishnan/Gehrke Ch. 8

“How index-learning turns no student pale Yet holds the eel of science by the tail.”

  • - Alexander Pope (1688-1744)
slide-2
SLIDE 2

2 340151 Big Databases & Cloud Services (P. Baumann)

Range Searches

  • ``Find all students with gpa > 3.0’’
  • sorted file (by gpa!), fixed-length records:

binary search to find first student, then scan to find rest

  • Cost of binary search can be quite high
  • Simple idea:

Create an `index’ file containing only key values + search values

  • Can do binary search on (smaller) index file!

tuple 1 tuple 2 tuple N tuple 3

Data File

k2 kN k1

Index File

slide-3
SLIDE 3

3 340151 Big Databases & Cloud Services (P. Baumann)

Indexes

  • speeds up selections on predefined search key field(s)
  • one relation (~file)
  • Any attribute (except BLOB) can be search key for an index on the relation
  • collection of data entries
  • For efficient retrieval of all data entries k* for given key value k
  • Index vs sorted files
  • Both: search faster than just heap
  • Updates: index much faster
slide-4
SLIDE 4

4 340151 Big Databases & Cloud Services (P. Baumann)

B+ Tree Indexes

  • Ordered, balanced tree of degree m
  • Non-leaf pages: index entries = keys & pointers
  • Leaf pages: keys + data pointers; prev/next page chain

Index pages Leaf pages

fill factor

P0 K1 P1 K2 P2 Km Pm

slide-5
SLIDE 5

5 340151 Big Databases & Cloud Services (P. Baumann)

B+-Tree Definition

  • B+-Tree of Order m has the following properties...
  • Property #1 - All leaf nodes must be at same level.
  • Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of

m-1 keys.

  • Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least

m/2 children.

  • Property #5 - A non leaf node with n-1 keys must have n number of children.
  • key values in a node sorted in ascending order

[http://btechsmartclass.com/data_structures/b-trees.html]

slide-6
SLIDE 6

6 340151 Big Databases & Cloud Services (P. Baumann)

B+ Tree: Operations

  • Find 28*? 29*? All > 15* and < 30*?
  • Insert/delete: Find data entry in leaf, change it; adjust parent if needed
  • change sometimes bubbles up the tree
  • O( logF N ) where F = fan-out, N = # leaf pages

2* 3*

Root

17

30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29*

Entries < 17 Entries => 17

Note how data entries in leaf level are sorted

https://www.cs.usfca.edu/~galles/visualization/BTree.html

slide-7
SLIDE 7

7 340151 Big Databases & Cloud Services (P. Baumann)

More Exercises

  • Consider
  • # node reads from disk – determines speed
  • # comparisons – not performance relevant, but for understanding mechanics
  • Find 15, 20, 0
  • Find all 11 – 15; 20 – 32

[https://condor.depaul.edu/ntomuro/courses/417/notes/lecture3.html]

slide-8
SLIDE 8

8 340151 Big Databases & Cloud Services (P. Baumann)

B+ Trees in Practice

  • Typical fill-factor: 67% (outdated; today ~90%)
  • Average fan-out: 133
  • Typical capacities:
  • Height 3: 1333 = 2,352,637 records
  • Height 4: 1334 = 312,900,700 records
  • Can often hold top levels in buffer pool:
  • Level 1 = 1 page = 8 Kbytes
  • Level 2 = 133 pages = 1 Mbyte
  • Level 3 = 17,689 pages = 133 MBytes
slide-9
SLIDE 9

9 340151 Big Databases & Cloud Services (P. Baumann)

Hash-Based Indexes

  • Goal: compute address without disk access
  • get data in O(1)
  • Idea: distribute data evenly into fixed number of “buckets”
  • Compute location from key via Hashing function
  • Ex: h(int r) = r*a mod b, b prime relative to a
  • overflow pages
  • Hash index = bucket set + hashing function
  • Bucket = primary page + 0..n overflow pages
  • only equality, no range queries

[Shankai Yan]

slide-10
SLIDE 10

10 340151 Big Databases & Cloud Services (P. Baumann)

Index Classification

  • Primary index: index contains primary key, otherwise: secondary index
  • Unique index
  • Clustered vs. unclustered:
  • rder of data records same as / `close to’ order of data entries

 clustered index

  • Cost varies greatly
slide-11
SLIDE 11

11 340151 Big Databases & Cloud Services (P. Baumann)

Clustered vs. Unclustered Index

  • To build clustered index, first sort Heap file
  • Overflow pages may be needed for inserts
  • order of data recs `close to’, but not identical to, sort order

Data Records Index entries direct search for data entries Data entries

CLUSTERED UNCLUSTERED

Data file Data Records Index File

slide-12
SLIDE 12

12 340151 Big Databases & Cloud Services (P. Baumann)

  • Composite Search Keys =

Search on combination of fields

  • Equality query:

Every field value equal to a constant value Ex: for <sal,age> index: age=20 and sal=75

  • Range query:

Some field value within interval Ex: age>20; or age=20 and sal>10

  • Data entries in index sorted by search

key to support range queries

Composite Search Keys

<age, sal> 12,20 12,10 11,80 20,75 <age> 11 12 12 20

Data records sorted by name

20 75 12 10 20 80 11 12 name age sal sue bob cal joe <sal, age> 20,12 10,12 75,20 80,11

Data entries in index sorted by <sal,age>

<sal> 10 20 75 80

Data entries sorted by <sal>

Examples of composite key indexes using lexicographic order:

slide-13
SLIDE 13

13 340151 Big Databases & Cloud Services (P. Baumann)

Composite Search Keys: How To Use

  • To retrieve Emp records with age=30 AND sal=4000:
  • index on <age,sal> better than index on age, or index on sal
  • Choice of index key orthogonal to clustering etc.
  • If condition is: 20<age<30 AND 3000<sal<5000:
  • Clustered tree index on <age,sal> or <sal,age> is best
  • If condition is: age=30 AND 3000<sal<5000:
  • Clustered <age,sal> index much better

than <sal,age> index!

  • Composite indexes are larger, updated more often
slide-14
SLIDE 14

14 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans

  • A number of queries can be answered without retrieving any tuples

from one or more of the relations involved …if a suitable index is available

SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno

<E.dno>

SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno

<E.dno,E.sal> Tree index!

SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND

E.sal BETWEEN 3000 AND 5000 <E. age,E.sal>

  • r

<E.sal, E.age> Tree index!

slide-15
SLIDE 15

15 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (contd.)

  • Index-only plans possible

if key is <dno,age> or <age,dno>

  • Which is better?

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno

slide-16
SLIDE 16

16 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (contd.)

  • Index-only plans possible

if key is <dno,age> or <age,dno>

  • Which is better?
  • What if we consider the second query?

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno

slide-17
SLIDE 17

17 340151 Big Databases & Cloud Services (P. Baumann)

Index-Only Plans (Contd.)

  • Index-only plans also for queries involving >1 table

SELECT D.dno FROM Dept D, Emp E WHERE D.dno=E.dno

<E.dno>

SELECT D.dno, E.eid FROM Dept D, Emp E WHERE D.dno=E.dno

<E.dno,E.eid>

slide-18
SLIDE 18

18 340151 Big Databases & Cloud Services (P. Baumann)

Understanding the Workload

  • For each query in workload:
  • Which relations accessed?
  • Which attributes retrieved?
  • Which attributes involved in selection/join conditions? How selective likely?
  • For each update in workload:
  • Which attributes involved in selection/join conditions? How selective likely?
  • Type of update (INSERT/DELETE/UPDATE) + attributes affected
  • Trade-off: Indexes can make queries faster, updates slower
  • …and require disk space
slide-19
SLIDE 19

19 340151 Big Databases & Cloud Services (P. Baumann)

  • Attributes in WHERE clause candidates for index keys
  • Exact match condition

hash index

  • Range query

tree index

  • Clustering especially useful for range queries
  • can also help on equality queries if many duplicates
  • Multi-attribute search keys should be considered when a WHERE clause

contains several conditions

  • Order of attributes is important for range queries
  • Such indexes can sometimes enable index-only strategies for important queries
  • For index-only strategies, clustering is not important!

Index Selection Guidelines

slide-20
SLIDE 20

20 340151 Big Databases & Cloud Services (P. Baumann)

  • Choose indexes that benefit as many queries as possible
  • impact on updates: Indexes make queries faster, updates slower
  • require disk space!
  • only one index can be clustered per relation

choose based on important queries that benefit most from clustering

  • understand how DBMS evaluates queries & creates query evaluation plans
  • …a practitioner's approach:
  • Consider most important queries in turn:
  • Consider best plan using current indexes
  • See if a better plan is possible with an additional index
  • If so, create it

Index Selection Guidelines (contd.)

slide-21
SLIDE 21

21 340151 Big Databases & Cloud Services (P. Baumann)

Summary

  • Many alternative file organizations exist
  • each appropriate in some situation
  • If selection queries are frequent: sort file or build index
  • Hash-based indexes only good for equality search
  • Sorted files and tree-based indexes best for range search; also good for equality

search

  • Files rarely kept sorted in practice; B+ tree index is better
  • Index = collection of data entries + way to quickly find entries with given

key values

slide-22
SLIDE 22

22 340151 Big Databases & Cloud Services (P. Baumann)

Summary (Contd.)

  • Index data entries can be actual data records
  • <key, rid> pairs or <key, rid-list> pairs
  • Choice orthogonal to indexing technique used
  • Can have several indexes on a given file of data records
  • each with different search key
  • important influence factors for utility/performance:
  • clustered vs. unclustered
  • primary vs. secondary
  • dense vs. sparse
slide-23
SLIDE 23

23 340151 Big Databases & Cloud Services (P. Baumann)

Summary (Contd.)

  • Understanding application workload & performance goals essential
  • important queries & updates?
  • What attributes/relations involved?
  • speed up important queries (and perhaps some updates!)
  • Index maintenance overhead on updates to key fields
  • Choose indexes that can help many queries, if possible
  • Build indexes to support index-only strategies
  • Clustering is an important decision; only one index per relation!
  • Order of fields in composite index key can be important