Overview of Storage and Indexing Chapter 8 Instructor: Vladimir - - PDF document

overview of storage and indexing
SMART_READER_LITE
LIVE PREVIEW

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir - - PDF document

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Database Management Systems, R. Ramakrishnan and J.


slide-1
SLIDE 1

1

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Overview of Storage and Indexing

Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh

2

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Data on External Storage

 Disks: Can retrieve random page at fixed cost

  • But reading several consecutive pages is much cheaper than

reading them in random order

 File organization: Method of arranging a file of records

  • n external storage.
  • Record id (rid) is sufficient to physically locate record
  • Indexes are data structures that allow us to find the record ids
  • f records with given values in index search key fields

 Architecture: Buffer manager stages pages from external

storage to main memory buffer pool. File and index layers make calls to the buffer manager.

slide-2
SLIDE 2

3

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Indexes

 An index on a file speeds up selections on the

search key fields for the index.

  • Any subset of the fields of a relation can be the

search key for an index on the relation.

  • Search key is not the same as key (minimal set of

fields that uniquely identify a record in a relation).

 An index supports efficient retrieval of all

data entries with a given key value k.

4

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Index Classification

 Primary vs. secondary: If search key contains

primary key, then called primary index.

  • Unique index: Search key contains a candidate key.

 Clustered vs. unclustered: If order of data records

is the same as, or `close to’, order of data entries, then called clustered index.

  • A file can be clustered on at most one search key.
  • Cost of retrieving data records through index varies

greatly based on whether index is clustered or not!

 Dense vs. sparse: If there is an index data entry

for each data records then called dense index.

slide-3
SLIDE 3

5

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

B+ Tree Indexes

 Leaf pages contain data entries, and are chained (prev & next)  Non-leaf pages contain index entries and direct searches:

P0 K 1 P 1 K 2 P 2 K m P m

index entry

Non-leaf Pages Pages Leaf 6

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

B+ Tree

 Insert/delete at log F N cost; keep tree height-

  • balanced. (F = fanout, N = # leaf pages)

 Minimum 50% occupancy (except for root). Each

node contains d <= m <= 2d entries. The parameter d is called the order of the tree.

 Supports equality and range-searches efficiently.

Index Entries Data Entries ("Sequence set") (Direct search)

slide-4
SLIDE 4

7

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Queries on B+-Trees

 Find all records with a search-key value of k.

  • Start with the root node
  • Examine the node for the smallest search-key

value > k.

  • If such a value exists, assume it is Kj. The

follow Pi to the child node

  • Otherwise k  Km–1, where there are m pointers

in the node. Then follow Pm to the child node.

  • If the node reached by following the pointer

above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer.

  • Eventually reach a leaf node. If key Ki = k, follow

pointer Pi to the desired record or bucket. Else no record with search-key value k exists.

8

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Example B+ Tree

 Find 28*? 29*? All > 15* and < 30*  Insert/delete: Find data entry in leaf, then

change it. Need to adjust parent sometimes.

  • And change sometimes bubbles up the tree

2* 3* Root

17

30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29*

Entries < 17 Entries >= 17

slide-5
SLIDE 5

9

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

B+ Trees in Practice

 Typical order: 100. Typical fill-factor: 67%.

  • average fanout = 133

 Typical capacities:

  • Height 4: 1334 = 312,900,700 records
  • Height 3: 1333 = 2,352,637 records

 Can often hold top levels in buffer pool:

  • Level 1 = 1 page = 8 Kbytes
  • Level 2 = 133 pages = 1 Mbyte
  • Level 3 = 17,689 pages = 133 MBytes

10

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Index Definition in SQL

 Create an index

create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index b-index on branch(branch-name)

 Use create unique index to indirectly specify

and enforce the condition that the search key is a candidate keyis a candidate key.

 To drop an index

drop index <index-name>

slide-6
SLIDE 6

11

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Multiple-Key Access

 Use multiple indices for certain types of queries.  Example:

select account-number from account where branch-name = “Perryridge” and balance = 1000

 Possible strategies for processing query using indices on

single attributes:

  • 1. Use index on branch-name to find accounts with branch-name =

“Perryridge”.; test balance = $1000.

  • 2. Use index on balance to find accounts with balances of $1000; test

branch-name = “Perryridge”.

  • 3. Use branch-name index to find pointers to all records pertaining to the

Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained.

12

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Indices on Multiple Attributes

 With the where clause

where branch-name = “Perryridge” and balance = 1000

the index on the combined search-key will fetch only records that satisfy both conditions. Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the conditions.

Suppose we have an index on combined search-key (branch-name, balance).

slide-7
SLIDE 7

13

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Choice of Indexes

 What indexes should we create?

  • Which relations should have indexes? What field(s)

should be the search key? Should we build several indexes?

 For each index, what kind of an index should it

be?

  • Clustered? Hash/tree?

14

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Hash-Based Indexes

 Good for equality selections.

  • Index is a collection of buckets. Bucket = primary

page plus zero or more overflow pages.

  • Hashing function h: h(r) = bucket in which

record r belongs. h looks at the search key fields

  • f r.
slide-8
SLIDE 8

15

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Summary

 Many alternative file organizations exist, each

appropriate in some situation.

 If selection queries are frequent, sorting the

file or building an index is important.

  • Hash-based indexes only good for equality search.
  • Sorted files and tree-based indexes best for range

search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)

 Index is a collection of data entries plus a way

to quickly find entries with given key values.

16

Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Summary (Contd.)

 Data entries can be actual data records, <key,

rid> pairs, or <key, rid-list> pairs.

  • Choice orthogonal to indexing technique used to

locate data entries with a given key value.

 Can have several indexes on a given file of

data records, each with a different search key.

 Indexes can be classified as clustered vs.

unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.