overview of storage and indexing
play

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir - PDF document

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Database Management Systems, R. Ramakrishnan and J.


  1. Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive pages is much cheaper than reading them in random order  File organization: Method of arranging a file of records on external storage.  Record id (rid) is sufficient to physically locate record  Indexes are data structures that allow us to find the record ids of records with given values in index search key fields  Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager. 2 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  2. Indexes  An index on a file speeds up selections on the search key fields for the index.  Any subset of the fields of a relation can be the search key for an index on the relation.  Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).  An index supports efficient retrieval of all data entries with a given key value k . 3 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Index Classification  Primary vs. secondary : If search key contains primary key, then called primary index.  Unique index: Search key contains a candidate key.  Clustered vs. unclustered : If order of data records is the same as, or `close to ’ , order of data entries, then called clustered index.  A file can be clustered on at most one search key.  Cost of retrieving data records through index varies greatly based on whether index is clustered or not!  Dense vs. sparse : If there is an index data entry for each data records then called dense index. 4 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  3. B+ Tree Indexes Non-leaf Pages Leaf Pages  Leaf pages contain data entries , and are chained (prev & next)  Non-leaf pages contain index entries and direct searches: index entry P0 K 1 P 1 K 2 P m P 2 K m 5 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny B+ Tree  Insert/delete at log F N cost; keep tree height- balanced . (F = fanout, N = # leaf pages)  Minimum 50% occupancy (except for root). Each node contains d <= m <= 2 d entries. The parameter d is called the order of the tree.  Supports equality and range-searches efficiently. Index Entries (Direct search) Data Entries ("Sequence set") 6 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  4. Queries on B + -Trees  Find all records with a search-key value of k.  Start with the root node • Examine the node for the smallest search-key value > k. • If such a value exists, assume it is K j . The follow P i to the child node • Otherwise k  K m – 1 , where there are m pointers in the node. Then follow P m to the child node.  If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer.  Eventually reach a leaf node. If key K i = k, follow pointer P i to the desired record or bucket. Else no record with search-key value k exists. 7 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Example B+ Tree Root 17 Entries < 17 Entries >= 17 5 13 27 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16*  Find 28*? 29*? All > 15* and < 30*  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes.  And change sometimes bubbles up the tree 8 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  5. B+ Trees in Practice  Typical order: 100. Typical fill-factor: 67%.  average fanout = 133  Typical capacities:  Height 4: 133 4 = 312,900,700 records  Height 3: 133 3 = 2,352,637 records  Can often hold top levels in buffer pool:  Level 1 = 1 page = 8 Kbytes  Level 2 = 133 pages = 1 Mbyte  Level 3 = 17,689 pages = 133 MBytes 9 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Index Definition in SQL  Create an index create index <index-name> on <relation-name> (<attribute-list>) E.g.: create index b-index on branch(branch-name)  Use create unique index to indirectly specify and enforce the condition that the search key is a candidate keyis a candidate key.  To drop an index drop index <index-name> 10 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  6. Multiple-Key Access  Use multiple indices for certain types of queries.  Example: select account-number from account where branch-name = “ Perryridge ” and balance = 1000  Possible strategies for processing query using indices on single attributes: 1. Use index on branch-name to find accounts with branch-name = “ Perryridge ” .; test balance = $1000. 2. Use index on balance to find accounts with balances of $1000; test branch-name = “ Perryridge ” . 3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance . Take intersection of both sets of pointers obtained. 11 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Indices on Multiple Attributes Suppose we have an index on combined search-key ( branch-name, balance ).  With the where clause where branch-name = “ Perryridge ” and balance = 1000 the index on the combined search-key will fetch only records that satisfy both conditions. Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the conditions. 12 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  7. Choice of Indexes  What indexes should we create?  Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?  For each index, what kind of an index should it be?  Clustered? Hash/tree? 13 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Hash-Based Indexes  Good for equality selections. • Index is a collection of buckets. Bucket = primary page plus zero or more overflow pages. • Hashing function h : h ( r ) = bucket in which record r belongs. h looks at the search key fields of r. 14 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

  8. Summary  Many alternative file organizations exist, each appropriate in some situation.  If selection queries are frequent, sorting the file or building an index is important.  Hash-based indexes only good for equality search.  Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)  Index is a collection of data entries plus a way to quickly find entries with given key values. 15 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny Summary (Contd.)  Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.  Choice orthogonal to indexing technique used to locate data entries with a given key value.  Can have several indexes on a given file of data records, each with a different search key.  Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance. 16 Database Management Systems, R. Ramakrishnan and J. Gehrke INFSCI2710 Instructor: Vladimir Zadorozhny

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend