Why Is This Important? DB performance depends on time it takes to - PDF document

Why Is This Important?  DB performance depends on time it takes to get the data from storage system and time to process Overview of Storage and Indexing  Choosing the right index for faster access can speed up queries significantly  Understanding why a query is slow helps finding a Chapter 8 remedy  Warning: DBMS is a complex system  Cannot understand every little detail  Our focus: Most important aspects, abstracted enough to make them “digestible” 1 2 Data on External Storage Components of a Disk Spindle  Disks: Can retrieve random page at fixed cost  Platters spin Tracks Disk head  But reading several consecutive pages is much cheaper than reading them  E.g., 10K rpm in random order  Arm assembly is moved  Tapes: Can only read pages in sequence in or out to position a Sector  Cheaper than disks; used for archival storage head on a desired track.  Flash memory: Starting to replace disks due to much faster random  Tracks under heads access make a cylinder.  Writes still slow, size often too small for DB applications  Only one head reads or  File organization: Method of arranging a file of records on external writes at any one time. Platters storage. Arm movement  Record id (rid) is sufficient to physically locate record  Block size is a multiple  Index: data structure for finding the ids of records with given values faster of sector size (which is fixed).  Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the  512 bytes (old), 4096 buffer manager. bytes (new) Arm assembly 3 4 Accessing a Disk Page Records on a Disk Page Rid = (i,N) Page i  Time to access (read/write) a disk block: Rid = (i,2)  Seek time (moving arms to position disk head on track) Rid = (i,1)  Rotational delay (waiting for block to rotate under head)  Transfer time (actually moving data to/from disk surface)  Seek time and rotational delay dominate.  Seek time typically a little below 9msec (consumer disks) 20 16 24 N Pointer to start  Rotational delay around 4msec on average (7.2K rpm disk) N . . . 2 1 # slots of free  Transfer rate disk-to-buffer of 70MB/sec (sustained) space SLOT DIRECTORY  Key to lower I/O cost: reduce seek/rotation delays.  Rid = <page#, slot#>  Hardware vs. software solutions?  Can move records on page without changing rid. 5 6

Possible File Organizations Indexes  Heap (random order) files  An index on a file speeds up selections on the search  Suitable when typical access is a file scan retrieving all key fields for the index. records.  Any subset of the fields of a relation can be the search key  Sorted Files for an index on the relation.  Best if records must be retrieved in some order, or only a  Search key is not the same as key (minimal set of fields `range’ of records is needed. that uniquely identify a record in a relation).  Indexes = data structures to organize records via  An index contains a collection of data entries, and trees or hashing. supports efficient retrieval of all data entries k* with  Like sorted files, they speed up searches for a subset of a given key value k. records, based on values in certain (“search key”) fields  Given data entry k*, we can find record with key k in at  Updates are much faster than in sorted files. most one disk I/O. (Details soon…) 7 8 B+ Tree Indexes Example B+ Tree Note how data entries Root Non-leaf in leaf level are sorted 17 Pages Entries <= 17 Entries > 17 Leaf 5 13 27 30 Pages (Sorted by search key)  Balanced index: all root-to-leaf paths have same length 33* 34* 38* 39*  2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* For n data entries, tree has height log n  Leaf pages contain data entries, and are chained (prev & next)  Find 28*? 29*? All > 15* and < 30* Non-leaf pages have index entries; only used to direct searches:  index entry  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. P0 K 1 P 1 K 2 P 2 K m P m  And change sometimes propagates up the tree 9 10 Hash-Based Indexes Static Hashing  # primary pages fixed, allocated sequentially, never de-  Good for equality selections. allocated; overflow pages if needed.  Index is a collection of buckets.  h(k) mod N = bucket to which data entry with key k  Bucket = primary page plus zero or more overflow pages. belongs. (N = # of buckets)  h(key) = (a * key + b) usually works well  Buckets contain data entries.  Hashing function h: h(r) = bucket in which (data 0 h(key) mod N 1 entry for) record r belongs. key  h looks at the search key fields of r. h  No need for “index entries” in this scheme. N-1 Primary bucket pages Overflow pages 11 12

Alternative 1 for Data Entries Alternatives for Data Entry k* in Index  In a data entry k* we can store:  Actual data record stored in index  Index structure is a file organization for data records 1. Data record with key value k, or (instead of a Heap file or sorted file). 2. <k, rid of data record with search key value k>, or  At most one index on a given collection of data 3. <k, list of rids of data records with search key k> records can use Alternative 1.  Choice of alternative for data entries is orthogonal to  Otherwise, data records are duplicated, leading to the indexing technique used to locate data entries redundant storage and potential inconsistency. with a given key value k.  If data records are very large, # of pages containing  Typically, index contains auxiliary information that directs data entries is high. Implies size of auxiliary searches to the desired data entries information in the index is also large, typically. 13 14 Alternatives 2 and 3 for Data Entries Index Classification  Data entries typically much smaller than data  Primary vs. secondary: If search key contains primary records. So, better than Alternative 1 with large data key, then called primary index.  Unique index: Search key contains a candidate key. records, especially if search keys are small.  Clustered vs. unclustered: If order of data records is the  Portion of index structure used to direct search, which same as, or `close to’, order of data entries, then called depends on size of data entries, is much smaller than with Alternative 1. clustered index.  Alternative 1 implies clustered  Alternative 3 more compact than Alternative 2, but leads to variable-sized data entries even if search • In practice, clustered also implies Alternative 1 (since sorted files are rare). keys are of fixed length.  A file can be clustered on at most one search key.  Extra cost for accessing data records in another file  Cost of retrieving data records through index varies greatly  Index only return rids based on whether index is clustered or not. 15 16 Clustered vs. Unclustered Index Cost Model for Our Analysis  Suppose Alternative 2 is used for data entries, and that the  We ignore CPU costs, for simplicity: data records are stored in a Heap file.  B : The number of data pages (“blocks”)  To build clustered index, first sort the Heap file (with some free space on each page for future inserts).  R: Number of records per page  Overflow pages may be needed for inserts. (Thus, order of data  D: (Average) time to read or write a single disk page records is `close to’, but not identical to, the sort order.)  Measuring number of page I/O’s ignores gains of Index entries UNCLUSTERED pre-fetching a sequence of pages; thus, even I/O cost CLUSTERED direct search for data entries is only approximated.  Average-case analysis; based on several simplifying Data entries Data entries assumptions. (Index File) (Data file) Good enough to show the overall trends! Data Records Data Records 17 18

Why Is This Important? DB performance depends on time it takes to - PDF document

Why Is This Important? DB performance depends on time it takes to get the data from storage system and time to process Overview of Storage and Indexing Choosing the right index for faster access can speed up queries significantly

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Why is this important? Why is this important? Anyone know what song this is? Yes thats right,

Why is shipping important? CARGO TRINITY HOUSE // KEY STAGE 1 WHY IS SHIPPING IMPORTANT?

The Worlds First SMART LASER The Platform That Redefines the Therapy Laser Industry. Why is

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

Crises Important and urgent Goals and planning Important but not urgent Assess tasks using

Occupational Hygiene Why is it important? And why ignoring it is bad business? Julian Dowson

WHAT IS HEATING VALUE WHY IS IT IMPORTANT WHAT IS HEATING VALUE WHY IS IT IMPORTANT

Deborah Grove, Principal Grove-Associates Why Green IT is Important Why Green IT is Important

WHY IS MIGRATION SO IMPORTANT? Why migration? German National Team 2014 Why migration?

G 98 7 Artist Workshop Spring 2014 Why We Do What We Do What? Why? The Reason Why what we

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Important Concepts Important Concepts Some important concepts in financial and derivative

WELCOME Building Envelope Problems AGENDA WHY IMPORTANT CONDENSATION STONE

WELCOME VENTED and UNVENTED Roof/Deck Systems AGENDA WHY IMPORTANT CODE THEORY

Ceph: A Scalable, High-Performance Distributed File System Sage A. Weil, Scott A. Brandt, Ethan

Disk Formation in Magnetized Dense Cores with Turbulence and Ambipolar Diffusion March 21, 2019

LO-PH Low-Observable Physical Host Instrumentation for Malware Analysis Chad Spensky ,

Virtual Memory Philipp Koehn 25 April 2018 Philipp Koehn Computer Systems Fundamentals: Virtual

Live Disk Forensics on Bare Metal Hongyi Hu and Chad Spensky {hongyi.hu,chad.spensky}@ll.mit.edu

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

A disk origin for inner stellar halo structures around the Milky Way Adrian Price-Whelan

wP wPerf rf: : Generic Off-CP CPU Analysis to Iden Identif ify B Bottlenec leneck W Wait

Why Is This Important? DB performance depends on time it takes to - PDF document

Why Is This Important? DB performance depends on time it takes to get the data from storage system and time to process Overview of Storage and Indexing Choosing the right index for faster access can speed up queries significantly

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Why is this important? Why is this important? Anyone know what song this is? Yes thats right,

Why is shipping important? CARGO TRINITY HOUSE // KEY STAGE 1 WHY IS SHIPPING IMPORTANT?

The Worlds First SMART LASER The Platform That Redefines the Therapy Laser Industry. Why is

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

Crises Important and urgent Goals and planning Important but not urgent Assess tasks using

Occupational Hygiene Why is it important? And why ignoring it is bad business? Julian Dowson

WHAT IS HEATING VALUE WHY IS IT IMPORTANT WHAT IS HEATING VALUE WHY IS IT IMPORTANT

Deborah Grove, Principal Grove-Associates Why Green IT is Important Why Green IT is Important

WHY IS MIGRATION SO IMPORTANT? Why migration? German National Team 2014 Why migration?

G 98 7 Artist Workshop Spring 2014 Why We Do What We Do What? Why? The Reason Why what we

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Important Concepts Important Concepts Some important concepts in financial and derivative

WELCOME Building Envelope Problems AGENDA WHY IMPORTANT CONDENSATION STONE

WELCOME VENTED and UNVENTED Roof/Deck Systems AGENDA WHY IMPORTANT CODE THEORY

Ceph: A Scalable, High-Performance Distributed File System Sage A. Weil, Scott A. Brandt, Ethan

Disk Formation in Magnetized Dense Cores with Turbulence and Ambipolar Diffusion March 21, 2019

LO-PH Low-Observable Physical Host Instrumentation for Malware Analysis Chad Spensky ,

Virtual Memory Philipp Koehn 25 April 2018 Philipp Koehn Computer Systems Fundamentals: Virtual

Live Disk Forensics on Bare Metal Hongyi Hu and Chad Spensky {hongyi.hu,chad.spensky}@ll.mit.edu

Scaling Your Cache &amp; Caching at Scale Alex Miller @puredanger Mission Why does caching

A disk origin for inner stellar halo structures around the Milky Way Adrian Price-Whelan

wP wPerf rf: : Generic Off-CP CPU Analysis to Iden Identif ify B Bottlenec leneck W Wait

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching