07 Part I Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Tree Indexes 07 Part I Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2020

2 ADM IN ISTRIVIA Project #1 is due Sunday Sept 27 th Homework #2 is due Sunday Oct 4 th 15-445/645 (Fall 2020)

3 UPCO M IN G DATABASE TALKS CockroachDB Query Optimizer → Monday Sept 28 th @ 5pm ET Apache Arrow → Monday Oct 5 th @ 5pm ET DataBricks Query Optimizer → Monday Oct 12 th @ 5pm ET 15-445/645 (Fall 2020)

4 DATA STRUCTURES Internal Meta-data Core Data Storage Temporary Data Structures Table Indexes 15-445/645 (Fall 2020)

5 TABLE IN DEXES A table index is a replica of a subset of a table's attributes that are organized and/or sorted for efficient access using a subset of those attributes. The DBMS ensures that the contents of the table and the index are logically in sync. 15-445/645 (Fall 2020)

6 TABLE IN DEXES It is the DBMS's job to figure out the best index(es) to use to execute each query. There is a trade-off on the number of indexes to create per database. → Storage Overhead → Maintenance Overhead 15-445/645 (Fall 2020)

7 TO DAY'S AGEN DA B+Tree Overview Using B+Trees in a DBMS 15-445/645 (Fall 2020)

8 B- TREE FAM ILY There is a specific data structure called a B-Tree . People also use the term to generally refer to a class of balanced tree data structures: → B-Tree (1971) → B+Tree (1973) → B*Tree (1977?) → B link -Tree (1981) 15-445/645 (Fall 2020)

9 B+ TREE A B+Tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in O(log n) . → Generalization of a binary search tree in that a node can have more than two children. → Optimized for systems that read and write large blocks of data. 15-445/645 (Fall 2020)

10 B+ TREE PRO PERTIES A B+Tree is an M -way search tree with the following properties: → It is perfectly balanced (i.e., every leaf node is at the same depth in tree). → Every node other than the root, is at least half-full M/2- 1 ≤ #keys ≤ M -1 → Every inner node with k keys has k+1 non-null children 15-445/645 (Fall 2020)

11 B+ TREE EXAM PLE Inner Node <node*>|<key> 5 9 <5 <9 ≥9 1 3 6 7 9 13 Leaf Nodes Sibling Pointers <value>|<key> 15-445/645 (Fall 2020)

12 N O DES Every B+Tree node is comprised of an array of key/value pairs. → The keys are derived from the attributes(s) that the index is based on. → The values will differ based on whether the node is classified as inner nodes or leaf nodes. The arrays are (usually) kept in sorted key order. 15-445/645 (Fall 2020)

13 B+ TREE LEAF N O DES B+Tree Leaf Node Prev Next ¤ ¤ V1 • • • Kn K1 Vn 15-445/645 (Fall 2020)

13 B+ TREE LEAF N O DES B+Tree Leaf Node Prev Next PageID PageID ¤ ¤ V1 • • • Kn K1 Vn 15-445/645 (Fall 2020)

13 B+ TREE LEAF N O DES B+Tree Leaf Node Prev Next PageID PageID ¤ ¤ ¤ ¤ V1 • • • Kn K1 Vn Key+ Value 15-445/645 (Fall 2020)

13 B+ TREE LEAF N O DES B+Tree Leaf Node Level Slots Prev Next ¤ ¤ # # Sorted Keys K5 • • • Kn K1 K2 K3 K4 Values • • • ¤ ¤ ¤ ¤ ¤ ¤ 15-445/645 (Fall 2020)

14 LEAF N O DE VALUES Approach #1: Record Ids → A pointer to the location of the tuple that the index entry corresponds to. Approach #2: Tuple Data → The actual contents of the tuple is stored in the leaf node. → Secondary indexes must store the record id as their values. 15-445/645 (Fall 2020)

15 B- TREE VS. B+ TREE The original B-Tree from 1972 stored keys + values in all nodes in the tree. → More space efficient since each key only appears once in the tree. A B+Tree only stores values in leaf nodes. Inner nodes only guide the search process. 15-445/645 (Fall 2020)

16 SELECTIO N CO N DITIO N S The DBMS can use a B+Tree index if the query provides any of the attributes of the search key. Example: Index on <a,b,c> → Supported: (a=5 AND b=3) → Supported: (b=3) . Not all DBMSs support this. For hash index, we must have all attributes in search key. 15-445/645 (Fall 2020)

17 SELECTIO N CO N DITIO N S Find Key=(A,B) A ≤ A B ≤ C A,C B,B C,C A,A A,B A,C B,A B,B B,C C,C C,D 15-445/645 (Fall 2020)

17 SELECTIO N CO N DITIO N S Find Key=(A,B) Find Key=(A,*) A ≤ A A,C B,B C,C A,A A,B A,C B,A B,B B,C C,C C,D 15-445/645 (Fall 2020)

17 SELECTIO N CO N DITIO N S Find Key=(A,B) Find Key=(A,*) A ≤ A A,C B,B C,C A,A A,B A,C B,A B,B B,C C,C C,D (A,*) ≤ (B,*) 15-445/645 (Fall 2020)

17 SELECTIO N CO N DITIO N S Find Key=(A,B) Find Key=(A,*) *,A < C,C Find Key=(*,A) A,C B,B C,C A,A A,B A,C B,A B,B B,C C,C C,D (A,A) (A,A) (B,A) (B,A) 15-445/645 (Fall 2020)

18 B+ TREE IN SERT Find correct leaf node L . Put data entry into L in sorted order. If L has enough space, done! Otherwise, split L keys into L and a new node L2 → Redistribute entries evenly, copy up middle key. → Insert index entry pointing to L2 into parent of L . To split inner node, redistribute entries evenly, but push up middle key. Source: Chris Re 15-445/645 (Fall 2020)

19 B+ TREE VISUALIZATIO N https://cmudb.io/btree Source: David Gales (Univ. of San Francisco) 15-445/645 (Fall 2020)

20 B+ TREE DELETE Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only M/2-1 entries, → Try to re-distribute, borrowing from sibling (adjacent node with same parent as L ). → If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L . Source: Chris Re 15-445/645 (Fall 2020)

21 B+ TREE DUPLICATE KEYS Approach #1: Append Record Id → Add the tuple's unique record id as part of the key to ensure that all keys are unique. → The DBMS can still use partial keys to find tuples. Approach #2: Overflow Leaf Nodes → Allow leaf nodes to spill into overflow nodes that contain the duplicate keys. → This is more complex to maintain and modify. 15-445/645 (Fall 2020)

22 B+ TREE APPEN D RECO RD ID Insert 6 5 9 <5 <9 ≥9 1 3 6 7 8 9 13 <Key,RecordId> 15-445/645 (Fall 2020)

22 B+ TREE APPEN D RECO RD ID Insert <6,(Page,Slot)> 5 9 <5 <9 ≥9 1 3 6 7 8 9 13 <Key,RecordId> 15-445/645 (Fall 2020)

22 B+ TREE APPEN D RECO RD ID Insert <6,(Page,Slot)> 5 7 9 9 <9 <5 <7 ≥9 1 3 6 7 6 8 7 8 9 13 <Key,RecordId> 15-445/645 (Fall 2020)

23 B+ TREE OVERFLOW LEAF N O DES Insert 6 5 9 <5 <7 ≥9 1 3 6 7 8 9 13 6 15-445/645 (Fall 2020)

23 B+ TREE OVERFLOW LEAF N O DES Insert 6 5 9 Insert 7 <5 <7 ≥9 1 3 6 7 8 9 13 6 7 15-445/645 (Fall 2020)

23 B+ TREE OVERFLOW LEAF N O DES Insert 6 5 9 Insert 7 <5 <7 ≥9 Insert 6 1 3 6 7 8 9 13 6 7 6 15-445/645 (Fall 2020)

24 CLUSTERED IN DEXES The table is stored in the sort order specified by the primary key. → Can be either heap- or index-organized storage. Some DBMSs always use a clustered index. → If a table does not contain a primary key, the DBMS will automatically make a hidden row id primary key. Other DBMSs cannot use them at all. 15-445/645 (Fall 2020)

25 CLUSTERED B+ TREE Index (Directs search) Traverse to the left-most leaf page, Data Entries ("Sequence set") and then retrieve tuples from all leaf pages. This will always better than external 101 102 103 104 sorting. Data Records 15-445/645 (Fall 2020)

26 H EAP CLUSTERIN G Tuples are sorted in the heap's pages using the order specified by a clustering index. If the query accesses tuples using the 101 102 103 104 clustering index's attributes, then the DBMS can jump directly to the pages that it needs. 15-445/645 (Fall 2020)

26 H EAP CLUSTERIN G Tuples are sorted in the heap's pages Scan Direction using the order specified by a clustering index. If the query accesses tuples using the 101 102 103 104 clustering index's attributes, then the DBMS can jump directly to the pages that it needs. 15-445/645 (Fall 2020)

27 IN DEX SCAN PAGE SO RTIN G Scan Direction Retrieving tuples in the order that appear in an unclustered index is inefficient. 101 102 103 104 The DBMS can first figure out all the tuples that it needs and then sort them based on their page id. 15-445/645 (Fall 2020)

07 Part I Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Tree Indexes 07 Part I Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2020 2 ADM IN ISTRIVIA Project #1 is due Sunday Sept 27 th Homework #2 is due Sunday Oct 4 th 15-445/645

Indexes Database Systems: The Complete Book Ch. 13.1-13.3, 14.1-14.2 1 2 3 $88 $24 4 $88

Leading by Influence: From Within Presented by: Rebecca Durney Research Librarian, Embry-

Leadership Effectiveness in Crisis Robert Wilkinson June 10, 2020 AABE Some Definitions of

Find your Seat: Academic librarians sit at RED Corporate librarians sit at GREEN Allison

B+ tree a dynamic structure that adjusts to changes in the file gracefully. It is the the

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Two of many repository structure decisions path length placement of products and

Trees Announcements Congratulations to the Winners of the Hog Strategy Contest 1st Place with 146

nbtree-indexes in PostgreSQL. Useful novelties. 2019 Victor Yegorov vyegorov@dataegret.com

CS 61A Lecture 11 Announcements Box-and-Pointer Notation The Closure Property of Data Types

Assignment 3 Oct 4th Kyle Klassy Announcements Assignment 2 will be graded by October 8th

Trees 5 May 2016 OSU CSE 1 A New Math Type: tree A ubiquitous concept in computing is that

Mat 3770 Relationships Week 7 Mary Trees Examples Balanced Prufer Exercises Spring 2014

PiTree: Practical Implementations of ABR Algorithms Using Decision Trees Paper # P5C-04 Zili Meng

CS 10: Problem solving via Object Oriented Programming Prioritizing 2 Agenda 1. Heaps 2. Heap

B -trees CSCI 333 Williams College Logistics Lab 2b Office hours Tuesday night, 7-9pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

A Datatype for Binary Trees Functions on Trees Amtoft from Hatcliff Binary Trees One of many

CS 225 Data Structures Mar March h 11 11 BT BTrees Wade Fa Wa Fagen-Ul Ulmsch schnei

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Virtual banking beyond borders 2 M a r k e t C h a l l e n g e 68 MIL ILLION REFUGEES

07 Part I Intro to Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Tree Indexes 07 Part I Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science Carnegie Mellon University Fall 2020 2 ADM IN ISTRIVIA Project #1 is due Sunday Sept 27 th Homework #2 is due Sunday Oct 4 th 15-445/645

Indexes Database Systems: The Complete Book Ch. 13.1-13.3, 14.1-14.2 1 2 3 $88 $24 4 $88

Leading by Influence: From Within Presented by: Rebecca Durney Research Librarian, Embry-

Leadership Effectiveness in Crisis Robert Wilkinson June 10, 2020 AABE Some Definitions of

Find your Seat: Academic librarians sit at RED Corporate librarians sit at GREEN Allison

B+ tree a dynamic structure that adjusts to changes in the file gracefully. It is the the

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Two of many repository structure decisions path length placement of products and

Trees Announcements Congratulations to the Winners of the Hog Strategy Contest 1st Place with 146

nbtree-indexes in PostgreSQL. Useful novelties. 2019 Victor Yegorov vyegorov@dataegret.com

CS 61A Lecture 11 Announcements Box-and-Pointer Notation The Closure Property of Data Types

Assignment 3 Oct 4th Kyle Klassy Announcements Assignment 2 will be graded by October 8th

Trees 5 May 2016 OSU CSE 1 A New Math Type: tree A ubiquitous concept in computing is that

Mat 3770 Relationships Week 7 Mary Trees Examples Balanced Prufer Exercises Spring 2014

PiTree: Practical Implementations of ABR Algorithms Using Decision Trees Paper # P5C-04 Zili Meng

CS 10: Problem solving via Object Oriented Programming Prioritizing 2 Agenda 1. Heaps 2. Heap

B -trees CSCI 333 Williams College Logistics Lab 2b Office hours Tuesday night, 7-9pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

A Datatype for Binary Trees Functions on Trees Amtoft from Hatcliff Binary Trees One of many

CS 225 Data Structures Mar March h 11 11 BT BTrees Wade Fa Wa Fagen-Ul Ulmsch schnei

Storage and Indexing DBS Database Systems Reading: R&amp;G Chapters 8, 9 &amp; 10.1 Implementing

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Virtual banking beyond borders 2 M a r k e t C h a l l e n g e 68 MIL ILLION REFUGEES

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing