[PPT] - CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing PowerPoint Presentation

SLIDE 1

Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book

Arun Kumar

1

CSE 232A  Graduate Database Systems

Slide ACKs: Jignesh Patel, Paris Koutris

SLIDE 2

2

Motivation for Indexing

❖ Consider the following SQL query:

SELECT * FROM Movies WHERE Year=2017

Movies (M)

MovieID Name Year Director

Q: How to obtain the matching records from the file? ❖ Heap file? Need to do a linear scan! O(N) I/O and CPU ❖ “Sorted” file? Binary search! O(log2(N)) I/O and CPU

SELECT * FROM Movies WHERE Year>=2000 AND Year<2010

Indexing helps retrieve records faster for selective predicates!

SLIDE 3

3

Another View of Storage Manager

I/O Accesses

I/O Manager

Access Methods

Heap File B+-tree Index Sorted File Hash Index

Buffer Manager Concurrency Control Manager Recovery Manager

SLIDE 4

4

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

SLIDE 5

5

Indexing

❖ Index: A data structure to speed up record retrieval ❖ Search Key: Attribute(s) on which file is indexed; also called Index Key (used interchangeably) ❖ Any permutation of any subset of a relation’s attributes can be index key for an index ❖ Index key need not be a primary/candidate key ❖ Two main types of indexes: ❖ B+ Tree index: good for both range and equality search ❖ Hash index: good for equality search

SLIDE 6

Overview of Indexes

❖ Need to consider efficiency of search, insert, and delete ❖ Primarily optimized to reduce (disk) I/O cost ❖ B+ Tree index: ❖ O(logF(N)) I/O and CPU cost for equality search (N: number of “data entries”; F: “fanout” of non-leaf node) ❖ Range search, Insert, and Delete all start with an equality search ❖ Hash index: ❖ O(1) I/O and CPU cost for equality search ❖ Insert and delete start with equality search ❖ Not “good” for range search!

6

SLIDE 7

What is stored in the Index?

❖ 2 things: Search/index key values and data entries ❖ Alternatives for data entries for a given key value k: ❖ AltRecord: Actual data records of file that match k ❖ AltRID: <k, RID of a record that matches k> ❖ AltRIDlist: <k, list of RIDs of records that match k> ❖ API for operations on records: ❖ Search (IndexKey); could be a predicate for B+Tree ❖ Insert (IndexKey, data entry) ❖ Delete (IndexKey); could be a predicate for B+Tree

7

SLIDE 8

Overview of B+ Tree Index

Index Entries (Non-leaf pages) Data Entries (Leaf pages) Entries of the form: AltRID: (IndexKey value, RID) Entries of the form: (IndexKey value, PageID)

❖ Non-leaf pages do not contain data values; they contain [d, 2d] index keys; d is order parameter ❖ Height-balanced tree; only root can have [1,d) keys ❖ Leaf pages in sorted order of IndexKey; connected as a doubly linked list Q: What is the difference between “B+ Tree” and “B Tree”?

8

SLIDE 9

Overview of Hash Index

Hash function SearchKey Bucket pages Overflow pages

h 1 N-1

Primary bucket pages

❖ Bucket pages have data entries (same 3 Alternatives) ❖ Hash function helps obtain O(1) search time

9

SLIDE 10

Trade-offs of Data Entry Alternatives

❖ Pros and cons of alternatives for data entries: ❖ AltRecord: Entire file is stored as an index! If records are long, data entries of index are large and search time could be high ❖ AltRID and AltRIDlist: Data entries typically smaller than records; often faster for equality search ❖ AltRIDlist has more compact data entries than AltRID but entries are variable-length Q: A file can have at most one AltRecord index. Why?

10

SLIDE 11

More Indexing-related Terminology

❖ Composite Index: IndexKey has > 1 attributes ❖ Primary Index: IndexKey contains the primary key ❖ Secondary Index: Any index that not a primary index ❖ Unique Index: IndexKey contains a candidate key ❖ All primary indexes are unique indexes!

MovieID Name Year Director IMDB_URL

IMDB_URL is a candidate key Index on MovieID? Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)?

11

SLIDE 12

More Indexing-related Terminology

❖ Clustered index: order in which records are laid out is same as (or “very close to”) order of IndexKey domain ❖ Matters for (range) search performance! ❖ AltRecord implies index is clustered. Why? ❖ In practice, clustered almost always implies AltRecord ❖ In practice, a file is clustered on at most 1 IndexKey ❖ Unclustered index: an index that is not clustered

MovieID Name Year Director IMDB_URL

Index on Year? Index on (Year, Name)?

12

SLIDE 13

13

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

SLIDE 14

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13

Height = 1

B+ Tree Index: Search

14

❖ Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached ❖ Comparison within non-leaf nodes: binary/linear search Examples: search 7; 8; 24; range [19,33*] Order = 2

SLIDE 15

Leaf Page

R1 K 1 R 2 K2 K n P n+1

data entries

record 1 record 2 Next Page Pointer

Rn

record n

P0

Prev Page Pointer

Non-leaf Page

P1 K 1 P2 K 2 P3 K m P m+1

index entries

Pointer to a  page with Values < K1 Pointer to a page with values s.t. K1≤ Values < K2 Pointer to a  page with   values ≥Km Pointer to a page with values s.t., K2≤ Values < K3

Pm

B+ Tree Index: Page Format

15

Order = m/2

SLIDE 16

B+ Trees in Practice

16

❖ Typical order value: 100 (so, non-leaf node can have up to 200 index keys) ❖ Typical occupancy: 67%; so, typical “fanout” = 133 ❖ Computing the tree’s capacity using fanout: ❖ Height 1 stores 133 leaf pages ❖ Height 4 store 1334 = 312,900,700 leaf pages ❖ Typically, higher levels of B+Tree cached in buffer pool ❖ Level 0 (root) = 1 page = 8 KB ❖ Level 1 = 133 pages ~ 1 MB ❖ Level 2 = 17,689 pages ~ 138 MB and so on

SLIDE 17

B+ Tree Index: Insert

17

❖ Search for correct leaf L ❖ Insert data entry into L; if L has enough space, done! Otherwise, must split L (into new L and a new leaf L’) ❖ Redistribute entries evenly, copy up middle key ❖ Insert index entry pointing to L’ into parent of L ❖ A split might have to propagate upwards recursively: ❖ To split non-leaf node, redistribute entries evenly, but   push up the middle key (not copy up, as in leaf splits!) ❖ Splits “grow” the tree; root split increases height. ❖ Tree growth: gets wider or one level taller at top.

SLIDE 18

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13

Entry to be inserted in parent node  Copied up (and continues to appear in the leaf)

2* 3* 5* 7* 8*

5

B+ Tree Index: Insert

18

Example: Insert 8* Split! Split! Height++

SLIDE 19

Insert in parent node. Pushed up (and only appears once in the index)

5 24 30 17 13

Minimum occupancy is guaranteed in both leaf and non-leaf page splits

B+ Tree Index: Insert

19

Example: Insert 8*

SLIDE 20

2* 3*

New Root

17 24 30 14*16* 19*20*22* 24* 27*29* 33*34*38*39* 13 5 7* 5* 8*

B+ Tree Index: Insert

20

❖ Recursive splitting went up to root; height went up by 1 ❖ Splitting is somewhat expensive; is it avoidable? ❖ Can redistribute data entries with left or right sibling, if there is space! Example: Insert 8*

SLIDE 21

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 8* 14* 16* 8

Insert: Leaf Node Redistribution

21

❖ Redistributing data entries with a sibling improves page

ccupancy at leaf level and avoids too many splits; but

usually not used for non-leaf node splits ❖ Could increase I/O cost (checking siblings) ❖ Propagating internal splits is better amortization ❖ Pointer management headaches Example: Insert 8*

SLIDE 22

B+ Tree Index: Delete

22

❖ Start at root, find leaf L where entry belongs ❖ Remove the entry; if L is at least half-full, done! Else, if L has only d-1 entries: ❖ Try to re-distribute, borrowing from sibling L’ ❖ If re-distribution fails, merge L and L’ into single leaf ❖ If merge occurred, must delete entry (pointing to L or sibling) from parent of L. ❖ A merge might have to propagate upwards recursively to root, which decreases height by 1

SLIDE 23

27*29* 2* 3* Root 17 24 30 14*16* 19* 20*22* 24* 33*34*38*39* 13 5 7* 5* 8* 27 24* 27* 29*

23

B+ Tree Index: Delete

Example: Delete 22* Example: Delete 20* ❖ Deleting 22* is easy ❖ Deleting 20* is followed by redistribution at leaf level. Note how middle key is copied up. Example: Delete 24*?

SLIDE 24

30 19* 27* 29* 33* 34* 38* 39*

Need to merge recursively upwards!

2* 3* 7* 14* 16* 19* 27* 29* 33* 34* 38* 39* 5* 8*

New Root

30 13 5 17

24

B+ Tree Index: Delete

Example: Delete 24* ❖ Must merge leaf nodes! ❖ In non-leaf node, remove index entry with key value = 27 ❖ Pull down of   the index entry

SLIDE 25

Root

13 5 17 20 22 30 14* 16* 17*18* 20* 33*34*38* 39* 22* 27*29* 21* 7* 5* 8* 3* 2*

25

Delete: Non-leaf Node Redistribution

❖ Suppose this is the state of the tree when deleting 24* ❖ Instead of merge of root’s children, we can also redistribute entry from left child of root to right child

SLIDE 26

14*16* 33* 34*38* 39* 22* 27*29* 17*18* 20* 21* 7* 5* 8* 2* 3*

Root

13 5 17 30 20 22

26

Delete: After Redistribution

❖ Rotate IndexKeys through the parent node ❖ It suffices to re-distribute index entry with key 20; for illustration, 17 also re-distributed

SLIDE 27

27

Delete: Redistribution Preferred

❖ Unlike Insert, where redistribution is discouraged for non-leaf nodes, Delete prefers redistribution over merge decisions at both leaf or non-leaf levels. Why? ❖ Files usually grow, not shrink; deletions are rare! ❖ High chance of redistribution success (high fanouts) ❖ Only need to propagate changes to parent node

SLIDE 28

28

Handling Duplicates/Repetitions

❖ Many data entries could have same IndexKey value ❖ Related to AltRIDlist vs AltRID for data entries ❖ Also, single data entry could still span multiple pages ❖ Solution 1: ❖ All data entries with a given IndexKey value reside

n a single page

❖ Use “overflow” pages, if needed (not inside leaf list) ❖ Solution 2: ❖ Allow repeated IndexKey values among data entries ❖ Modify Search appropriately ❖ Use RID to get a unique composite key!

SLIDE 29

29

Order Concept in Practice

❖ In practice, order (d) concept replaced by physical space criterion: at least half-full ❖ Non-leaf pages can typically hold many more entries than leaf pages, since leaf pages could have long data records (AltRecord) or RID lists (AltRIDlist) ❖ Often, different nodes could have different # entries: ❖ Variable sized IndexKey ❖ AltRecord and variable-sized data records ❖ AltRIDlist could leads to different numbers of data entries sharing an IndexKey value

SLIDE 30

30

B+ Tree Index: Bulk Loading

❖ Given an existing file we want to index, multiple record- at-a-time Inserts are wasteful (too many IndexKeys!) ❖ Bulk loading avoids this overhead; reduces I/O cost ❖ 1) Sort data entries by IndexKey (AltRecord sorts file!) ❖ 2) Create empty root page; copy leftmost IndexKey of leftmost leaf page to root (non-leaf) and assign child ❖ 3) Go left to right; Insert only leftmost IndexKey from each leaf page into index as usual (NB: fewer keys!) ❖ 4) When non-leaf fills up, follow usual Split procedure and recurse upwards, if needed

SLIDE 31

31

B+ Tree Index: Bulk Loading

14 5 14*16* 17*18* 20* 22* 27* 21* 7* 5* 3* 2*

Sorted pages of data entries Insert IndexKey 17 Split; push up 14

14 17 20

Insert IndexKey 22 No redistribution! Split again; push up 20 Height++ … and so on

SLIDE 32

32

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

SLIDE 33

33

Overview of Hash Indexing

❖ Reduces search cost to nearly O(1) ❖ Good for equality search (but not for range search) ❖ Many variants: ❖ Static hashing ❖ Extendible hashing ❖ Linear hashing, etc. (we will not discuss these)

SLIDE 34

34

Static Hashing

Hash function SearchKey k Bucket pages Overflow pages

h 1 N-1

Primary bucket pages

❖ N is fixed; primary bucket pages never deallocated ❖ Bucket pages contain data entries (same 3 Alts) ❖ Search: ❖ Overflow pages help handle hash collisions ❖ Average search cost is O(1) + #overflow pages

E.g., h(k) = a*k + b h(k) mod N

2

SLIDE 35

35

Static Hashing: Example

Say, AltRecord and

nly one record fits

per page! N = 3

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 15 Avatar 2009 Jim Cameron 52 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen

Hash function: a = 1, b = 0 h(k) = k MOD 3 15 MOD 3 = 0

(15, Avatar, …) (52, Gravity, …) (20, Inception, …) (74, Blue …)

1 2 74 MOD 3 = 2

SLIDE 36

36

Static Hashing: Insert and Delete

❖ Insert: ❖ Equality search; find space on primary bucket page ❖ If not enough space, add overflow page ❖ Delete: ❖ Equality search; delete record ❖ If overflow page becomes empty, remove it ❖ Primary bucket pages are never removed!

SLIDE 37

37

Static Hashing: Issues

❖ Since N is fixed, #overflow pages might grow and degrade search cost; deletes waste a lot of space ❖ Full reorg. is expensive and could block query proc. ❖ Skew in hashing: ❖ Could be due to “bad” hash function that does not “spread” values—but this issue is well-studied/solved ❖ Could be due to skew in the data (duplicates); this could cause more overflow pages—difficult to resolve Extendible (dynamic) hashing helps resolve first two issues

SLIDE 38

38

Extendible Hashing

❖ Idea: Instead of hashing directly to data pages, maintain a dynamic directory of pointers to data pages ❖ Directory can grow and shrink; chosen to double/halve ❖ Search I/O cost: 1 (2 if

direc. not cached)

Check last GD bits of h(k)

2 2 4* 12* 32* 16* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 00 01 10 11 Example: Search 17* 10001 Search 42* …10

SLIDE 39

39

Extendible Hashing: Insert

❖ Search for k; if data page has space, add record; done ❖ If data page has no more space: ❖ If LD < GD, create Split Image of bucket; LD++ for both bucket and split image; insert record properly ❖ If LD == GD, create Split Image; LD++ for both buckets; insert record; but also, double the directory and GD++! Duplicate other direc. pointers properly ❖

Direc. typically grows in spurts

SLIDE 40

40

Extendible Hashing: Insert

2 2 4* 12* 32* 16* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) Check last GD bits of h(k) 00 01 10 11 Example: Insert 24* …00 No space in bucket A Need to split and LD++ Since LD was = GD, GD++ and direc. doubles

3 32* 16* 24* 3 4* 12*

Bucket A Bucket A2 000 100

SLIDE 41

41

Extendible Hashing: Insert

3 3 32* 16* 24* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 000 001 010 011 Example: Insert 24*

3 4* 12*

Bucket A2 100 101 110 111 Example: Insert 18* …010

18*

Example: Insert 25* …001 Need to split bucket B! Since LD < GD, direc. does not double this time Only LD++ on old bucket and modify direc. ptrs

3 13*

Bucket B2

3 1* 41* 17*

SLIDE 42

42

Extendible Hashing: Delete

❖ Search for k; delete record on data page ❖ If data page becomes empty, we can merge with another data page with same last GD-1 bits (its “historical split image”) and do LD--; update direc. ptrs ❖ If all splits images get merged back and all buckets have LD = GD-1, shrink direc. by half; GD-- NB: Never does a bucket’s LD become > GD!

SLIDE 43

43

Extendible Hashing: Delete

2 2 4* 12* 2 1* 13* 41* 17* 2 26* 2 7* 15*

Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 00 01 10 11 Example: Delete 26* …10 Bucket C is now empty Can merge with A; LD-- Q: Why did we pick A to merge with?

1

Example: Delete 41* …01

In practice, deletes and thus, bucket merges are rare So, directory shrinking is even more uncommon

SLIDE 44

44

Static Hashing vs Extendible Hashing

Q: Why not let N in static hashing grow and shrink too? ❖ Extendible hash direc. size is typically much smaller than data pages; with static hash, reorg. of all data pages is far more expensive ❖ Hashing skew is a common issue for both; in static hash, this could lead to large # overflow pages; in extendible hash, this could blow up direc. size (this is OK) ❖ If too many data entries share search key (duplicates),

verflow pages needed for extendible hash too!

SLIDE 45

45

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

SLIDE 46

46

Statistical Machine Learning 101

❖ An “ML model” is a program/function to approximate a hidden / unknown function based on a set of data points ❖ The set of functions representable by an ML model is called its hypothesis space

f : DX → DY

<latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit>

Input/features Output/target Examples:

DY = R

<latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit>

H = {f}

<latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit>

DX = R

<latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit>

Linear Regression

H = R2

<latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit>

Polynomial Regression H = Rd

<latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit>

SLIDE 47

47

❖ Most ML models are parametric; an ML model’s learning algorithm is a program to help pick a specific function from its hypothesis space, i.e., set parameter values ❖ Accuracy of ML model quantified by how well the picked function “fits” the observed data ❖ Two kinds of accuracy: “train” vs “test” ❖ Train is on same data seen by learning algorithm ❖ Test is on held out data; evaluates “generalization” ❖ Different ML models represent different sets of functions and thus, typically yield different accuracy

Statistical Machine Learning 101

SLIDE 48

48

Artificial Neural Networks (ANNs)

❖ Extremely powerful, versatile, and popular ML model ❖ Can be viewed as a generalization of linear regression: append non-linearity after linear regression: composite is called “neuron”; line up many neurons: called a “layer”; stack many such layers from input to output ❖ Universal function approximator! ❖ A special form of ANNs can simulate arbitrary Turing Machines!

SLIDE 49

49

Artificial Neural Networks (ANNs)

❖ Different forms of ANNs tailored for different “raw” data representations for input and/or output ❖ Multi-Layer Perceptrons (MLPs) deal with a “flat” numeric vector for input and/or output ❖ Recurrent Neural Networks (RNNs) tailored for sequentially structured inputs (text, time series, etc.), Convolutional Neural Networks (CNNs) for images/ spatially structured data, etc. ❖ Bottomline: ANNs can help approximate any function!

f : DX → DY

<latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit>

(Switch to Tim Kraska’s slides)

SLIDE 50

50

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

SLIDE 51

51

Motivation for Sorting

❖ User’s SQL query has ORDER BY clause! ❖ First step of bulk loading of a B+ tree index ❖ Used in implementations of many relational ops: project, join, set ops, group by aggregate, etc. (next topic!) Q: But sorting is well-known; why should a DBMS bother? Often, the file (relation) to be sorted will not fit in RAM! “External” Sorting

SLIDE 52

52

External Sorting: Overview

❖ Goal: Given relation R with N pages, SortKey A, M buffer pages (often, M << N), sort R on A to get sorted R’ ❖ Idea: Sorting algorithm should be disk page I/O-aware! ❖ Desiderata: ❖ High efficiency, i.e., low I/O cost, even for very large N ❖ Use sequential I/Os rather than random I/Os AMAP ❖ Interleave I/O and comp. (DMA); reduce CPU cost too NB: I/O-aware sorting is also a key part of the implementation of MapReduce/Hadoop!

SLIDE 53

53

Warm-up: 2-way External Merge Sort

1. Sort phase: Read each page into buffer memory; do

“internal” sort (use any popular fast sorting algorithm, e.g., quicksort); write it back to disk (a sorted “run”)

2. Merge phase: Read 2 runs, merge them on the fly, write
ut a new double-length run; recurse till whole file is a run!

Idea: Make Merge Sort I/O-aware! NB: Sort phase is 1-pass; merge phase is often multi-pass!

SLIDE 54

Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 1-page runs PASS 0 3,4 5,6 2,6 4,9 7,8 1,3 2 2-page runs PASS 1 2,3 4,6 4,7 8,9 1,3 5,6 2 4-page runs PASS 2 2,3 4,4 6,7 8,9 1,2 3,5 6 8-page runs PASS 3 9 1,2 2,3 3,4 4,5 6,6 7,8

54

Warm-up: 2-way External Merge Sort

Each pass does 1 read and 1 write of whole file: 2N page I/Os per pass Number of passes: Sort phase: 1 Merge phase: dlog2(N)e I/O cost of 2-way EMS:

2N(1 + dlog2(N)e)

Q: How to reduce this cost further? Whole file is sorted! N=7 pages =274=56

SLIDE 55

55

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

SLIDE 56

56

Multi-way EMS: Motivation

Idea: Why not exploit more buffer pages (say, B >> 3)? Q: How many buffer pages does 2-way EMS use? Sort phase: 2 (1 for read, 1 for write) Merge phase: 3 (1 for each run input; 1 for merged output) So, 2-way EMS uses only 3 buffer pages! Sort phase: Read B pages at a time (not just 1 at a time)! Write out sorted runs of length B each (not just 1) But I/O cost of sort phase is still the same! dN/Be

SLIDE 57

57

Multi-way EMS: B-way Merge Phase

Idea: In 2-way EMS, we merge 2 sorted runs at a time; in multi-way EMS, we merge B-1 sorted runs at a time!

B-1 way merge; total # buffer pages used: B

INPUT 1 INPUT B-1 OUTPUT

Disk Disk

INPUT 2

. . . . . . . . .

# passes for Merge Phase reduces to: dlogB−1(dN/Be)e

SLIDE 58

58

Multi-way EMS I/O Cost

Overall, # passes = I/O cost per pass = Total I/O cost of EMS =

1 + dlogB−1(dN/Be)e 2N(1 + dlogB−1(dN/Be)e) 2N

Example: File with 100M records of length 0.5KB each; page size is 8KB; number of buffer pages for EMS B=1000 Number of pages N = 100M * 0.5KB / 8KB = 6.25M Total I/O cost of EMS = Only need the ceil! = 2 x 6.25M x (1 + 2) = 37.5M

2 ⇥ 6.25M ⇥ (1 + dlog999(6250)e)

SLIDE 59

59

Multi-way EMS I/O Cost

Total number of passes =

N Naive 2- way EMS B=1K B=10K B=100K 1M 21 3 2 2 10M 25 3 2 2 100M 28 3 3 2 1B 31 4 3 2

1 + dlogB−1(dN/Be)e

With 8KB page, 7.5TB! With 8KB page, 782MB Only 2 passes to sort up to 74.5TB! (2 is the lower bound for EMS!)

SLIDE 60

60

Multi-way EMS: Improvements

❖ While already efficient, some key algorithmic+systems-

riented improvements have been made to multi-way

EMS to reduce overall runtime (not just counting I/O cost) ❖ Three prominent improvements:

1. Replacement sort (aka heap sort) as internal sort
2. “Blocked” I/O
3. Double Buffering

SLIDE 61

61

Improvement 1: Replacement Sort

❖ In standard EMS, quick sort used during Sort Phase ❖ Produces runs of length B pages each ❖ Replacement sort is an alternative for Sort Phase ❖ Produces runs of average length 2B pages each ❖ So, number of runs reduced on average to ❖ Maintains a sorted heap in B-2 pages; 1 page for reading; 1 for sorted output ❖ Slightly higher CPU cost; but signif. lower I/O cost dN/Be dN/2Be (We are skipping the details of this algorithm) New total I/O cost = 2N(1 + dlogB−1(dN/2Be)e)

SLIDE 62

62

Improvement 2: “Blocked” I/O

❖ Merge Phase did not recognize distinction between sequential I/O and random I/O! ❖ Time difference not reflected in counting I/O cost ❖ Idea: Read a “block” of b pages of each run at a time! ❖ So, only runs can be merged at a time ❖ b controls trade-off of # passes vs time-per-pass New total I/O cost = bB/bc 1 “Fan-in” of Merge Phase =

r

SLIDE 63

63

Improvement 3: Double Buffering

❖ Most machines have DMA; enables I/O-CPU parallelism ❖ Trivially feasible to exploit DMA in the Sort Phase ❖ But in the Merge Phase, CPU blocked by I/O for runs ❖ Idea: Allocate double the buffers for each run; while CPU processes one set, read pages (I/O) into other set! ❖ So, only runs can be merged at a time New total I/O cost = New fan-in of Merge Phase = F =

r

bB/2bc 1 bB/2bc 1

SLIDE 64

64

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

SLIDE 65

65

Using B+ Tree for EMS

❖ Suppose we already have a B+ tree index with the SortKey being equal to (or a prefix of) the IndexKey ❖ Data entries of the B+ tree are already in sorted order! Q: Is it a “good” idea to simply read the leaf level of the B+ tree to achieve the EMS then? It depends! On whether the index is clustered or not! Good idea! Might be really bad!

SLIDE 66

Data Pages Index Data Entries

66

Using Clustered B+ Tree for EMS

❖ Go down the tree to reach left-most leaf ❖ Scan leaf pages (data entries) left to right ❖ If AltRecord, done! O/W, retrieve data pages pointed to by successive data entries ❖ I/O cost if AltRecord: height + # leaf pages ❖ I/O cost otherwise: height + # leaf pages + # data pages ❖ Either way, I/O cost often << from-scratch EMS!

SLIDE 67

Data Pages Index Data Entries

67

Using Unclustered B+ Tree for EMS

❖ Unclustered means not AltRecord! Why? ❖ Same procedure as for clustered B+ tree ❖ Same I/O “cost” as for clustered tree with AltRID/AltRIDlist but many back-to-back random I/Os; thrashing! ❖ Usually, much slower than from-scratch EMS! Q: But when is this faster than from-scratch EMS?

SLIDE 68

68

Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book

Arun Kumar

CSE 232A Graduate Database Systems

Slide ACKs: Jignesh Patel, Paris Koutris

Motivation for Indexing

❖ Consider the following SQL query:

SELECT * FROM Movies WHERE Year=2017

Movies (M)

MovieID Name Year Director

Q: How to obtain the matching records from the file? ❖ Heap file? Need to do a linear scan! O(N) I/O and CPU ❖ “Sorted” file? Binary search! O(log2(N)) I/O and CPU

SELECT * FROM Movies WHERE Year>=2000 AND Year<2010

Indexing helps retrieve records faster for selective predicates!

Another View of Storage Manager

I/O Accesses

I/O Manager

Access Methods

Heap File B+-tree Index Sorted File Hash Index

Buffer Manager Concurrency Control Manager Recovery Manager

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

Indexing

Overview of Indexes

What is stored in the Index?

Overview of B+ Tree Index

Index Entries (Non-leaf pages) Data Entries (Leaf pages) Entries of the form: AltRID: (IndexKey value, RID) Entries of the form: (IndexKey value, PageID)

Overview of Hash Index

Hash function SearchKey Bucket pages Overflow pages

Primary bucket pages

❖ Bucket pages have data entries (same 3 Alternatives) ❖ Hash function helps obtain O(1) search time

Trade-offs of Data Entry Alternatives

More Indexing-related Terminology

❖ Composite Index: IndexKey has > 1 attributes ❖ Primary Index: IndexKey contains the primary key ❖ Secondary Index: Any index that not a primary index ❖ Unique Index: IndexKey contains a candidate key ❖ All primary indexes are unique indexes!

MovieID Name Year Director IMDB_URL

IMDB_URL is a candidate key Index on MovieID? Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)?

More Indexing-related Terminology

MovieID Name Year Director IMDB_URL

Index on Year? Index on (Year, Name)?

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

Root

Height = 1

B+ Tree Index: Search

❖ Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached ❖ Comparison within non-leaf nodes: binary/linear search Examples: search 7*; 8*; 24*; range [19*,33*] Order = 2

Leaf Page

data entries

Non-leaf Page

index entries

B+ Tree Index: Page Format

Order = m/2

B+ Trees in Practice

B+ Tree Index: Insert

Root

Entry to be inserted in parent node Copied up (and continues to appear in the leaf)

5

B+ Tree Index: Insert

Example: Insert 8* Split! Split! Height++

Insert in parent node. Pushed up (and only appears once in the index)

Minimum occupancy is guaranteed in both leaf and non-leaf page splits

B+ Tree Index: Insert

Example: Insert 8*

New Root

B+ Tree Index: Insert

❖ Recursive splitting went up to root; height went up by 1 ❖ Splitting is somewhat expensive; is it avoidable? ❖ Can redistribute data entries with left or right sibling, if there is space! Example: Insert 8*

Root

Insert: Leaf Node Redistribution

❖ Redistributing data entries with a sibling improves page

usually not used for non-leaf node splits ❖ Could increase I/O cost (checking siblings) ❖ Propagating internal splits is better amortization ❖ Pointer management headaches Example: Insert 8*

B+ Tree Index: Delete

B+ Tree Index: Delete

Example: Delete 22* Example: Delete 20* ❖ Deleting 22* is easy ❖ Deleting 20* is followed by redistribution at leaf level. Note how middle key is copied up. Example: Delete 24*?

Need to merge recursively upwards!

New Root

B+ Tree Index: Delete

Example: Delete 24* ❖ Must merge leaf nodes! ❖ In non-leaf node, remove index entry with key value = 27 ❖ Pull down of the index entry

Root

Delete: Non-leaf Node Redistribution

❖ Suppose this is the state of the tree when deleting 24* ❖ Instead of merge of root’s children, we can also redistribute entry from left child of root to right child

Root

Delete: After Redistribution

❖ Rotate IndexKeys through the parent node ❖ It suffices to re-distribute index entry with key 20; for illustration, 17 also re-distributed

CSE 232A  Graduate Database Systems

❖ Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached ❖ Comparison within non-leaf nodes: binary/linear search Examples: search 7; 8; 24; range [19,33*] Order = 2

Entry to be inserted in parent node  Copied up (and continues to appear in the leaf)

Example: Delete 24* ❖ Must merge leaf nodes! ❖ In non-leaf node, remove index entry with key value = 27 ❖ Pull down of   the index entry