CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing - - PowerPoint PPT Presentation

cse 232a graduate database systems
SMART_READER_LITE
LIVE PREVIEW

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing - - PowerPoint PPT Presentation

CSE 232A Graduate Database Systems Arun Kumar Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book Slide ACKs: Jignesh Patel, Paris Koutris 1 Motivation for Indexing Consider the following SQL query: Movies (M)


slide-1
SLIDE 1

Topic 2: Indexing and Sorting Chapters 10, 11, and 13 of Cow Book

Arun Kumar

1

CSE 232A
 Graduate Database Systems

Slide ACKs: Jignesh Patel, Paris Koutris

slide-2
SLIDE 2

2

Motivation for Indexing

❖ Consider the following SQL query:

SELECT * FROM Movies WHERE Year=2017

Movies (M)

MovieID Name Year Director

Q: How to obtain the matching records from the file? ❖ Heap file? Need to do a linear scan! O(N) I/O and CPU ❖ “Sorted” file? Binary search! O(log2(N)) I/O and CPU

SELECT * FROM Movies WHERE Year>=2000 AND Year<2010

Indexing helps retrieve records faster for selective predicates!

slide-3
SLIDE 3

3

Another View of Storage Manager

I/O Accesses

I/O Manager

Access Methods

Heap File B+-tree Index Sorted File Hash Index

Buffer Manager Concurrency Control Manager Recovery Manager

slide-4
SLIDE 4

4

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

slide-5
SLIDE 5

5

Indexing

❖ Index: A data structure to speed up record retrieval ❖ Search Key: Attribute(s) on which file is indexed; also called Index Key (used interchangeably) ❖ Any permutation of any subset of a relation’s attributes can be index key for an index ❖ Index key need not be a primary/candidate key ❖ Two main types of indexes: ❖ B+ Tree index: good for both range and equality search ❖ Hash index: good for equality search

slide-6
SLIDE 6

Overview of Indexes

❖ Need to consider efficiency of search, insert, and delete ❖ Primarily optimized to reduce (disk) I/O cost ❖ B+ Tree index: ❖ O(logF(N)) I/O and CPU cost for equality search (N: number of “data entries”; F: “fanout” of non-leaf node) ❖ Range search, Insert, and Delete all start with an equality search ❖ Hash index: ❖ O(1) I/O and CPU cost for equality search ❖ Insert and delete start with equality search ❖ Not “good” for range search!

6

slide-7
SLIDE 7

What is stored in the Index?

❖ 2 things: Search/index key values and data entries ❖ Alternatives for data entries for a given key value k: ❖ AltRecord: Actual data records of file that match k ❖ AltRID: <k, RID of a record that matches k> ❖ AltRIDlist: <k, list of RIDs of records that match k> ❖ API for operations on records: ❖ Search (IndexKey); could be a predicate for B+Tree ❖ Insert (IndexKey, data entry) ❖ Delete (IndexKey); could be a predicate for B+Tree

7

slide-8
SLIDE 8

Overview of B+ Tree Index

Index Entries (Non-leaf pages) Data Entries (Leaf pages) Entries of the form: AltRID: (IndexKey value, RID) Entries of the form: (IndexKey value, PageID)

❖ Non-leaf pages do not contain data values; they contain [d, 2d] index keys; d is order parameter ❖ Height-balanced tree; only root can have [1,d) keys ❖ Leaf pages in sorted order of IndexKey; connected as a doubly linked list Q: What is the difference between “B+ Tree” and “B Tree”?

8

slide-9
SLIDE 9

Overview of Hash Index

Hash function SearchKey Bucket pages Overflow pages

h 1 N-1

Primary bucket pages

❖ Bucket pages have data entries (same 3 Alternatives) ❖ Hash function helps obtain O(1) search time

9

slide-10
SLIDE 10

Trade-offs of Data Entry Alternatives

❖ Pros and cons of alternatives for data entries: ❖ AltRecord: Entire file is stored as an index! If records are long, data entries of index are large and search time could be high ❖ AltRID and AltRIDlist: Data entries typically smaller than records; often faster for equality search ❖ AltRIDlist has more compact data entries than AltRID but entries are variable-length Q: A file can have at most one AltRecord index. Why?

10

slide-11
SLIDE 11

More Indexing-related Terminology

❖ Composite Index: IndexKey has > 1 attributes ❖ Primary Index: IndexKey contains the primary key ❖ Secondary Index: Any index that not a primary index ❖ Unique Index: IndexKey contains a candidate key ❖ All primary indexes are unique indexes!

MovieID Name Year Director IMDB_URL

IMDB_URL is a candidate key Index on MovieID? Index on Year? Index on Director? Index on IMDB_URL? Index on (Year,Name)?

11

slide-12
SLIDE 12

More Indexing-related Terminology

❖ Clustered index: order in which records are laid out is same as (or “very close to”) order of IndexKey domain ❖ Matters for (range) search performance! ❖ AltRecord implies index is clustered. Why? ❖ In practice, clustered almost always implies AltRecord ❖ In practice, a file is clustered on at most 1 IndexKey ❖ Unclustered index: an index that is not clustered

MovieID Name Year Director IMDB_URL

Index on Year? Index on (Year, Name)?

12

slide-13
SLIDE 13

13

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

slide-14
SLIDE 14

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13

Height = 1

B+ Tree Index: Search

14

❖ Given SearchKey k, start from root; compare k with IndexKeys in non-leaf/index entries; descend to correct child; keep descending like so till a leaf node is reached ❖ Comparison within non-leaf nodes: binary/linear search Examples: search 7*; 8*; 24*; range [19*,33*] Order = 2

slide-15
SLIDE 15

Leaf Page

R1 K 1 R 2 K2 K n P n+1

data entries

record 1 record 2 Next Page Pointer

Rn

record n

P0

Prev Page Pointer

Non-leaf Page

P1 K 1 P2 K 2 P3 K m P m+1

index entries

Pointer to a
 page with Values < K1 Pointer to a page with values s.t. K1≤ Values < K2 Pointer to a
 page with 
 values ≥Km Pointer to a page with values s.t., K2≤ Values < K3

Pm

B+ Tree Index: Page Format

15

Order = m/2

slide-16
SLIDE 16

B+ Trees in Practice

16

❖ Typical order value: 100 (so, non-leaf node can have up to 200 index keys) ❖ Typical occupancy: 67%; so, typical “fanout” = 133 ❖ Computing the tree’s capacity using fanout: ❖ Height 1 stores 133 leaf pages ❖ Height 4 store 1334 = 312,900,700 leaf pages ❖ Typically, higher levels of B+Tree cached in buffer pool ❖ Level 0 (root) = 1 page = 8 KB ❖ Level 1 = 133 pages ~ 1 MB ❖ Level 2 = 17,689 pages ~ 138 MB and so on

slide-17
SLIDE 17

B+ Tree Index: Insert

17

❖ Search for correct leaf L ❖ Insert data entry into L; if L has enough space, done! Otherwise, must split L (into new L and a new leaf L’) ❖ Redistribute entries evenly, copy up middle key ❖ Insert index entry pointing to L’ into parent of L ❖ A split might have to propagate upwards recursively: ❖ To split non-leaf node, redistribute entries evenly, but 
 push up the middle key (not copy up, as in leaf splits!) ❖ Splits “grow” the tree; root split increases height. ❖ Tree growth: gets wider or one level taller at top.

slide-18
SLIDE 18

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13

Entry to be inserted in parent node
 Copied up (and continues to appear in the leaf)

2* 3* 5* 7* 8*

5

B+ Tree Index: Insert

18

Example: Insert 8* Split! Split! Height++

slide-19
SLIDE 19

Insert in parent node. Pushed up (and only appears once in the index)

5 24 30 17 13

Minimum occupancy is guaranteed in both leaf and non-leaf page splits

B+ Tree Index: Insert

19

Example: Insert 8*

slide-20
SLIDE 20

2* 3*

New Root

17 24 30 14*16* 19*20*22* 24* 27*29* 33*34*38*39* 13 5 7* 5* 8*

B+ Tree Index: Insert

20

❖ Recursive splitting went up to root; height went up by 1 ❖ Splitting is somewhat expensive; is it avoidable? ❖ Can redistribute data entries with left or right sibling, if there is space! Example: Insert 8*

slide-21
SLIDE 21

Root

17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 13 8* 14* 16* 8

Insert: Leaf Node Redistribution

21

❖ Redistributing data entries with a sibling improves page

  • ccupancy at leaf level and avoids too many splits; but

usually not used for non-leaf node splits ❖ Could increase I/O cost (checking siblings) ❖ Propagating internal splits is better amortization ❖ Pointer management headaches Example: Insert 8*

slide-22
SLIDE 22

B+ Tree Index: Delete

22

❖ Start at root, find leaf L where entry belongs ❖ Remove the entry; if L is at least half-full, done! Else, if L has only d-1 entries: ❖ Try to re-distribute, borrowing from sibling L’ ❖ If re-distribution fails, merge L and L’ into single leaf ❖ If merge occurred, must delete entry (pointing to L or sibling) from parent of L. ❖ A merge might have to propagate upwards recursively to root, which decreases height by 1

slide-23
SLIDE 23

27*29* 2* 3* Root 17 24 30 14*16* 19* 20*22* 24* 33*34*38*39* 13 5 7* 5* 8* 27 24* 27* 29*

23

B+ Tree Index: Delete

Example: Delete 22* Example: Delete 20* ❖ Deleting 22* is easy ❖ Deleting 20* is followed by redistribution at leaf level. Note how middle key is copied up. Example: Delete 24*?

slide-24
SLIDE 24

30 19* 27* 29* 33* 34* 38* 39*

Need to merge recursively upwards!

2* 3* 7* 14* 16* 19* 27* 29* 33* 34* 38* 39* 5* 8*

New Root

30 13 5 17

24

B+ Tree Index: Delete

Example: Delete 24* ❖ Must merge leaf nodes! ❖ In non-leaf node, remove index entry with key value = 27 ❖ Pull down of 
 the index entry

slide-25
SLIDE 25

Root

13 5 17 20 22 30 14* 16* 17*18* 20* 33*34*38* 39* 22* 27*29* 21* 7* 5* 8* 3* 2*

25

Delete: Non-leaf Node Redistribution

❖ Suppose this is the state of the tree when deleting 24* ❖ Instead of merge of root’s children, we can also redistribute entry from left child of root to right child

slide-26
SLIDE 26

14*16* 33* 34*38* 39* 22* 27*29* 17*18* 20* 21* 7* 5* 8* 2* 3*

Root

13 5 17 30 20 22

26

Delete: After Redistribution

❖ Rotate IndexKeys through the parent node ❖ It suffices to re-distribute index entry with key 20; for illustration, 17 also re-distributed

slide-27
SLIDE 27

27

Delete: Redistribution Preferred

❖ Unlike Insert, where redistribution is discouraged for non-leaf nodes, Delete prefers redistribution over merge decisions at both leaf or non-leaf levels. Why? ❖ Files usually grow, not shrink; deletions are rare! ❖ High chance of redistribution success (high fanouts) ❖ Only need to propagate changes to parent node

slide-28
SLIDE 28

28

Handling Duplicates/Repetitions

❖ Many data entries could have same IndexKey value ❖ Related to AltRIDlist vs AltRID for data entries ❖ Also, single data entry could still span multiple pages ❖ Solution 1: ❖ All data entries with a given IndexKey value reside

  • n a single page

❖ Use “overflow” pages, if needed (not inside leaf list) ❖ Solution 2: ❖ Allow repeated IndexKey values among data entries ❖ Modify Search appropriately ❖ Use RID to get a unique composite key!

slide-29
SLIDE 29

29

Order Concept in Practice

❖ In practice, order (d) concept replaced by physical space criterion: at least half-full ❖ Non-leaf pages can typically hold many more entries than leaf pages, since leaf pages could have long data records (AltRecord) or RID lists (AltRIDlist) ❖ Often, different nodes could have different # entries: ❖ Variable sized IndexKey ❖ AltRecord and variable-sized data records ❖ AltRIDlist could leads to different numbers of data entries sharing an IndexKey value

slide-30
SLIDE 30

30

B+ Tree Index: Bulk Loading

❖ Given an existing file we want to index, multiple record- at-a-time Inserts are wasteful (too many IndexKeys!) ❖ Bulk loading avoids this overhead; reduces I/O cost ❖ 1) Sort data entries by IndexKey (AltRecord sorts file!) ❖ 2) Create empty root page; copy leftmost IndexKey of leftmost leaf page to root (non-leaf) and assign child ❖ 3) Go left to right; Insert only leftmost IndexKey from each leaf page into index as usual (NB: fewer keys!) ❖ 4) When non-leaf fills up, follow usual Split procedure and recurse upwards, if needed

slide-31
SLIDE 31

31

B+ Tree Index: Bulk Loading

14 5 14*16* 17*18* 20* 22* 27* 21* 7* 5* 3* 2*

Sorted pages of data entries Insert IndexKey 17 Split; push up 14

14 17 20

Insert IndexKey 22 No redistribution! Split again; push up 20 Height++ … and so on

slide-32
SLIDE 32

32

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

slide-33
SLIDE 33

33

Overview of Hash Indexing

❖ Reduces search cost to nearly O(1) ❖ Good for equality search (but not for range search) ❖ Many variants: ❖ Static hashing ❖ Extendible hashing ❖ Linear hashing, etc. (we will not discuss these)

slide-34
SLIDE 34

34

Static Hashing

Hash function SearchKey k Bucket pages Overflow pages

h 1 N-1

Primary bucket pages

❖ N is fixed; primary bucket pages never deallocated ❖ Bucket pages contain data entries (same 3 Alts) ❖ Search: ❖ Overflow pages help handle hash collisions ❖ Average search cost is O(1) + #overflow pages

E.g., h(k) = a*k + b h(k) mod N

2

slide-35
SLIDE 35

35

Static Hashing: Example

Say, AltRecord and

  • nly one record fits

per page! N = 3

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 15 Avatar 2009 Jim Cameron 52 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen

Hash function: a = 1, b = 0 h(k) = k MOD 3 15 MOD 3 = 0

(15, Avatar, …) (52, Gravity, …) (20, Inception, …) (74, Blue …)

1 2 74 MOD 3 = 2

slide-36
SLIDE 36

36

Static Hashing: Insert and Delete

❖ Insert: ❖ Equality search; find space on primary bucket page ❖ If not enough space, add overflow page ❖ Delete: ❖ Equality search; delete record ❖ If overflow page becomes empty, remove it ❖ Primary bucket pages are never removed!

slide-37
SLIDE 37

37

Static Hashing: Issues

❖ Since N is fixed, #overflow pages might grow and degrade search cost; deletes waste a lot of space ❖ Full reorg. is expensive and could block query proc. ❖ Skew in hashing: ❖ Could be due to “bad” hash function that does not “spread” values—but this issue is well-studied/solved ❖ Could be due to skew in the data (duplicates); this could cause more overflow pages—difficult to resolve Extendible (dynamic) hashing helps resolve first two issues

slide-38
SLIDE 38

38

Extendible Hashing

❖ Idea: Instead of hashing directly to data pages, maintain a dynamic directory of pointers to data pages ❖ Directory can grow and shrink; chosen to double/halve ❖ Search I/O cost: 1 (2 if

  • direc. not cached)

Check last GD bits of h(k)

2 2 4* 12* 32* 16* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 00 01 10 11 Example: Search 17* 10001 Search 42* …10

slide-39
SLIDE 39

39

Extendible Hashing: Insert

❖ Search for k; if data page has space, add record; done ❖ If data page has no more space: ❖ If LD < GD, create Split Image of bucket; LD++ for both bucket and split image; insert record properly ❖ If LD == GD, create Split Image; LD++ for both buckets; insert record; but also, double the directory and GD++! Duplicate other direc. pointers properly ❖

  • Direc. typically grows in spurts
slide-40
SLIDE 40

40

Extendible Hashing: Insert

2 2 4* 12* 32* 16* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Directory Data Pages Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) Check last GD bits of h(k) 00 01 10 11 Example: Insert 24* …00 No space in bucket A Need to split and LD++ Since LD was = GD, GD++ and direc. doubles

3 32* 16* 24* 3 4* 12*

Bucket A Bucket A2 000 100

slide-41
SLIDE 41

41

Extendible Hashing: Insert

3 3 32* 16* 24* 2 1* 13* 41* 17* 2 10* 2 7* 15*

Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 000 001 010 011 Example: Insert 24*

3 4* 12*

Bucket A2 100 101 110 111 Example: Insert 18* …010

18*

Example: Insert 25* …001 Need to split bucket B! Since LD < GD, direc. does not double this time Only LD++ on old bucket and modify direc. ptrs

3 13*

Bucket B2

3 1* 41* 17*

slide-42
SLIDE 42

42

Extendible Hashing: Delete

❖ Search for k; delete record on data page ❖ If data page becomes empty, we can merge with another data page with same last GD-1 bits (its “historical split image”) and do LD--; update direc. ptrs ❖ If all splits images get merged back and all buckets have LD = GD-1, shrink direc. by half; GD-- NB: Never does a bucket’s LD become > GD!

slide-43
SLIDE 43

43

Extendible Hashing: Delete

2 2 4* 12* 2 1* 13* 41* 17* 2 26* 2 7* 15*

Bucket A Bucket B Bucket C Bucket D Global Depth (GD) Local Depth (LD) 00 01 10 11 Example: Delete 26* …10 Bucket C is now empty Can merge with A; LD-- Q: Why did we pick A to merge with?

1

Example: Delete 41* …01

In practice, deletes and thus, bucket merges are rare So, directory shrinking is even more uncommon

slide-44
SLIDE 44

44

Static Hashing vs Extendible Hashing

Q: Why not let N in static hashing grow and shrink too? ❖ Extendible hash direc. size is typically much smaller than data pages; with static hash, reorg. of all data pages is far more expensive ❖ Hashing skew is a common issue for both; in static hash, this could lead to large # overflow pages; in extendible hash, this could blow up direc. size (this is OK) ❖ If too many data entries share search key (duplicates),

  • verflow pages needed for extendible hash too!
slide-45
SLIDE 45

45

Indexing: Outline

❖ Overview and Terminology ❖ B+ Tree Index ❖ Hash Index ❖ Learned Index

slide-46
SLIDE 46

46

Statistical Machine Learning 101

❖ An “ML model” is a program/function to approximate a hidden / unknown function based on a set of data points ❖ The set of functions representable by an ML model is called its hypothesis space

f : DX → DY

<latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit>

Input/features Output/target Examples:

DY = R

<latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit><latexit sha1_base64="o6TNPS9XaJ4hfLe0y8QUzyVNuA=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuhqAuXVexD2hAm0k7dPJgZiKUEHDjr7hxoYhbf8Kdf+Mk7UJbDwycOede7r3HizmTyrK+jdLC4tLySnm1sra+sblbu+0ZJQIQpsk4pHoeFhSzkLaVEx2okFxYHadsbXeZ+4EKyaLwTo1j6gR4EDKfEay05Jp7vQCrIcE8vcrce3SOir/npbeZa1atmlUAzRN7SqowRcM1v3r9iCQBDRXhWMqubcXKSbFQjHCaVXqJpDEmIzygXU1DHFDpMUNGTrUSh/5kdAvVKhQf3ekOJByHi6Mt9Qznq5+J/XTZR/5qQsjBNFQzIZ5CcqQjlgaA+E5QoPtYE8H0rogMscBE6dgqOgR79uR50jqu2ZrfnFTrF9M4yrAPB3AENpxCHa6hAU0g8AjP8ApvxpPxYrwbH5PSkjHt2YU/MD5/AIBRl2o=</latexit>

H = {f}

<latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit><latexit sha1_base64="iFoBkjwDSiYI+LMX2WYkixA7TL8=">AB/HicbVBNS8NAFHypX7V+RXv0slgETyURQS9C0UuPFWwtNKFstpt26WYTdjdCPWvePGgiFd/iDf/jZs2B20dWBhm3uPNTpBwprTjfFuVtfWNza3qdm1nd2/wD486qk4lYR2Scxj2Q+wopwJ2tVMc9pPJMVRwOlDML0t/IdHKhWLxb3OEupHeCxYyAjWRhradS/CekIwz9szdI28PRmQ7vhNJ050CpxS9KAEp2h/eWNYpJGVGjCsVID10m0n2OpGeF0VvNSRNMpnhMB4YKHFHl5/PwM3RqlBEKY2me0Giu/t7IcaRUFgVmsoiqlr1C/M8bpDq8nMmklRTQRaHwpQjHaOiCTRikhLNM0MwkcxkRWSCJSba9FUzJbjLX14lvfOma/jdRaN1U9ZRhWM4gTNw4RJa0IYOdIFABs/wCm/Wk/VivVsfi9GKVe7U4Q+szx8s6ZRy</latexit>

DX = R

<latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit><latexit sha1_base64="4Zfe0oX1mHwIs7JR9hck+NzgEI8=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVRIRdCMUdeGyin1AG8JkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfd4MWdSWda3sbC4tLyWlmrm9sbm2bO7tGSWC0BaJeCS6HpaUs5C2FOcdmNBceBx2vHGV7nfeaBCsi8V5OYOgEehsxnBCstueZ+P8BqRDBPrzO3iy5Q8fe89C5zZpVtwqgeWKXpAYlmq751R9EJAloqAjHUvZsK1ZOioVihNOs2k8kjTEZ4yHtaRrigEonLW7I0JFWBsiPhH6hQoX6uyPFgZSTwNOV+YZy1svF/7xeovxzJ2VhnCgakukgP+FIRSgPBA2YoETxiSaYCKZ3RWSEBSZKx1bVIdizJ8+T9knd1vz2tNa4LOowAEcwjHYcAYNuIEmtIDAIzDK7wZT8aL8W58TEsXjLJnD/7A+PwBfsCXaQ=</latexit>

Linear Regression

H = R2

<latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit><latexit sha1_base64="mTEr5jAgHxn2wZoxVHV+OCctKmE=">ACA3icbVDLSsNAFL3xWesr6k43g0VwVZIi6EYoumyin1AG8tkOmHTh7MTIQSAm78FTcuFHrT7jzb5ykWjrgYEz59zLvfe4EWdSWda3sbS8srq2Xtob25t7+yae/tGcaC0BYJeSi6LpaUs4C2FOcdiNBse9y2nEn15nfeaBCsjC4U9OIOj4eBcxjBCstDczDvo/VmGCeNFJ0ifKf6ya36X1tYFasqpUDLRK7IBUo0ByYX/1hSGKfBopwLGXPtiLlJFgoRjhNy/1Y0giTCR7RnqYB9ql0kvyGFJ1oZYi8UOgXKJSrvzsS7Es59V1dme0o571M/M/rxcq7cBIWRLGiAZkN8mKOVIiyQNCQCUoUn2qCiWB6V0TGWGCidGxlHYI9f/IiadeqtuY3Z5X6VRFHCY7gGE7BhnOoQwOa0AICj/AMr/BmPBkvxrvxMStdMoqeA/gD4/MHScqXRg=</latexit>

Polynomial Regression H = Rd

<latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit><latexit sha1_base64="j57nMtFKsMQ9nf2oG2XASbwbh2E=">ACA3icbVDLSsNAFL2pr1pfUXe6GSyCq5KIoBuh6KbLKvYBbSyTyaQdOnkwMxFKCLjxV9y4UMStP+HOv3GSdqGtBwbOnHMv97jxpxJZVnfRmlpeWV1rbxe2djc2t4xd/faMkoEoS0S8Uh0XSwpZyFtKaY47caC4sDltOr3O/80CFZF4pyYxdQI8DJnPCFZaGpgH/QCrEcE8bWToEhU/101vs3tvYFatmlUALRJ7RqowQ3NgfvW9iCQBDRXhWMqebcXKSbFQjHCaVfqJpDEmYzykPU1DHFDpMUNGTrWiof8SOgXKlSovztSHEg5CVxdme8o571c/M/rJcq/cFIWxomiIZkO8hOVITyQJDHBCWKTzTBRDC9KyIjLDBROraKDsGeP3mRtE9rtuY3Z9X61SyOMhzCEZyADedQhwY0oQUEHuEZXuHNeDJejHfjY1paMmY9+/AHxucPlZKXeA=</latexit>
slide-47
SLIDE 47

47

❖ Most ML models are parametric; an ML model’s learning algorithm is a program to help pick a specific function from its hypothesis space, i.e., set parameter values ❖ Accuracy of ML model quantified by how well the picked function “fits” the observed data ❖ Two kinds of accuracy: “train” vs “test” ❖ Train is on same data seen by learning algorithm ❖ Test is on held out data; evaluates “generalization” ❖ Different ML models represent different sets of functions and thus, typically yield different accuracy

Statistical Machine Learning 101

slide-48
SLIDE 48

48

Artificial Neural Networks (ANNs)

❖ Extremely powerful, versatile, and popular ML model ❖ Can be viewed as a generalization of linear regression: append non-linearity after linear regression: composite is called “neuron”; line up many neurons: called a “layer”; stack many such layers from input to output ❖ Universal function approximator! ❖ A special form of ANNs can simulate arbitrary Turing Machines!

slide-49
SLIDE 49

49

Artificial Neural Networks (ANNs)

❖ Different forms of ANNs tailored for different “raw” data representations for input and/or output ❖ Multi-Layer Perceptrons (MLPs) deal with a “flat” numeric vector for input and/or output ❖ Recurrent Neural Networks (RNNs) tailored for sequentially structured inputs (text, time series, etc.), Convolutional Neural Networks (CNNs) for images/ spatially structured data, etc. ❖ Bottomline: ANNs can help approximate any function!

f : DX → DY

<latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit><latexit sha1_base64="FTGe5Y5wZ8mndx+6j4g7Qhpg8=">ACE3icbVDLSsNAFJ34rPUVdelmsAjioiQiK6KunBZwT6kCWEynbRDJzNhZqKU0H9w46+4caGIWzfu/BsnbRa29cCFwzn3cu89YcKo0o7zYy0sLi2vrJbWyusbm1vb9s5uU4lUYtLAgnZDpEijHLS0FQz0k4kQXHISCscXOV+64FIRQW/08OE+DHqcRpRjLSRAvs4uoBejHQfI5Zdj4I29CTt9TWSUjxOfeBXGqzhwnrgFqYAC9cD+9roCpzHhGjOkVMd1Eu1nSGqKGRmVvVSRBOEB6pGOoRzFRPnZ+KcRPDRKF0ZCmuIajtW/ExmKlRrGoenMj1SzXi7+53VSHZ37GeVJqgnHk0VRyqAWMA8IdqkWLOhIQhLam6FuI8kwtrEWDYhuLMvz5PmSdV1qu7taV2WcRAvgABwBF5yBGrgBdAGDyBF/AG3q1n69X6sD4nrQtWMbMHpmB9/QIPU5H</latexit>

(Switch to Tim Kraska’s slides)

slide-50
SLIDE 50

50

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

slide-51
SLIDE 51

51

Motivation for Sorting

❖ User’s SQL query has ORDER BY clause! ❖ First step of bulk loading of a B+ tree index ❖ Used in implementations of many relational ops: project, join, set ops, group by aggregate, etc. (next topic!) Q: But sorting is well-known; why should a DBMS bother? Often, the file (relation) to be sorted will not fit in RAM! “External” Sorting

slide-52
SLIDE 52

52

External Sorting: Overview

❖ Goal: Given relation R with N pages, SortKey A, M buffer pages (often, M << N), sort R on A to get sorted R’ ❖ Idea: Sorting algorithm should be disk page I/O-aware! ❖ Desiderata: ❖ High efficiency, i.e., low I/O cost, even for very large N ❖ Use sequential I/Os rather than random I/Os AMAP ❖ Interleave I/O and comp. (DMA); reduce CPU cost too NB: I/O-aware sorting is also a key part of the implementation of MapReduce/Hadoop!

slide-53
SLIDE 53

53

Warm-up: 2-way External Merge Sort

  • 1. Sort phase: Read each page into buffer memory; do

“internal” sort (use any popular fast sorting algorithm, e.g., quicksort); write it back to disk (a sorted “run”)

  • 2. Merge phase: Read 2 runs, merge them on the fly, write
  • ut a new double-length run; recurse till whole file is a run!

Idea: Make Merge Sort I/O-aware! NB: Sort phase is 1-pass; merge phase is often multi-pass!

slide-54
SLIDE 54

Input file 3,4 6,2 9,4 8,7 5,6 3,1 2 1-page runs PASS 0 3,4 5,6 2,6 4,9 7,8 1,3 2 2-page runs PASS 1 2,3 4,6 4,7 8,9 1,3 5,6 2 4-page runs PASS 2 2,3 4,4 6,7 8,9 1,2 3,5 6 8-page runs PASS 3 9 1,2 2,3 3,4 4,5 6,6 7,8

54

Warm-up: 2-way External Merge Sort

Each pass does 1 read and 1 write of whole file: 2N page I/Os per pass Number of passes: Sort phase: 1 Merge phase: dlog2(N)e I/O cost of 2-way EMS:

2N(1 + dlog2(N)e)

Q: How to reduce this cost further? Whole file is sorted! N=7 pages =2*7*4=56

slide-55
SLIDE 55

55

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

slide-56
SLIDE 56

56

Multi-way EMS: Motivation

Idea: Why not exploit more buffer pages (say, B >> 3)? Q: How many buffer pages does 2-way EMS use? Sort phase: 2 (1 for read, 1 for write) Merge phase: 3 (1 for each run input; 1 for merged output) So, 2-way EMS uses only 3 buffer pages! Sort phase: Read B pages at a time (not just 1 at a time)! Write out sorted runs of length B each (not just 1) But I/O cost of sort phase is still the same! dN/Be

slide-57
SLIDE 57

57

Multi-way EMS: B-way Merge Phase

Idea: In 2-way EMS, we merge 2 sorted runs at a time; in multi-way EMS, we merge B-1 sorted runs at a time!

B-1 way merge; total # buffer pages used: B

INPUT 1 INPUT B-1 OUTPUT

Disk Disk

INPUT 2

. . . . . . . . .

# passes for Merge Phase reduces to: dlogB−1(dN/Be)e

slide-58
SLIDE 58

58

Multi-way EMS I/O Cost

Overall, # passes = I/O cost per pass = Total I/O cost of EMS =

1 + dlogB−1(dN/Be)e 2N(1 + dlogB−1(dN/Be)e) 2N

Example: File with 100M records of length 0.5KB each; page size is 8KB; number of buffer pages for EMS B=1000 Number of pages N = 100M * 0.5KB / 8KB = 6.25M Total I/O cost of EMS = Only need the ceil! = 2 x 6.25M x (1 + 2) = 37.5M

2 ⇥ 6.25M ⇥ (1 + dlog999(6250)e)

slide-59
SLIDE 59

59

Multi-way EMS I/O Cost

Total number of passes =

N Naive 2- way EMS B=1K B=10K B=100K 1M 21 3 2 2 10M 25 3 2 2 100M 28 3 3 2 1B 31 4 3 2

1 + dlogB−1(dN/Be)e

With 8KB page, 7.5TB! With 8KB page, 782MB Only 2 passes to sort up to 74.5TB! (2 is the lower bound for EMS!)

slide-60
SLIDE 60

60

Multi-way EMS: Improvements

❖ While already efficient, some key algorithmic+systems-

  • riented improvements have been made to multi-way

EMS to reduce overall runtime (not just counting I/O cost) ❖ Three prominent improvements:

  • 1. Replacement sort (aka heap sort) as internal sort
  • 2. “Blocked” I/O
  • 3. Double Buffering
slide-61
SLIDE 61

61

Improvement 1: Replacement Sort

❖ In standard EMS, quick sort used during Sort Phase ❖ Produces runs of length B pages each ❖ Replacement sort is an alternative for Sort Phase ❖ Produces runs of average length 2B pages each ❖ So, number of runs reduced on average to ❖ Maintains a sorted heap in B-2 pages; 1 page for reading; 1 for sorted output ❖ Slightly higher CPU cost; but signif. lower I/O cost dN/Be dN/2Be (We are skipping the details of this algorithm) New total I/O cost = 2N(1 + dlogB−1(dN/2Be)e)

slide-62
SLIDE 62

62

Improvement 2: “Blocked” I/O

❖ Merge Phase did not recognize distinction between sequential I/O and random I/O! ❖ Time difference not reflected in counting I/O cost ❖ Idea: Read a “block” of b pages of each run at a time! ❖ So, only runs can be merged at a time ❖ b controls trade-off of # passes vs time-per-pass New total I/O cost = bB/bc 1 “Fan-in” of Merge Phase =

  • r
slide-63
SLIDE 63

63

Improvement 3: Double Buffering

❖ Most machines have DMA; enables I/O-CPU parallelism ❖ Trivially feasible to exploit DMA in the Sort Phase ❖ But in the Merge Phase, CPU blocked by I/O for runs ❖ Idea: Allocate double the buffers for each run; while CPU processes one set, read pages (I/O) into other set! ❖ So, only runs can be merged at a time New total I/O cost = New fan-in of Merge Phase = F =

  • r

bB/2bc 1 bB/2bc 1

slide-64
SLIDE 64

64

Sorting: Outline

❖ Overview and Warm-up ❖ Multi-way External Merge Sort (EMS) ❖ EMS and B+ trees

slide-65
SLIDE 65

65

Using B+ Tree for EMS

❖ Suppose we already have a B+ tree index with the SortKey being equal to (or a prefix of) the IndexKey ❖ Data entries of the B+ tree are already in sorted order! Q: Is it a “good” idea to simply read the leaf level of the B+ tree to achieve the EMS then? It depends! On whether the index is clustered or not! Good idea! Might be really bad!

slide-66
SLIDE 66

Data Pages Index Data Entries

66

Using Clustered B+ Tree for EMS

❖ Go down the tree to reach left-most leaf ❖ Scan leaf pages (data entries) left to right ❖ If AltRecord, done! O/W, retrieve data pages pointed to by successive data entries ❖ I/O cost if AltRecord: height + # leaf pages ❖ I/O cost otherwise: height + # leaf pages + # data pages ❖ Either way, I/O cost often << from-scratch EMS!

slide-67
SLIDE 67

Data Pages Index Data Entries

67

Using Unclustered B+ Tree for EMS

❖ Unclustered means not AltRecord! Why? ❖ Same procedure as for clustered B+ tree ❖ Same I/O “cost” as for clustered tree with AltRID/AltRIDlist but many back-to-back random I/Os; thrashing! ❖ Usually, much slower than from-scratch EMS! Q: But when is this faster than from-scratch EMS?

slide-68
SLIDE 68

68

External Sorting as Competitive Sport!

The geekiest “sport” in the world: sortbenchmark.org