CSE 132C Database System Implementation Arun Kumar Exercise 6 - PowerPoint PPT Presentation

CSE 132C   Database System Implementation Arun Kumar Exercise 6 Answers 1

Common Info: Netflix Schema Ratings / R RatingID Stars RateDate UID MID 7254 4.5 12/15/19 839 123 … … … … … UID UName Age JoinDate Users / U 32 Alvarez 39 11/02/14 … … … … Movies / M MID Name Year Director 492 Parasite 2019 Bong Joon-Ho … … … … 2

Common Info: Netflix Database • All attributes in the given Netflix database are of fixed length and 8 bytes long, except for Director, Name, and UName, which are 40 bytes long. • The number of tuples in R, U, and M are 10 billion, 50 million, and 1 million, respectively. • All tables are stored as heap files with the pages storing the fixed- length records in unpacked layout. • The foreign keys have no dangling references. • In the given instance, every UID in U and every MID in M arises at least once in R. 3

Exercise Q1) Page size is 4096 bytes. A. [9pts] What is the size of each table in pages? Round to the nearest thousand. Sizes of tuples: R is 40 B, U is 64 B, and M is 96 B. NTuples(R) = 10bil, NTuples(U) = 50mil, NTuples(M) = 1mil Page is 4096 B. Unpacked has bitmap (1b per slot) and count (4 B). So #tuples/page of R: 4096 >= 40x + x/8 + 4 => max x is 101. So, #pages of R is N R = ceil(10 billion / 101) ~ 99,010,000 Likewise, we get N U ~ 794,000 and N M ~ 24,000 4

Exercise Q1) Page size is 4096 bytes. B. [3pts] Which 2-table key-foreign key join has the larger output in terms of cardinality? In terms of arity? In terms of size? 2 KFK joins are possible: R JOIN U and R JOIN M. Both will have the same cardinality (that of R) because neither foreign key has dangling references. Both will have the same arity because U and M have the same arity and both foreign keys have only one attribute each. R JOIN M will be larger in size because the attribute set from M is larger than that from U. 5

Exercise Q1) Page size is 4096 bytes. C. [5pts] What is the size (in pages) of the output of the 3-table star join in pages? Round to the nearest thousand. Size of output tuple: 40 + 64 + 96 - 2 * 8 = 184 B #tuples/page of output: 4096 >= 184x + x/8 + 4 => max x is 22 #tuples of output is same as R: 10 billion So, #pages of output is ceil(10 billion / 22) ~ 454,545,000 6

<latexit sha1_base64="IG8qkgBOzduiWdWRKnzU3mdCxo=">ACQHicbVBNSwJBGJ61L7Mvq2OXIQkUSnZDqG5SHeogWLQquCKz46wOzn4w824giz+tSz+hW+cuHYro2qnx45DaCwMPzwfvO48bCa7ANF+N1NLyupaej2zsbm1vZPd3aupMJaU2TQUoWy4RDHBA2YDB8EakWTEdwWru/2rkV5/ZFLxMHiAQcRaPukG3OUgKba2boT8Xbi+AR6HBL79no4zNsF7AD3mcIzYmUsVgr4BC+EjmdN94V2NmcWzfHgRWBNQ5Np9rOvjidkMY+C4AKolTMiNoJUQCp4INM06sWERon3RZU8OA6PtaybiAIT7STAd7odQvADxm/yYS4is18F3tHN2p5rUR+Z/WjME7byU8iGJgAZ0s8mKBIcSjNnGHS0ZBDQgVHJ9K6Y9IgkF3XlGl2DNf3kR1E6LVql4cVfKlS+ndaTRATpEeWShM1RGN6iKbETRE3pDH+jTeDbejS/je2JNGdPMPpoZ4+cXzbKv4w=</latexit> Exercise Q1) Page size is 4096 bytes. D. [8pts] What is the smallest possible size (in pages) of the output of the following query? Round to the nearest thousand. π UID ( U ) × π MID ( M ) − π UID , MID ( R ) Size of output tuple: 2 * 8 = 16 B #tuples/page of output: 4096 >= 16x + x/8 + 4 => max x is 253 Observe that the crossproduct output cardinality is fixed: 50m x 1m = 50,000 billion. So, to get smallest #tuples possible in the output, we need to check the largest number of tuples from the second project. Clearly, it cannot exceeded NTuples(R) = 10 billion. So, smallest output cardinality is 49,990 billion. So, #pages of output is ceil(above / 253) ~ 197,588,933,000 7

<latexit sha1_base64="TGig+9U4jvuztJ+hyxVHv9jXz08=">ACH3icbVDLSgMxFM3UV62vUZdugkWoiHWmFB+7UjeuSgX7gM4wZNJMG5p5kGSEMvRP3PgrblwoIu76N2baEbT1QMjJOfdyc48bMSqkYUy13Mrq2vpGfrOwtb2zu6fvH7RFGHNMWjhkIe+6SBGA9KSVDLSjThBvstIx3dpn7nkXBw+BjiNi+2gQUI9iJXk6JeVBiyZ9BimFAGWThwkvq5OYGlTGnAC1ipQ4unr9Of29GLRtmYAS4TMyNFkKHp6F9WP8SxTwKJGRKiZxqRtBPEJcWMTApWLEiE8AgNSE/RAPlE2Mlsvwk8UofeiFXJ5Bwpv7uSJAvxNh3VaWP5FAseqn4n9eLpXdtJzSIYkCPB/kxQzKEKZhwT7lBEs2VgRhTtVfIR4ijrBUkRZUCObiysukXSmb1fLNfbVYq2dx5MEROAYlYIrUAN3oAlaAIMn8ALewLv2rL1qH9rnvDSnZT2H4A+06Tfi5Z87</latexit> Exercise Q2) Page size is 4096 bytes. You have 1 million buffer frames. A. [6pts] What is the lowest possible I/O cost (in pages) of sorting R on Stars using any of the optimizations discussed in the lectures? Round to the nearest thousand. N R = 99.01mil. Note output #tuples, #tuples/page, #pages is same. To get lowest I/O cost, use internal replacement sort but no double buffering or blocked I/O: 2 N (1 + d log B − 1 ( d N/ 2 B e ) e ) = 2N (1 + 1) = 396,040,000 8

Exercise Q2) Page size is 4096 bytes. You have 1 million buffer frames. B. [10pts] Suppose page pointers are 8 bytes long and record IDs are 12 bytes long. What is the lowest possible size (in pages) of an extendible hash index built on Users with IndexKey UID using the AltRID alternative? Assume the hash function enables uniform hashing without skews. Round to the nearest thousand. (Hint: Count the number of slots first). IndexKey is primary key of U. So, #slots = NTuples(U) = 50mil Slot size = 8 b (UID) + 12 b (record ID) = 20 b; 4B for LD #slots/bucket = floor((4096 - 4)/20) = 204 #buckets = ceil(50mil/204) = 245,099 Dir has 1 page pointer per bucket; #pointers/page = 4096/8 = 512; so #dir pages = ceil(245099/512) = 479 So, total #pages = 245,099 + 479 ~ 246,000 . 9

<latexit sha1_base64="09FmxXkEGzeM5YvXlYLEn5I9irY=">ACAXicbVBNS8NAEN3Ur1q/ol4EL4tFqJeSEG9FfUgnirYD2hK2Wy37dLNJuxOhBLqxb/ixYMiXv0X3vw3btoctPXBwO9GWbm+ZHgGhzn28otLa+sruXCxubW9s79u5eQ4exoqxOQxGqlk80E1yOnAQrBUpRgJfsKY/ukr95gNTmofyHsYR6wRkIHmfUwJG6toHXsS7XkBgyCG5Dbm8JsAmpfpJ1y46ZWcKvEjcjBRhlrX/vJ6IY0DJoEKonXbdSLoJEQBp4JNCl6sWUToiAxY21BJAqY7yfSDCT42Sg/3Q2VKAp6qvycSEmg9DnzTmd6q571U/M9rx9A/7yRcRjEwSWeL+rHAEOI0DtzjilEQY0MIVdzciumQKELBhFYwIbjzLy+SxmnZrZQv7irF6mUWRx4doiNUQi46Q1V0g2qojih6RM/oFb1ZT9aL9W59zFpzVjazj/7A+vwBR+qWyg=</latexit> Exercise Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless of the data distributions using only the operator implementations discussed in the lectures? Include the cost of writing the output in C and D. π JoinDate ( U ) A. [6pts] N U = 794,000. Without extra histogram info on JoinDate, assume dedup count worst case of NTuples(U). So, worst-case non-dedup intermediate N T ~ (8/64) N U ~ 99,250. Since T fits entirely in DRAM, overall I/O cost (excluding output write cost) is just N U = 794,000 . 10

<latexit sha1_base64="JLKP0EsXjzdOFAY27I2BpBxLl8=">ACJHicbVDLSgNBEJyN7/iKevQyGIQoEnZFUPESjAcvPjEmkA1hdjKJgzM7y0yvEpb9GC/+ihcPvDgxW9xEiOosaChqOqmuyuIBDfgu9OZmR0bHxicio7PTM7N59bWLw0KtaUVagStcCYpjgIasAB8FqkWZEBoJVg+tyz6/eMG24Ci+gG7GJ2QtzklYKVmbs/vEClJM/ElgSsOyQHXjILS6Qb+lsonleOLtLC+luLCEfYDdQuc4fO1Zi7vFt0+8DxBiSPBjht5l78lqKxZCFQYype24EjYRo4FSwNOvHhkWEXpMOq1saEslMI+k/meJVq7RwW2lbIeC+nMiIdKYrgxsZ+9u89frif959RjaO42Eh1EMLKRfi9qxwKBwLzHc6iciupYQqrm9FdMrogkFm2vWhuD9fXmYXG4Wva3i7tlWvrQ/iGMSLaMVEAe2kYldIhOUQVRdIce0BN6du6dR+fVeftqzTiDmSX0C87HJ8TbpDM=</latexit> Exercise Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless of the data distributions using only the operator implementations discussed in the lectures? Include the cost of writing the output in C and D. � Director , COUNT ( ∗ ) ( M . / R ) B. [8pts] N R = 99,01mil; N M = 24,000; B = 0.5mil ~ 2 GB Clearly a hash join is feasible with just one pass over each base table. Pipeline it to a hash-based group by. #entries in hash table in worst-case is NTuples(M) = 1mil. Entry size is 40 b + 8b = 48b. So, max hash table size is just 1.4 * 1mil * 48 b ~ 68 MB, well under B. So, whole PQP is fully pipelined and no partitioning is needed anywhere. Overall I/O cost is just one read of each base table: N R + N M = 99,034,000 . 11

<latexit sha1_base64="jdoQO0RcFjuf8OMDStVny0+h76U=">AB/nicbZDNSsNAFIUn/tb6FxVXbgaL4KokUlB3RTduhCqmLbShTKaTduhkEmZulBIKvobF4q49Tnc+TZO2yDaemDg49x7uXdOkAiuwXG+rIXFpeWV1cJacX1jc2vb3tmt6zhVlHk0FrFqBkQzwSXzgINgzUQxEgWCNYLB5bjeuGdK81jewTBhfkR6koecEjBWx96/xe0gfgDOsPdD1x275JSdifA8uDmUK5ax/5sd2OaRkwCFUTrlusk4GdEAaeCjYrtVLOE0AHpsZBSKm/Wxy/gfGaeLw1iZJwFP3N8TGYm0HkaB6YwI9PVsbWz+V2ulEJ75GZdJCkzS6aIwFRhiPM4Cd7liFMTQAKGKm1sx7RNFKJjEiYEd/bL81A/KbuV8vlNpVS9yOMoAN0iI6Ri05RFV2hGvIQRl6Qi/o1Xq0nq0363aumDlM3voj6yPb9YclMk=</latexit> Exercise Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless of the data distributions using only the operator implementations discussed in the lectures? Include the cost of writing the output in C and D. C. [10pts] R . / U . / M N R = 99,01mil; N U = 794,000; N M = 24,000; B = 0.5mil ~ 2 GB Clearly R JOIN U cannot execute with just one read of each base table because U is larger than available RAM. HJ or SMJ will have same cost based on given B. But R JOIN M can run with just one read of M because M is much smaller than available RAM. So, do a PQP with right-deep tree with M as leftmost joined with the output of U joined with R as rightmost. Fully pipelined plan with join I/O cost = N M + 3(N U + N R ) = 299,436,000. Output size for write is 454,545,000. So, total I/O cost is 753,891,000 . 12

CSE 132C Database System Implementation Arun Kumar Exercise 6 - PowerPoint PPT Presentation

CSE 132C Database System Implementation Arun Kumar Exercise 6 Answers 1 Common Info: Netflix Schema Ratings / R RatingID Stars RateDate UID MID 7254 4.5 12/15/19 839 123 UID UName Age JoinDate Users

CSE 132C Database System Implementation Arun Kumar Topic 6: Transaction Management Chapter

CSE 132C Database System Implementation Arun Kumar Topic 2: Indexing Chapters 10 and 11 of

CSE 132C Database System Implementation Arun Kumar Topic 3: External Sorting Chapter 13 of

CSE 132C Database System Implementation Arun Kumar Topic 7: Parallel Data Systems Chapter

CSE 132C Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; NOT

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Cracking September 7, 2016 CSE 662 - Database Languages & Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

CSE 510 Web Data Engineering Database Design UB CSE 510 Web Data Engineering How to Design a

CSE 132B CSE 132B Database Systems Applications Database Systems Applications SQL as Query

Intuitionistic Fixed Point Logic and Program Extraction (with Prawf) Olga Petrovska (joint work

Intrinsic Currying for C++ Template Metaprograms Symposium on Trends in Functional Programming

Extracting Higher-Order Goals from the Introduction Three Constructs Mizar Mathematical Library

Imperative Languages Chapters 6 and 8 Key Concepts Values are read from memory, and used to

Formalizing Cut Elimination of Coalgebraic Logics in Coq Hendrik Tews Technische Universit at

Part I Structured Data Data Representation: I.1 The entity-relationship (ER) data model I.2

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Try to find a good excuse! BRA-2015 (Workshop on Belief Revision and Argumentation) Bernhard Nebel

CSE 132C Database System Implementation Arun Kumar Exercise 6 - PowerPoint PPT Presentation

CSE 132C Database System Implementation Arun Kumar Exercise 6 Answers 1 Common Info: Netflix Schema Ratings / R RatingID Stars RateDate UID MID 7254 4.5 12/15/19 839 123 UID UName Age JoinDate Users

CSE 132C Database System Implementation Arun Kumar Topic 6: Transaction Management Chapter

CSE 132C Database System Implementation Arun Kumar Topic 2: Indexing Chapters 10 and 11 of

CSE 132C Database System Implementation Arun Kumar Topic 3: External Sorting Chapter 13 of

CSE 132C Database System Implementation Arun Kumar Topic 7: Parallel Data Systems Chapter

CSE 132C Database System Implementation Arun Kumar Topic 9: ML for RDBMSs Optional; NOT

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

CSE 132B CSE 132B Database Systems Applications Database Systems Applications Alin Deutsch

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Database Cracking September 7, 2016 CSE 662 - Database Languages &amp; Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages &amp; Runtimes

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

CSE 510 Web Data Engineering Database Design UB CSE 510 Web Data Engineering How to Design a

CSE 132B CSE 132B Database Systems Applications Database Systems Applications SQL as Query

Intuitionistic Fixed Point Logic and Program Extraction (with Prawf) Olga Petrovska (joint work

Intrinsic Currying for C++ Template Metaprograms Symposium on Trends in Functional Programming

Extracting Higher-Order Goals from the Introduction Three Constructs Mizar Mathematical Library

Imperative Languages Chapters 6 and 8 Key Concepts Values are read from memory, and used to

Formalizing Cut Elimination of Coalgebraic Logics in Coq Hendrik Tews Technische Universit at

Part I Structured Data Data Representation: I.1 The entity-relationship (ER) data model I.2

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Try to find a good excuse! BRA-2015 (Workshop on Belief Revision and Argumentation) Bernhard Nebel

Database Cracking September 7, 2016 CSE 662 - Database Languages & Runtimes 1 Row Stores

Database Cracking Languages and Runtimes for Big Data CSE 662 - Database Languages & Runtimes