CSE 132C Database System Implementation Arun Kumar Exercise 6 - - PowerPoint PPT Presentation

cse 132c database system implementation
SMART_READER_LITE
LIVE PREVIEW

CSE 132C Database System Implementation Arun Kumar Exercise 6 - - PowerPoint PPT Presentation

CSE 132C Database System Implementation Arun Kumar Exercise 6 Answers 1 Common Info: Netflix Schema Ratings / R RatingID Stars RateDate UID MID 7254 4.5 12/15/19 839 123 UID UName Age JoinDate Users


slide-1
SLIDE 1

1

Exercise 6 Answers

CSE 132C
 Database System Implementation

Arun Kumar

slide-2
SLIDE 2

2

RatingID Stars RateDate UID MID 7254 4.5 12/15/19 839 123 … … … … … UID UName Age JoinDate 32 Alvarez 39 11/02/14 … … … … MID Name Year Director 492 Parasite 2019 Bong Joon-Ho … … … …

Ratings / R Users / U Movies / M

Common Info: Netflix Schema

slide-3
SLIDE 3

3

  • All attributes in the given Netflix database are of fixed length and

8 bytes long, except for Director, Name, and UName, which are 40 bytes long.

  • The number of tuples in R, U, and M are 10 billion, 50 million, and

1 million, respectively.

  • All tables are stored as heap files with the pages storing the fixed-

length records in unpacked layout.

  • The foreign keys have no dangling references.
  • In the given instance, every UID in U and every MID in M arises at

least once in R.

Common Info: Netflix Database

slide-4
SLIDE 4

4

Exercise

Q1) Page size is 4096 bytes.

  • A. [9pts] What is the size of each table in pages? Round to the

nearest thousand. Sizes of tuples: R is 40 B, U is 64 B, and M is 96 B. NTuples(R) = 10bil, NTuples(U) = 50mil, NTuples(M) = 1mil Page is 4096 B. Unpacked has bitmap (1b per slot) and count (4 B). So #tuples/page of R: 4096 >= 40x + x/8 + 4 => max x is 101. So, #pages of R is NR = ceil(10 billion / 101) ~ 99,010,000 Likewise, we get NU ~ 794,000 and NM ~ 24,000

slide-5
SLIDE 5

5

Exercise

Q1) Page size is 4096 bytes.

  • B. [3pts] Which 2-table key-foreign key join has the larger output in

terms of cardinality? In terms of arity? In terms of size? 2 KFK joins are possible: R JOIN U and R JOIN M. Both will have the same cardinality (that of R) because neither foreign key has dangling references. Both will have the same arity because U and M have the same arity and both foreign keys have only one attribute each. R JOIN M will be larger in size because the attribute set from M is larger than that from U.

slide-6
SLIDE 6

6

Exercise

Q1) Page size is 4096 bytes.

  • C. [5pts] What is the size (in pages) of the output of the 3-table star

join in pages? Round to the nearest thousand. Size of output tuple: 40 + 64 + 96 - 2 * 8 = 184 B #tuples/page of output: 4096 >= 184x + x/8 + 4 => max x is 22 #tuples of output is same as R: 10 billion So, #pages of output is ceil(10 billion / 22) ~ 454,545,000

slide-7
SLIDE 7

7

Exercise

Q1) Page size is 4096 bytes.

  • D. [8pts] What is the smallest possible size (in pages) of the output
  • f the following query? Round to the nearest thousand.

πUID(U) × πMID(M) − πUID,MID(R)

<latexit sha1_base64="IG8qkgBOzduiWdWRKnzU3mdCxo=">ACQHicbVBNSwJBGJ61L7Mvq2OXIQkUSnZDqG5SHeogWLQquCKz46wOzn4w824giz+tSz+hW+cuHYro2qnx45DaCwMPzwfvO48bCa7ANF+N1NLyupaej2zsbm1vZPd3aupMJaU2TQUoWy4RDHBA2YDB8EakWTEdwWru/2rkV5/ZFLxMHiAQcRaPukG3OUgKba2boT8Xbi+AR6HBL79no4zNsF7AD3mcIzYmUsVgr4BC+EjmdN94V2NmcWzfHgRWBNQ5Np9rOvjidkMY+C4AKolTMiNoJUQCp4INM06sWERon3RZU8OA6PtaybiAIT7STAd7odQvADxm/yYS4is18F3tHN2p5rUR+Z/WjME7byU8iGJgAZ0s8mKBIcSjNnGHS0ZBDQgVHJ9K6Y9IgkF3XlGl2DNf3kR1E6LVql4cVfKlS+ndaTRATpEeWShM1RGN6iKbETRE3pDH+jTeDbejS/je2JNGdPMPpoZ4+cXzbKv4w=</latexit>

Size of output tuple: 2 * 8 = 16 B #tuples/page of output: 4096 >= 16x + x/8 + 4 => max x is 253 Observe that the crossproduct output cardinality is fixed: 50m x 1m = 50,000 billion. So, to get smallest #tuples possible in the output, we need to check the largest number of tuples from the second

  • project. Clearly, it cannot exceeded NTuples(R) = 10 billion. So,

smallest output cardinality is 49,990 billion. So, #pages of output is ceil(above / 253) ~ 197,588,933,000

slide-8
SLIDE 8

8

Exercise

Q2) Page size is 4096 bytes. You have 1 million buffer frames.

  • A. [6pts] What is the lowest possible I/O cost (in pages) of sorting

R on Stars using any of the optimizations discussed in the lectures? Round to the nearest thousand. NR = 99.01mil. Note output #tuples, #tuples/page, #pages is same. To get lowest I/O cost, use internal replacement sort but no double buffering or blocked I/O:

2N(1 + dlogB−1(dN/2Be)e)

<latexit sha1_base64="TGig+9U4jvuztJ+hyxVHv9jXz08=">ACH3icbVDLSgMxFM3UV62vUZdugkWoiHWmFB+7UjeuSgX7gM4wZNJMG5p5kGSEMvRP3PgrblwoIu76N2baEbT1QMjJOfdyc48bMSqkYUy13Mrq2vpGfrOwtb2zu6fvH7RFGHNMWjhkIe+6SBGA9KSVDLSjThBvstIx3dpn7nkXBw+BjiNi+2gQUI9iJXk6JeVBiyZ9BimFAGWThwkvq5OYGlTGnAC1ipQ4unr9Of29GLRtmYAS4TMyNFkKHp6F9WP8SxTwKJGRKiZxqRtBPEJcWMTApWLEiE8AgNSE/RAPlE2Mlsvwk8UofeiFXJ5Bwpv7uSJAvxNh3VaWP5FAseqn4n9eLpXdtJzSIYkCPB/kxQzKEKZhwT7lBEs2VgRhTtVfIR4ijrBUkRZUCObiysukXSmb1fLNfbVYq2dx5MEROAYlYIrUAN3oAlaAIMn8ALewLv2rL1qH9rnvDSnZT2H4A+06Tfi5Z87</latexit>

= 2N (1 + 1) = 396,040,000

slide-9
SLIDE 9

9

Exercise

Q2) Page size is 4096 bytes. You have 1 million buffer frames.

  • B. [10pts] Suppose page pointers are 8 bytes long and record IDs are 12

bytes long. What is the lowest possible size (in pages) of an extendible hash index built on Users with IndexKey UID using the AltRID alternative? Assume the hash function enables uniform hashing without skews. Round to the nearest thousand. (Hint: Count the number of slots first).

IndexKey is primary key of U. So, #slots = NTuples(U) = 50mil Slot size = 8 b (UID) + 12 b (record ID) = 20 b; 4B for LD #slots/bucket = floor((4096 - 4)/20) = 204 #buckets = ceil(50mil/204) = 245,099 Dir has 1 page pointer per bucket; #pointers/page = 4096/8 = 512; so #dir pages = ceil(245099/512) = 479 So, total #pages = 245,099 + 479 ~ 246,000.

slide-10
SLIDE 10

10

Exercise

Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless

  • f the data distributions using only the operator implementations discussed in

the lectures? Include the cost of writing the output in C and D.

  • A. [6pts]

πJoinDate(U)

<latexit sha1_base64="09FmxXkEGzeM5YvXlYLEn5I9irY=">ACAXicbVBNS8NAEN3Ur1q/ol4EL4tFqJeSEG9FfUgnirYD2hK2Wy37dLNJuxOhBLqxb/ixYMiXv0X3vw3btoctPXBwO9GWbm+ZHgGhzn28otLa+sruXCxubW9s79u5eQ4exoqxOQxGqlk80E1yOnAQrBUpRgJfsKY/ukr95gNTmofyHsYR6wRkIHmfUwJG6toHXsS7XkBgyCG5Dbm8JsAmpfpJ1y46ZWcKvEjcjBRhlrX/vJ6IY0DJoEKonXbdSLoJEQBp4JNCl6sWUToiAxY21BJAqY7yfSDCT42Sg/3Q2VKAp6qvycSEmg9DnzTmd6q571U/M9rx9A/7yRcRjEwSWeL+rHAEOI0DtzjilEQY0MIVdzciumQKELBhFYwIbjzLy+SxmnZrZQv7irF6mUWRx4doiNUQi46Q1V0g2qojih6RM/oFb1ZT9aL9W59zFpzVjazj/7A+vwBR+qWyg=</latexit>

NU = 794,000. Without extra histogram info on JoinDate, assume dedup count worst case of NTuples(U). So, worst-case non-dedup intermediate NT ~ (8/64) NU ~ 99,250. Since T fits entirely in DRAM, overall I/O cost (excluding output write cost) is just NU = 794,000.

slide-11
SLIDE 11

11

Exercise

Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless

  • f the data distributions using only the operator implementations discussed in

the lectures? Include the cost of writing the output in C and D.

Director,COUNT(∗)(M . / R)

<latexit sha1_base64="JLKP0EsXjzdOFAY27I2BpBxLl8=">ACJHicbVDLSgNBEJyN7/iKevQyGIQoEnZFUPESjAcvPjEmkA1hdjKJgzM7y0yvEpb9GC/+ihcPvDgxW9xEiOosaChqOqmuyuIBDfgu9OZmR0bHxicio7PTM7N59bWLw0KtaUVagStcCYpjgIasAB8FqkWZEBoJVg+tyz6/eMG24Ci+gG7GJ2QtzklYKVmbs/vEClJM/ElgSsOyQHXjILS6Qb+lsonleOLtLC+luLCEfYDdQuc4fO1Zi7vFt0+8DxBiSPBjht5l78lqKxZCFQYype24EjYRo4FSwNOvHhkWEXpMOq1saEslMI+k/meJVq7RwW2lbIeC+nMiIdKYrgxsZ+9u89frif959RjaO42Eh1EMLKRfi9qxwKBwLzHc6iciupYQqrm9FdMrogkFm2vWhuD9fXmYXG4Wva3i7tlWvrQ/iGMSLaMVEAe2kYldIhOUQVRdIce0BN6du6dR+fVeftqzTiDmSX0C87HJ8TbpDM=</latexit>
  • B. [8pts]

NR = 99,01mil; NM = 24,000; B = 0.5mil ~ 2 GB Clearly a hash join is feasible with just one pass over each base table. Pipeline it to a hash-based group by. #entries in hash table in worst-case is NTuples(M) = 1mil. Entry size is 40 b + 8b = 48b. So, max hash table size is just 1.4 * 1mil * 48 b ~ 68 MB, well under B. So, whole PQP is fully pipelined and no partitioning is needed anywhere. Overall I/O cost is just

  • ne read of each base table: NR + NM = 99,034,000.
slide-12
SLIDE 12

12

Exercise

Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless

  • f the data distributions using only the operator implementations discussed in

the lectures? Include the cost of writing the output in C and D.

  • C. [10pts] R .

/ U . / M

<latexit sha1_base64="jdoQO0RcFjuf8OMDStVny0+h76U=">AB/nicbZDNSsNAFIUn/tb6FxVXbgaL4KokUlB3RTduhCqmLbShTKaTduhkEmZulBIKvobF4q49Tnc+TZO2yDaemDg49x7uXdOkAiuwXG+rIXFpeWV1cJacX1jc2vb3tmt6zhVlHk0FrFqBkQzwSXzgINgzUQxEgWCNYLB5bjeuGdK81jewTBhfkR6koecEjBWx96/xe0gfgDOsPdD1x275JSdifA8uDmUK5ax/5sd2OaRkwCFUTrlusk4GdEAaeCjYrtVLOE0AHpsZBSKm/Wxy/gfGaeLw1iZJwFP3N8TGYm0HkaB6YwI9PVsbWz+V2ulEJ75GZdJCkzS6aIwFRhiPM4Cd7liFMTQAKGKm1sx7RNFKJjEiYEd/bL81A/KbuV8vlNpVS9yOMoAN0iI6Ri05RFV2hGvIQRl6Qi/o1Xq0nq0363aumDlM3voj6yPb9YclMk=</latexit>

NR = 99,01mil; NU = 794,000; NM = 24,000; B = 0.5mil ~ 2 GB Clearly R JOIN U cannot execute with just one read of each base table because U is larger than available RAM. HJ or SMJ will have same cost based on given B. But R JOIN M can run with just one read of M because M is much smaller than available RAM. So, do a PQP with right-deep tree with M as leftmost joined with the output of U joined with R as rightmost. Fully pipelined plan with join I/O cost = NM + 3(NU + NR) = 299,436,000. Output size for write is 454,545,000. So, total I/O cost is 753,891,000.

slide-13
SLIDE 13

13

Exercise

Q3) Page size is 4096 bytes. You have 500,000 buffer frames for each query (independent of the others). What is the lowest possible I/O cost (in pages, rounded to the nearest thousand) of each of the following queries regardless

  • f the data distributions using only the operator implementations discussed in

the lectures? Include the cost of writing the output in C and D.

πUID(U) × πMID(M)

<latexit sha1_base64="ag5oDSef0NKWT3QugpWgfvdzX1A=">ACF3icbVA9SwNBEN3zM8avqKXNYhCSJtxJQO2CWmgRiOAlgdx7G02yZK9D3bnhHDkX9j4V2wsFLHVzn/jXpIiJj4YeLw3w8w8PxZcgWn+GCura+sbm7mt/PbO7t5+4eCwqaJEUmbTSESy7RPFBA+ZDRwEa8eSkcAXrOUPrzO/9cik4lH4AKOYuQHph7zHKQEteYWKE3PCQgMOKT23c24ZJexAzxgCs9b9cyql71C0ayYE+BlYs1IEc3Q8ArfTjeiScBCoIo1bHMGNyUSOBUsHeSRSLCR2SPutoGhK92E0nf43xqVa6uBdJXSHgiTo/kZJAqVHg687sTLXoZeJ/XieB3oWb8jBOgIV0uqiXCAwRzkLCXS4ZBTHShFDJ9a2YDogkFHSUeR2CtfjyMmeVaxq5fK+WqxdzeLIoWN0gkrIQueohm5RA9mIoif0gt7Qu/FsvBofxue0dcWYzRyhPzC+fgHiL58f</latexit>
  • D. [10pts]

Output tuple size: 2 * 8 = 16 B; So, #tuples/page of output: 4096 >= 16x + x/8 + 4 => max x is 253; Output cardinality: 50m x 1m = 50,000 billion; So, #pages is ceil(above / 253) ~ 197,628,458,500 Use BNLJ for crossproduct. NU = 794,000; NM = 24,000; B = 0.5mil ~ 2 GB So, smaller table M can fit entirely as a single block in DRAM. Project on UID and MID implicitly do not need dedup, since they are primary keys. So, whole crossproduct needs only 1 read of M entirely into DRAM and 1 streaming read of U. It adds a small NU + NM to the above. Total I/O cost: 197,629,276,500.

slide-14
SLIDE 14

14

Exercise

Q4) Are these pairs of queries equivalent? If you say yes, explain

  • why. If you say no, provide a counterexample.
  • A. [6pts]

πUID(U) × πMID(M)

<latexit sha1_base64="Bedz0xIQE31UJaDFKIRvx95BDSM=">ACHXicbVA9SwNBEN3zM8avqKXNYhCSJtxJQO2CWmgRiOAlgdx7G02yZK9D3bnhHDkj9j4V2wsFLGwEf+Ne0kKk/hg4PHeDPz/FhwBab5Y6ysrq1vbOa28ts7u3v7hYPDpoSZlNIxHJtk8UEzxkNnAQrB1LRgJfsJY/vM781iOTikfhA4xi5gakH/IepwS05BWqTsy91AkIDik9t3NeIxLdhk7wAOm8Jxbn7r1slcomhVzArxMrBkpohkaXuHL6UY0CVgIVBClOpYZg5sSCZwKNs47iWIxoUPSZx1NQ6J3u+nkuzE+1UoX9yKpKwQ8Uf9OpCRQahT4ujM7VC16mfif10mgd+GmPIwTYCGdLuolAkOEs6hwl0tGQYw0IVRyfSumAyIJBR1oXodgLb68TJpnFataubyvFmtXszhy6BidoBKy0DmqoVvUQDai6Am9oDf0bjwbr8aH8TltXTFmM0doDsb3L4LPoYs=</latexit>

πUID,MID(U × M)

<latexit sha1_base64="Cmbl0CTC5V5aBOPYRf7r0Oahyw=">ACFXicbVBNS8NAEN3Ur1q/qh69LBahQimJFNRbUQ96KFQwbaEpZbPdtks3H+xOhBL6J7z4V7x4UMSr4M1/46aNoK0PBt68mWFmnhsKrsA0v4zM0vLK6lp2PbexubW9k9/da6gkpTZNBCBbLlEMcF9ZgMHwVqhZMRzBWu6o8uk3rxnUvHAv4NxyDoeGfi8zykBLXzJSfk3djxCAw5xPbN1aT0k9R0MsFGzvAPaZw7bibL5hlcwq8SKyUFCKejf/6fQCGnMByqIUm3LDKETEwmcCjbJOZFiIaEjMmBtTX2i93Ti6VcTfKSVHu4HUocPeKr+noiJp9TYc3VncrGaryXif7V2BP2zTsz9MALm09mifiQwBDixCPe4ZBTEWBNCJde3YjoklDQRua0Cdb8y4ukcVK2KuXz20qhepHakUH6BAVkYVOURVdozqyEUP6Am9oFfj0Xg23oz3WvGSGf20R8YH9+Oyp5q</latexit>
  • Yes. The schemas are same. The output cardinality will always be the

same, viz., NTuples(U) * NTuples(M), since these are the primary keys. The output will have all pairs of UIDs and MIDs from U and M on the left and on the right.

slide-15
SLIDE 15

15

Exercise

Q4) Are these pairs of queries equivalent? If you say yes, explain

  • why. If you say no, provide a counterexample.
  • B. [8pts]

Year,MAX (Stars)(R . / M)

<latexit sha1_base64="uduLGSXLP2h0WAcQWYcnTxpcWoI=">ACK3icbVBNa9tAEF0laZo6Taqmx16WmIDwUgh0PaWj0svgTSJP4JlzGg9tpfsSmJ31GKE/k8u+Ss9tId+0Gv/R1aODa3TBwtv3pthZ16cKWkpCH56K6trT9afbjyrbT7f2n7hv9xp2zQ3AlsiVanpxmBRyQRbJElhNzMIOlbYiW9OK7/zCY2VaXJF0wz7GsaJHEkB5KSBfxKNQWsYFJEGmkgqrhFMuc8X5dlxt+SNRXVJYGy5VzYueBSn0kiP9sb+PWgGczAH5NwTupsjvOB/zUapiLXmJBQYG0vDLqF2BICoVlLcotZiBuYIw9RxPQaPvF7NaSv3HKkI9S415CfKb+PVGAtnaqY9dZLW2XvUr8n9fLafSuX8gkywkT8fDRKFecUl4Fx4fSoCA1dQSEkW5XLiZgQJCLt+ZCJdPfkzaB83wsPn+42H96GQexwZ7zXZg4XsLTtiH9g5azHBbtkX9p398O68b94v7/dD64o3n3nF/oH35x6ua6fp</latexit>

Year,MAX (Stars)(M . / MID,MAX (Stars)(R))

<latexit sha1_base64="j/OWFRiwU5CfujmL+JfcXtkIj/4=">ACYnichVFdSxtBFJ1dtdVoNeqjPgwNQgIl7Bah9c2PtSHgLZGI9kQ7k5u4uDM7jJzVwnL/knfOpLf4iTGMHGghcGzj3nHubOmThT0lIQPHr+wuLSh4/LK5XVtU/rG9XNrUub5kZgW6QqNZ0YLCqZYJskKexkBkHCq/i25OJfnWHxso0uaBxhj0No0QOpQByVL86jkagNfSLSAPdSCquEUz5hb+0raNOyesv3W8CY8tGW/xKE7vSKf97dOf7xr5/VfjUa/WguawbT4WxDOQI3N6qxfYgGqcg1JiQUWNsNg4x6BRiSQmFZiXKLGYhbGHXwQ02l4xjajke4Z8GFq3EmIT9nXjgK0tWMdu8nJsnZem5D/07o5Db/3CplkOWEini8a5opTyid584E0KEiNHQBhpNuVixswIMj9SsWFEM4/+S24/NoM95sH5/u1w+NZHMtsh31mdRayb+yQ/WRnrM0E+MtevehvfXr/ib/vbzqO/NPNvsn/J3nwAjE7dY</latexit>
  • Yes. The schemas are same. The output cardinality will always be the

same, viz., #values of Year in M. Now the partial pre-aggregation in R in the second query reduces the cardinality to #unique values of MID in R. But MID is the join attribute. So, the join cardinality is unaffected. Multiple MIDs might have the same Year in M. But due to the property of MAX, the MAX taken over MIDs in the second query is topped up with a MAX taken over Year. MAX of multiple MAX values corresponding to the same Year (for each MID) will still give the same global MAX for each Year as the query above.

slide-16
SLIDE 16

16

Exercise

Q4) Are these pairs of queries equivalent? If you say yes, explain

  • why. If you say no, provide a counterexample.
  • C. [8pts]

COUNT(∗)(R . / U . / M)

<latexit sha1_base64="b2wCKPihFPL1X6oqVcMK/rZACnM=">ACG3icbVDLSgMxFM3UV62vqks3wSK0LspMKai7Yjdu1CodFTqlZNK0DU1mhuSOUob+hxt/xY0LRVwJLvwb0wei1QOBwzn3cnOHwmuwbY/rdTc/MLiUno5s7K6tr6R3dy60mGsKHNpKEJ14xPNBA+YCxwEu4kUI9IX7NrvV0f+9S1TmodBHQYRa0rSDXiHUwJGamVLXpdISVqJwn0OCTVc/esPszvF4Y4f4k9P7wDzrD7zU4LrWzOLtpj4L/EmZIcmqLWyr57ZDGkgVABdG64dgRNBOigFPBhkv1iwitE+6rGFoQCTzWScbYj3jNLGnVCZFwAeqz83EiK1HkjfTI4S6FlvJP7nNWLoHDYTHkQxsIBODnVigSHEo6JwmytGQwMIVRx81dMe0QRCqbOjCnBmY38l1yVik65eHRzlWOp3Wk0Q7aRXnkoANUQSeohlxE0T16RM/oxXqwnqxX620ymrKmO9voF6yPL7Rrn/E=</latexit>

COUNT(∗)(⇡UID,MID(R) . / U . / M)

<latexit sha1_base64="9mOFZBEiSCvgUbomAiMSvz9M4A=">ACPXicbVDLSgMxFM3Ud31VXboJFqEVKTMiqDtRF7rwScWOqVk0rQGk5khuaOUYX7Mjf/gzp0bF4q4dWta6t6IHDuOfeSe48fCa7Btu+tzNDwyOjY+ER2cmp6ZjY3N3+uw1hR5tJQhKrqE80ED5gLHASrRoR6QtW8S93u37linNw6AMnYjVJWkHvMUpASM1cmWvTaQkjcSTBC4JLvH7lE5LawU1zwIv5tuAd76epncWgK03BWxJ4fXgNn2P1ih8VGLm+X7B7wX+L0SR71cdLI3XnNkMaSBUAF0brm2BHUE6KAU8HSrBdrFhF6SdqsZmhAJNP1pHd9ipeN0sStUJkXAO6pPycSIrXuSN90drfXg15X/M+rxdDarCc8iGJgAf34qBULDCHuRombXDEKomMIoYqbXTG9IpQMIFnTQjO4Ml/yflayVkvbZ2u57d3+nGMo0W0hArIQRtoG+2jE+Qim7QA3pCz9at9Wi9WK8frRmrP7OAfsF6ewcOTK3g</latexit>
  • No. Schemas are same. Output cardinality after group by (1) are same.

But in the second query, the deduplication of R on (UID,MID) may alter the star join’s intermediate output cardinality before the group by because duplicates of (UID,MID) across tuples are possible in R. Thus, the output instance values may differ. Simplest counterexample: R has [(RID = 1, …, UID = 1, MID = 1), (RID = 2, …, UID = 1, MID =1)]. U has [(UID = 1, …)]. M has [(MID = 1, …)]. The first query gives a count of 2; the second gives a count of 1.