CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About - - PowerPoint PPT Presentation

cse 232a graduate database systems
SMART_READER_LITE
LIVE PREVIEW

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About - - PowerPoint PPT Presentation

CSE 232A Graduate Database Systems Fall 2019 Arun Kumar 1 About Myself 2009: Bachelors in CSE from IIT Madras Summers: 110F! 200916: MS and PhD in CS from UW-Madison PhD thesis area: Data systems for ML workloads Winters: 40F!


slide-1
SLIDE 1

CSE 232A
 Graduate Database Systems

1

Fall 2019 Arun Kumar

slide-2
SLIDE 2

2009: Bachelors in CSE from IIT Madras 2009—16: MS and PhD in CS from UW-Madison PhD thesis area: Data systems for ML workloads 2016-: Asst. Prof. at UC San Diego CSE 2019-: + Asst. Prof. at UC San Diego HDSI

Summers: 110F! Winters: —40F! Ahh! :)

About Myself

2

slide-3
SLIDE 3

3

What is this course about? Why take it?

slide-4
SLIDE 4
  • 1. IBM’s Watson beats humans in Jeapordy!
slide-5
SLIDE 5

5

How did Watson achieve that?

slide-6
SLIDE 6

6

Watson devoured LOTS of data!

slide-7
SLIDE 7

7

  • 2. “Structured” data with Google search results
slide-8
SLIDE 8

8

How does Google know that?

slide-9
SLIDE 9

9

Google also devours LOTS of data!

slide-10
SLIDE 10
  • 3. Amazon’s “spot-on” recommendations

10

slide-11
SLIDE 11

11

How does Amazon know that?

slide-12
SLIDE 12

12

You guessed it! LOTS and LOTS of data!

Analysis

slide-13
SLIDE 13

13

And innumerable “traditional” applications

slide-14
SLIDE 14

14

slide-15
SLIDE 15

15

Large-scale data management systems are the cornerstone of many digital applications, both modern and traditional

slide-16
SLIDE 16

16

The Age of “Big Data”/“Data Science”

slide-17
SLIDE 17

17

Data data everywhere, All the wallets did shrink! Data data everywhere, Nor any moment to think?

slide-18
SLIDE 18

18

CSE 232A will get you thinking about the fundamentals of database systems 1. How are large structured datasets stored and organized? 2. How are “queries” handled? 3. How to make the system faster? 4. Deeper and more recent issues

slide-19
SLIDE 19

19

What this course is NOT about

❖ NOT a course on basics of relational algebra or SQL ❖ Take CSE 132A instead (pre-requisite for 232A!) ❖ NOT a course on how to use an RDBMS for database- backed applications (triggers, physical tuning, etc.) ❖ Take CSE 132B instead ❖ NOT a course on distributed systems or transactions ❖ Take CSE 223B instead

http://cseweb.ucsd.edu/classes/fa19/cse232-a

slide-20
SLIDE 20

20

And now for the (boring) logistics …

slide-21
SLIDE 21

21

Prerequisites

❖ CSE 132A (or equivalent) is essential ❖ CSE 120 is also helpful ❖ For all other cases, email the instructor with proper justification. A waiver can be considered.

http://cseweb.ucsd.edu/classes/fa19/cse232-a

slide-22
SLIDE 22

22

Course Administrivia

❖ Lectures: MWF 3:00-3:50pm, Ledden Auditorium Attending ALL lectures is mandatory! ❖ Instructor: Arun Kumar; arunkk@eng.ucsd.edu Office hours: Wed 4-5pm, 3218 CSE (EBU3b) ❖ TAs: Nikos Koulouris, Kaiqi Yao, Aman Achpal, and Allen Ordookhanians ❖ Piazza: TBD

http://cseweb.ucsd.edu/classes/fa19/cse232-a

slide-23
SLIDE 23

23

Grading

❖ Midterm Exam 1: 20% Date: Friday, October 25; in-class ❖ Midterm Exam 2: 20% Date: Monday, November 25; in-class ❖ Final Exam: 60% (cumulative) Date: Friday, December 13, 3-6pm; Room TBD ❖ (Optional/Extra Credit) 6 Paper Reviews: 3%

http://cseweb.ucsd.edu/classes/fa19/cse232-a

slide-24
SLIDE 24

24

Course Outline

  • 1. How are large structured datasets stored?

Storage and file layout

  • 2. How are “queries” handled?

Indexing, sorting, relational operator implementations, and query processing

  • 3. How to make the system faster?

Query optimization and parallelism

  • 4. Recent issues and trends

Data systems for ML, Data integration and cleaning, semi-structured data, ML for RDBMSs

http://cseweb.ucsd.edu/classes/fa19/cse232-a

slide-25
SLIDE 25

25

The primary focus will be the relational data model and Relational Database Management Systems (RDBMS)

slide-26
SLIDE 26

26

Relational model in a nutshell

Basically, Relation:Table :: Pilot:Driver (okay, a bit more) The model formalizes “operations” to manipulate relations RatingID Rating Date UserID MovieID 1 3.5 08/27/15 23294 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

slide-27
SLIDE 27

27

Relational model in a nutshell

Invented by E. F. Codd in 1970s at IBM Research in CA

slide-28
SLIDE 28

28

Relational DBMS in a nutshell

A software system to implement the relational model, i.e., enable users to manage data stored as relations

slide-29
SLIDE 29

29

Relational DBMS in a nutshell

First RDBMSs: System R (IBM) and Ingres (Berkeley) in 1970s A rare photo of the original System R manual Mike Stonebraker won the Turing Award in 2015!

slide-30
SLIDE 30

30

Relational DBMS in a nutshell

RDBMS software is now a USD 40+ billions/year industry! Numerous open source RDBMSs also popular People still start companies about what are basically RDBMSs!

slide-31
SLIDE 31

31

Course Textbook

Prescribed: “Database Management Systems” 3rd Edition Raghu Ramakrishnan and Johannes Gehrke Optional: “Database Systems: The Complete Book” by Garcia-Molina, Widom, and Ullman “Big Data Integration” by Dong and Srivastava Aka The “Cow Book” Which cow are you?

slide-32
SLIDE 32

32

Tentative Course Schedule

Week Topic 0-1 Introduction; Data Storage (Disks; Files; New Hardware) 2-3 Indexing (B+ Tree; Hash Indexes; Learned Indexes); Sorting 3-4 Relational Operator Implementations; Query Processing 4 Midterm Exam 1 on Friday, 10/25 5-6 Query Optimization; Materialized Views 6-7 Parallel RDBMSs; Dataflow Systems; Cloud RDBMSs 7-8 Data Systems for ML Workloads 9 Midterm Exam 2 on Monday, 11/25 9-10 Data Integration and Cleaning; Semi-Structured Data 10 More ML for RDBMSs; Recap 11 Final Exam on Friday, 12/13

slide-33
SLIDE 33

33

General Dos and Do NOTs

Do: ❖ Raise your hand before asking questions during lectures ❖ Discuss papers in the reading list with peers and others ❖ Participate in class discussions; and on Piazza, if you like ❖ Use “CSE232A:” as subject prefix for all emails to me/TAs Do NOT: ❖ Plagiarize or share content for your paper reviews ❖ Harass, cut off, or be disrespectful to others in class ❖ Use email as primary communication mechanism for doubts/questions instead of Office Hours ❖ Record or quote the instructor’s anecdotes out of class! ☺

slide-34
SLIDE 34

34

Questions?

slide-35
SLIDE 35

35

Example Relational DB: Netflix!

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 … … … … … UserID Name Age JoinDate 79 Alice 23 01/10/13 … … … MovieID Name ReleaseDate Director 20 Inception 07/13/2010 Christopher Nolan … … …

Ratings Users Movies

slide-36
SLIDE 36

36

Recap: Relational Model

slide-37
SLIDE 37

37

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

What is a Relation? A glorified table! What are Attributes? These things What are Domains? The mathematical “domains” for the attributes Integers Real … What is Arity? Ratings Number of attributes

slide-38
SLIDE 38

38

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

What are Tuples? What is Cardinality? These things Number of tuples Ratings

slide-39
SLIDE 39

39

Referring to “tuples”: Two notations

1. Without using attribute names (positional/sequence) 2. Using attribute names (named/set)

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

Ratings (R) t[1] = 3.5 t.NumStars = 3.5

slide-40
SLIDE 40

40

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

What is Schema? The relation name, and the name and logical descriptions of the attributes (including domains) Aka “metadata” Ratings

slide-41
SLIDE 41

41

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 4232 293 3 2.5 08/02/15 54551 846 … … … … …

What is an Instance? A given relation populated with a set of tuples (loose analogy: schema:instance::type:value in PL) Instance 1

RatingID NumStars Timestamp UserID MovieID 3292 1.5 06/27/14 794 10 294122 4.0 07/10/14 232 329 74423 0.5 03/08/14 8451 846 … … … … …

Instance 2 Ratings

slide-42
SLIDE 42

42

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 … … … … …

What is a Relational Database?

UserID Name Age JoinDate 79 Alice 23 01/10/13 … … … MovieID Name ReleaseDate Director 20 Inception 07/13/2010 Christopher Nolan … … …

A collection of relations; similarly, schema vs. instance Ratings Users Movies

slide-43
SLIDE 43

43

“Write Operations” on a Relation

❖ Insert Add tuples to a relation ❖ Delete Remove tuples from a relation (typically based on “predicate” matches, e.g., “NumStars <= 4.5” ❖ Modify Logically, deletes + inserts, but typically implemented as in-place updates to a relation instance

slide-44
SLIDE 44

44

“Read Operations” on a Relation

❖ “Select” Select all tuples from Ratings with “UserID == 19” ❖ “Project” Select only Director attribute from Movies ❖ “Aggregate” Select Average of all NumStars in Ratings

And a few more formal “algebraic” operations …

slide-45
SLIDE 45

45

Recap: Relational Algebra

slide-46
SLIDE 46

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

46

slide-47
SLIDE 47

Select

47

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

Ratings (R)

Example: Get all ratings with 4.0 or more stars “Selection condition/predicate” Select “Operator”

slide-48
SLIDE 48

Select

48

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R

RatingID NumStars Timestamp UserID MovieID 2 4.0 07/20/15 80 20 4 4.5 03/05/14 80 16

R’

Schema preserved Subset of tuples (satisfying selection condition)

slide-49
SLIDE 49

Complex Selection Conditions

49

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20

R’

slide-50
SLIDE 50

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

50

slide-51
SLIDE 51

Project

51

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

Ratings (R)

Example: Get all MovieIDs “Projection list”

slide-52
SLIDE 52

Project

52

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R

MovieID 20 16

R’

Schema reduced (to projection list) Tuple values “deduplicated” (slightly different semantics in SQL)

slide-53
SLIDE 53

Composition of Relational Ops

53

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R

Example: Get UserID and NumStars of ratings with less than 3 stars UserID NumStars 79 2.5

R’

RelOp(Relation) gives a Relation

slide-54
SLIDE 54

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

54

slide-55
SLIDE 55

Rename

55

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

Ratings (R)

Example: Rename Timestamp to RateDate

ρC(2−>RateDate)(R) ρRatingID,NumStars,RateDate,UserID,MovieID(R)

slide-56
SLIDE 56

Rename

56

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R R’

Instance preserved Schema modified

RatingID NumStars RateDate UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

ρC(2−>RateDate)(R)

slide-57
SLIDE 57

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

57

slide-58
SLIDE 58

Cross Product

58

UserID Name Age JoinDate 79 Alice 23 01/10/13 80 Bob 41 05/10/13 MovieID Name ReleaseYear Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron

Users (U) Movies (M)

❖ Cartesian product (construct all pairs of tuples across tables) ❖ Schema of output “concatenates” the input schemas ❖ Be careful with attribute name conflicts! Use Rename op

slide-59
SLIDE 59

Cross Product

59

UserID Name Age JoinDate 79 Alice 23 01/10/13 80 Bob 41 05/10/13 MovieID Name ReleaseYear Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron

Users (U) Movies (M)

UserID U.Name Age JoinDate MovieID M.Name Release Year Director 79 Alice 23 01/10/13 20 Inception 2010 Christopher Nolan 79 Alice 23 01/10/13 16 Avatar 2009 Jim Cameron 80 Bob 41 05/10/13 20 Inception 2010 Christopher Nolan 80 Bob 41 05/10/13 16 Avatar 2009 Jim Cameron

ρC(1−>U.Name(U) × ρC(1−>M.Name(M)

slide-60
SLIDE 60

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

60

slide-61
SLIDE 61

Union

61

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

R2

Union of sets of tuples (instances) Inputs must have identical schema: “Union-compatible”

slide-62
SLIDE 62

Union

62

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

slide-63
SLIDE 63

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

63

slide-64
SLIDE 64

Set Difference

64

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

R2

Set difference of sets of tuples (instances) Inputs must be “Union-compatible”

slide-65
SLIDE 65

Set Difference

65

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

R2

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20

R’ = R1 – R2

slide-66
SLIDE 66

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

66

slide-67
SLIDE 67

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

67

slide-68
SLIDE 68

Intersection

68

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

R2

Set intersection of sets of tuples (instances) Inputs must be “Union-compatible”

slide-69
SLIDE 69

Intersection

69

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

R2

RatingID NumStars Timestamp UserID MovieID 3 2.5 08/02/14 79 16

slide-70
SLIDE 70

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

70

slide-71
SLIDE 71

Join

71

❖ Equivalent Select on Cross Product, but “bypasses” full X ❖ Perhaps the most intensively studied Rel Op! ❖ Several “types” of Joins: Natural Join and Equi-Join Condition Join (aka Theta Join) Semi-Join, Inner Join, Outer Join, Anti-Join, etc.

R . /JoinCondition M

σJoinCondition(R × M)

slide-72
SLIDE 72

Natural Join

72

MovieID Name ReleaseYear Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron

Movies (M)

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

Ratings (R)

R . / M

slide-73
SLIDE 73

Natural Join

73

Rating ID NumS tars Timesta mp UserI D MovieI D Name Release Year Director 1 3.5 08/27/15 79 20 Inception 2010 Christopher Nolan 2 4.0 07/20/15 80 20 Inception 2010 Christopher Nolan 3 2.5 08/02/14 79 16 Avatar 2009 Jim Cameron 4 4.5 03/05/14 80 16 Avatar 2009 Jim Cameron

❖ “Join attributes”: attributes that determine matching tuples Have same name in both inputs! “MovieID” in R and M ❖ Implicit equality condition on join attributes If > 1 pair, implicit logical “and” of all equality terms ❖ Output schema concatenates input schemas But join attributes appear only once in output (Project)

R . / M

slide-74
SLIDE 74

Equi-Join

74

❖ Generalization of the Natural Join ❖ Attribute names of “join attributes” need not be the same ❖ EqualityCondition is a general boolean expression (logical “and”, and/or “or”) of terms with equality predicates only ❖ Join attributes from both R and M in output (no Project) Perhaps the most important and common type of Join! Lots of R&D on efficient implementations!

R . /EqualityCondition M

slide-75
SLIDE 75

Equi-Join: Example

75

J K 10 x 20 y 30 x

R1(J,K)

P Q x 4 y 9 x 8

R2(P,Q) T(J,K,P,Q)

J K P Q 10 x x 4 20 y y 9 30 x x 8 10 x x 8 30 x x 4

T(J, K, P, Q) = R1 . /K=P R2

slide-76
SLIDE 76

(Primary) Key-Foreign Key Join

76

❖ A special kind of equi-join ❖ One of the join attributes is the (Primary) Key of an input relation; the other is a Foreign Key in the other relation RatingID NumStars Timestamp UID MovieID UserID Name Age JoinDate

Ratings Users

Also a common and important (sub) type of join with even more specialized efficient implementations!

Ratings . /UID=UserID Users

slide-77
SLIDE 77

Condition Join (aka “Theta” Join)

77

❖ Generalization of the Equi-Join

R . /JoinCondition M

slide-78
SLIDE 78

Condition Join: Example

78

J K 10 x 20 y 30 x

R1(J,K)

P Q x 4 y 9 x 12

R2(P,Q) T(J,K,P,Q)

J K P Q 10 x x 4 20 y x 4 20 y y 9 30 x x 4 30 x y 9 30 x x 12 Perhaps the most difficult type of Join to implement efficiently!

T(J, K, P, Q) = R1 . /J/2>Q R2

slide-79
SLIDE 79

Join Expressions

79

Can compose many joins into a single complex expression RatingID NumStars Timestamp UserID MovieID UserID UName Age JoinDate MovieID Name ReleaseYear Director

Ratings Users Movies

  • Q. What do we get as the output?

UserI D UNa me Ag e JoinDat e RatingI D NumSta rs Timesta mp MovieI D Name ReleaseY Ear Directo r

AllStuff

slide-80
SLIDE 80

80

Taxonomy of Joins

All kinds of joins Inner joins Outer joins Semi joins Anti joins Theta joins Equi joins Natural joins Key-Foreign Key Joins “Snowflake” joins “Star” joins

slide-81
SLIDE 81

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

81

slide-82
SLIDE 82

Group By Aggregate

82

❖ NOT a part of relational algebra, but “Extended RA”! ❖ Useful for “analytics” queries that aggregate numerical data RatingID NumStars Timestamp UserID MovieID

Ratings

What is the average rating for each movie? How many movies has each user rated? ❖ Standard 5 numerical aggregations supported in SQL: Count, Sum, Average, Maximum, and Minimum Extra: Median, Mode, Variance, Standard Deviation, etc.

slide-83
SLIDE 83

Group By Aggregate

83

“Grouping Attributes” (Subset of R’s attributes) A numerical attribute in R “Aggregate Function” (SUM, COUNT, AVG, MAX, MIN)

❖ Output schema will have X and an extra numerical attribute (result of the aggregate function) ❖ Can list multiple aggregate functions in the same operation

γX,Agg(Y )(R)

slide-84
SLIDE 84

Group By Aggregate

84

RatingID NumStars Timestamp UserID MovieID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16

R

MovieID AVG(NumStars) 20 3.75 16 3.5

R’

What is the average rating for each movie?

γMovieID,AV G(NumStars)(R)

slide-85
SLIDE 85

Group By Aggregate

85

AVG(Age) 25.75

U’

What is the average age of the users? UserID Name Age JoinDate 79 Alice 23 01/10/13 80 Bob 41 05/10/13 123 Carol 19 08/09/14 420 Dan 20 03/01/15

Users (U)

The set of Grouping Attributes can be empty too!

γAV G(Age)(U)

slide-86
SLIDE 86

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

86

slide-87
SLIDE 87

87

Recap: SQL

slide-88
SLIDE 88

Basic Form of an SQL Query

88

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]

List of attributes to project Optional List of relations (possibly with “aliases”) Selection/join condition (optional)

slide-89
SLIDE 89

What does it mean logically?

89

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]

1. Cross-product of relations in relation-list 2. If condition given, apply it to filter out tuples 3. Remove attributes not present in target-list 4. If DISTINCT given, deduplicate tuples in result

The above is only a logical interpretation. It is NOT a “plan” an RDBMS would use in general to run an SQL query!

slide-90
SLIDE 90

90

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Example: Get the names of movies released in 2013

πName(σY ear=2013(M))

slide-91
SLIDE 91

91

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Name Gravity Blue Jasmine

slide-92
SLIDE 92

92

Example SQL Query

SELECT * FROM Movies M WHERE M.Year = 2013

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen MovieID Name Year Director 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen

slide-93
SLIDE 93

93

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year <> 2013

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Example: Get the names of movies from years other than 2013

πName(σY ear6=2013(M))

slide-94
SLIDE 94

94

Example SQL Query

SELECT M.Year FROM Movies M

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Example: For which years do we have movie data?

πY ear(M)

slide-95
SLIDE 95

95

Example SQL Query

SELECT M.Year FROM Movies M

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Year 2010 2009 2013 2013

SQL allows repetitions of tuples in a relation! Not the same semantics as RA’s Project Called “bag semantics” vs. RA’s set semantics

slide-96
SLIDE 96

96

DISTINCT in SQL

SELECT DISTINCT M.Year FROM Movies M

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Year 2010 2009 2013

DISTINCT needed to achieve set semantics of RA’s Project in SQL

slide-97
SLIDE 97

97

Aliases in SQL

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Why bother with the alias? Not needed here!

SELECT Name FROM Movies WHERE Year = 2013

slide-98
SLIDE 98

98

Aliases in SQL – Useful for Joins!

Movies (M)

MovieID Name Year DirectorID

SELECT M.Name FROM Movies M, Directors D WHERE D.Name = “Jim Cameron” AND M.DirectorID = D.DID

Aliases help disambiguate attributes with the same name from multiple relations (or even a self-join!) Example: Get names of movies directed by “Jim Cameron”

Directors (D)

DID Name Age

slide-99
SLIDE 99

99

More SQL Examples

Example: Get names of movies released in 2013 by Woody Allen or some other director 50 years or older

SELECT M.Name FROM Movies M, Directors D WHERE (D.Name = “Woody Allen” OR D.Age >= 50) AND M.Year = 2013 AND M.DirectorID = D.DID Movies (M)

MovieID Name Year DirectorID

Directors (D)

DID Name Age

slide-100
SLIDE 100

100

LIKE in SQL

SELECT DISTINCT M.Director FROM Movies M WHERE M.Name LIKE “Blue%”

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Example: Get the directors of movies that start with “Blue” “%” matches any number of characters; “_” matches one

slide-101
SLIDE 101

101

ORDER BY in SQL

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen Name Gravity Blue Jasmine

ORDER BY M.Name

Name Blue Jasmine Gravity

Useful for data readability Ordering defined by domain semantics Can specify DESC; multiple attributes

slide-102
SLIDE 102

102

LIMIT in SQL

SELECT M.Name FROM Movies M WHERE M.Year >= 2010

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 74 Blue Jasmine 2013 Woody Allen

ORDER BY M.Year

Year Inception Gravity Blue Jasmine

Also useful for data readability Prevents “flooding” of screen with data Be wary of using it without ORDER BY!

LIMIT 2

slide-103
SLIDE 103

103

Aggregate Functions in SQL

SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010 Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 91 Interstellar 2014 Christopher Nolan How many movies came out after 2010?

slide-104
SLIDE 104

104

Aggregate Functions in SQL

SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010

Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 91 Interstellar 2014 Christopher Nolan How many movies came out after 2010? COUNT(*) 2

slide-105
SLIDE 105

105

5 Native Aggregate Functions in SQL

❖ COUNT ([DISTINCT] attribute) ❖ AVG ([DISTINCT] attribute) ❖ SUM ([DISTINCT] attribute) ❖ MAX (attribute) ❖ MIN (attribute)

slide-106
SLIDE 106

106

Aggregate Functions in SQL

SELECT COUNT(DISTINCT M.Director) FROM Movies M Movies (M)

MovieID Name Year Director 20 Inception 2010 Christopher Nolan 16 Avatar 2009 Jim Cameron 53 Gravity 2013 Alfonso Cuaron 91 Interstellar 2014 Christopher Nolan How many directors do we have?

slide-107
SLIDE 107

Aggregate Functions in SQL

107

RatingID Stars UserID MovieID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42

Ratings (R) SELECT R.MovieID, MAX(R.Stars) FROM Ratings R

Which MovieID(s) have the highest rating? Other attributes NOT allowed in the target-list as such!

slide-108
SLIDE 108

Aggregate Functions in SQL

108

RatingID Stars UserID MovieID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42

Ratings (R) SELECT DISTINCT R.MovieID FROM Ratings R WHERE R.Stars = (SELECT MAX(R2.Stars) FROM Ratings R2)

Which MovieID(s) have the highest rating?

slide-109
SLIDE 109

Group By Aggregate in SQL

109

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition] GROUP BY grouping-list HAVING group-condition

X Condition on each group in aggregate target-list must be in this form: X’, Agg(Y) Subset of X

γX,Agg(Y )(R)

slide-110
SLIDE 110

Group By Aggregate in SQL

110

RatingID Stars UserID MovieID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42

Ratings (R)

What is the average rating for each movie?

SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID

γMovieID,AVG(Stars)(R)

slide-111
SLIDE 111

Group By Aggregate in SQL

111

RatingID Stars UserID MovieID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42

Ratings (R)

SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID

MovieID AvgRating 20 4.0 42 4.0 53 2.5

One tuple in output per unique value of R.MovieID (aka “group”)

slide-112
SLIDE 112

112

  • Q1. Which of the following is not a basic operation in

relational algebra?

A B C D

σ ∪ ρ . /

Review Questions

slide-113
SLIDE 113

113

  • Q2. What type of join is the following NOT an example of?

RID NumStars RateDate UID MID UID UName Age JoinDate MID MName Year Director

R U M A B C D

R . / U

Inner join Outer join Natural join Theta join

Review Questions

slide-114
SLIDE 114

114

  • Q3. How many movies did “Jim Cameron” direct?

γCOUNT(∗)(σDirector=“Jim Cameron”(M))

γReleaseYear,COUNT(∗)(σDirector=“Jim Cameron”(M))

COUNT(∗)(Director=“Jim Cameron”(M . / R)) σDirector=“Jim Cameron”(γCOUNT(∗)(M))

A B C D

RID NumStars RateDate UID MID UID UName Age JoinDate MID MName Year Director

R U M

Review Questions

slide-115
SLIDE 115

115

  • Q4. Which of the following attributes is a primary key?

RID NumStars RateDate UID MID UID UName Age JoinDate MID MName Year Director

R U M A B C D

U.UID R.UID R.MID M.Year

Review Questions

slide-116
SLIDE 116

116

  • Q5. Which of the following SQL features do not have a

counterpart operation in extended relational algebra?

A B C D

SELECT DISTINCT WHERE LIMIT GROUP BY

Review Questions

slide-117
SLIDE 117

117

  • Q6. What is the cardinality of this query’s output?

A B C D

1 2 3 4 RID NumStars RateDate UID MID 1 3.5 08/27/15 79 20 2 4.0 07/20/15 80 20 3 2.5 08/02/14 79 16 4 4.5 03/05/14 80 16 5 5.0 06/09/13 135 20

SELECT COUNT(DISTINCT MID) FROM R R

Review Questions

slide-118
SLIDE 118

118

  • Q7. Get the year and director of all movies from the last

decade that have the term “Avengers” in their title.

A

SELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers%” MID MName Year Director

M B

SELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers”

C

SELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “Avengers%”

D

SELECT Year, Director from M WHERE Year >= 2008 AND MName = “Avengers”

Review Questions

slide-119
SLIDE 119

119

RatingID Stars UserID MovieID

Ratings (R)

  • Q8. Write an SQL query to get the number of 5-star ratings

for each movie directed by “Christopher Nolan” SELECT R.MovieID, COUNT(R.Stars) AS NumHighRatings FROM Ratings R, Movies M WHERE M.Director = “Christopher Nolan” AND R.MovieID = M.MovieID AND R.Stars = 5 GROUP BY R.MovieID

Movies (M)

MovieID Name Year Director

Review Questions

slide-120
SLIDE 120

120

Advanced SQL Operations (Optional)

slide-121
SLIDE 121

UNION in SQL

121

Get the IDs of users that have rated a movie directed by “Ang Lee” or a movie that released in 2013 RID Stars UID MID MID Name Year Director

Ratings (R) Movies (M)

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013 Union-compatible!

slide-122
SLIDE 122

Semantics of UNION in SQL

122

MID Name Year Director 20 Inception 2010 Christopher Nolan 42 Life of Pi 2012 Ang Lee 53 Gravity 2013 Alfonso Cuaron RID Stars UID MID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42 R

M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID 79 UID 79 123

slide-123
SLIDE 123

Semantics of UNION in SQL

123

UNION

UID 79 123 UID 79 UID 79 123

UNION implicitly deduplicates tuples (unlike SELECT)!

  • Q. How to retain duplicates with UNION?

UNION ALL

UID 79 123 UID 79 UID 79 79 123

slide-124
SLIDE 124

INTERSECT in SQL

124

MID Name Year Director 20 Inception 2010 Christopher Nolan 42 Life of Pi 2012 Ang Lee 53 Gravity 2013 Alfonso Cuaron RID Stars UID MID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42 R

M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” INTERSECT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID 79 UID 79 123 UID 79

INTERSECT

slide-125
SLIDE 125

EXCEPT (Set Difference) in SQL

125

MID Name Year Director 20 Inception 2010 Christopher Nolan 42 Life of Pi 2012 Ang Lee 53 Gravity 2013 Alfonso Cuaron RID Stars UID MID 1 3.5 79 42 2 4.0 80 20 3 2.5 79 53 4 4.5 123 42 R

M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” EXCEPT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID 79 UID 79 123 UID 123

EXCEPT

slide-126
SLIDE 126

The Contentious Bag Semantics!

126

UID 79 79 79 123 123 80 UID 79 123 123 92

UNION ALL

UID 79 79 79 79 123 123 123 123 80 92

Add the number

  • f repetitions
slide-127
SLIDE 127

The Contentious Bag Semantics!

127

UID 79 79 79 123 123 80 UID 79 123 123 92

EXCEPT ALL

UID 79 79 80

Subtract the number

  • f repetitions
slide-128
SLIDE 128

The Contentious Bag Semantics!

128

UID 79 79 79 123 123 80 UID 79 123 123 92

INTERSECT ALL

UID 79 123 123

Minimum of the number of repetitions