Querying Relational Data: Algebra Gerome Miklau UMass Amherst - - PowerPoint PPT Presentation

querying relational data algebra
SMART_READER_LITE
LIVE PREVIEW

Querying Relational Data: Algebra Gerome Miklau UMass Amherst - - PowerPoint PPT Presentation

Querying Relational Data: Algebra Gerome Miklau UMass Amherst CMPSCI 645 Database Systems Jan 31, 2008 Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom Next lectures Today


slide-1
SLIDE 1

Querying Relational Data: Algebra

Gerome Miklau

UMass Amherst CMPSCI 645 – Database Systems

Jan 31, 2008

Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom

slide-2
SLIDE 2

Next lectures

  • Today

– Relational model, relational algebra

  • Next Tuesday

– SQL

  • Homework 1 will be on these topics
slide-3
SLIDE 3

The Relational Model

  • The relational data model (Codd, 1970):

– Data independence: details of physical storage are hidden from users – High-level declarative query language

  • say what you want, not how to compute it.
  • mathematical foundation

– A theory of normalization guides the design of relations

slide-4
SLIDE 4

Relational Database: Definitions

  • Relational database: a set of relations
  • Relation: made up of 2 parts:

– Instance : a table, with rows and columns. – Schema : specifies name of relation, plus

name and type/domain of each column.

Restriction: all attributes are of atomic type, no nested tables

Students(sid: string, name: string, login: string, age: integer, gpa: real).

slide-5
SLIDE 5

Relational instances: tables

Arity (number of attributes) is 5 Students

column, attribute, field row, tuple

Attribute value A relation is a set of tuples: no tuple can occur more than once

– Real systems may allow duplicates for efficiency or other reasons – we’ll come back to this.

slide-6
SLIDE 6

Relational Query Languages

  • Query languages: Allow manipulation and retrieval
  • f data from a database.
  • Query Languages != programming languages!

– QLs not expected to be “Turing complete”. – QLs not intended to be used for complex calculations. – QLs support easy, efficient access to large data sets.

slide-7
SLIDE 7

Preliminaries

  • A query is applied to one or more relation

instances

  • The result of a query is a relation instance.
  • Input and output schema:

– Schema of input relations for a query are fixed – The schema for the result of a given query is also fixed:

determined by definition of query language constructs.

Query Q: R1..Rn → R’

slide-8
SLIDE 8

What is an “Algebra”

  • Mathematical system consisting of:

– Operands --- variables or values from which new values can be constructed. – Operators --- symbols denoting procedures that construct new values from given values.

slide-9
SLIDE 9

What is the Relational Algebra?

  • An algebra whose operands are

relations or variables that represent relations.

  • Operators are designed to do the most

common things that we need to do with relations in a database.

– The result is an algebra that can be used as a query language for relations.

slide-10
SLIDE 10

Relational Algebra

  • Operates on relations, i.e. sets

– Later: we discuss how to extend this to bags

  • Five operators:

– Union: ∪ – Difference: - – Selection: σ – Projection: Π – Cartesian Product: ×

  • Derived or auxiliary operators:

– Intersection, complement – Joins (natural,equi-join, theta join) – Renaming: ρ – Division: /

slide-11
SLIDE 11

Example Database

sid name 1 Jill 2 Bo 3 Maya fid name 1 Diao 2 Saul 8 Weems sid cid 1 645 1 683 3 635 cid name sem 645 DB F05 683 AI S05 635 Arch F05 fid cid 1 645 2 683 8 635

STUDENT Takes COURSE PROFESSOR Teaches

slide-12
SLIDE 12
  • 1. Union and 2. Difference

sid name 1 Jill 2 Bo 3 Maya

R1

sid name 1 Jill 4 Bob

R2

sid name 2 Bo 3 Maya sid name 1 Jill 2 Bo 3 Maya 4 Bob

R1 – R2 R1 ∪ R2

slide-13
SLIDE 13

What about Intersection ?

  • It is a derived operator
  • R1 ∩ R2 = R1 – (R1 – R2)
  • Also expressed as a join (will see later)

R1 R2 R1 – R2

slide-14
SLIDE 14
  • 3. Selection
  • Returns all tuples which satisfy a

condition

  • Notation: σc(R)
  • Examples

σCID > 600 (Course) σname = “AI” (Course)

  • The condition c can be =, <, ≤, >, ≥, <>

cid name sem 645 DB F05 683 AI S05 635 Arch F05

Course

slide-15
SLIDE 15
  • 4. Projection
  • Eliminates columns, then removes duplicates
  • Notation: Π A1,…,An (R)
  • Example: project cid and name

Π cid, name (Course) Output schema: Answer(cid, name)

cid name sem 645 DB F05 683 AI S05 645 DB S05

Course

cid name 645 DB 683 AI

Answer Π

slide-16
SLIDE 16
  • 5. Cartesian Product
  • Each tuple in R1 with each tuple in R2
  • Notation: R1 × R2
  • Very rare in practice; mainly used to

express joins

Also called “Cross Product”

slide-17
SLIDE 17

Cartesian Product

17

sid cid 1 645 1 683 3 635 sid name 1 Jill 2 Bo

Student Takes Student × Takes

sid name sid cid 1 Jill 1 645 1 Jill 1 683 1 Jill 3 635 2 Bo 1 645 2 Bo 1 683 2 Bo 3 635

slide-18
SLIDE 18

Renaming

  • Changes the schema, not the instance
  • Notation: ρ B1,…,Bn (R)
  • Example:

ρcourseID, cname, term (Course)

cid name sem 645 DB F05 683 AI S05 645 DB S05

Course

courseID

cname term

645 DB F05 683 AI S05 645 DB S05

ρ

slide-19
SLIDE 19

Natural Join

  • Notation: R1 R2
  • Meaning: R1 R2 = ΠA(σC(R1 × R2))
  • Where:

– The selection σC checks equality of all common attributes – The projection eliminates the duplicate common attributes

slide-20
SLIDE 20

Natural join example

20

sid name 1 Jill 2 Bo 3 Maya sid cid 1 645 1 683 3 635

Takes Student

sid name cid 1 Jill 645 1 Jill 683 3 Maya 635

Student Takes

slide-21
SLIDE 21

Example Database

sid name 1 Jill 2 Bo 3 Maya fid name 1 Diao 2 Saul 8 Weems sid cid 1 645 1 683 3 635 cid name sem 645 DB F05 683 AI S05 635 Arch F05 fid cid 1 645 2 683 8 635

STUDENT Takes COURSE PROFESSOR Teaches

slide-22
SLIDE 22

Natural join questions

  • Given the schemas R(A, B, C, D), S(A, C, E),

what is the schema of R S ?

  • Given R(A, B, C), S(D, E), what is R S ?
  • Given R(A, B), S(A, B), what is R S ?

– R(A,B,C,D,E) – Cartesian Product – Intersection

slide-23
SLIDE 23

Theta Join

  • A join that involves a predicate
  • R1 θ R2 = σ θ (R1 × R2)
  • Here θ can be any condition:

=, <, ≠, ≤, >, ≥

Example: Student sid<sid Takes

slide-24
SLIDE 24

Equi-join

  • A theta join where θ is an equality
  • R1 A=B R2 = σ A=B (R1 × R2)
  • Very useful join in practice
  • Example: Student sid=sid Takes
slide-25
SLIDE 25

Division

 Not supported as a primitive operator, but useful for

expressing queries like:

Find students who have enrolled in all systems courses.

 Let A have 2 fields, x and y; B have only field y:

  • A/B =
  • i.e., A/B contains all x tuples (students) such that for

every y tuple (course) in B, there is an xy tuple in A.

  • Or: If the set of y values (courses) associated with an x

value (student) in A contains all y values in B, the x value is in A/B.

 In general, x and y can be any lists of fields; y is the

list of fields in B, and x y is the list of fields of A.

{ (x) | ∀ (y) ∈ B, ∃ (x,y) ∈ A }

slide-26
SLIDE 26

Division examples

sno pno s1 p1 s1 p2 s1 p3 s1 p4 s2 p1 s2 p2 s3 p2 s4 p2 s4 p4 pno p2 pno p2 p4 pno p1 p2 p4 sno s1 s2 s3 s4 sno s1 s4 sno s1

A B1 B2 B3 A/B1 A/B2 A/B3

slide-27
SLIDE 27

Expressing A/B Using Basic Operators

 Division is not essential op; just a useful

shorthand.

  • (Also true of joins, but joins are so common that

systems implement joins specially.)

 Idea: For A/B, compute all x values that are not

`disqualified’ by some y value in B.

  • x value is disqualified if by attaching y value from B, we
  • btain an xy tuple that is not in A.

Disqualified x values: A/B: all disqualified tuples

slide-28
SLIDE 28

Combining operators: complex expressions

Πname,sid (σname=”DB” (Students (Takes Course)))

Students Course Takes

σname=”DB”

Πname,sid

slide-29
SLIDE 29

Query Optimization Is Based on Algebraic Equivalences

  • Relational algebra has laws of commutativity,

associativity, etc. that imply certain expressions are equivalent.

  • They may be different in cost of evaluation!

σc ∧ d(R) ≡ σc( σd(R) ) σc (R ⋈ S) ≡ σc(R) ⋈ S

  • Query optimization finds the most efficient

representation to evaluate (or one that’s not bad)

R ⋈ (S ⋈ T) ≡ (R ⋈ S) ⋈ T)

cascading selection join associativity pushing selections

Definition: Query Equivalence Two queries Q and Q’ are equivalent if: for all databases D, Q(D) = Q’(D)

slide-30
SLIDE 30

Operations on Bags

A bag = a set with repeated elements Relational Engines work on bags, not sets ! All operations need to be defined carefully on bags

  • {a,b,b,c}∪{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f}
  • {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b}
  • σC(R): preserves the number of occurrences
  • ΠA(R): no duplicate elimination
  • Cartesian product, join: no duplicate elimination
slide-31
SLIDE 31

Beware: Bag Laws != Set Laws

  • Some, but not all algebraic laws that

hold for sets also hold for bags.

  • Example: the commutative law for

union (R ∪ S = S ∪ R ) does hold for bags.

– Since addition is commutative, adding the number of times x appears in R and S doesn’t depend on the order of R and S.

slide-32
SLIDE 32

Example of the Difference

  • Set union is idempotent, meaning that

S ∪ S = S.

  • However, for bags, if x appears n times

in S, then it appears 2n times in S ∪ S.

  • Thus S ∪ S != S in general.
slide-33
SLIDE 33

Relational calculus

  • What is a “calculus”?

– The term "calculus" means a system of computation – The relational calculus is a system of computing with relations

33

slide-34
SLIDE 34

Relational calculus (in 1 slide)

We will study another logic-based formalism for queries called Datalog later. Name and sid of students who are taking the course “DB” English:

{xname, xsid | ∃xcid∃xterm Students(xsid,xname) ∧ Takes(xsid,xcid) ∧ Course(xcid,”DB”, xterm) }

RC: RA:

Πname,sid (Students Takes σname=”DB” (Course)

Where are the joins?

slide-35
SLIDE 35

Algebra v. Calculus

  • Relational Algebra: More operational;

very useful for representing execution plans.

  • Relational Calculus: More declarative,

basis of SQL

  • The calculus and algebra have

equivalent expressive power (Codd)

A language that can express this core class

  • f queries is called Relationally Complete
slide-36
SLIDE 36

What can’t you express in RA,RC?

  • Can I get from

Oakland to Boston in 2 flights?

  • Can I get from

Oakland to Reno?

36

depart arrive NYC Reno NYC Oakland Boston Tampa Oakland Boston Tampa NYC