CMPSCI 645 Database Design & Implementation Instructor: Gerome - - PowerPoint PPT Presentation

cmpsci 645 database design implementation
SMART_READER_LITE
LIVE PREVIEW

CMPSCI 645 Database Design & Implementation Instructor: Gerome - - PowerPoint PPT Presentation

Welcome to CMPSCI 645 Database Design & Implementation Instructor: Gerome Miklau Overview of Databases Gerome Miklau CMPSCI 645 Database Design & Implementation UMass Amherst Jan 29, 2008 Some slide content courtesy of Zack


slide-1
SLIDE 1

CMPSCI 645 Database Design & Implementation

Instructor: Gerome Miklau

Welcome to

slide-2
SLIDE 2

Overview of Databases

Gerome Miklau

CMPSCI 645 – Database Design & Implementation UMass Amherst

Jan 29, 2008

Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom

slide-3
SLIDE 3

Today

  • Student information form
  • Overview of databases
  • Course topics
  • Course requirements
slide-4
SLIDE 4

Databases & DBMS’s

  • A database is a large, integrated collection of

data.

  • A database management system (DBMS) is a

collection of software designed to store and manage databases, allowing:

– Define the kind of data stored – Querying/updating interface – Reliable storage & recovery of 100s of GB – Control access to data from many concurrent users

slide-5
SLIDE 5

Can filesystems do it?

  • Schema for files is limited
  • No query language for data in files
  • Files can store large amounts of data, but

– no efficient access to items within file – no recovery from failure

  • Concurrent access not safe

No

slide-6
SLIDE 6

Evolution

  • Early DBMS’s (1960’s), evolved from

file systems.

  • Data with many small items & many

queries or modifications:

– Airline reservations – Banking

slide-7
SLIDE 7

Early DB systems

  • Tree-based hierarchical data model
  • Graph-based network data model
  • Encouraged users to think about data

the way it was stored.

  • No high level query language

Data model

The data model includes basic assumptions about what’s an “item” of data, how to represent it and interpret it.

slide-8
SLIDE 8

The Relational Model

  • The relational data model (Codd, 1970):

– Data independence: details of physical storage are hidden from users – High-level declarative query language

  • say what you want, not how to compute it.
  • mathematical foundation

– A theory of normalization guides the design of relations Side-note: Turing Awards in Databases 1973: Bachman, networked data model 1981: Codd, relational model 1998: Jim Gray, transaction processing

slide-9
SLIDE 9

DBMS Benefit #1: Generality and Declarativity

  • The programmer or user does not need to

know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.

  • Instead, the programmer/user programs with

a logical model in mind

  • The DBMS “makes it happen” based on an

understanding of relative costs of different methods

slide-10
SLIDE 10

Benefit #2: Efficiency and Scale

  • Efficient storage of hundreds of GBs of data
  • Efficient access to data
  • Rapid processing of transactions
slide-11
SLIDE 11

Benefit #3: Management of Concurrency and Reliability

  • Simultaneous transactions handled safely.
  • Recovery of system data after system failure.
  • More formally: the ACID properties

– Atomicity - all or nothing – Consistency - sensible state not violated – Isolation - separated from effects – Durability - once completed, never lost

slide-12
SLIDE 12

How Does One Build a Database?

  • Start with a conceptual model
  • Design & implement schema
  • Write applications using DBMS and other

tools

– Many ways of doing this (DBMS, API writers, library authors, web server, etc.) – Common applications include PHP/JSP/servlet- driven web sites

  • The DBMS takes care of query optimization

and execution

slide-13
SLIDE 13

Conceptual Design

STUDENT COURSE Takes name sid cid name PROFESSOR Teaches semester fid name

slide-14
SLIDE 14

Designing a Schema (Set of Relations)

  • Convert to tables +

constraints

  • Then need to do

“physical” design: the layout on disk, indices, etc.

sid name 1 Jill 2 Bo 3 Maya fid name 1 Diao 2 Saul 8 Weems sid cid 1 645 1 683 3 635 cid name sem 645 DB F05 683 AI S05 635 Arch F05 fid cid 1 645 2 683 8 635

STUDENT Takes COURSE PROFESSOR Teaches

slide-15
SLIDE 15

Queries

  • Find all courses that “Mary” takes
  • What happens behind the scene ?

– Query processor figures out how to answer the query efficiently. SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid

slide-16
SLIDE 16

Queries, behind the scene

Query execution plan: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid Declarative SQL query

Students Takes

sid=sid sname name=“Mary” cid=cid

Courses

The optimizer chooses the best execution plan for a query

slide-17
SLIDE 17

An Issue: 80% of the World’s Data is Not in a DB!

Examples:

– Scientific data (large images, complex programs that analyze the data) – Personal data – WWW and email (some of it is stored in something resembling a DBMS)

Data management is expanding to tackle these problems

slide-18
SLIDE 18

DBMSs in the Real World

A huge industry for 20% of the world’s data!

  • Big, mature relational databases

– IBM DB2, Oracle, Microsoft SQL Server – Adding advanced features, including “native XML” support

  • “Middleware” above these systems

– SAP, Siebel, PeopleSoft, dozens of special-purpose apps

  • Integration and warehousing systems

– BEA AquaLogic, DB2 Information Integrator

  • Current trends:

– Web services; XML everywhere – Smarter, self-tuning systems – Distributed databases, column-stores

slide-19
SLIDE 19

Database Research

  • One of the broadest, most exciting areas in CS!
  • A microcosm of CS in general
  • languages, operating systems, concurrent

programming, data structures, algorithms, theory, distributed systems, statistical techniques.

  • Theory and systems well-integrated.
slide-20
SLIDE 20

Recent Trends in Databases

  • XML

– Relational databases with XML support – Middleware between XML and relational databases – Large-scale XML message systems

  • Main memory database systems
  • Peer data management
  • Stream data management
  • Model management, provenance
  • Security and privacy
  • Modeling uncertainty, probabilistic databases
slide-21
SLIDE 21

What is the Field of Databases ?

  • To an applied researcher (SIGMOD/VLDB/ICDE)

– Query optimization – Query processing (yet-another join algorithm) – Transaction processing, recovery (but most stuff is already done) – Novel applications: data mining, high-dimensional search

  • To a theoretical researcher (PODS/ICDT/LICS)

– Focus on the query languages – Query language = logic = complexity classes

slide-22
SLIDE 22

Course topics

  • Fundamentals: relational design, query

languages.

  • Theory: expressiveness of query languages,

static analysis, complexity.

  • Database internals: storage, indexing, query

processing, query optimization, transaction management.

  • XML and semi-structured data models.
  • Security: access control, privacy.
  • Advanced topics: incomplete/probabilistic

DBs, parallel and distributed DBs.

slide-23
SLIDE 23

Prerequisites

  • Official: undergrad course in DB or OS
  • Also:

– Elementary complexity theory

slide-24
SLIDE 24

Grading

  • Homework: 30%
  • Paper reviews & participation: 15%
  • Midterm: 30%
  • Project: 25%
slide-25
SLIDE 25

Homework: 30%

  • ~ 4 assignments throughout the course

– written problem sets – practical experience with SQL, XQuery

slide-26
SLIDE 26

Paper Reviews & Participation: 15%

  • Approximately 5 classic papers will be

assigned

  • Short written reviews are due before

the day of class. Email to:

– cs645-reviews@cs.umass.edu

First paper review: Read thru 1.4 of Codd’s paper Due Wed Feb 5th

slide-27
SLIDE 27

Project: 25%

  • General theme: apply database principles to a new

problem

  • Suggested topics will be discussed next Tuesday
  • Groups of 2 preferred. 3 possible.
  • Project work will include:

– Reading some of the research literature – Implementation – Written report – In-class presentation

  • Periodic consultation with the instructor
slide-28
SLIDE 28

Midterm Exam (30%)

  • Midterm scheduled for Apr 17th at 7pm
  • (No Final!)
slide-29
SLIDE 29

Textbook

Database Management Systems

Ramakrishnan and Gehrke

slide-30
SLIDE 30

Other useful resources

  • Database systems: the complete book (Ullman,

Widom and Garcia-Molina)

  • Readings in Database Systems (Stonebraker and

Hellerstein)

  • Foundations of Databases (Abiteboul, Hull, Vianu)
  • Data on the Web (Abiteboul, Buneman, Suciu)
  • Parallel and Distributed DBMS (Ozsu and Valduriez)
  • Transaction Processing (Gray and Reuter)
  • Data and Knowledge based Systems (volumes I, II)

(Ullman)

  • Proceedings of SIGMOD, VLDB, PODS conferences.
slide-31
SLIDE 31

Communication

  • Instructor

– Office hours: by appointment – Email: miklau at cs dot umass dot edu

  • Check the course webpage often
  • You should have been added to the

mailing list.

31

slide-32
SLIDE 32

Questions about the course?

32