What is a database system ? Database: a large, integrated collection - - PowerPoint PPT Presentation

what is a database system
SMART_READER_LITE
LIVE PREVIEW

What is a database system ? Database: a large, integrated collection - - PowerPoint PPT Presentation

2 What is a database system ? Database: a large, integrated collection of data. Models a real world enterprise Entities (teams, games) Relationships (Orphan Pamuk received the Nobel Prize) Course introduction Constraints (


slide-1
SLIDE 1

1

Course introduction

Introduction to databases CSCC43 Winter 2013 Ryan Johnson

Thanks to Arnold Rosenbloom and Renee Miller for material in these slides

2

What is a database system?

  • Database: a large, integrated collection of data.
  • Models a real‐world enterprise

– Entities (teams, games) – Relationships (Orphan Pamuk received the Nobel Prize) – Constraints (at least one doctor on duty during off‐hours) – More recently, active components (“business logic”)

  • Database Management System (DBMS): a

software system designed to store, manage, and facilitate access to databases.

3

In the beginning…

  • There was The Mainframe

– Cost: millions – Watts: millions – Size: acres – Speed: 40kHz – Memory: 2kB – Storage: 3.5MB (tape)

SAGE (1954)

Few organizations could afford two!

4

Early computing challenges

  • Time sharing

– ~100 terminals per mainframe – Users share hardware – Want to share data, too

  • Bare hardware

– No OS – No device drivers – No file system

UNIVAC (1951) SABRE (1960)

=> File Management System => “The Database”

slide-2
SLIDE 2

2

5

“The Database”

  • Abstract concept dating back to the 1950’s

– Centralized repository for all the enterprise’s data – Realtime updates from many sources – Concurrent access by many users – Interactive (ad‐hoc) exploration and reporting

  • Semi Automatic Ground Environment (SAGE)

– Computer‐aided tracking and interception of aircraft – Dozens of SAGE installations (big one in North Bay) – Hundreds of radar stations throughout North America – Thousands of operators

Goal: all relevant information at your fingertips

6

File management systems (FMS)

Huge need for portability, abstraction

  • File management ca. 1935

– File: box of punchcards – Metadata: label on the box – Ad‐hoc report: no big deal – Hardware change: no big deal

  • File management ca. 1955

– File: several km of magnetic tape – Metadata: embedded in application logic – Ad‐hoc report: hire a couple programmers – Hardware change: hire a dozen programmers…

7

Database Management System

  • File management systems meet The Database

– Protect users from each other (isolation, consistency) – Protect application from data changes (at logical level) – Protect data from hardware changes (at physical level)

  • Split personality remains to this day

– Theory/applications (declarative access to changing data) – Systems (make it run fast on ever‐changing hardware)

  • Why so important?

– Rate of change of DB applications is incredibly slow – dapp/dt << dplatform/dt

This semester: the theory/application side

Why study databases??

  • Shift from computation to information

– always true for corporate computing – Web made this point for personal computing – more and more true for scientific computing

  • Need for DBMS has exploded

– Corporate: retail swipe/clickstreams, “customer relationship mgmt”, “supply chain mgmt”, “data warehouses”, etc. – Scientific: digital libraries, Human Genome project, Sloan Digital Sky Survey, physical sensors, grid physics network

  • A practical discipline spanning much of CS

– OS, languages, theory, AI, multimedia, logic – Yet with a focus on real‐world apps

slide-3
SLIDE 3

3

9

What’s the intellectual content?

  • Representing information

– data modeling

  • Languages and systems for querying data

– complex queries with real semantics* – over massive data sets

  • Concurrency control for data manipulation

– controlling concurrent access – ensuring transactional semantics

  • Reliable data storage

– maintain data semantics even if the lights go out

* semantics: the meaning or relationship of meanings of a sign or set of signs

10

Is the WWW a DBMS?

  • Fairly sophisticated search available

– Crawler indexes pages on the web – Keyword‐based search for pages

  • But…

– Data is mostly unstructured and untyped – Search only (can’t modify, summarize, analyze, correlate, …) – Few (zero) guarantees of freshness, accuracy, durability, consistency – DBMS lurking behind most Web sites provides these functions

  • The picture is changing

– New standards like XML can help data modeling – The WWW/DB boundary is blurry!

11

“Search” vs. Query

  • What if you wanted to find
  • ut which actors donated to

Steven Harper’s campaign?

  • Try “actors donate to harper

campaign” in your favorite search engine.

  • Stephen Harper (politician)
  • r Hill Harper (actor)?
  • Did Harper give or

receive the donation?

  • Year? Comparison with other

donations?

12

Is my file system a DBMS?

  • Strong shared heritage

– Direct descendant of file management system – Excellent insulator against hardware changes

  • But…

– Data is mostly unstructured and untyped – No concept of constraints, relationships – Minimal support for atomicity, isolation, consistency

  • The picture is changing

– File systems adopting database concepts (logging, transactions) – Object‐oriented file systems provide finer grain data model – The FS/DBMS boundary is blurry!

slide-4
SLIDE 4

4

13

Database vs. file system

  • Thought experiment #1

– You and your project partner are editing the same file. – You both save it at the same time. – Whose changes survive?

  • Thought experiment #2

– You’re updating a file when the lights go out – Which of your changes survive?

  • How to code against “who knows” ???

– Very, very carefully… A) Yours B) Partner’s C) Both D) Neither E) Who knows A) All B) None C) All since last save D) Who knows

14

OS support for data management

  • Again, strong shared heritage

– Another direct descendant of file management system – Powerful API abstractions – Bring your favorite programming language – Enforces protections on files, objects

  • But…

– Scheduling, resource management inadequate for big data – Error handling: “program terminated with SIGSEGV” – Ad‐hoc query? Hire a programmer… – Concurrency? Write code very, very carefully…

15

DBMS vs. {OS, FS, WWW}

  • Key services missing from some or all

– Recovery, isolation, consistency – Support for ad‐hoc queries – Effective concurrency control – Preserve semantics across crashes, outages

  • SMOP? Simple matter of programming?

– Not really (we’ll see this semester) – In fact, OS/FS often get in the way (next semester) – Analogy: Memory management in C++ vs. Java

  • Misquoting Greenspun’s tenth rule:

Any sufficiently complex data processing system resembles a buggy, half‐implemented, and poorly performing DBMS

16

Concept: transaction

  • “Business transaction”

– Old idea: withdraw money, reserve seats, escrow, etc. – Atomic: I deliver and you pay,

  • r neither

– Consistent: Sell each seat to

  • nly one person

– Isolated: Doctor doesn’t talk about the patient next door – Durable: Sales receipt, confirmation number, etc.

  • Database transaction

– Sequence of reads and writes to underlying data – Writes [appear to] take effect atomically – Each transaction moves the system between consistent states** – Transactions can’t see (or interfere with) each other – Once the system returns success it will not lose the data

Formalized into an entire programming model

** user responsible to write sane transactions

slide-5
SLIDE 5

5

Concept: concurrency control

  • Concurrent execution: key to high performance.

– Disk accesses frequent, pretty slow – Keep the CPU working on several programs concurrently

  • Interleaving two programs’ actions: trouble!

– Print statements during active account transfer – He and She both withdraw the last $100 from the ATM

  • DBMS ensures “anomalies” don’t arise

– Give users/programmers illusion of a single‐user system – Thank goodness! Don’t have to program “very, very carefully”.

Concept: data models

  • Data model: a collection of concepts for

describing data.

  • Schema: a description of a particular

collection of data, using a given data model.

  • Many possible data models

– Network, hierarchical, relational, object‐oriented, … – The relational model is the most widely used today

A good data model is key to data independence

19

Concept: data independence

  • FMS (1950’s)

– File, metadata management – Hardware abstraction layer

  • CODASYL/DBTG (1965)

– Decouple application from schema – Decouple schema from physical data layout

  • Edgar Codd (1970)

– Relational algebra – Move from procedural to declarative

  • Charles Bachman (1973)

– Programmer navigates data instead of (merely) writing code – Move from machine‐centric to data‐centric programming

  • Fast forward to today

– SQL, ODBC/JDBC, federation, web services, … – Data integration, cleaning, performance tuning, …

Big Deal™… but still a work in progress

20

Advantages of a DBMS

  • Data independence
  • Efficient data access
  • Data integrity & security
  • Data administration
  • Concurrent access, crash recovery
  • Reduced application development time
  • So why not use them always?

– Expensive/complicated to set up & maintain – Cost & complexity must be offset by need – General‐purpose, not suited for special‐purpose tasks (e.g. text search!)

slide-6
SLIDE 6

6

Databases make these folks happy ...

  • DBMS vendors, programmers

– Oracle, IBM, MS, Sybase, NCR, …

  • End users in many fields

– Business, education, science, …

  • DB application programmers

– Build enterprise applications on top of DBMSs – Build web services that run off DBMSs

  • Database administrators (DBAs)

– Design logical/physical schemas – Handle security and authorization – Data availability, crash recovery – Database tuning as needs evolve

Summary (part 1)

  • DBMS marries two very old concepts

– The Database (idealistic vision) – File management system (imminently practical)

  • Benefits

– Maintain, query large datasets – Manipulate data and exploit semantics – Recover from system crashes, – Juggle concurrent access, automatic parallelization – Quick application development – Preserve data integrity and security

  • Powerful abstractions provide data independence

– Application safe from changes to data organization, hardware – Key when dapp/dt << dplatform/dt

Summary (cont.)

DB administrators, developers are the bedrock of the information economy Data management R&D spans a broad, fundamental branch of the science of computation

This semester: become an effective DBMS user

24

Course administrivia

  • Professor

– Ryan Johnson (IC481) – Office hours: by appointment

  • TA

– Cicely Zhang

  • Email: cscc43‐instructor@cs.utoronto.ca

– No promises for class‐related email sent elsewhere – My main email gets hundreds of messages, things get lost

  • Course web site:

– http://www.cs.utoronto.ca/~ryanjohn/teaching/cscc43‐s13/

slide-7
SLIDE 7

7

What is an inverted classroom?

  • Students read lectures outside class

– Knowledge transfer is the first part of learning – Avoids long in‐class lectures about dry material

  • Instructor as tutor

– Focus: Q&A, review of tricky material, hands‐on practice – Instructor gives you what Google can’t

  • Formative assessment

– Many assignments marked for effort, not correctness – Opportunity to learn from mistakes – Some peer marking (see syllabus for privacy issues)

25 26

Course marks

  • Formative/Participation (15%)

– Gain marks: in‐class assignments and activities – Lose marks: distract or disturb others, come unprepared => Reflects fact that “lectures” are hands‐on and participatory

  • In‐class homework and quizzes (15%)

– Lots of small ones (< 1% each) – “Encouragement” to read before class

  • Midterm exam (28%)

– 2h, week of 4 or 11 Feb (still waiting for room assignment)

  • Final exam (42%)

– 3h (time/place TBA) – Comprehensive, 2h for new material

27

A few do’s and don’ts

  • Do

– read materials before class! – take assignments seriously – ask questions if you don’t understand something – bring your laptop/tablet to class (we’ll use it!)

  • Don’t

– expect a high mark if you ignore reading/class/assignments – hand in other peoples’ work (it’s cheating) – harass others (it’s University policy) – distract or disrupt the class (it’s immature)