CSE 462 - Databases Oliver Kennedy okennedy@buffalo.edu 1 Why - - PowerPoint PPT Presentation

cse 462 databases
SMART_READER_LITE
LIVE PREVIEW

CSE 462 - Databases Oliver Kennedy okennedy@buffalo.edu 1 Why - - PowerPoint PPT Presentation

CSE 462 - Databases Oliver Kennedy okennedy@buffalo.edu 1 Why Study Databases? 2 3 3 2 Queries per Second 3 Interesting Problems Algorithms Systems Databases Hardware Theory 4 $$$ 8 of the top 10 Forbes Global 2000 Software &


slide-1
SLIDE 1

CSE 462 - Databases

Oliver Kennedy

  • kennedy@buffalo.edu

1

slide-2
SLIDE 2

Why Study Databases?

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

3

slide-5
SLIDE 5

2 Queries per Second

3

slide-6
SLIDE 6

Interesting Problems

Databases Algorithms Systems Theory Hardware

4

slide-7
SLIDE 7

$$$

8 of the top 10 Forbes Global 2000 Software & Programming Companies base their business on data management

5

slide-8
SLIDE 8

What is “Databases”?

6

slide-9
SLIDE 9

Databases

  • How do we ask and answer questions about data?
  • How do we manipulate and persist data?

7

slide-10
SLIDE 10

Databases

  • How do we ask and answer questions about data?
  • How do we manipulate and persist data?

efficiency accuracy multiple sources consistency correctness summaries parallelism

7

slide-11
SLIDE 11

Database Tools

Recipes: Techniques: Knowledge: Join Algorithms Index Datastructures Data Modeling Cost-Based Optimization The Memory Hierarchy Data Consistency

8

slide-12
SLIDE 12

Which tools do you use … and when?

9

slide-13
SLIDE 13

This Course in a Nutshell

10

slide-14
SLIDE 14

This Course in a Nutshell

There might be many correct options…

10

slide-15
SLIDE 15

This Course in a Nutshell

There might be many correct options… …but some are better than others…

10

slide-16
SLIDE 16

This Course in a Nutshell

There might be many correct options… …but some are better than others… …for specific tasks.

10

slide-17
SLIDE 17

This Course in a Nutshell

There might be many correct options… …but some are better than others… …for specific tasks. How do you define ‘correct’ and ‘better’?

10

slide-18
SLIDE 18

This Course in a Nutshell

There might be many correct options… …but some are better than others… …for specific tasks. How do you define ‘correct’ and ‘better’? How do you find alternatives that are correct?

10

slide-19
SLIDE 19

This Course in a Nutshell

There might be many correct options… …but some are better than others… …for specific tasks. How do you define ‘correct’ and ‘better’? How do you find alternatives that are correct? How do you find alternatives that are better?

10

slide-20
SLIDE 20

What is ‘Better’?

  • Declarative Queries: ‘Easy to think about’ vs ‘Fast’
  • Data Layouts: Space vs Fast Updates vs Fast Queries
  • Parallel Updates: Reactive vs Proactive Concurrency

11

slide-21
SLIDE 21

Today

  • Logistics: What you need to know
  • Project Outline: Build the next big data startup
  • Ways to Fail: What not to do and why
  • Intro: So what is a database anyway?

12

slide-22
SLIDE 22

General Course Information

13

slide-23
SLIDE 23

People

  • Oliver Kennedy (okennedy@buffalo.edu)
  • Jun Chu (jchu6@buffalo.edu)
  • Nikhil Londhe (support role only)

14

slide-24
SLIDE 24

Syllabus & Website

Course Forum: Piazza Course Project: DµBStep

http://odin.cse.buffalo.edu/teaching/cse-462

15

slide-25
SLIDE 25

Course Structure

  • Programming Assignment (50% of overall grade)
  • 4-Person Groups
  • Build a relational query engine
  • Course Content (50% of overall grade)
  • 2 Midterm Exams (5 or 10% of overall grade each)
  • Comprehensive Final Exam (20, 25, or 30% of overall grade)
  • Final Grade replaces up to 5% of each midterm’s grade
  • Homeworks due on Thursdays (10% of overall grade; drop lowest 2)

16

slide-26
SLIDE 26

Data µBases Step-by Step

(a.k.a., how to be the next ‘big’ data startup)

17

slide-27
SLIDE 27

Embedded Databases

  • SQLite (in your browser, computer, phone, etc…)
  • Simple, easy-to-use, declarative data management
  • Critical for future tech: Part of Mobile, IoT, Web

18

slide-28
SLIDE 28

Embedded Databases

  • SQLite (in your browser, computer, phone, etc…)
  • Simple, easy-to-use, declarative data management
  • Critical for future tech: Part of Mobile, IoT, Web

Your startup’s goal… …build (part of) an embedded database

18

slide-29
SLIDE 29

Data µBases (Step-by Step)

I give you data (CSV Files + Schema) I ask you a question about the data (SQL) You give me an answer

19

slide-30
SLIDE 30

Data µBases (Step-by Step)

Real World Challenge: You start with… … an empty GIT repository … open-source libraries (more on this next week)

20

slide-31
SLIDE 31

Data µBases (Step-by Step)

Real World Challenge: You get graded on your code’s… … correctness (do you produce the right answer) minimum 1/3 of grade for producing the right answer … speed (how fast did you produce the answer) +2/3 for meeting/beating the reference implementation

21

slide-32
SLIDE 32

DµBStep

SUBMIT

You write code You push to GIT DµBStep compiles your code π-graders run your code DµBStep emails your group

22

slide-33
SLIDE 33

DµBStep

SUBMIT

You write code You push to GIT DµBStep compiles your code π-graders run your code DµBStep emails your group

22

slide-34
SLIDE 34

Project Outline

SQL Query Parser & Translator Relational Algebra Optimizer Execution Plan Evaluation Engine Query Result

Statistics

A relational query processor

23

slide-35
SLIDE 35

Project Outline

SQL Query Parser & Translator Relational Algebra Optimizer Execution Plan Evaluation Engine Query Result

Statistics

A relational query processor

JSqlParser.jar

24

slide-36
SLIDE 36

Project Outline

SQL Query Parser & Translator Relational Algebra Optimizer Execution Plan Evaluation Engine Query Result

Checkpoint 1 Statistics JSqlParser.jar

25

slide-37
SLIDE 37

Project Outline

SQL Query Parser & Translator Relational Algebra Optimizer Execution Plan Evaluation Engine Query Result

Statistics Checkpoint 2 JSqlParser.jar

26

slide-38
SLIDE 38

Project Outline

SQL Query Parser & Translator Relational Algebra Optimizer Execution Plan Evaluation Engine Query Result

Statistics Checkpoint 3 JSqlParser.jar

27

slide-39
SLIDE 39

Projects

  • Checkpoint 0: “Hello World” Set-up (Due Feb 8)
  • 5% of your overall grade (free points)
  • Checkpoint 1: Basic SPJU Query Evaluation
  • 15% of your overall grade
  • Checkpoint 2: “Big” Data & Query Optimization
  • 15% of your overall grade
  • Checkpoint 3: Pre-computation
  • 15% of your overall grade

28

slide-40
SLIDE 40

Those 5 free points sounded interesting… … what do I need to do to get them?

29

slide-41
SLIDE 41

Those 5 free points sounded interesting… … what do I need to do to get them?

29

http://odin.cse.buffalo.edu/dubstep/checkpoint0.html

slide-42
SLIDE 42

5 free points

30

slide-43
SLIDE 43

OMGWTFBBQTooHard

31

slide-44
SLIDE 44

5 free points

  • Create a group of up to 4 people.
  • Register your group.
  • Access your group’s GIT repository.
  • Commit a “Hello World” program.
  • Hit “Submit”

32

slide-45
SLIDE 45

If it doesn’t work, try again

33

slide-46
SLIDE 46

Submit any project as many times as you need to (before the deadline) Your grade will not go down if you submit again

34

slide-47
SLIDE 47

Any questions on the project?

35

slide-48
SLIDE 48

Ways to Fail

(do not do these things)

  • Start your project at the last minute
  • Don’t go to office hours
  • Don’t ask questions on Piazza
  • Wait until the deadline to submit for the first time
  • Cheat

36

slide-49
SLIDE 49

Ways to Fail

(do not do these things)

  • Start your project at the last minute
  • Don’t go to office hours
  • Don’t ask questions on Piazza
  • Wait until the deadline to submit for the first time
  • Cheat

36

slide-50
SLIDE 50

37

slide-51
SLIDE 51

Academic Integrity

Cheating is submitting any work that you did not perform by yourself as if you did. References (be sure to cite properly): Wikipedia, Wikibooks (or similar): OK Public Code: StackExchange (or similar): NOT OK Discussing ideas with classmates out of class: “A hash index has O(1) lookups”: OK (except during exams 😈 ) Sharing code or answers with classmates: “Just have a look at how I implemented it”: NOT OK

38

slide-52
SLIDE 52

MOSS

Submission Overlap (Ignoring Library Code)

39

slide-53
SLIDE 53

MOSS-Details

Identical Code Structure Code in Case Statement Code in “Operator Class”

40

slide-54
SLIDE 54

Academic Integrity

Zero Tolerance: If I catch you submitting someone else’s code, you will fail the class. Group Responsibility: If your teammate cheats on a group project, the entire group will be penalized. Share Code, Share Blame: If someone else submits your code as their own, you will be penalized as well.

41

slide-55
SLIDE 55

Questions/Concerns?

42

slide-56
SLIDE 56

What does a data- management system do?

43

slide-57
SLIDE 57

Data Management

Manipulation: Safely persisting and sharing data updates

Analysis: Answering user-provided questions about a dataset

What kind of tools can we give end-users?

  • Declarative Languages
  • Organizational Datastructures (e.g., Indexes)

What kind of tools can we give end-users?

  • Consistency Primitives
  • Data Validation Primitives

44

slide-58
SLIDE 58

Data

vs

45

slide-59
SLIDE 59

Data

vs

{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": 10021 }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] }

46

slide-60
SLIDE 60

Data

vs

{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": 10021 }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] }

Databases exploit the data’s structure!

46

slide-61
SLIDE 61

So let’s talk structure…

47

slide-62
SLIDE 62

Types

48

slide-63
SLIDE 63

Types

Integer Floating Point Number String List/Array Set Bag Struct Dictionary/Object

48

slide-64
SLIDE 64

Types

Integer Floating Point Number String List/Array Set Bag Struct Dictionary/Object Primitive Collection Tuple

49

slide-65
SLIDE 65

Type Glossary

  • Primitive: Basic building blocks like Int, Float, Char, String
  • Tuple: Several ‘fields’ of different types. (N-Tuple = N fields)
  • A Tuple has a ‘schema’ defining names/types for each field
  • Set: A collection of unique records, all of the same type
  • Bag: An unordered collection of records, all of the same type
  • List: An ordered collection of records, all of the same type

50

slide-66
SLIDE 66

Relational Database Glossary

51

slide-67
SLIDE 67

Relational Database Glossary

51

Relational Database

Relation Relation Relation

slide-68
SLIDE 68

Relational Database Glossary

51

Relational Database

Relation Relation Relation Schema Instance

Specifies the name

  • f the relation, plus

the name and type

  • f each column

Officers( firstname: string, lastname: string, id: int )

The Data

[James, Kirk, 2260] [Jean Luc, Picard, 2360] [Benjamin, Sisko, 2365]

slide-69
SLIDE 69

Relational Database Glossary

51

Relational Database

Relation Relation Relation Schema Instance

Columns (# = degree/arity) Specifies the name

  • f the relation, plus

the name and type

  • f each column

Officers( firstname: string, lastname: string, id: int )

The Data

[James, Kirk, 2260] [Jean Luc, Picard, 2360] [Benjamin, Sisko, 2365]

slide-70
SLIDE 70

Relational Database Glossary

51

Relational Database

Relation Relation Relation Schema Instance

Columns (# = degree/arity) Rows (# = cardinality) Specifies the name

  • f the relation, plus

the name and type

  • f each column

Officers( firstname: string, lastname: string, id: int )

The Data

[James, Kirk, 2260] [Jean Luc, Picard, 2360] [Benjamin, Sisko, 2365]

slide-71
SLIDE 71

Relational Database Glossary

51

Relational Database

Relation Relation Relation Schema Instance

Columns (# = degree/arity) Rows (# = cardinality) Specifies the name

  • f the relation, plus

the name and type

  • f each column

Officers( firstname: string, lastname: string, id: int )

The Data

[James, Kirk, 2260] [Jean Luc, Picard, 2360] [Benjamin, Sisko, 2365]

A relation is a set of tuples (rows) with the same schema

slide-72
SLIDE 72

Why?

Your data is currently an Unordered Set of 100-attribute Tuples Tomorrow, you’ll be repeatedly asked for 1 specific attribute

  • f 5 specific rows identified by the first attribute

Can you do better?

52

slide-73
SLIDE 73

Why?

Better Idea: Rewrite data into a 99-Tuple of Maps keyed on the 1st attribute This representation is equivalent, and better for your needs.

Declarative specs make it easier to find equivalences.

53