Data Management Systems Fall Semester 2020 Gustavo Alonso - - PowerPoint PPT Presentation

data management systems
SMART_READER_LITE
LIVE PREVIEW

Data Management Systems Fall Semester 2020 Gustavo Alonso - - PowerPoint PPT Presentation

Data Management Systems Fall Semester 2020 Gustavo Alonso Institute of Computing Platforms Content Introduction Department of Computer Science ETH Zrich Course Introduction 1 Course contents Intro & basics Storage Access


slide-1
SLIDE 1

Data Management Systems

Fall Semester 2020 Content Introduction

1

Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

Course Introduction

slide-2
SLIDE 2

Course contents

  • Intro & basics
  • Storage
  • Access methods
  • Query Processing
  • Concurrency Control and Recovery
  • Distributed Transaction Processing
  • Key Value Stores
  • Cloud Data Management Systems

2 Course Introduction

slide-3
SLIDE 3

Why data management?

  • Databases and data management are at the heart of many key

applications

  • The “assembly code” of data science
  • Organizing and processing information
  • Data management engines
  • Architecture and algorithms
  • Scalability and distribution
  • Cloud and new hardware
  • What you learn here will help you be more effective as a computer

scientist or as a scientist dealing with data

3 Course Introduction

slide-4
SLIDE 4

A Database System (DBMS)

  • A DBMS is a tool that helps develop and run data-intensive

applications.

  • Push the complexity of dealing with the data (storage, processing,

consistency) to the database rather than to the program

  • Share the database
  • The Database is a tool
  • Many shapes and forms
  • Many applications

4 Course Introduction

slide-5
SLIDE 5

5 Course Introduction

slide-6
SLIDE 6

How does a database look like

  • Until recently, a database often meant a relational database
  • Today, there are many forms of data management engines (or

databases)

  • Principles and ideas behind relational databases apply to almost all

forms of data management

6 Course Introduction

slide-7
SLIDE 7

SQL Accelerators

From Oracle M7 documentation

7 Course Introduction

slide-8
SLIDE 8

Why use a DBMS?

  • Avoid redundancy and inconsistency
  • Rich (declarative) access to the data
  • Synchronize concurrent data access
  • Recovery after system failures
  • Security and privacy
  • Facilitate reuse of the data
  • Reduce cost and pain of doing something useful
  • There is always an alternative!!!

8 Course Introduction

slide-9
SLIDE 9

What makes a database great?

  • Physical and logical data independence
  • Declarative language (SQL)
  • Query optimization
  • Data integrity
  • Access controls
  • Persistence
  • Concurrency Control
  • Recovery
  • Scalability and flexibility

9 Course Introduction

slide-10
SLIDE 10

Data independence

Physical data independence

  • The database hides how the

data is actually stored persistently and represented in memory

  • What the user sees are just

relations and a schema

  • How to map physical data to the

schema is the engine’s problem Logical data independence

  • The database allows to build

views over the schema so that different logical interpretations

  • f the same data are possible
  • This is implemented through

views and the engine is responsible for maintaining such views

10 Course Introduction

slide-11
SLIDE 11

Declarative language

  • SQL (Structured Query

Language) specifies how the result looks like and describes the tuples that should be part of the result

  • It does not say how to get to the

result (no control flow)

  • This is what allows the database

to optimize queries

  • Standardized (to some extent)

11

TPC-H Query 1 SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER FROM LINEITEM WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as date)) GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG,L_LINESTATUS

Course Introduction

slide-12
SLIDE 12

Query optimization

  • Relational algebra allows to

prove the equivalence of certain transformations.

  • One an SQL statement is

transformed into an operator tree (a plan), the plan can be

  • ptimized using additional

information, heuristics, cost estimations, etc.

  • This is automatic, user is not

involved (hints allowed)

12 Course Introduction

slide-13
SLIDE 13

Data integrity

  • Databases use a number of features to preserve the integrity of the

data:

  • Foreign key constraints (if a tuple appears in one table, it has to appear in

another table)

  • NULL/NOT-NULL (preventing that key attributed are left empty)
  • Value constraints (preventing that attributes get assigned invalid values)
  • This allows to define rules controlling the consistency of the data,

letting the engine enforce these rules at run time. These rules are declarative.

13 Course Introduction

slide-14
SLIDE 14

Access controls

  • Relations and queries are logically the same thing
  • Database engines provide access control by determining what data

can be seen using queries

  • Different possibilities of implementation
  • The engine decides how to enforce access control
  • Allows a declarative specification of access control
  • Use similar tricks as query optimization …

14 Course Introduction

slide-15
SLIDE 15

Concurrency control

  • One of the big selling points of database engines
  • No matter how many users read and write the data concurrently, they all see

a consistent database

  • The application does not need to worry about the fact that other

applications/users are also modifying the data

  • The engine ensures that you read the correct data even if it is being

modified and it also ensures your changes make it to the database even when others are also reading and writing the same data

  • A hot topic for decades and still very much being discussed (lots of

fake news from companies when they do not know how to do it)

15 Course Introduction

slide-16
SLIDE 16

Recovery

  • Another big selling point of database engines
  • Data committed to the database will be persistent and recoverable even

when failures occur

  • Crucial in a wide range of applications (financial, banking, etc.)
  • Upon restart (even after a failure) the database will restore its last

consistent stage before shutting down

  • The database can recreate its past states and records all the changes

made, keeping track of all modifications to the data or the system

16 Course Introduction

slide-17
SLIDE 17

Scalability and flexibility

  • Database engines today provide very high levels of performance and

a wide variety of features

  • Very complex systems (potentially very expensive as well) with many

extension and deployments

  • Established code basis have been optimized to the extreme and are

highly reliable (code tested by very many users)

  • Often, old engines are better not because of their design but because of their

reliability and hugely optimized code

  • In spite of diversification, relational engines remain the key

benchmark in functionality as well as performance

17 Course Introduction

slide-18
SLIDE 18

Architecture

18 Course Introduction

slide-19
SLIDE 19

Database Architecture

19

Physical storage Blocks, files, segments Pages in memory Physical records Logical records (tuples) Logical data (tables, schemas) Relations, views Queries, Transactions (SQL) Record Interface Record Access Page access File Access Application Logical view (logical data) Access Paths Physical data in memory Page structure Storage allocation

Course Introduction

slide-20
SLIDE 20

Database Architecture (another view)

20

https://docs.oracle.com/cd/E11882_01/server.112/e40540/pro cess.htm#CNCPT902

Course Introduction

slide-21
SLIDE 21

Database Architecture (yet another view)

21

https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf

Course Introduction

slide-22
SLIDE 22

Basic Architectures – shared nothing

  • Each node runs on its own data and
  • n its own engine. They share

neither disk nor memory

  • Function shipping = the query has

to go to where the data is

  • Data Shipping = the data has to go

where the query is

  • Easy to maintain and to scale
  • Ideal when data can be sharded
  • Replication used for fault tolerance

and to minimize data shipping

22

DB Engine Disk NETWORK DB Engine Disk DB Engine Disk

Course Introduction

slide-23
SLIDE 23

Who uses shared nothing?

  • Shared nothing works well when:
  • Data can be sharded (partitioned) across the

nodes

  • There are not many updates across nodes
  • The data fits within the local storage of each

node

  • Replication does not add too much overhead
  • Queries can be parallelized across the nodes
  • Shared nothing does not work when
  • Data cannot be easily partitioned
  • There are too many updates across all the data

23

DB Engine Disk NETWORK DB Engine Disk DB Engine Disk

Course Introduction

slide-24
SLIDE 24

Example of processing with shared nothing

24

DB Engine Disk DB Engine Disk DB Engine Disk T T1 T3 T2 SELECT * FROM T WHERE Age > 50 SELECT * FROM T WHERE Age > 50 SELECT * FROM T WHERE Age > 50 SELECT * FROM T WHERE Age > 50 Results Results Merge and answer

Course Introduction

slide-25
SLIDE 25

Example of processing with shared nothing

25

DB Engine Disk DB Engine Disk DB Engine Disk R S SELECT * FROM R,S WHERE R.name = S.name R S Data Shipping (S) SELECT * FROM R,S WHERE R.name = S.name Process and answer

Course Introduction

slide-26
SLIDE 26

Example of processing with shared nothing

26

DB Engine Disk DB Engine Disk DB Engine Disk R S SELECT * FROM R,S WHERE R.id = S.id R S S SELECT * FROM R,S WHERE R.id = S.id Function shipping

Course Introduction

slide-27
SLIDE 27

It gets arbitrarily complex …

… but databases deal with all such cases:

  • The tables do not fit in memory and need to be processed in chunks
  • The tables are so big that need to be processed in several nodes
  • The queries have selection predicates that change the relative sizes of

the tables involved

  • The query might require many joins of many tables (very common!)
  • Tables can be partitioned on different attributes
  • There might be indexes in the tables

27 Course Introduction

slide-28
SLIDE 28

Basic Architectures – shared memory

  • A coherent memory abstraction

is provided over the network merging the memory of each node

  • Data and queries still move but

programming is much easier (the shared memory abstraction is provided at lower levels)

  • Used in high scale transactional

systems

28

DB Engine Disk NETWORK DB Engine Disk DB Engine Disk Shared memory abstraction

Course Introduction

slide-29
SLIDE 29

Who uses shared memory?

  • The big ones:
  • Often implemented on a

supercomputer or large scale machines

  • Shared memory abstraction

provided by the hardware and highly specialized networks

  • Allows to run transactions at very

large scales without having to change the engine

29 Course Introduction

slide-30
SLIDE 30

Shared memory vs multicore

  • Today’s multicore architectures are a form of shared memory that

allows a growing number of processors connected to a large pool of memory in a single machine

  • Avoids network overheads
  • Virtual memory and cache coherency at the hardware level
  • NUMA effects
  • Scalability effects
  • Common for large scale systems but not without problems (single

machine, connection bottlenecks, need backup, price …)

30 Course Introduction

slide-31
SLIDE 31

Basic Architectures – shared disk

  • Data is stored in shared storage

(typically networks attached storage)

  • Data Shipping = the data has to

go where the query is

  • Very common in the cloud
  • Simplifies data placement and

partitioning

  • Separates storage from compute

31

DB Engine Disk NETWORK DB Engine Disk DB Engine Disk STORAGE SYSTEM

Course Introduction

slide-32
SLIDE 32

Query processing

32 Course Introduction

slide-33
SLIDE 33

Database operators

  •  Selection
  •  Projection
  • X Cartesian Product
  • ⋈ Join
  •  Rename
  •  Set Minus
  •  Relational Division
  •  Union
  •  Intersection
  • ⋉ Semi-Join (left)
  • ⋊ Semi-Join (right)
  • left outer Join
  • right outer Join

33

⟕ ⟖

Course Introduction

slide-34
SLIDE 34

What is great about relational algebra …

… is that it can be used to prove equivalence

  • Basic operators.
  • Selection: p (E1)
  • Projection: S (E1)
  • Cartesian Product: E1 x E2
  • Rename: V (E1), A  B (E1)
  • Union: E1  E2
  • Minus: E1 - E2
  • All other operators can be derived from these

34 Course Introduction

slide-35
SLIDE 35

Query optimization

  • Databases optimize queries before execution
  • Rewriting queries
  • Using relational algebra
  • Using heuristics
  • Using statistics gathered during execution
  • Remembering past queries and learning from them

35

SQL STATEMENT OPERATOR TREE OPTIMIZED OPERATOR TREE COMPILED QUERY INTERPRETER

Course Introduction

slide-36
SLIDE 36

It starts by choosing the right operator

  • Nested loop join (R ⋈ S)

for all X in R for all Y in S if X = Y { merge X and Y; return X+Y }

  • Complexity is MxN (tuples in R

times tuples in S)

  • Potentially expensive unless

tables small (fit in cache)

36

R S

Course Introduction

slide-37
SLIDE 37

Optimizations and options galore

  • Sorting one or both tables
  • Join is fast (sorting can be

expensive)

  • Hash one table and probe with

the other

  • Use the hash as an index so that

the comparison is just a hash lookup

  • Partition the tables first using
  • ne hash, then do a hash join
  • Makes sure the data fits in the

cache when doing the comparison

37

R S

Course Introduction

slide-38
SLIDE 38

More options

  • Keep basic statistics on a table such a max and min values
  • Before marching two tuples, check with the max and min, if it is outside the

range, there is no match

  • Use indexes to find the matching ranges and ignore those where no

matches exists

  • The indexed show where the data is in the distribution, there is no point in

comparing ranges where there are no matches

  • Use an index on one table and do the lookup of the other table

against the index

  • That way we do not have to traverse the entire table with every comparison

38 Course Introduction

slide-39
SLIDE 39

Which operator?

  • Most commercial database engines will implement many customized

version of every operator:

  • Optimized for concrete data types
  • Optimized for sorted or indexed data
  • Different algorithms depending on statistics (table sizes, selectivity)
  • Which one to use is determine by many factors
  • Current database state
  • Availability of statistics
  • Availability of resources
  • Query optimization
  • Availability of cached data (intermediate results, actual results, etc.)

39 Course Introduction

slide-40
SLIDE 40

Transactions

40 Course Introduction

slide-41
SLIDE 41

What is a transaction?

Example: transfer CHF 50 from A to B

1. Read balance of A from DB into Variable a: read(A,a); 2. Subtract 50.- CHF from the balance: a:= a – 50; 3. Write new balance back into DB: write(A,a); 4. Read balance of B from DB into Variable b: read(B,b); 5. Add 50,- CHF to balance: b := b + 50; 6. Write new balance back into DB: write(B, b); This set of operations is correct if all of them are completed. In between, the state is inconsistent

=> THIS IS NOT HOW BANKS DO WITHDRAWALS AND DEPOSITS

41 Course Introduction

slide-42
SLIDE 42

Transactional properties

42

Transaction = ordered sequence of read and writes over the database that either commits or aborts T1 = b1 r1[x] r1[y] w1[x] c1 Basic assumptions:

  • transactions are delimited by a begin operation and either an abort or a commit operation
  • transaction must end with a commit or abort but cannot have both
  • after a transaction commits or aborts, no more operations from that transaction are possible
  • operations within a transaction are totally ordered with respect to each other
  • Communication between transactions only through the database by reading and writing
  • Transactions are correct programs
  • If they operate on a consistent state of the database …
  • … they will leave it in a consistent state when they commit

Consistent state Consistent state

transaction

Course Introduction

slide-43
SLIDE 43

ACID principle

43

ATOMICITY: a transaction is executed in its entirety or not at all

Consistent state

transaction CONSISTENCY: a transaction executed in its entirety over a consistent DB produces a consistent DB

Consistent state Consistent state

transaction ISOLATION: a transaction executes as if it were alone in the system

Consistent state Consistent state

T1 T2 DURABILITY: committed changes of a transaction are never lost - can be recovered

Consistent state Consistent state

transaction Persistent storage

Course Introduction

slide-44
SLIDE 44

Transactional databases

  • The concept of transactions plays a key role in the scalability of data

management:

  • Distribution helps with scalability and performance
  • Makes concurrency control more complicated and more expensive
  • The trade-off has lead to many definitions of correctness and

consistency

  • Snapshot isolation (Multi-version Concurrency Control)
  • Eventual consistency
  • Important concept beyond databases as it applied to the

management of any large data collection (replication, cloud)

Course Introduction 44

slide-45
SLIDE 45

In the course

  • We will first study how traditional database engines deal with all

these issues

  • We will then consider options and alternative designs that are

tailored to particular applications and use cases (transactions, analytics, warehouses, etc.)

  • We will discuss how modern hardware change the design decisions

and affect the architectures

  • We will look into how modern data management systems apply these

ideas when running in the cloud and data centers

45 Course Introduction