Data Management Systems Fall Semester 2020 Gustavo Alonso - PowerPoint PPT Presentation

Data Management Systems Fall Semester 2020 Gustavo Alonso Institute of Computing Platforms Content Introduction Department of Computer Science ETH Zürich Course Introduction 1

Course contents • Intro & basics • Storage • Access methods • Query Processing • Concurrency Control and Recovery • Distributed Transaction Processing • Key Value Stores • Cloud Data Management Systems Course Introduction 2

Why data management? • Databases and data management are at the heart of many key applications • The “assembly code” of data science • Organizing and processing information • Data management engines • Architecture and algorithms • Scalability and distribution • Cloud and new hardware • What you learn here will help you be more effective as a computer scientist or as a scientist dealing with data Course Introduction 3

A Database System (DBMS) • A DBMS is a tool that helps develop and run data-intensive applications. • Push the complexity of dealing with the data (storage, processing, consistency) to the database rather than to the program • Share the database • The Database is a tool • Many shapes and forms • Many applications Course Introduction 4

Course Introduction 5

How does a database look like • Until recently, a database often meant a relational database • Today, there are many forms of data management engines (or databases) • Principles and ideas behind relational databases apply to almost all forms of data management Course Introduction 6

SQL Accelerators From Oracle M7 documentation Course Introduction 7

Why use a DBMS? • Avoid redundancy and inconsistency • Rich (declarative) access to the data • Synchronize concurrent data access • Recovery after system failures • Security and privacy • Facilitate reuse of the data • Reduce cost and pain of doing something useful • There is always an alternative!!! Course Introduction 8

What makes a database great? • Physical and logical data independence • Declarative language (SQL) • Query optimization • Data integrity • Access controls • Persistence • Concurrency Control • Recovery • Scalability and flexibility • … Course Introduction 9

Data independence Physical data independence Logical data independence • The database hides how the • The database allows to build data is actually stored views over the schema so that persistently and represented in different logical interpretations memory of the same data are possible • What the user sees are just • This is implemented through relations and a schema views and the engine is responsible for maintaining such • How to map physical data to the views schema is the engine’s problem Course Introduction 10

Declarative language • SQL (Structured Query TPC-H Query 1 SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS Language) specifies how the SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, result looks like and describes SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS the tuples that should be part of SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS the result SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, AVG(L_DISCOUNT) AS • It does not say how to get to the AVG_DISC, COUNT(*) AS COUNT_ORDER result (no control flow) FROM LINEITEM WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as • This is what allows the database date)) to optimize queries GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG,L_LINESTATUS • Standardized (to some extent) Course Introduction 11

Query optimization • Relational algebra allows to prove the equivalence of certain transformations. • One an SQL statement is transformed into an operator tree (a plan), the plan can be optimized using additional information, heuristics, cost estimations, etc. • This is automatic, user is not involved (hints allowed) Course Introduction 12

Data integrity • Databases use a number of features to preserve the integrity of the data: • Foreign key constraints (if a tuple appears in one table, it has to appear in another table) • NULL/NOT-NULL (preventing that key attributed are left empty) • Value constraints (preventing that attributes get assigned invalid values) • … • This allows to define rules controlling the consistency of the data, letting the engine enforce these rules at run time. These rules are declarative. Course Introduction 13

Access controls • Relations and queries are logically the same thing • Database engines provide access control by determining what data can be seen using queries • Different possibilities of implementation • The engine decides how to enforce access control • Allows a declarative specification of access control • Use similar tricks as query optimization … Course Introduction 14

Concurrency control • One of the big selling points of database engines • No matter how many users read and write the data concurrently, they all see a consistent database • The application does not need to worry about the fact that other applications/users are also modifying the data • The engine ensures that you read the correct data even if it is being modified and it also ensures your changes make it to the database even when others are also reading and writing the same data • A hot topic for decades and still very much being discussed (lots of fake news from companies when they do not know how to do it) Course Introduction 15

Recovery • Another big selling point of database engines • Data committed to the database will be persistent and recoverable even when failures occur • Crucial in a wide range of applications (financial, banking, etc.) • Upon restart (even after a failure) the database will restore its last consistent stage before shutting down • The database can recreate its past states and records all the changes made, keeping track of all modifications to the data or the system Course Introduction 16

Scalability and flexibility • Database engines today provide very high levels of performance and a wide variety of features • Very complex systems (potentially very expensive as well) with many extension and deployments • Established code basis have been optimized to the extreme and are highly reliable (code tested by very many users) • Often, old engines are better not because of their design but because of their reliability and hugely optimized code • In spite of diversification, relational engines remain the key benchmark in functionality as well as performance Course Introduction 17

Architecture Course Introduction 18

Database Architecture Relations, views Application Queries, Transactions (SQL) Logical data (tables, schemas) Logical view (logical data) Record Interface Logical records (tuples) Access Paths Record Access Physical records Physical data in memory Page access Page structure Pages in memory File Access Storage allocation Blocks, files, segments Physical storage Course Introduction 19

Database Architecture (another view) https://docs.oracle.com/cd/E11882_01/server.112/e40540/pro cess.htm#CNCPT902 Course Introduction 20

Database Architecture (yet another view) https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf Course Introduction 21

Basic Architectures – shared nothing • Each node runs on its own data and NETWORK on its own engine. They share neither disk nor memory • Function shipping = the query has to go to where the data is DB DB DB • Data Shipping = the data has to go Engine Engine Engine where the query is • Easy to maintain and to scale Disk Disk Disk • Ideal when data can be sharded • Replication used for fault tolerance and to minimize data shipping Course Introduction 22

Who uses shared nothing? • Shared nothing works well when: • Data can be sharded (partitioned) across the NETWORK nodes • There are not many updates across nodes • The data fits within the local storage of each node DB DB DB Engine Engine Engine • Replication does not add too much overhead • Queries can be parallelized across the nodes • Shared nothing does not work when Disk Disk Disk • Data cannot be easily partitioned • There are too many updates across all the data Course Introduction 23

Example of processing with shared nothing Merge and answer T Results Results SELECT * FROM T WHERE Age > 50 SELECT * DB SELECT * SELECT * DB DB FROM T Engine FROM T FROM T Engine Engine WHERE WHERE WHERE Age > 50 Age > 50 Age > 50 Disk Disk Disk T3 T1 T2 Course Introduction 24

Example of processing with shared nothing Process and answer Data Shipping (S) R S SELECT * FROM R,S WHERE R.name = S.name DB SELECT * DB DB Engine FROM R,S Engine Engine WHERE R.name = S.name Disk Disk Disk S R Course Introduction 25

Example of processing with shared nothing R S Function shipping SELECT * FROM R,S WHERE R.id = S.id SELECT * FROM R,S DB DB DB WHERE S Engine Engine Engine R.id = S.id Disk Disk Disk S R Course Introduction 26

Data Management Systems Fall Semester 2020 Gustavo Alonso - PowerPoint PPT Presentation

Data Management Systems Fall Semester 2020 Gustavo Alonso Institute of Computing Platforms Content Introduction Department of Computer Science ETH Zrich Course Introduction 1 Course contents Intro & basics Storage Access

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

CMPUT 391 Database Management Systems Spatial Data Management 1 Dr. Jrg Sander, 2006

BSD Data Systems Report to the School Board, April 2019 Data-Driven ESSA Best Practices Data

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

PUBLIC FINANCE MANAGEMENT SYSTEMS MANAGEMENT SYSTEMS AND RISK MODELLING IN TURKISH TREASURY

Database Management Objectives of Lecture 10 Systems Spatial Data Management Spatial Data

Bracing Systems Bracing Systems 1 1 Rod Bracing Rod Bracing 2 2 Wind Bracing Systems Wind

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

CAS CS 460/660 Data Base Design En3ty/Rela3onship Model

History Drawbacks of using file systems (cont.) Atomicity of updates Failures may

Architecture of DBMS Mrs. Maninder Kaur professormaninder@gmail.com Mrs. Maninder Kaur

1 5.1 Introduction A Typical Memory Hierarchy Memory Technology q Take advantage of the

Wentworth Institute of Technology College of Engineering and Technology COMP2670 Database

Chapter 1 Introduction 1

CSE 461 MIDTERM REVIEW HELP YOURSELF TO SNACKS MIDTERM OVERVIEW 50 minutes Closed

Combining Instrumentation and Sampling for Trace-based Application Performance Analysis 8th