DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N

administrivia • Reminder – Assignment 1 due on next Wednesday – Sign up for discussion slots on next Thursday GT 8803 // Fall 2019 2

LAST CLASS • Disk-centric & in-memory DBMSs – Buffer management (ACI D ) – Query processing – Concurrency control (AC I D) – Logging and recovery ( A CI D ) GT 8803 // Fall 2019 3

TODAY’s AGENDA • Storage Models • Compression • Visual Storage Engine GT 8803 // Fall 2019 4

STORAGE MODELS 5 GT 8803 // Fall 2018

ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor Query Optimizer Query Executor Query Lock Manager (Concurrency Control) Transactional Access Methods (or Indexes) Storage Manager Buffer Pool Manager Log Manager Shared Utilities Memory Manager + Disk Manager Networking Manager Source: Anatomy of a Database System GT 8803 // Fall 2019 6

DATA ORGANIZATION • One can think of an in-memory database as just a large array of bytes. – The schema tells the DBMS how to convert the bytes into the appropriate type (e.g., INTEGER , DATE ). – Each tuple is prefixed with a header that contains meta-data (e.g., last modified time-stamp). GT 8803 // Fall 2019 7

TABLE STORAGE FORMAT • Storage Models – N -ary Storage Model (NSM) / Row-Store – Decomposition Storage Model (DSM) / Column- Store – Flexible or Hybrid Storage Model GT 8803 // Fall 2019 8

N-ARY STORAGE MODEL (NSM) • The DBMS stores all of the attributes for a single tuple contiguously. • Ideal for OLTP workloads where txns tend to operate only on an individual entity and insert-heavy workloads. • Use the tuple-at-a-time iterator model. GT 8803 // Fall 2019 9

N-ARY STORAGE MODEL (NSM) ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 10

NSM PHYSICAL STORAGE • Choice #1: Heap-Organized Tables – Tuples are stored in blocks called a heap . – The heap does not necessarily define an order • Choice #2: Index-Organized Tables – Tuples are stored in the primary key index itself. – Index does define an order based on the primary key GT 8803 // Fall 2019 11

N-ARY STORAGE MODEL (NSM) • Advantages – Fast inserts, updates, and deletes. – Good for queries that need the entire tuple. – Can use index-oriented physical storage. • Disadvantages – Not good for scanning large portions of the table and/or a subset of the attributes. – OLAP workloads & wide tables with lots of attributes GT 8803 // Fall 2019 12

DECOMPOSITION STORAGE MODEL (DSM) • The DBMS stores a single attribute for all tuples contiguously in a block of data. – Sometimes also called vertical partitioning . • Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes. • Use the vector-at-a-time iterator model. GT 8803 // Fall 2019 13

DECOMPOSITION STORAGE MODEL (DSM) ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 14

TUPLE IDENTIFICATION IN DSM • Choice #1: Fixed-length Offsets – Each value is the same length for an attribute. • Choice #2: Embedded Tuple Ids – Each value is stored with its tuple id in a column. Offsets Embedded Ids A B C D A B C D 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 GT 8803 // Fall 2019 15

DECOMPOSITION STORAGE MODEL (DSM) • Advantages – Reduces the amount wasted work because the DBMS only reads the data that it needs. – Better compression. • Disadvantages – Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching (OLTP workloads). GT 8803 // Fall 2019 16

OBSERVATION • Can we build a single system that supports both OLTP and OLAP workloads? • Data is “hot” when first entered into database – A newly inserted tuple is more likely to be updated again the near future. • As a tuple ages, it is updated less frequently. – At some point, a tuple is only accessed in read-only queries along with other tuples. GT 8803 // Fall 2019 17

BIFURCATED ENVIRONMENT Extract Transform Load OLTP DATA SILOS OLAP DATA WAREHOUSE 18 GT 8803 // Fall 2018

HYBRID STORAGE MODEL • Single database instance that uses different storage models for hot and cold data. • Store new data in NSM for fast OLTP Migrate data to DSM for more efficient OLAP GT 8803 // Fall 2019 21

HYBRID STORAGE MODEL ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Madison Wisconsin 30000 3 Pittsburgh Carnegie Mellon 6000 UC Berkeley 30000 4 Berkeley GT 8803 // Fall 2019 22

PELOTON ADAPTIVE STORAGE • Employ a single execution engine architecture that is able to operate on both NSM and DSM data. – Don’t need to store two copies of the database. – Don’t need to sync multiple database segments. • Note that a DBMS can still use the delta-store approach with this single-engine architecture. BRIDGING THE ARCHIPELAGO BETWEEN ROW-STORES AND COLUMN-STORES FOR HYBRID WORKLOADS SIGMOD 2016 GT 8803 // Fall 2019 23

PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” 24 GT 8803 // Fall 2018

PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” 25 GT 8803 // Fall 2018

PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 26 GT 8803 // Fall 2018

PELOTON ADAPTIVE STORAGE Original Data Adapted Data A B C D A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 A B C D WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 27 GT 8803 // Fall 2018

PELOTON ADAPTIVE STORAGE Original Data Adapted Data A B C D A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 A B C D WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 28 GT 8803 // Fall 2018

FLEXIBLE STORAGE MODEL ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 29

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D GT 8803 // Fall 2019 30

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D Tile Group A Tile Group B GT 8803 // Fall 2019 31

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D Tile Group A Tile #1 Tile Group B Tile #2 Tile #3 Tile #4 GT 8803 // Fall 2019 32

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. Tile Group H A B C D Header + Tile #1 + + Tile #2 Tile #3 Tile #4 + + GT 8803 // Fall 2019 33

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. H A B C D + + + + + GT 8803 // Fall 2019 34

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + + s + + + AS GT 8803 // Fall 2019 35

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + B + 1 s + 2 + 1 + AS 2 3 GT 8803 // Fall 2019 36

TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + B + 1 s + 2 + 1 + AS 2 3 GT 8803 // Fall 2019 37

PARTING THOUGHTS • A flexible architecture that supports a hybrid storage model is the next major trend in DBMSs – This will enable relational DBMSs to support both OLTP and OLAP database workloads. GT 8803 // Fall 2019 38

COMPRESSION 39 GT 8803 // Fall 2018

OBSERVATION • I/O is the main bottleneck if the DBMS has to fetch data from disk. – CPU cost for decompressing data < – I/O cost for fetching un-compressed data. • Compression always helps . GT 8803 // Fall 2019 40

OBSERVATION • In-memory DBMSs are more complicated – Compressing the database reduces DRAM requirements and processing. • Key trade-off is speed vs. compression ratio – In-memory DBMSs (always?) choose speed. GT 8803 // Fall 2019 41

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N administrivia Reminder Assignment 1 due on next Wednesday Sign up for discussion

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Data Systems Modernization (DSM) Project: Development, Deployment, and Direction Robert

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Optimizing Magnetic Shielding vs. Cryogenics i XFEL Configurations ILC (~16 000 cavits)

Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck) Software DSM 101

PROBABILISTIC SURFACE CHANGE DETECTION AND MEASUREMENT FROM DIGITAL AERIAL STEREO IMAGES Andr

Highlights and key findings of the 2015 conference Ruprecht Niepold Independent spectrum policy

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Overview of

Integrated Deep and Shallow Networks for Salient Object Detection Jing Zhang, Bo Li, Yuchao Dai,

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N administrivia Reminder Assignment 1 due on next Wednesday Sign up for discussion

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Deep Data Analytics for Pricing: Uses, Issues, and Solutions Walter R. Paczkowski, Ph.D. Data

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // VENKATA KISHORE PATCHA Lecture#16 :

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Data Systems Modernization (DSM) Project: Development, Deployment, and Direction Robert

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Optimizing Magnetic Shielding vs. Cryogenics i XFEL Configurations ILC (~16 000 cavits)

Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck) Software DSM 101

PROBABILISTIC SURFACE CHANGE DETECTION AND MEASUREMENT FROM DIGITAL AERIAL STEREO IMAGES Andr

Highlights and key findings of the 2015 conference Ruprecht Niepold Independent spectrum policy

Distributed Shared Memory and Machine Learning CSci 8211 Chai-Wen Hsieh 11/5/2018 Overview of

Integrated Deep and Shallow Networks for Salient Object Detection Jing Zhang, Bo Li, Yuchao Dai,

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues