DATA ANALYTICS USING DEEP LEARNING
GT 8803 // FALL 2019 // JOY ARULRAJ
L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N
DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - - PowerPoint PPT Presentation
DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N administrivia Reminder Assignment 1 due on next Wednesday Sign up for discussion
L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N
GT 8803 // Fall 2019
– Assignment 1 due on next Wednesday – Sign up for discussion slots on next Thursday
2
GT 8803 // Fall 2019
– Buffer management (ACID) – Query processing – Concurrency control (ACID) – Logging and recovery (ACID)
3
GT 8803 // Fall 2019
4
GT 8803 // Fall 2018
5
GT 8803 // Fall 2019
Connection Manager + Admission Control Query Parser Query Optimizer Query Executor Lock Manager (Concurrency Control) Access Methods (or Indexes) Buffer Pool Manager Log Manager Memory Manager + Disk Manager Networking Manager
6
Source: Anatomy of a Database System
GT 8803 // Fall 2019
– The schema tells the DBMS how to convert the bytes into the appropriate type (e.g., INTEGER, DATE). – Each tuple is prefixed with a header that contains meta-data (e.g., last modified time-stamp).
7
GT 8803 // Fall 2019
– N-ary Storage Model (NSM) / Row-Store – Decomposition Storage Model (DSM) / Column- Store – Flexible or Hybrid Storage Model
8
GT 8803 // Fall 2019
9
GT 8803 // Fall 2019
10
1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh ID University Enrollment City 4 UC Berkeley 30000 Berkeley
GT 8803 // Fall 2019
– Tuples are stored in blocks called a heap. – The heap does not necessarily define an order
– Tuples are stored in the primary key index itself. – Index does define an order based on the primary key
11
GT 8803 // Fall 2019
– Fast inserts, updates, and deletes. – Good for queries that need the entire tuple. – Can use index-oriented physical storage.
– Not good for scanning large portions of the table and/or a subset of the attributes. – OLAP workloads & wide tables with lots of attributes
12
GT 8803 // Fall 2019
– Sometimes also called vertical partitioning.
13
GT 8803 // Fall 2019
14
1 2 3 4
Georgia Tech Wisconsin Carnegie Mellon UC Berkeley 15000 30000 6000 30000 Atlanta Madison Pittsburgh Berkeley ID University Enrollment City
GT 8803 // Fall 2019
– Each value is the same length for an attribute.
– Each value is stored with its tuple id in a column.
15
1 2 3
A B C D
A
1 2 3
B
1 2 3
C
1 2 3
D
1 2 3
GT 8803 // Fall 2019
– Reduces the amount wasted work because the DBMS only reads the data that it needs. – Better compression.
– Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching (OLTP workloads).
16
GT 8803 // Fall 2019
– A newly inserted tuple is more likely to be updated again the near future.
– At some point, a tuple is only accessed in read-only queries along with other tuples.
17
GT 8803 // Fall 2018
18
OLTP DATA SILOS OLAP DATA WAREHOUSE
GT 8803 // Fall 2018
19
OLTP DATA SILOS OLAP DATA WAREHOUSE
GT 8803 // Fall 2018
20
OLTP DATA SILOS OLAP DATA WAREHOUSE
GT 8803 // Fall 2019
21
GT 8803 // Fall 2019
22
1 2 3 4
Georgia Tech 15000 Wisconsin 30000 Carnegie Mellon 6000 UC Berkeley 30000 Atlanta Madison Pittsburgh Berkeley ID University Enrollment City
GT 8803 // Fall 2019
– Don’t need to store two copies of the database. – Don’t need to sync multiple database segments.
23
BRIDGING THE ARCHIPELAGO BETWEEN ROW-STORES AND COLUMN-STORES FOR HYBRID WORKLOADS SIGMOD 2016
GT 8803 // Fall 2018
24
SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”
A B C D
GT 8803 // Fall 2018
25
SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”
A B C D
GT 8803 // Fall 2018
26
SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”
A B C D
Cold Hot
GT 8803 // Fall 2018
27
SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”
A B C D A B C D A B C D
Cold Hot
GT 8803 // Fall 2018
28
SELECT AVG(B) FROM AndySux WHERE C = “yyy” UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx”
A B C D A B C D A B C D
Cold Hot
GT 8803 // Fall 2019
29
1 2 Georgia Tech 15000 Wisconsin 30000 Atlanta Madison ID University Enrollment City 3 Carnegie Mellon 4 UC Berkeley 6000 30000 Pittsburgh Berkeley
GT 8803 // Fall 2019
30
A B C D
GT 8803 // Fall 2019
31
A B C D
GT 8803 // Fall 2019
32
A B C D
Tile #1 Tile #2 Tile #3 Tile #4
GT 8803 // Fall 2019
33
A B C D
Tile #1 Tile #2 Tile #3 Tile #4
H
+ + + + +
GT 8803 // Fall 2019
34
A B C D H
+ + + + +
GT 8803 // Fall 2019
35
A B C D H
+ + + + +
SELECT AVG(B) FROM AndySux WHERE C = “yyy”
GT 8803 // Fall 2019
36
A B C D H
+ + + + +
SELECT AVG(B) FROM AndySux WHERE C = “yyy”
1 2
B
1 2 3
GT 8803 // Fall 2019
37
A B C D H
+ + + + +
SELECT AVG(B) FROM AndySux WHERE C = “yyy”
1 2
B
1 2 3
GT 8803 // Fall 2019
– This will enable relational DBMSs to support both OLTP and OLAP database workloads.
38
GT 8803 // Fall 2018
39
GT 8803 // Fall 2019
– CPU cost for decompressing data < – I/O cost for fetching un-compressed data.
40
GT 8803 // Fall 2019
– Compressing the database reduces DRAM requirements and processing.
– In-memory DBMSs (always?) choose speed.
41
GT 8803 // Fall 2019
– Example: Zipfian distribution of the Brown Corpus – Words like “the”, “a” occur very frequently in books
42
GT 8803 // Fall 2019
– Example: Order Date to Ship Date (few days) – (June 5, +5) instead of (June 5, June 10)
43
GT 8803 // Fall 2019
44
GT 8803 // Fall 2019
– Example: Sensor data. Readings are taken every second, but we may only store average per minute.
– Example: BlinkDB, SnappyData, XDB, Oracle (2017)
45
GT 8803 // Fall 2019
46
Original Data
val 100 200 300 400 400
GT 8803 // Fall 2019
47
Zone Map
val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT
Original Data
val 100 200 300 400 400
GT 8803 // Fall 2019
48
Zone Map
val 100 400 280 1400 type MIN MAX AVG SUM 5 COUNT
Original Data
val 100 200 300 400 400
SELECT * FROM table WHERE val > 600
GT 8803 // Fall 2019
– Run-length Encoding – Bitmap Encoding – Delta Encoding – Incremental Encoding – Mostly Encoding – Dictionary Encoding
49
GT 8803 // Fall 2019
50
DICTIONARY-BASED ORDER-PRESERVING STRING COMPRESSION FOR MAIN MEMORY COLUMN STORES
SIGMOD 2009
GT 8803 // Fall 2019
– Encode: For a given uncompressed value, convert it into its compressed form. – Decode: For a given compressed value, convert it back into its original form.
51
GT 8803 // Fall 2019
52
GT 8803 // Fall 2019
– Compute the dictionary for all the tuples at a given point of time. – New tuples must use a separate dictionary or the all tuples must be recomputed.
– Merge new tuples in with an existing dictionary. – Likely requires re-encoding of existing tuples.
53
GT 8803 // Fall 2019
– Only include a subset of tuples within a single table. – Lower compression ratio, but can add tuples easily – Impact of dictionary data corruption is localized
– Construct a dictionary for the entire table. – Better compression ratio, but expensive to update.
– Sometimes helps with joins and set operations.
54
GT 8803 // Fall 2019
55
GT 8803 // Fall 2018
56
GT 8803 // Fall 2019
– Query parser – Query optimizer – Query execution engine – Storage engine
57
GT 8803 // Fall 2019
– This slows down model training – Traditional video compression formats are
– Accelerate model training – Leverage the observation that the compression format need not be optimized for human consumption
58
GT 8803 // Fall 2019
– I-frame (intra-coded picture) – P-frame (predicted picture i.e. delta from I-frame) – B-frame (bi-directional predicted picture i.e. deltas from both the preceding and following frames)
59
GT 8803 // Fall 2019
60
ENCODER (CONVOLUTIONS) DECODER (DECONVOLUTIONS) COMPRESSED DATA
GT 8803 // Fall 2019
– Compress frames using the auto encoder – Store compressed frames in the storage engine
– Reduce storage footprint by orders of magnitude – Accelerate query processing
61
GT 8803 // Fall 2019
62
GT 8803 // Fall 2019
63