Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE - PowerPoint PPT Presentation

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde

THE DEBATE: DATABASE SYSTEMS VS MAPREDUCE www.cwi.nl/~boncz/bigdatacourse

A major step backwards? • MapReduce is a step backward in database access – Schemas are good – Separation of the schema from the application is good Michael Stonebraker – High-level access languages are good Turing Award Winner 2015 • MapReduce is poor implementation – Brute force and only brute force (no indexes, for example) • MapReduce is not novel • MapReduce is missing features – Bulk loader, indexing, updates, transactions… • MapReduce is incompatible with DMBS tools www.cwi.nl/~boncz/bigdatacourse Source: Blog post by DeWitt and Stonebraker

Known and unknown unknowns • Databases only help if you know what questions to ask – “Known unknowns” • What’s if you don’t know what you’re looking for? – “Unknown unknowns” www.cwi.nl/~boncz/bigdatacourse

ETL: redux • Often, with noisy datasets, ETL is the analysis! • Note that ETL necessarily involves brute force data scans • L, then E and T? www.cwi.nl/~boncz/bigdatacourse

Structure of Hadoop warehouses Don’t normalize! www.cwi.nl/~boncz/bigdatacourse Source: Wikipedia (Star Schema)

Relational databases vs. MapReduce • Relational databases: – Multipurpose: analysis and transactions; batch and interactive – Data integrity via ACID transactions – Lots of tools in software ecosystem (for ingesting, reporting, etc.) – Supports SQL (and SQL integration, e.g., JDBC) – Automatic SQL query optimization • MapReduce (Hadoop): – Designed for large clusters, fault tolerant – Data is accessed in “native format” – Supports many query languages – Programmers retain control over performance – Open source www.cwi.nl/~boncz/bigdatacourse Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)

Philosophical differences • Parallel relational databases – Schema on write – Failures are relatively infrequent – “Possessive” of data – Mostly proprietary • MapReduce – Schema on read – Failures are relatively common – In situ data processing – Open source www.cwi.nl/~boncz/bigdatacourse

MapReduce vs. RDBMS: grep SELECT * FROM Data WHERE field LIKE ‘%XYZ%’; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

MapReduce vs. RDBMS: select SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

MapReduce vs. RDBMS: aggregation SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

MapReduce vs. RDBMS: join www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

Why? • Schemas are a good idea – Parsing fields out of flat text files is slow – Schemas define a contract, decoupling logical from physical • Schemas allow for building efficient auxiliary structures – Value indexes, join indexes, etc. • Relational algorithms have been optimised for the underlying system – The system itself has complete control of performance-critical decisions – Storage layout, choice of algorithm, order of execution, etc. www.cwi.nl/~boncz/bigdatacourse

Storage layout: row vs. column stores R 1 R 2 R 3 R 4 Row store Column store www.cwi.nl/~boncz/bigdatacourse

Storage layout: row vs. column stores • Row stores – Easy to modify a record – Might read unnecessary data when processing • Column stores – Only read necessary data when processing – Tuple writes require multiple accesses www.cwi.nl/~boncz/bigdatacourse

Advantages of column stores • Read efficiency – If only need to access a few columns, no need to drag around the rest of the values • Better compression – Repeated values appear more frequently in a column than repeated rows appear • Vectorised processing – Leveraging CPU architecture-level support • Opportunities to operate directly on compressed data – For instance, when evaluating a selection; or when projecting a column www.cwi.nl/~boncz/bigdatacourse

Why not in Hadoop? RCFile No reason why not www.cwi.nl/~boncz/bigdatacourse Source: He et al. (2011) RCFile: A Fast and Space-Efficient Data Placement Structure in MapReduce-based Warehouse Systems. ICDE.

Some small steps forward • MapReduce is a step backward in database access: ✔ – Schemas are good ✔ – Separation of the schema from the application is good ? – High-level access languages are good • MapReduce is poor implementation ✔ – Brute force and only brute force (no indexes, for example) • MapReduce is not novel • MapReduce is missing features ? – Bulk loader, indexing, updates, transactions… • MapReduce is incompatible with DMBS tools www.cwi.nl/~boncz/bigdatacourse Source: Blog post by DeWitt and Stonebraker

MODERN SQL-ON-HADOOP SYSTEMS www.cwi.nl/~boncz/bigdatacourse

Analytical Database Systems Parallel (MPP): Teradata Paraccel Pivotal Vertica Redshift Oracle (IMM) Netteza DB2-BLU InfoBright SQLserver Vectorwise (columnstore) open source: ? MySQL LucidDB MonetDB www.cwi.nl/~boncz/bigdatacourse

SQL-on-Hadoop Systems Open Source: Commercial: • Hive (HortonWorks) • HAWQ (Pivotal) • Impala (Cloudera) • Vortex (Actian) • Drill (MapR) • Vertica Hadoop (HP) • Presto (Facebook) • BigQuery (IBM) • DataBricks • Splice Machine • CitusData • InfiniDB Hadoop www.cwi.nl/~boncz/bigdatacourse

Analytical DB engines for Hadoop storage – columnar storage + compression – table partitioning / distribution – exploiting correlated data query-processor  CPU-efficient query engine (vectorized or JIT codegen)  many-core ready system  rich SQL (+authorization+..)  batch update infrastructure  scaling with multiple nodes  MetaStore & file formats  YARN & elasticity www.cwi.nl/~boncz/bigdatacourse

Columnar Storage Query on data and store row-store column-store Date Store Product Customer Price Date Store Product Customer Price Inserting a new record + easy to add/modify a record + only need to read in relevant data - might read in unnecessary data - tuple writes require multiple accesses => suitable for read-mostly, read-intensive, large data repositories www.cwi.nl/~boncz/bigdatacourse

Analytical DB engines for Hadoop storage – columnar storage + compression – table partitioning / distribution – exploiting correlated data query-processor  CPU-efficient query engine (vectorized or JIT codegen)  many-core ready system  rich SQL (+authorization+..)  batch update infrastructure  scaling with multiple nodes  MetaStore & file formats  YARN & elasticity www.cwi.nl/~boncz/bigdatacourse

Columnar Compression • Trades I/O for CPU – A winning proposition currently – Even trading RAM bandwidth for CPU wins • 64 core machines starved for RAM bandwidth • Additional column-store synergy: – Column store: data of the same distribution close together • Better compression rates • Generic compression (gzip) vs Domain-aware compression – Synergy with vectorized processing (see later) compress/decompress/execution, SIMD – Can use extra space to store multiple copies of data in different sort orders (see later) www.cwi.nl/~boncz/bigdatacourse

Run-length Encoding Quarter Product Price Price Quarter Product ID ID Q1 1 5 (value, start_pos, run_length) (value, start_pos, run_length) 5 (1, 1, 5) (Q1, 1, 300) Q1 1 7 7 (2, 6, 2) Q1 1 2 (Q2, 301, 350) 2 Q1 1 9 … 9 (Q3, 651, 500) Q1 1 6 (1, 301, 3) 6 Q1 2 8 (Q4, 1151, 600) (2, 304, 1) 8 Q1 2 5 5 … … … … … Q2 1 3 3 Q2 1 8 8 Q2 1 1 1 Q2 2 4 4 … … … … www.cwi.nl/~boncz/bigdatacourse

“Integrating Compression and Execution in Column -Oriented Database Systems” Abadi et. al, SIGMOD ’06 Bitmap Encoding • For each unique Product ID: 1 ID: 2 ID: 3 … value, v, in column c, ID 1 1 0 0 0 create bit-vector b 1 1 0 0 0 – b[i] = 1 if c[i] = v 1 1 0 0 0 • Good for columns 1 1 0 0 0 with few unique 1 1 0 0 0 values 2 0 1 0 0 • Each bit-vector can 2 0 1 0 0 be further … … … … … compressed if sparse 1 1 0 0 0 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 … … … … … www.cwi.nl/~boncz/bigdatacourse

“Integrating Compression and Execution in Column -Oriented Database Systems” Abadi et. al, SIGMOD ’06 Dictionary Encoding Quarter • For each unique Quarter value create 0 1 dictionary entry Q1 3 0 Q2 • Dictionary can 2 0 Q4 be per-block or 0 0 per-column Q1 1 3 Q3 • Column-stores 2 2 Q1 have the + Q1 advantage that Dictionary dictionary Q1 entries may Q2 0: Q1 encode multiple Q4 1: Q2 values at once Q3 2: Q3 Q3 3: Q4 … www.cwi.nl/~boncz/bigdatacourse

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE - PowerPoint PPT Presentation

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS MAPREDUCE www.cwi.nl/~boncz/bigdatacourse A major step backwards? MapReduce is a step backward in database access Schemas are good

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

I Prefer Pi Corey Sinnamon Febuary 3, 2015 Big Day 3/14/15 Big Day 3/14/15 Themes Big

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

Sambuz

Useful Links

Newsletter

Mail Us

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE - PowerPoint PPT Presentation

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS MAPREDUCE www.cwi.nl/~boncz/bigdatacourse A major step backwards? MapReduce is a step backward in database access Schemas are good

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

I Prefer Pi Corey Sinnamon Febuary 3, 2015 Big Day 3/14/15 Big Day 3/14/15 Themes Big

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

Dressing up data for Hannes Mhleisen DSC 2017 Problem? People push large amounts of

Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du,

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

Sambuz

Useful Links

Newsletter

Mail Us

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical