big data for data science
play

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE - PowerPoint PPT Presentation

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS MAPREDUCE www.cwi.nl/~boncz/bigdatacourse A major step backwards? MapReduce is a step backward in database access Schemas are good


  1. Big Data for Data Science SQL on Big Data event.cwi.nl/lsde

  2. THE DEBATE: DATABASE SYSTEMS VS MAPREDUCE www.cwi.nl/~boncz/bigdatacourse

  3. A major step backwards? • MapReduce is a step backward in database access – Schemas are good – Separation of the schema from the application is good Michael Stonebraker – High-level access languages are good Turing Award Winner 2015 • MapReduce is poor implementation – Brute force and only brute force (no indexes, for example) • MapReduce is not novel • MapReduce is missing features – Bulk loader, indexing, updates, transactions… • MapReduce is incompatible with DMBS tools www.cwi.nl/~boncz/bigdatacourse Source: Blog post by DeWitt and Stonebraker

  4. Known and unknown unknowns • Databases only help if you know what questions to ask – “Known unknowns” • What’s if you don’t know what you’re looking for? – “Unknown unknowns” www.cwi.nl/~boncz/bigdatacourse

  5. ETL: redux • Often, with noisy datasets, ETL is the analysis! • Note that ETL necessarily involves brute force data scans • L, then E and T? www.cwi.nl/~boncz/bigdatacourse

  6. Structure of Hadoop warehouses Don’t normalize! www.cwi.nl/~boncz/bigdatacourse Source: Wikipedia (Star Schema)

  7. Relational databases vs. MapReduce • Relational databases: – Multipurpose: analysis and transactions; batch and interactive – Data integrity via ACID transactions – Lots of tools in software ecosystem (for ingesting, reporting, etc.) – Supports SQL (and SQL integration, e.g., JDBC) – Automatic SQL query optimization • MapReduce (Hadoop): – Designed for large clusters, fault tolerant – Data is accessed in “native format” – Supports many query languages – Programmers retain control over performance – Open source www.cwi.nl/~boncz/bigdatacourse Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)

  8. Philosophical differences • Parallel relational databases – Schema on write – Failures are relatively infrequent – “Possessive” of data – Mostly proprietary • MapReduce – Schema on read – Failures are relatively common – In situ data processing – Open source www.cwi.nl/~boncz/bigdatacourse

  9. MapReduce vs. RDBMS: grep SELECT * FROM Data WHERE field LIKE ‘%XYZ%’; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

  10. MapReduce vs. RDBMS: select SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

  11. MapReduce vs. RDBMS: aggregation SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

  12. MapReduce vs. RDBMS: join www.cwi.nl/~boncz/bigdatacourse Source: Pavlo et al. (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD.

  13. Why? • Schemas are a good idea – Parsing fields out of flat text files is slow – Schemas define a contract, decoupling logical from physical • Schemas allow for building efficient auxiliary structures – Value indexes, join indexes, etc. • Relational algorithms have been optimised for the underlying system – The system itself has complete control of performance-critical decisions – Storage layout, choice of algorithm, order of execution, etc. www.cwi.nl/~boncz/bigdatacourse

  14. Storage layout: row vs. column stores R 1 R 2 R 3 R 4 Row store Column store www.cwi.nl/~boncz/bigdatacourse

  15. Storage layout: row vs. column stores • Row stores – Easy to modify a record – Might read unnecessary data when processing • Column stores – Only read necessary data when processing – Tuple writes require multiple accesses www.cwi.nl/~boncz/bigdatacourse

  16. Advantages of column stores • Read efficiency – If only need to access a few columns, no need to drag around the rest of the values • Better compression – Repeated values appear more frequently in a column than repeated rows appear • Vectorised processing – Leveraging CPU architecture-level support • Opportunities to operate directly on compressed data – For instance, when evaluating a selection; or when projecting a column www.cwi.nl/~boncz/bigdatacourse

  17. Why not in Hadoop? RCFile No reason why not www.cwi.nl/~boncz/bigdatacourse Source: He et al. (2011) RCFile: A Fast and Space-Efficient Data Placement Structure in MapReduce-based Warehouse Systems. ICDE.

  18. Some small steps forward • MapReduce is a step backward in database access: ✔ – Schemas are good ✔ – Separation of the schema from the application is good ? – High-level access languages are good • MapReduce is poor implementation ✔ – Brute force and only brute force (no indexes, for example) • MapReduce is not novel • MapReduce is missing features ? – Bulk loader, indexing, updates, transactions… • MapReduce is incompatible with DMBS tools www.cwi.nl/~boncz/bigdatacourse Source: Blog post by DeWitt and Stonebraker

  19. MODERN SQL-ON-HADOOP SYSTEMS www.cwi.nl/~boncz/bigdatacourse

  20. Analytical Database Systems Parallel (MPP): Teradata Paraccel Pivotal Vertica Redshift Oracle (IMM) Netteza DB2-BLU InfoBright SQLserver Vectorwise (columnstore) open source: ? MySQL LucidDB MonetDB www.cwi.nl/~boncz/bigdatacourse

  21. SQL-on-Hadoop Systems Open Source: Commercial: • Hive (HortonWorks) • HAWQ (Pivotal) • Impala (Cloudera) • Vortex (Actian) • Drill (MapR) • Vertica Hadoop (HP) • Presto (Facebook) • BigQuery (IBM) • DataBricks • Splice Machine • CitusData • InfiniDB Hadoop www.cwi.nl/~boncz/bigdatacourse

  22. Analytical DB engines for Hadoop storage – columnar storage + compression – table partitioning / distribution – exploiting correlated data query-processor  CPU-efficient query engine (vectorized or JIT codegen)  many-core ready system  rich SQL (+authorization+..)  batch update infrastructure  scaling with multiple nodes  MetaStore & file formats  YARN & elasticity www.cwi.nl/~boncz/bigdatacourse

  23. Columnar Storage Query on data and store row-store column-store Date Store Product Customer Price Date Store Product Customer Price Inserting a new record + easy to add/modify a record + only need to read in relevant data - might read in unnecessary data - tuple writes require multiple accesses => suitable for read-mostly, read-intensive, large data repositories www.cwi.nl/~boncz/bigdatacourse

  24. Analytical DB engines for Hadoop storage – columnar storage + compression – table partitioning / distribution – exploiting correlated data query-processor  CPU-efficient query engine (vectorized or JIT codegen)  many-core ready system  rich SQL (+authorization+..)  batch update infrastructure  scaling with multiple nodes  MetaStore & file formats  YARN & elasticity www.cwi.nl/~boncz/bigdatacourse

  25. Columnar Compression • Trades I/O for CPU – A winning proposition currently – Even trading RAM bandwidth for CPU wins • 64 core machines starved for RAM bandwidth • Additional column-store synergy: – Column store: data of the same distribution close together • Better compression rates • Generic compression (gzip) vs Domain-aware compression – Synergy with vectorized processing (see later) compress/decompress/execution, SIMD – Can use extra space to store multiple copies of data in different sort orders (see later) www.cwi.nl/~boncz/bigdatacourse

  26. Run-length Encoding Quarter Product Price Price Quarter Product ID ID Q1 1 5 (value, start_pos, run_length) (value, start_pos, run_length) 5 (1, 1, 5) (Q1, 1, 300) Q1 1 7 7 (2, 6, 2) Q1 1 2 (Q2, 301, 350) 2 Q1 1 9 … 9 (Q3, 651, 500) Q1 1 6 (1, 301, 3) 6 Q1 2 8 (Q4, 1151, 600) (2, 304, 1) 8 Q1 2 5 5 … … … … … Q2 1 3 3 Q2 1 8 8 Q2 1 1 1 Q2 2 4 4 … … … … www.cwi.nl/~boncz/bigdatacourse

  27. “Integrating Compression and Execution in Column -Oriented Database Systems” Abadi et. al, SIGMOD ’06 Bitmap Encoding • For each unique Product ID: 1 ID: 2 ID: 3 … value, v, in column c, ID 1 1 0 0 0 create bit-vector b 1 1 0 0 0 – b[i] = 1 if c[i] = v 1 1 0 0 0 • Good for columns 1 1 0 0 0 with few unique 1 1 0 0 0 values 2 0 1 0 0 • Each bit-vector can 2 0 1 0 0 be further … … … … … compressed if sparse 1 1 0 0 0 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 … … … … … www.cwi.nl/~boncz/bigdatacourse

  28. “Integrating Compression and Execution in Column -Oriented Database Systems” Abadi et. al, SIGMOD ’06 Dictionary Encoding Quarter • For each unique Quarter value create 0 1 dictionary entry Q1 3 0 Q2 • Dictionary can 2 0 Q4 be per-block or 0 0 per-column Q1 1 3 Q3 • Column-stores 2 2 Q1 have the + Q1 advantage that Dictionary dictionary Q1 entries may Q2 0: Q1 encode multiple Q4 1: Q2 values at once Q3 2: Q3 Q3 3: Q4 … www.cwi.nl/~boncz/bigdatacourse

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend