analytical query processing
play

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - PowerPoint PPT Presentation

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date and location confirmed October 22 at 7-9pm in ILC S331 2 2 MapReduce vs. DBMSs 3 Advantages of DBMSs Abstract data representation


  1. Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7

  2. Announcement • Midterm date and location confirmed • October 22 at 7-9pm in ILC S331 2 2

  3. MapReduce vs. DBMSs 3

  4. Advantages of DBMSs • Abstract data representation • Relational model • Data storage is delegated to the DBMS • Functional query language (SQL) • Queries specified as simple relational operators • Actual query execution delegated to the DBMS… • … including parallelism, distribution, pipelining etc. • Support for indexing 4 4

  5. Disadvantages of DBMSs • SQL is a limited interface for complex analytics • E.g. image analysis, creating maps • Need to define a schema for data a priori • High cost of loading data and indexing • Can be amortized only if same data and schema reused • Too complex for “one shot” analytics 5 5

  6. Advantages of MapReduce • Support for arbitrary UDFs • Support for a variety of arbitrary data formats • Simple API • Scalability 6 6

  7. Disadvantages • Many of the optimizations of DBMS must be reimplemented, for example • Indices • Query execution plans (logical + physical) • Column-based storage • Data format specifications (ProtoBuf) • Support for updates • Several efforts towards closing the gap for analytics 7 7

  8. Data Analytics 8

  9. In Situ Analytics • Data dumped on GFS/HDFS (data lake) • Some of this data is relational • Several systems to execute relational queries on HDFS data • SQL-like language • Query optimization • Columnar data representation • Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto) • We will discuss both classes 9 9

  10. Analytical Queries • Long-running, complex queries • Often aggregates • Run on read-only data or snapshots of dynamic data • Data characteristics • Tuples (rows) have many possible attributes (columns) • A row will have only a subset of attributes set 10 10

  11. Star Schema: Facts and Dimensions • Popular schema for analytics/data warehousing • Many others exist! • At the center is a large fact table • Foreign-key references to small dimension tables 11 11

  12. 12

  13. Dremel • In-situ analytics • Independent query executor (not on top of MR) • Uses columnar store 13 13

  14. Data Model: Column Families • Also called column groups, nested columns • Common to many systems, e.g. Cassandra, HBase Nested in repeated Name New record Nested in repeated Language 14 14

  15. Assembling a Row • Finite state machine 15 15

  16. Pros and Cons of Columnar Model • Pros • Compression: columns have uniform values • Less data to scan on projections (which are common) • Cons • Additional CPU load to decompress columns and rebuild rows 16 16

  17. Query Execution 17 17

  18. SparkSQL: Spark + DBMS • Extend Spark with • Simple, high-level SQL-like operators • Query optimization • No need to transfer data across systems • ETL, query processing, complex analytics in one system 19 19

  19. Architecture 20 20

  20. DataFrames • Collection of rows with homogeneous schema • Like a table in a DBMS • Can be manipulated like an RDD • DataFrame operations • Similar to Python Pandas or R data frames • Evaluated lazily (query planning is postponed) • Can optimize across multiple queries 21 21

  21. SparkSQL Query Execution 22 22

  22. Advantages • Relational structure enables query optimization • In-memory caching using columnar representation • Better compression • Mix SQL-like operators and arbitrary code • More flexible than UDFs in DBMSs • Can optimize across multiple SQL operations 23 23

  23. Catalyst • Query optimizer of SparkSQL • Rule-based optimization • Rule: find pattern and transform • Used for both logical and physical plans • Can customize rules • Code generation • Directly outputs bytecode (as opposed to interpreting a plan) • Much more CPU efficient • Flexible data sources • Can change the physical representation of DataFrames • Still use the optimizer 24 24

  24. Catalyst: Rule-Based Optimization • Apply rules to subtree until fixed point Execution tree Transformation rules 25 25

  25. Catalyst: Code Generation • Faster than interpreting a physical plan 26 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend