Impala A Modern, Open Source SQL Engine for Hadoop Yogesh - PowerPoint PPT Presentation

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam

Agenda • Introduction • Architecture • Front End • Back End • Evaluation • Comparison with Spark SQL

Introduction

Why not use Hive or HBase? • Hive is a data warehousing tool • HBase is a NoSQL database that built on top of Hadoop and uses runs on top of HDFS that Hive Query Language(HQL) for provides real-time read/write querying data stored in a access. Hadoop cluster. • HQL automatically translates queries into MapReduce jobs. • Hive doesn’t support transactions.

Impala • General purpose SQL query engine: • Works across analytical and transactional workloads • High performance: • Execution engine written in C++ • Runs directly within Hadoop • Does not use MapReduce • MPP database support: • Multi-user workloads

Creating tables CREATE TABLE T (...) PARTITIONED BY (day int, month int) LOCATION '<hdfs-path>' STORED AS PARQUET; For a partitioned table, data is placed in subdirectories whose paths reflect the partition columns' values. For example, for day 17, month 2 of table T, all data files would be located in <root>/day=17/month=2/

Metadata • Table metadata including the table definition, column names, data types, schema etc. are stored in HCatalog.

INSERT / UPDATE / DELETE • The user can add data to a table simply by copying/moving data files into the directory! • Does NOT support UPDATE and DELETE . • Limitation of HDFS, as it does not support an in-place update. • Recompute the values and replace the data in the partitions. • COMPUTE STATS <table> after inserts. • Those statistics will subsequently be used during query optimization.

Architecture

I: Impala Daemon Impala daemon service is dually responsible for: 1. Accepting queries from client processes and orchestrating their execution across the Catalog Statestore cluster. In this role it’s called the query coordinator. 2. Executing individual query fragments on behalf of other Impala daemons. The Impala daemons are in constant • . . . . . . communication with the statestore , to confirm which nodes are healthy and can accept new work. They also receive broadcast messages from the • catalog daemon via the statestore , to keep track Impala Daemon of metadata changes.

II: Statestore Daemon • Handles cluster membership information. • Periodically sends two kinds of messages to Impala daemons: • Topic update: The new changes made since the last topic update message • Keepalive: A heartbeat mechanism • If an Impala daemon goes offline, the statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable node.

III: Catalog Daemon • Impala's catalog service serves catalog metadata to Impala daemons via the statestore broadcast mechanism, and executes DDL operations on behalf of Impala daemons. • The catalog service pulls information from Hive Metastore and aggregates that information into an Impala-compatible catalog structure. • This structure is then passed on to the statestore daemon which communicates with the Impala daemons.

1. Request arrives from client via Thrift API SQL App Hive Metastore HDFS NN Statestore ODBC SQL request Impala Daemon Impala Daemon Impala Daemon

2. Planner turns request into collections of plan fragments. Coordinator initiates execution on remote Impala daemons. Hive Metastore HDFS NN SQL App Statestore ODBC

3. Intermediate results are streamed between Impala daemons. Query results are streamed back to client. SQL App Hive Metastore HDFS NN Statestore ODBC Query Results Query Planner Query Coordinator Query Executor HDFS DN HBase

Front-End

Query Plans • The Impala frontend is responsible for compiling SQL text into query plans executable by the Impala backends. • The query compilation process proceeds as follows: • Query parsing • Semantic analysis • Query planning/optimization • Query planning 1. Single node planning 2. Plan parallelization and fragmentation

Query Planning: Single Node • In the first phase, the parse tree is translated into a non-executable single-node plan tree. E.g. Query joining two HDFS tables (t1, t2) and one HBase table (t3) followed by an aggregation and order by with limit (top-n). SELECT t1.custid, SUM(t2.revenue) AS revenue Agg FROM LargeHdfsTable t1 JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id) HashJoin Scan: t3 JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online' HashJoin Scan: t2 GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10; Scan: t1

Query Planning: Distributed Nodes • The second planning phase takes the single-node plan as input and produces a distributed execution plan. Goal: • To minimize data movement • Maximize scan locality as remote reads are considerably slower than local ones. • Cost--based decision based on column stats/estimated cost of data transfers • Decide parallel join strategy: • Broadcast Join : Join is collocated with left-hand side input; right--hand side table is broadcast to each node executing join. Preferred for small right-hand side input. • Partitioned Join : Both tables are hash-partitioned on join columns. Preferred for large joins.

Back-End

Executing the Query • Impala's backend receives query fragments from the front-end and is responsible for their execution. • High performance: • Written in C++ for minimal execution overhead • Internal in-memory tuple format puts fixed-width data at fixed offsets • Uses intrinsic/special CPU instructions for tasks like text parsing and CRC computation. • Runtime code generation for “big loops”

Runtime Code Generation Impala uses runtime code generation to produce query-specific versions of functions that are critical to performance. • For example, to convert every record to Impala’s in-memory tuple format: • Known at query compile time: # of tuples in a batch, tuple layout, column types, etc. • Generate at compile time: unrolled loop that inlines all function calls, dead code elimination and minimizes branches. • Code generated using LLVM

Evaluation

Comparison of query response times on single-user runs.

Comparison of query response times and throughput on multi-user runs.

Comparison of the performance of Impala and a commercial analytic RDBMS. https://github.com/cloudera/impala-tpcds-kit

Comparison with Spark SQL

• Impala is faster than Spark SQL as it is an engine designed especially for the mission of interactive SQL over HDFS, and it has architecture concepts that helps it achieve that. • For example the Impala ‘always-on’ daemons are up and waiting for queries 24/7 — something that is not part of Spark SQL.

Thank you!

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh - PowerPoint PPT Presentation

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase? Hive is a data

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

Impala Platinum Limited Impala Platinum Limited Rock Engineering Rock Engineering The

Impala Platinum Holdings Limited Interim results 1 March 2018 Impala Rustenburg 16 Shaft

IMP IMPO IMP - Impala Platinum Holdings - Consolidated interim results for the six

Impala mine approach to Covid-19 17/07/2020 Screening and Tracing Work Flowsheet 2 Employee

Cheetah By Bre'Aziah Habitat Asia Africa Europe Savanna's Cheetah food Antelope Impala's

Impala Platinum Holdings Limited Annual results 14 September 2017 Forward looking statement 2

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

15-721 ADVANCED DATABASE SYSTEMS Lecture #25 End of Semester + Impala/Kudu Tech Talk Andy

Motorcycle Safety in Pinellas County, Florida Chanyoung Lee, Ph.D., CUTR/USF Edie Peters, FDOT

Motorcycle Safety in Florida (Hernando and Citrus Counties) Chanyoung Lee, Ph.D., CUTR/USF Edie

( Aepyceros melampus ) R.A. Engels & Prof. L.C. Hoffman Introduction Game farming - a

SUBTATIONS (APPROX 19KMS) AND MTUBATUBA NSELENI SUBSTATIONS (APPROX 42KMS), IN THE

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History & Journey to Zero Harm

Date: WEDNESDAY, 01 July 2020 Time: 11H00 AM Venue: RAND WATER, HEAD OFFICE 522 IMPALA ROAD

NEW MEXICO DEPARTMENT OF CULTURAL AFFAIRS Presentatjon to the Legislatjve Finance Commituee July

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Frequency, Edges, and Corners Various slides from previous courses by: D.A. Forsyth (Berkeley /

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Structures

Complex Structures in Algebra, Geometry, Topology, Analysis and Dynamical Systems Zalman Balanov

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh - PowerPoint PPT Presentation

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase? Hive is a data

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

Impala Platinum Limited Impala Platinum Limited Rock Engineering Rock Engineering The

Impala Platinum Holdings Limited Interim results 1 March 2018 Impala Rustenburg 16 Shaft

IMP IMPO IMP - Impala Platinum Holdings - Consolidated interim results for the six

Impala mine approach to Covid-19 17/07/2020 Screening and Tracing Work Flowsheet 2 Employee

Cheetah By Bre'Aziah Habitat Asia Africa Europe Savanna's Cheetah food Antelope Impala's

Impala Platinum Holdings Limited Annual results 14 September 2017 Forward looking statement 2

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

15-721 ADVANCED DATABASE SYSTEMS Lecture #25 End of Semester + Impala/Kudu Tech Talk Andy

Motorcycle Safety in Pinellas County, Florida Chanyoung Lee, Ph.D., CUTR/USF Edie Peters, FDOT

Motorcycle Safety in Florida (Hernando and Citrus Counties) Chanyoung Lee, Ph.D., CUTR/USF Edie

( Aepyceros melampus ) R.A. Engels &amp; Prof. L.C. Hoffman Introduction Game farming - a

SUBTATIONS (APPROX 19KMS) AND MTUBATUBA NSELENI SUBSTATIONS (APPROX 42KMS), IN THE

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History &amp; Journey to Zero Harm

Date: WEDNESDAY, 01 July 2020 Time: 11H00 AM Venue: RAND WATER, HEAD OFFICE 522 IMPALA ROAD

NEW MEXICO DEPARTMENT OF CULTURAL AFFAIRS Presentatjon to the Legislatjve Finance Commituee July

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Frequency, Edges, and Corners Various slides from previous courses by: D.A. Forsyth (Berkeley /

Computer Science &amp; Engineering 150A Introduction Problem Solving Using Computers Structures

Complex Structures in Algebra, Geometry, Topology, Analysis and Dynamical Systems Zalman Balanov

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

( Aepyceros melampus ) R.A. Engels & Prof. L.C. Hoffman Introduction Game farming - a

Thabo Dube MHSC CEO PRESENTATION CONTENTS MHSC History & Journey to Zero Harm

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Structures