SamzaSQL Scalable Fast Data Management with Streaming SQL Milinda - - PowerPoint PPT Presentation

samzasql
SMART_READER_LITE
LIVE PREVIEW

SamzaSQL Scalable Fast Data Management with Streaming SQL Milinda - - PowerPoint PPT Presentation

SamzaSQL Scalable Fast Data Management with Streaming SQL Milinda Pathirage (IU), Julian Hyde (Hortonworks), Yi Pan (LinkedIn), Beth Plale (IU) School of Informatics and Computing INDIANA UNIVERSITY Fast Data Data has to be processed as it


slide-1
SLIDE 1

SamzaSQL

Scalable Fast Data Management with Streaming SQL

Milinda Pathirage (IU), Julian Hyde (Hortonworks), Yi Pan (LinkedIn), Beth Plale (IU)

School of Informatics and Computing INDIANA UNIVERSITY

slide-2
SLIDE 2

Fast Data

Data has to be processed as it arrives, so that we can react immediately to changing conditions.

BIG DATA ISN’T JUST BIG; IT’S ALSO FAST.

Big data is often data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, and sensor data.

John Hugg, "Fast data: The next step after big data"

slide-3
SLIDE 3

Applications

Real-time distributed tracing for website performance and efficiency optimizations Calculating click-through rates Data stream enrichment

  • Count page views by group key where group key is

retrieved from a key/value storage

  • Enriching data streams related to use activities with user’s

information such as location and company

At the time of writing LinkedIn uses 90 Kafka clusters deployed across 1500 nodes to process 150TB of input data daily

slide-4
SLIDE 4

Lambda Architecture (LA)

LA is a technology agnostic data processing architecture that attempts to balance latency, accuracy, throughput and fault-tolerance by providing a unified serving layer on top of batch and stream processing sub-systems.

From: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

slide-5
SLIDE 5

Kappa Architecture (KA)

Simplification of Lambda Architecture is KA that uses append-only immutable log as the canonical data store; batch processing is replaced by stream replay.

From: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

slide-6
SLIDE 6

MOTIVATION

slide-7
SLIDE 7

Programming APIs for LA and KA

Summingbird is a well known abstraction for writing LA style

  • applications. KA style applications are mainly written in a

stateful stream processing APIs provided by frameworks such as Apache Samza.

Limitations

Need to maintain two complex distributed systems Users need to understand complex programming abstractions Long turnaround times

slide-8
SLIDE 8

Summingbird

WORD COUNT

def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)

More examples at https://github.com/twitter/summingbird

slide-9
SLIDE 9

Samza API

WINDOW AGGREGATION

public class WikipediaStatsStreamTask implements StreamTask, InitableTask, WindowableTask { ... private KeyValueStore<String, Integer> store; public void init(Config config, TaskContext context) { this.store = (KeyValueStore<String, Integer>) context.getStore("wikipedia-stats"); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { Map<String, Object> edit = (Map<String, Object>) envelope.getMessage(); ... } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) { ... collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wikipedia-stats"), counts)); ... }

slide-10
SLIDE 10

SQL for Big Data

There are several well known SQL-on-Hadoop solutions and most organizations that use Hadoop use one or more SQL-on-Hadoop solutions. Apache Hive Presto Apache Drill Apache Impala Apache Kylin Apache Tajo Apache Phoenix

slide-11
SLIDE 11

Motivating Research Questions

Can the same low barrier and the clear semantics of SQL be extended to queries that execute simultaneously over data streams (in movement) and tables (at rest)? Can this be done with minimal and well-founded extensions to SQL? And with minimal overhead over a non-SQL-based LA/KA?

slide-12
SLIDE 12

SAMZASQL

slide-13
SLIDE 13

Streaming SQL - Data Model

Stream: A stream S is a possibly indefinite partitioned sequence of temporally-defined elements where an element is a tuple belonging to the schema of S. Partition: A partition is a time-ordered, immutable sequence of elements existing within a single stream. Relation: Analogous to a relation/table in relational databases, a relation R is a bag of tuples belonging to the schema of R.

slide-14
SLIDE 14

Streaming SQL - Continuous Queries

SAMZASQL

SELECT STREAM rowtime, productId, units FROM Orders WHERE units > 25

CQL

SELECT ISTREAM rowtime, productId, units FROM Orders WHERE units > 25;

slide-15
SLIDE 15

Streaming SQL - Window Aggregations

SAMZASQL

SELECT STREAM TUMBLE_END (rowtime, INTERVAL '1' HOUR) AS rowtime, productId, COUNT(*) AS c, SUM(units) AS units FROM Orders GROUP BY TUMBLE (rowtime, INTERVAL '1' HOUR), productId

CQL

SELECT ISTREAM ... AS rowtime, productId, COUNT(*) AS c, SUM(units) AS units FROM Orders[Range '1' HOUR, Slide '1' HOUR] GROUP BY productId;

slide-16
SLIDE 16

Streaming SQL - Sliding Windows

SAMZASQL

SELECT STREAM rowtime, productId, units, SUM(units) OVER (ORDER BY rowtime PARTITION BY productId RANGE INTERVAL '1' HOUR PRECEDING) unitsLastHour FROM Orders;

CQL

SELECT ISTREAM rowtime, productId, units, SUM(units) AS unitsLastHour FROM Orders[Range '1' HOUR] GROUP BY productId;

slide-17
SLIDE 17

Streaming SQL - Window Joins

SAMZASQL

SELECT STREAM GREATEST(PacketsR1.rowtime, PacketsR2.rowtime) AS rowtime, PacketsR1.sourcetime, PacketsR1.packetId, PacketsR2.rowtime - PacketsR1.rowtime AS timeToTravel FROM PacketsR1 JOIN PacketsR2 ON PacketsR1.rowtime BETWEEN PacketsR2.rowtime - INTERVAL '2' SECOND AND PacketsR2.rowtime + INTERVAL '2' SECOND AND PacketsR1.packetId = PacketsR2.packetId

slide-18
SLIDE 18

Streaming SQL - Stream-to-Relation Joins

SAMZASQL

SELECT STREAM * FROM Orders as o JOIN Products as p

  • n o.productId = p.productId
slide-19
SLIDE 19

SamzaSQL - Implementation

Uses Apache Calcite query planning framework Utilizes Calcite’s code generation framework Generates Samza jobs for streaming SQL queries Uses Samza’s local storage to implement fault-tolerant window aggregations Uses Samza’s bootstrap stream feature to cache the relation to perform stream-to-relation join queries Uses Janino to compile operators generated during stream task initialization

slide-20
SLIDE 20

SamzaSQL - Architecture

SamzaSQL Shell Samza YARN Client Calcite Model Schema Registry Zookeeper SamzaSQL Job

slide-21
SLIDE 21

SamzaSQL - Samza Job

Samza YARN Client YARN Resource Manager Samza App Master SamzaContainer [s-p2] SamzaContainer [s-p1] SamzaContainer [s-p0] Kafka Cluster Node Manager Node Manager Node Manager

s-p1 s-p0 s-p2

Kafka Broker 1 Kafka Broker 2 Kafka Broker n

slide-22
SLIDE 22

SamzaSQL - Kafka

slide-23
SLIDE 23

SamzaSQL - Query Planner

SELECT STREAM … Parser Validator Convert to Logical Plan Generic Optimizations Conver to SamzaSQL Model SamzaSQL Optimizations Samza Job Configuration* Execution Plan Apache Calcite

slide-24
SLIDE 24

EVALUATION

slide-25
SLIDE 25

Evaluation - Environment

100 byte messages (based on previous Kafka benchmarks) 3 node (EC2 r3.2xlarge) Kafka cluster 3 node (EC2 r3.2xlarge) YARN cluster Each r3.2xlarge instance has 8 vCPUs, 61GB of RAM and 160 GB SSD backed storage Data model

  • Stream - Orders (rowtime, productId, orderId, units)
  • Table - Products (productId, name, supplierId)
slide-26
SLIDE 26

Evaluation - Results

Per task throughput is around 550MB/m for simple queries (100 byte messages) Throughput is around 200MB/m when local storage is used (100 byte messages) 30-40% overhead for simple queries when compared with Samza jobs written in Java Overheads are mainly due to message format transformations required in streaming SQL runtime Overheads increase when local storage is used due to message serialization/deserialization

slide-27
SLIDE 27

Evaluation - SamzaSQL Message Processing Flow

MESSAGE PROCESSING FLOW

Decode AvrotoArray Process ArraytoAvro Encode

slide-28
SLIDE 28

Evaluation - Filter Throughput

2 4 8 16 2 4 6 ·107

Number of tasks Throughput (msg/m)

SamzaSQL Native

SELECT STREAM * FROM Orders WHERE units > 50

slide-29
SLIDE 29

Evaluation - Project Throughput

2 4 8 16 2 4 6 ·107

Number of tasks Throughput (msg/m)

SamzaSQL Native

SELECT STREAM rowtime, productId, units FROM Orders

slide-30
SLIDE 30

Evaluation - Stream-to-Relation Join Throughput

2 4 8 16 2 4 ·107

Number of tasks Throughput (msg/m)

SamzaSQL Native

SELECT STREAM Orders.rowtime, Orders.orderId, Orders,productId, Orders.units, Products.supplierId FROM Orders JOIN ON Orders.productId = Products.productId

slide-31
SLIDE 31

Evaluation - Sliding Window Throughput

0.2 0.4 0.6 0.8 1 ·106 1

Throughput (msg/m) Number of tasks

SamzaSQL Samza

SELECT STREAM rowtime, productId, units, SUM(units) OVER (PARTITION BY productId ORDER BY rowtime RANGE INTERVAL ’5’ MINUTE PRECEDING) unitsLastFiveMinutes FROM Orders

Sliding window query throughput was measured in a iMac due to limitations in EC2 IO rates.

slide-32
SLIDE 32

RELATED WORK

slide-33
SLIDE 33

Related Work

Eerly work on streaming SQL - TelegraphCQ, Tribecca, GSQL CQL Streaming SQL for Apache Flink and Apache Storm based

  • n our work in Apache Calcite
slide-34
SLIDE 34

FUTURE WORK AND CONCLUSION

slide-35
SLIDE 35

Future Work

Code generation to bring SamzaSQL generated physical plans closer to Samza Java API based queries Local storage related improvements to reduce serialization/deserialization overheads Streaming query optimizations for fast data management systems Ordering guarantees in the presence of stream repartitioning Stream-to-relation queries Intra-query optimizations Handling out-of-order arrivals

slide-36
SLIDE 36

Summary and Conclusion

We proposed a novel set of extensions to standard SQL for expressing streaming queries. SamzaSQL is an implementation of proposed streaming SQL variant on top of Apache Samza. We demonstrate that we can achieve decent amount of performance by utilizing existing libraries. Our evaluation results shows that further improvements such as code generation is needed to bring streaming SQL runtime closer in performance to streaming queries written in languages such as Java and Scala.

slide-37
SLIDE 37

References

Apache Samza Apache Calcite High-Level Language for Samza Calcite Streaming SQL Stream Processing for Everyone with SQL and Apache Flink

slide-38
SLIDE 38

Acknowledgments

The authors thank

  • Chris Riccomini, Jay Kreps, Martin Kleppman, Navina

Ramesh, Guzhang Wang and the Apache Samza and Apache Calcite communities for their valuable feedback.

  • Amazon Web Services for the resources allocation award.