Evolution of an Apache Spark Nick Afshartous Architecture for - PowerPoint PPT Presentation

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data Platform May 17 th 2017 May 17 th , 2017

About Me • nafshartous@wbgames.com • WB Analytics Core Platform Lead • Contributor to Reactive Kafka • Based in Turbine Game Studio (Needham, MA) • Hobbies • Sailing • Chess

Some of our games…

• Intro • Ingestion pipeline • Redesigned ingestion pipeline • Summary and lessons learned

Problem Statement • How to evolve Spark Streaming architecture to address challenges in streaming data into Amazon Redshift

Tech Stack

Tech Stack – Data Warehouse

Kafka Topics Partitions (Key, Value) at Partition, Offset • Message is a (key, value) • Producers Consumers Optional key used to assign message to partition • Consumers can start processing from earliest, latest, or from specific offsets

Game Data • Game (mobile and console) instrumented to send event data • Volume varies up to 100,000 events per second per game • Games have up to ~ 70 event types • Data use-cases • Development • Reporting • Decreasing player churn • Increase revenue

Ingestion Pipelines • Batch Pipeline • Input JSON • Processing Hadoop Map Reduce • Storage Vertica • Spark / Redshift Real-time Pipeline • Input Avro • Processing Spark Streaming • Storage Redshift

Spark Versions • In process of upgrading to Spark 2.0.2 from 1.5.2 • Load tested Spark 2.1 • Blocked by deadlock issue • SPARK-19300 Executor is waiting for lock

• Intro • Ingestion pipeline • Re-designed ingestion pipeline • Summary and lessons learned

Process for Game Sending Events Avro Returned hash based on schema fields/types Schema Schema Registry Registration triggers Redshift table create/alter Schema statements Hash Avro Data Event Ingestion Schema Hash

Ingestion Pipeline Event Avro Schema Hash HTTPS S3 Kafka Event Micro Batch Data topic Spark Ingestion Streaming Service Run COPY Data flow Invocation

Redshift Copy Command • Redshift optimized for loading from S3 create table if not exists public.person ( person.txt id integer, 1|john doe name varchar 2|sarah smith ) • COPY is a SQL statement executed by Redshift Table • Example COPY copy public.person from 's3://mybucket/person.txt'

Ingestion Pipeline Event Avro Schema Hash HTTPS S3 Micro Batch Event Kafka Data topic Ingestion Spark Service Streaming Data flow Invocation

Challenges • Redshift designed for loading large data files • Not for highly concurrent workloads (single-threaded commit queue) • Redshift latency can destabilize Spark streaming • Data loading competes with user queries and reporting workloads • Weekly maintenance

Redesign The Pipeline • Goals • De-couple Spark streaming job from Redshift • Tolerate Redshift unavailability • High-level Solution • Spark only writes to S3 • Spark sends copy tasks to Kafka topic consumed by (new) Redshift loader • Design Redshift loader to be fault-tolerant w.r.t. Redshift

Technical Design Options • Options considered for building Redshift loader • 2 nd Spark streaming • Build a lightweight consumer

Redshift Loader • Redshift loader built using Reactive Kafka Reactive Kafka • API’s for Scala and Java • Reactive Kafka Akka Streams • High-level Kafka consumer API • Leverages Akka streams and Akka Akka

Akka • Akka is an implementation of Actors • Actors: a model of concurrent computation in distributed systems, Gul Agha, 1986 Queue Actor • Actors • Single-threaded entities with an asynchronous message queue (mailbox) • No shared memory • Features • Location transparency • Actors can be distributed over a cluster • Fault-tolerance • Actors restarted on failure http://akka.io

Akka Streams • Hard to implement stream processing considering • Back pressure – slow down rate to that of slowest part of stream • Not dropping messages • Akka Streams is a domain specific language for stream processing • Stream executed by Akka

Akka Streams DSL • Source generates stream elements • Flow is a transformer (input and output) • Sink is stream endpoint Sink Source Flow

Akka Streams Example • Run stream to process two elements val s = Source(1 to 2) Not executed by s.map(x => println("Hello: " + x)) calling thread .runWith(Sink.ignore ) Nothing happens until run method is invoked Output Hello: 1 Hello: 2

Reactive Kafka • Reactive Kafka stream is a type of Akka Stream • Supported version is from Kafka 0.10+ • 0.8 branch is unsupported and less stable https://github.com/akka/reactive-kafka

Reactive Kafka – Example • Create consumer config Deserializers for key, value implicit val system = ActorSystem("Example") Kafka endpoint val consumerSettings = ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer) .withBootstrapServers("localhost:9092") Consumer .withGroupId("group1") group Creates Source that message has type streams elements • Create and run stream ConsumerRecord from Kafka (Kafka API) Consumer.plainSource(consumerSettings, Subscriptions.topics("topic.name")) .map { message => println("message: " + message.value()) } .runWith(Sink.ignore)

Backpressure • Slows consumption when message rate is faster than consumer can process • Asynchronous operations inside map bypass backpressure mechanism • Use mapAsync instead of map for asynchronous operations (futures)

Game Revised Architecture Clients Event Avro Data topic HTTPS Spark Kafka Streaming COPY topic Event S3 Ingestion Service Redshift Loader Copy Tasks Data flow Invocation

Goals • De-couple Spark streaming job from Redshift • Tolerate Redshift unavailability

Redshift Cluster Status • Cluster status displayed on AWS console • Can be obtained programmatically via AWS SDK

Redshift Fault Tolerance • Loader Checks health of Redshift using AWS SDK • Start consuming when Redshift available • Shut down consumer when Redshift not available Consumer.Control.shutdown() • Run test query to validate database connections • Don’t rely on JDBC driver’s Connection.isClosed( ) method

Transactions • With auto-commit enabled each COPY is a transaction • Commit queue limits throughput • Better throughput by executing multiple COPY’s in a single transaction • Run several concurrent transactions per job

Deadlock • Concurrent transactions create potential for deadlock since COPY statements lock tables • Redshift will detect and return deadlock exception

Deadlock Time Transaction 1 Transaction 2 Copy table A Copy table B A, B locked Copy table B Copy table A Wait for lock Deadlock

Deadlock Avoidance • Master hashes to worker based on Redshift table name • Ensures that all COPY’s for the same table are dispatched to the same worker • Alternatively could order COPY’s within transaction to avoid deadlock

Lessons (Re) learned • Distinguish post-processing versus data processing • Use Spark for data processing • Assume dependencies will fail • Load testing • Don’t focus exclusively on load volume • Number of event types was a factor • Load test often • Facilitated by automation

Monitoring • Monitor individual components • Redshift loader sends a heartbeat via CloudWatch metric API • Monitor flow through the entire pipeline • Send test data and verify successful processing • Catches all failures

Possible Future Directions • Explore using Akka Persistence • Implement using more of Reactive Kafka • Use Source.groupedWithin for batching COPY’s instead of Akka

Related Note • Kafka Streams released as part of Kafka 0.10+ • Provides streams API similar to Reactive Kafka • Reactive Kafka has API integration with Akka

Questions ?

Imports for Code Examples import org.apache.kafka.clients.consumer.ConsumerRecord import akka.actor.{ActorRef, ActorSystem} import akka.stream.ActorMaterializer import akka.stream.scaladsl.{Keep, Sink, Source} import scala.util.{Success, Failure} import scala.concurrent.ExecutionContext.Implicits.global import akka.kafka.ConsumerSettings import org.apache.kafka.clients.consumer.ConsumerConfig import org.apache.kafka.common.serialization.{StringDeserializer, ByteArrayDeserializer} import akka.kafka.Subscriptions import akka.kafka.ConsumerMessage.{CommittableOffsetBatch, CommittableMessage} import akka.kafka.scaladsl.Consumer

Evolution of an Apache Spark Nick Afshartous Architecture for - PowerPoint PPT Presentation

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data Platform May 17 th 2017 May 17 th , 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead Contributor to Reactive

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me

RFNoC: Evolving SDR Toolkits to the FPGA platgorm Martjn Braun 31.1.2016 USRP: A White Box?

Delay and laziness no not ea eager er ... When are expressions

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Causal Commutative Arrows Revisited Jeremy Yallop Hai (Paul) Liu University of Cambridge Intel

Big Data with ADAMS Big Data with ADAMS What the heck is ADAMS? Peter Reutemann What is ADAMS?

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

Evolution of an Apache Spark Nick Afshartous Architecture for - PowerPoint PPT Presentation

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data Platform May 17 th 2017 May 17 th , 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead Contributor to Reactive

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me

RFNoC: Evolving SDR Toolkits to the FPGA platgorm Martjn Braun 31.1.2016 USRP: A White Box?

Delay and laziness no not ea eager er ... When are expressions

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

MANAGING AND MANAGING AND PROCESSING LARGE PROCESSING LARGE DATASETS DATASETS Christian

Causal Commutative Arrows Revisited Jeremy Yallop Hai (Paul) Liu University of Cambridge Intel

Big Data with ADAMS Big Data with ADAMS What the heck is ADAMS? Peter Reutemann What is ADAMS?

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark