Evolution of an Apache Spark Nick Afshartous Architecture for - - PowerPoint PPT Presentation

evolution of an apache spark
SMART_READER_LITE
LIVE PREVIEW

Evolution of an Apache Spark Nick Afshartous Architecture for - - PowerPoint PPT Presentation

Evolution of an Apache Spark Nick Afshartous Architecture for Processing WB Analytics Game Data Platform May 17 th 2017 May 17 th , 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead Contributor to Reactive


slide-1
SLIDE 1

Evolution of an Apache Spark Architecture for Processing Game Data

Nick Afshartous WB Analytics Platform May 17th 2017

May 17th, 2017

slide-2
SLIDE 2

About Me

  • nafshartous@wbgames.com
  • WB Analytics Core Platform Lead
  • Contributor to Reactive Kafka
  • Based in Turbine Game Studio (Needham, MA)
  • Hobbies
  • Sailing
  • Chess
slide-3
SLIDE 3

Some of our games…

slide-4
SLIDE 4
  • Intro
  • Ingestion pipeline
  • Redesigned ingestion pipeline
  • Summary and lessons learned
slide-5
SLIDE 5

Problem Statement

  • How to evolve Spark Streaming architecture to address challenges in

streaming data into Amazon Redshift

slide-6
SLIDE 6

Tech Stack

slide-7
SLIDE 7

Tech Stack – Data Warehouse

slide-8
SLIDE 8

Partitions Producers Consumers (Key, Value) at Partition, Offset

Kafka

Topics

  • Message is a (key, value)
  • Optional key used to assign

message to partition

  • Consumers can start processing

from earliest, latest, or from specific offsets

slide-9
SLIDE 9

Game Data

  • Game (mobile and console) instrumented to send event data
  • Volume varies up to 100,000 events per second per game
  • Games have up to ~ 70 event types
  • Data use-cases
  • Development
  • Reporting
  • Decreasing player churn
  • Increase revenue
slide-10
SLIDE 10

Ingestion Pipelines

  • Batch Pipeline
  • Input

JSON

  • Processing

Hadoop Map Reduce

  • Storage

Vertica

  • Spark / Redshift Real-time Pipeline
  • Input

Avro

  • Processing

Spark Streaming

  • Storage

Redshift

slide-11
SLIDE 11

Spark Versions

  • In process of upgrading to Spark 2.0.2 from 1.5.2
  • Load tested Spark 2.1
  • Blocked by deadlock issue
  • SPARK-19300 Executor is waiting for lock
slide-12
SLIDE 12
  • Intro
  • Ingestion pipeline
  • Re-designed ingestion pipeline
  • Summary and lessons learned
slide-13
SLIDE 13

Process for Game Sending Events

Avro Data

Event Ingestion

Schema Hash Avro Schema

Schema Registry

Schema Hash

Returned hash based on schema fields/types Registration triggers Redshift table create/alter statements

slide-14
SLIDE 14

Ingestion Pipeline

Event Ingestion Service HTTPS

Event Avro Schema Hash Micro Batch

Spark Streaming

S3

Data flow Invocation Kafka Data topic Run COPY

slide-15
SLIDE 15

Redshift Copy Command

Table person.txt 1|john doe 2|sarah smith create table if not exists public.person ( id integer, name varchar )

  • Redshift optimized for loading from S3
  • COPY is a SQL statement executed by Redshift
  • Example COPY

copy public.person from 's3://mybucket/person.txt'

slide-16
SLIDE 16

Ingestion Pipeline

Event Ingestion Service HTTPS

Event Avro Schema Hash Micro Batch

Spark Streaming

S3

Data flow Invocation Kafka Data topic

slide-17
SLIDE 17

Challenges

  • Redshift designed for loading large data files
  • Not for highly concurrent workloads (single-threaded commit queue)
  • Redshift latency can destabilize Spark streaming
  • Data loading competes with user queries and reporting workloads
  • Weekly maintenance
slide-18
SLIDE 18
  • Intro
  • Ingestion pipeline
  • Redesigned ingestion pipeline
  • Summary and lessons learned
slide-19
SLIDE 19

Redesign The Pipeline

  • Goals
  • De-couple Spark streaming job from Redshift
  • Tolerate Redshift unavailability
  • High-level Solution
  • Spark only writes to S3
  • Spark sends copy tasks to Kafka topic consumed by (new) Redshift loader
  • Design Redshift loader to be fault-tolerant w.r.t. Redshift
slide-20
SLIDE 20

Technical Design Options

  • Options considered for building Redshift loader
  • 2nd Spark streaming
  • Build a lightweight consumer
slide-21
SLIDE 21

Redshift Loader

  • Redshift loader built using Reactive Kafka
  • API’s for Scala and Java
  • Reactive Kafka
  • High-level Kafka consumer API
  • Leverages Akka streams and Akka

Reactive Kafka Akka Streams Akka

slide-22
SLIDE 22

Akka

  • Akka is an implementation of Actors
  • Actors: a model of concurrent computation in distributed systems, Gul Agha, 1986
  • Actors
  • Single-threaded entities with an asynchronous message queue (mailbox)
  • No shared memory
  • Features
  • Location transparency
  • Actors can be distributed over a cluster
  • Fault-tolerance
  • Actors restarted on failure

Actor Queue http://akka.io

slide-23
SLIDE 23

Akka Streams

  • Hard to implement stream processing considering
  • Back pressure – slow down rate to that of slowest part of stream
  • Not dropping messages
  • Akka Streams is a domain specific language for stream processing
  • Stream executed by Akka
slide-24
SLIDE 24

Akka Streams DSL

  • Source generates stream elements
  • Flow is a transformer (input and output)
  • Sink is stream endpoint

Source Sink Flow

slide-25
SLIDE 25

Akka Streams Example

  • Run stream to process two elements

val s = Source(1 to 2) s.map(x => println("Hello: " + x)) .runWith(Sink.ignore)

Hello: 1 Hello: 2 Output Not executed by calling thread Nothing happens until run method is invoked

slide-26
SLIDE 26

Reactive Kafka

  • Reactive Kafka stream is a type of Akka Stream
  • Supported version is from Kafka 0.10+
  • 0.8 branch is unsupported and less stable

https://github.com/akka/reactive-kafka

slide-27
SLIDE 27

Reactive Kafka – Example

  • Create consumer config

implicit val system = ActorSystem("Example") val consumerSettings = ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer) .withBootstrapServers("localhost:9092") .withGroupId("group1") Consumer.plainSource(consumerSettings, Subscriptions.topics("topic.name")) .map { message => println("message: " + message.value()) } .runWith(Sink.ignore) Creates Source that streams elements from Kafka Deserializers for key, value Consumer group

  • Create and run stream

Kafka endpoint message has type ConsumerRecord (Kafka API)

slide-28
SLIDE 28

Backpressure

  • Slows consumption when message rate is faster than consumer can

process

  • Asynchronous operations inside map bypass backpressure

mechanism

  • Use mapAsync instead of map for asynchronous operations (futures)
slide-29
SLIDE 29

Revised Architecture

S3

Copy Tasks Data flow Invocation Game Clients Event Ingestion Service HTTPS

Event Avro

Spark Streaming Redshift Loader Kafka Data topic COPY topic

slide-30
SLIDE 30

Goals

  • De-couple Spark streaming job from Redshift
  • Tolerate Redshift unavailability
slide-31
SLIDE 31

Redshift Cluster Status

  • Cluster status displayed on AWS console
  • Can be obtained programmatically via AWS SDK
slide-32
SLIDE 32

Redshift Fault Tolerance

  • Loader Checks health of Redshift using AWS SDK
  • Start consuming when Redshift available
  • Shut down consumer when Redshift not available

Consumer.Control.shutdown()

  • Run test query to validate database connections
  • Don’t rely on JDBC driver’s Connection.isClosed() method
slide-33
SLIDE 33

Transactions

  • With auto-commit enabled each COPY is a transaction
  • Commit queue limits throughput
  • Better throughput by executing multiple COPY’s in a single transaction
  • Run several concurrent transactions per job
slide-34
SLIDE 34

Deadlock

  • Concurrent transactions create potential for deadlock since COPY

statements lock tables

  • Redshift will detect and return deadlock exception
slide-35
SLIDE 35

Deadlock

Copy table A Copy table B A, B locked Copy table B Copy table A Wait for lock Deadlock

Transaction 1 Transaction 2 Time

slide-36
SLIDE 36

Deadlock Avoidance

  • Master hashes to worker based on Redshift table name
  • Ensures that all COPY’s for the same table are dispatched to the same worker
  • Alternatively could order COPY’s within transaction to avoid deadlock
slide-37
SLIDE 37
  • Intro
  • Ingestion pipeline
  • Redesigned ingestion pipeline
  • Summary and lessons learned
slide-38
SLIDE 38

Lessons (Re) learned

  • Distinguish post-processing versus data processing
  • Use Spark for data processing
  • Assume dependencies will fail
  • Load testing
  • Don’t focus exclusively on load volume
  • Number of event types was a factor
  • Load test often
  • Facilitated by automation
slide-39
SLIDE 39

Monitoring

  • Monitor individual components
  • Redshift loader sends a heartbeat via CloudWatch metric API
  • Monitor flow through the entire pipeline
  • Send test data and verify successful processing
  • Catches all failures
slide-40
SLIDE 40

Possible Future Directions

  • Explore using Akka Persistence
  • Implement using more of Reactive Kafka
  • Use Source.groupedWithin for batching COPY’s instead of Akka
slide-41
SLIDE 41

Related Note

  • Kafka Streams released as part of Kafka 0.10+
  • Provides streams API similar to Reactive Kafka
  • Reactive Kafka has API integration with Akka
slide-42
SLIDE 42

Questions ?

slide-43
SLIDE 43

Imports for Code Examples

import org.apache.kafka.clients.consumer.ConsumerRecord import akka.actor.{ActorRef, ActorSystem} import akka.stream.ActorMaterializer import akka.stream.scaladsl.{Keep, Sink, Source} import scala.util.{Success, Failure} import scala.concurrent.ExecutionContext.Implicits.global import akka.kafka.ConsumerSettings import org.apache.kafka.clients.consumer.ConsumerConfig import org.apache.kafka.common.serialization.{StringDeserializer, ByteArrayDeserializer} import akka.kafka.Subscriptions import akka.kafka.ConsumerMessage.{CommittableOffsetBatch, CommittableMessage} import akka.kafka.scaladsl.Consumer