Apache Spark Streaming, Kafka and HarmonicIO: A Performance - - PowerPoint PPT Presentation

apache spark streaming kafka and harmonicio a performance
SMART_READER_LITE
LIVE PREVIEW

Apache Spark Streaming, Kafka and HarmonicIO: A Performance - - PowerPoint PPT Presentation

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench19,


slide-1
SLIDE 1

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench’19, Denver, USA, November 2019. http://www.benchcouncil.org/bench19/index.html

slide-2
SLIDE 2

Summary

  • Performance Benchmark for Streaming Frameworks

– Apache Spark (under various integrations…) – HarmonicIO

  • Large Message Size (and higher processing cost)

– Scientific use cases: microscopy

  • Key finding: ‘islands’ of good performance over that 2D

domain, varying utility w.r.t. theoretical bounds.

slide-3
SLIDE 3

Background

  • Apache Spark

– Enterprise grade (resilient, great features, etc.) – Proven performance for typical enterprise use cases.

  • HASTE Project:

– Microscopy use cases – Message Size 1-10MB, >1 second per message.

  • How well do enterprise tools adapt to sci. computing?
slide-4
SLIDE 4

The Parameter Space

  • 2D Parameter Space
  • Theoretical Bounds

– Network – CPU

  • How does performance

generalize across this domain?

C P U C

  • s

t ( Ma p F u n c t i

  • n

) Me s s a g e S i z e ( A ) C P U B

  • u

n d ( B ) N e t w

  • r

k B

  • u

n d ( C ) ‘ F r a m e w

  • r

k ’ B

  • u

n d

slide-5
SLIDE 5

HarmonicIO

Image source: Torruangwatthana et al., HarmonicIO: Scalable Data Stream Processing for Scientific Datasets, IEEE Services 2018

  • Favors P2P message transfer.
  • Fallback to Master Queue
  • Processing runs inside Docker containers.
  • Intended for scientific computing applications.
slide-6
SLIDE 6

Methodology

A p a c h e S p a r k S t r e a m i n g w . F i l e S t r e a m i n g Ma s t e r Me s s a g e T r a n s f e r – Q u e u e Mo d e Me s s a g e T r a n s f e r – P 2 P Mo d e

S t r e a m S

  • u

r c e

Ma s t e r

H a r m

  • n

icIO

Wo r k e r 1 . . . . . . . . . Wo r k e r N F i l e L i s t i n g ( N F S S h a r e ) F i l e T r a n s f e r ( N F S S h a r e ) A p a c h e S p a r k S t r e a m i n g w . T C P Ma s t e r Wo r k e r 1 . . . . . . . . . Wo r k e r N Me s s a g e T r a n s f e r

S t r e a m S

  • u

r c e S t r e a m S

  • u

r c e

Wo r k e r 1 . . . . . . . . . Wo r k e r N A p a c h e S p a r k S t r e a m i n g w . K a f k a Ma s t e r Wo r k e r 1 . . . . . . Wo r k e r N

  • 1

K a f k a S e r v e r

S t r e a m S

  • u

r c e

Me s s a g e T r a n s f e r

slide-7
SLIDE 7

Experimental Setup

Spark Benchmarking Application Streaming Source

CPU Pause, padded to length

Throttling Application

M

  • n

i t

  • r

i n g v i a L

  • g

s , R e s t A P I M

  • n

i t

  • r

i n g v i a L

  • g

s S e t s M e s s a g e S i z e , C P U C

  • s

t v i a R e s t A P I

slide-8
SLIDE 8

The Workload

StreamingBenchmark.scala

slide-9
SLIDE 9

Dark = High Freq Black = Best Light = Low Freq

slide-10
SLIDE 10
  • Excellent Performance near
  • rigin.
  • 300KHz
  • Relatively weaker for high CPU

Load

  • Cores used for message

forwarding

  • Crashes for Large Messages.
slide-11
SLIDE 11
  • Excellent Performance near
  • rigin.
  • At origin, beaten by

Spark+TCP

  • Weaker for high CPU load
  • Overhead of Kafka server
  • Weaker for larger messages.
  • Not intended use case.
slide-12
SLIDE 12
  • Great Performance at low

frequencies.

  • Sparks’ filesystem polling

struggles at high frequency.

slide-13
SLIDE 13
  • Good overall performance.
  • Able to match performance
  • f Spark+FS, and

Spark+Kafka in their regions

  • f good performance
  • …and in between.
  • Struggles at higher

frequencies near origin.

slide-14
SLIDE 14

Results – Theoretical Bounds

slide-15
SLIDE 15

Performance for nil CPU Load

slide-16
SLIDE 16

Discussion

  • ‘islands’ of

good performance.

slide-17
SLIDE 17

Conclusions

  • Choice of Spark Integration matters

– depends on the parameters, frequency.

  • 2D Parameter Sweep is a nice way to viz. performance.
  • Various phenomenon visible only in some regions:

– Bottlenecks, overhead costs. – Varying utility (w.r.t. theoretical bounds).

  • ‘Middle Region’ – 1-10Mb, >1 second cost

– Neglected in streaming benchmark studies? – A region where HarmonicIO does well.

slide-18
SLIDE 18

Funding

The HASTE Project (http://haste.research.it.uu.se/) is funded by the Swedish Foundation for Strategic Research (SSF) under award no. BD15-0008, and the eSSENCE strategic collaboration for eScience.

slide-19
SLIDE 19

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden

Questions?

http://haste.research.it.uu.se/ https://github.com/HASTE-project/HarmonicIO https://github.com/HASTE-project

slide-20
SLIDE 20

Results