Apache Spark Streaming, Kafka and HarmonicIO: A Performance - - PowerPoint PPT Presentation

▶

Jul 17, 2023 260 likes •479 views

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench19,

SLIDE 1

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench’19, Denver, USA, November 2019. http://www.benchcouncil.org/bench19/index.html

SLIDE 2

Summary

Performance Benchmark for Streaming Frameworks

– Apache Spark (under various integrations…) – HarmonicIO

Large Message Size (and higher processing cost)

– Scientific use cases: microscopy

Key finding: ‘islands’ of good performance over that 2D

domain, varying utility w.r.t. theoretical bounds.

SLIDE 3

Background

Apache Spark

– Enterprise grade (resilient, great features, etc.) – Proven performance for typical enterprise use cases.

HASTE Project:

– Microscopy use cases – Message Size 1-10MB, >1 second per message.

How well do enterprise tools adapt to sci. computing?

SLIDE 4

The Parameter Space

2D Parameter Space
Theoretical Bounds

– Network – CPU

How does performance

generalize across this domain?

C P U C

t ( Ma p F u n c t i

) Me s s a g e S i z e ( A ) C P U B

n d ( B ) N e t w

k B

n d ( C ) ‘ F r a m e w

k ’ B

n d

SLIDE 5

HarmonicIO

Image source: Torruangwatthana et al., HarmonicIO: Scalable Data Stream Processing for Scientific Datasets, IEEE Services 2018

Favors P2P message transfer.
Fallback to Master Queue
Processing runs inside Docker containers.
Intended for scientific computing applications.

SLIDE 6

Methodology

A p a c h e S p a r k S t r e a m i n g w . F i l e S t r e a m i n g Ma s t e r Me s s a g e T r a n s f e r – Q u e u e Mo d e Me s s a g e T r a n s f e r – P 2 P Mo d e

S t r e a m S

r c e

Ma s t e r

H a r m

icIO

Wo r k e r 1 . . . . . . . . . Wo r k e r N F i l e L i s t i n g ( N F S S h a r e ) F i l e T r a n s f e r ( N F S S h a r e ) A p a c h e S p a r k S t r e a m i n g w . T C P Ma s t e r Wo r k e r 1 . . . . . . . . . Wo r k e r N Me s s a g e T r a n s f e r

S t r e a m S

r c e S t r e a m S

r c e

Wo r k e r 1 . . . . . . . . . Wo r k e r N A p a c h e S p a r k S t r e a m i n g w . K a f k a Ma s t e r Wo r k e r 1 . . . . . . Wo r k e r N

K a f k a S e r v e r

S t r e a m S

r c e

Me s s a g e T r a n s f e r

SLIDE 7

Experimental Setup

Spark Benchmarking Application Streaming Source

CPU Pause, padded to length

Throttling Application

i t

i n g v i a L

s , R e s t A P I M

i t

i n g v i a L

s S e t s M e s s a g e S i z e , C P U C

t v i a R e s t A P I

SLIDE 8

The Workload

StreamingBenchmark.scala

SLIDE 9

Dark = High Freq Black = Best Light = Low Freq

SLIDE 10

Excellent Performance near
rigin.
300KHz
Relatively weaker for high CPU

Load

Cores used for message

forwarding

Crashes for Large Messages.

SLIDE 11

Excellent Performance near
rigin.
At origin, beaten by

Spark+TCP

Weaker for high CPU load
Overhead of Kafka server
Weaker for larger messages.
Not intended use case.

SLIDE 12

Great Performance at low

frequencies.

Sparks’ filesystem polling

struggles at high frequency.

SLIDE 13

Good overall performance.
Able to match performance
f Spark+FS, and

Spark+Kafka in their regions

f good performance
…and in between.
Struggles at higher

frequencies near origin.

SLIDE 14

Results – Theoretical Bounds

SLIDE 15

Performance for nil CPU Load

SLIDE 16

Discussion

‘islands’ of

good performance.

SLIDE 17

Conclusions

Choice of Spark Integration matters

– depends on the parameters, frequency.

2D Parameter Sweep is a nice way to viz. performance.
Various phenomenon visible only in some regions:

– Bottlenecks, overhead costs. – Varying utility (w.r.t. theoretical bounds).

‘Middle Region’ – 1-10Mb, >1 second cost

– Neglected in streaming benchmark studies? – A region where HarmonicIO does well.

SLIDE 18

Funding

The HASTE Project (http://haste.research.it.uu.se/) is funded by the Swedish Foundation for Strategic Research (SSF) under award no. BD15-0008, and the eSSENCE strategic collaboration for eScience.

SLIDE 19

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden

Questions?

http://haste.research.it.uu.se/ https://github.com/HASTE-project/HarmonicIO https://github.com/HASTE-project

SLIDE 20