Apache Spark Streaming, Kafka and HarmonicIO: A Performance - - PowerPoint PPT Presentation
Apache Spark Streaming, Kafka and HarmonicIO: A Performance - - PowerPoint PPT Presentation
Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench19,
Summary
- Performance Benchmark for Streaming Frameworks
– Apache Spark (under various integrations…) – HarmonicIO
- Large Message Size (and higher processing cost)
– Scientific use cases: microscopy
- Key finding: ‘islands’ of good performance over that 2D
domain, varying utility w.r.t. theoretical bounds.
Background
- Apache Spark
– Enterprise grade (resilient, great features, etc.) – Proven performance for typical enterprise use cases.
- HASTE Project:
– Microscopy use cases – Message Size 1-10MB, >1 second per message.
- How well do enterprise tools adapt to sci. computing?
The Parameter Space
- 2D Parameter Space
- Theoretical Bounds
– Network – CPU
- How does performance
generalize across this domain?
C P U C
- s
t ( Ma p F u n c t i
- n
) Me s s a g e S i z e ( A ) C P U B
- u
n d ( B ) N e t w
- r
k B
- u
n d ( C ) ‘ F r a m e w
- r
k ’ B
- u
n d
HarmonicIO
Image source: Torruangwatthana et al., HarmonicIO: Scalable Data Stream Processing for Scientific Datasets, IEEE Services 2018
- Favors P2P message transfer.
- Fallback to Master Queue
- Processing runs inside Docker containers.
- Intended for scientific computing applications.
Methodology
A p a c h e S p a r k S t r e a m i n g w . F i l e S t r e a m i n g Ma s t e r Me s s a g e T r a n s f e r – Q u e u e Mo d e Me s s a g e T r a n s f e r – P 2 P Mo d e
S t r e a m S
- u
r c e
Ma s t e r
H a r m
- n
icIO
Wo r k e r 1 . . . . . . . . . Wo r k e r N F i l e L i s t i n g ( N F S S h a r e ) F i l e T r a n s f e r ( N F S S h a r e ) A p a c h e S p a r k S t r e a m i n g w . T C P Ma s t e r Wo r k e r 1 . . . . . . . . . Wo r k e r N Me s s a g e T r a n s f e r
S t r e a m S
- u
r c e S t r e a m S
- u
r c e
Wo r k e r 1 . . . . . . . . . Wo r k e r N A p a c h e S p a r k S t r e a m i n g w . K a f k a Ma s t e r Wo r k e r 1 . . . . . . Wo r k e r N
- 1
K a f k a S e r v e r
S t r e a m S
- u
r c e
Me s s a g e T r a n s f e r
Experimental Setup
Spark Benchmarking Application Streaming Source
CPU Pause, padded to length
Throttling Application
M
- n
i t
- r
i n g v i a L
- g
s , R e s t A P I M
- n
i t
- r
i n g v i a L
- g
s S e t s M e s s a g e S i z e , C P U C
- s
t v i a R e s t A P I
The Workload
StreamingBenchmark.scala
Dark = High Freq Black = Best Light = Low Freq
- Excellent Performance near
- rigin.
- 300KHz
- Relatively weaker for high CPU
Load
- Cores used for message
forwarding
- Crashes for Large Messages.
- Excellent Performance near
- rigin.
- At origin, beaten by
Spark+TCP
- Weaker for high CPU load
- Overhead of Kafka server
- Weaker for larger messages.
- Not intended use case.
- Great Performance at low
frequencies.
- Sparks’ filesystem polling
struggles at high frequency.
- Good overall performance.
- Able to match performance
- f Spark+FS, and
Spark+Kafka in their regions
- f good performance
- …and in between.
- Struggles at higher
frequencies near origin.
Results – Theoretical Bounds
Performance for nil CPU Load
Discussion
- ‘islands’ of
good performance.
Conclusions
- Choice of Spark Integration matters
– depends on the parameters, frequency.
- 2D Parameter Sweep is a nice way to viz. performance.
- Various phenomenon visible only in some regions:
– Bottlenecks, overhead costs. – Varying utility (w.r.t. theoretical bounds).
- ‘Middle Region’ – 1-10Mb, >1 second cost