apache spark streaming kafka and harmonicio a performance
play

Apache Spark Streaming, Kafka and HarmonicIO: A Performance - PowerPoint PPT Presentation

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench19,


  1. Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden Ben.Blamey@it.uu.se Bench’19, Denver, USA, November 2019. http://www.benchcouncil.org/bench19/index.html

  2. Summary • Performance Benchmark for Streaming Frameworks – Apache Spark (under various integrations…) – HarmonicIO • Large Message Size (and higher processing cost) – Scientific use cases: microscopy • Key finding: ‘islands’ of good performance over that 2D domain, varying utility w.r.t. theoretical bounds.

  3. Background • Apache Spark – Enterprise grade (resilient, great features, etc.) – Proven performance for typical enterprise use cases. • HASTE Project: – Microscopy use cases – Message Size 1-10MB, >1 second per message. • How well do enterprise tools adapt to sci. computing?

  4. The Parameter Space • 2D Parameter Space ( A ) ) • Theoretical Bounds C P U B o u n d n o i t c – Network n u F p – CPU Ma ( ( B ) N e t w o r k t • How does performance s B o u n d o ( C ) ‘ F r a m e w o r k ’ C generalize across this B o u n d U P domain? C Me s s a g e S i z e

  5. HarmonicIO • Favors P2P message transfer. Image source: Torruangwatthana et al., • Fallback to Master Queue HarmonicIO: Scalable Data Stream Processing for • Processing runs inside Docker containers. Scientific Datasets , IEEE • Intended for scientific computing applications. Services 2018

  6. Methodology icIO A p a c h e S p a r k S t r e a m i n g w . F i l e S t r e a m i n g A p a c h e S p a r k S t r e a m i n g w . T C P A p a c h e S p a r k S t r e a m i n g w . K a f k a H a r m o n Ma s t e r Ma s t e r Ma s t e r Ma s t e r K a f k a S e r v e r Wo r k e r 1 Wo r k e r 1 Wo r k e r 1 Wo r k e r 1 S t r e a m S t r e a m S t r e a m S t r e a m S o u r c e S o u r c e S o u r c e S o u r c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wo r k e r N - 1 . . . Wo r k e r N Wo r k e r N Wo r k e r N Me s s a g e T r a n s f e r – P 2 P Mo d e F i l e T r a n s f e r ( N F S S h a r e ) Me s s a g e T r a n s f e r Me s s a g e T r a n s f e r Me s s a g e T r a n s f e r – Q u e u e Mo d e F i l e L i s t i n g ( N F S S h a r e )

  7. Experimental Setup CPU Pause, padded to length Spark Streaming Source Benchmarking Application M o n i t o r i n g I P v A S i a e t t L s s o e M g R s , e s s g s o a L g v i e a a i S v R Throttling i g z e e n s , t i r C A o Application P t P i U I n o C M o s t

  8. The Workload StreamingBenchmark.scala

  9. Dark = High Freq Black = Best Light = Low Freq

  10. - Excellent Performance near origin. - 300KHz - Relatively weaker for high CPU Load - Cores used for message forwarding - Crashes for Large Messages.

  11. - Excellent Performance near origin. - At origin, beaten by Spark+TCP - Weaker for high CPU load - Overhead of Kafka server - Weaker for larger messages. - Not intended use case.

  12. - Great Performance at low frequencies. - Sparks’ filesystem polling struggles at high frequency.

  13. - Good overall performance. - Able to match performance of Spark+FS, and Spark+Kafka in their regions of good performance - …and in between. - Struggles at higher frequencies near origin.

  14. Results – Theoretical Bounds

  15. Performance for nil CPU Load

  16. Discussion • ‘islands’ of good performance.

  17. Conclusions • Choice of Spark Integration matters – depends on the parameters, frequency. • 2D Parameter Sweep is a nice way to viz. performance. • Various phenomenon visible only in some regions: – Bottlenecks, overhead costs. – Varying utility (w.r.t. theoretical bounds). • ‘Middle Region’ – 1-10Mb, >1 second cost – Neglected in streaming benchmark studies? – A region where HarmonicIO does well.

  18. Funding The HASTE Project (http://haste.research.it.uu.se/) is funded by the Swedish Foundation for Strategic Research (SSF) under award no. BD15-0008, and the eSSENCE strategic collaboration for eScience.

  19. Questions? Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey, Andreas Hellander and Salman Toor Uppsala University, Sweden http://haste.research.it.uu.se/ https://github.com/HASTE-project/HarmonicIO https://github.com/HASTE-project

  20. Results

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend