data processing at the speed of 100 gbps using apache
play

Data Processing at the Speed of 100 Gbps using Apache Crail - PowerPoint PPT Presentation

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache Crail (crail.apache.org) Apache Crail (crail.apache.org) Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce


  1. Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research

  2. Apache Crail (crail.apache.org)

  3. Apache Crail (crail.apache.org)

  4. Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce HDFS, Output data S3

  5. Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce HDFS, Output data S3

  6. Ephemeral Data HDFS, Input data S3 Broadcast Apache Crail Map Shuffle Reduce HDFS, Output data S3

  7. Ephemeral Data HDFS, Input data S3 Broadcast Apache Crail Map Shuffle Reduce HDFS, Intermediate S3 data HDFS, S3

  8. Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, HDFS, S3 S3 S3 Apache Crail

  9. Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, HDFS, HDFS, S3 S3 S3 S3 Apache Crail

  10. Ephemeral Data ML pre-processing normalized ML training (map-reduce job) images (Tensorflow job) Input data HDFS, HDFS, S3 S3 Apache Crail

  11. Why/when to use Crail

  12. Why/when to use Crail No Crail needed 100MB/s 10ms 10Gb/s 20us

  13. Why/when to use Crail 10GB/s 10us 200Gb/s 1us No 100x Crail Crail needed land 100MB/s 10ms 10Gb/s 20us

  14. Why/when to use Crail 10GB/s 10us 200Gb/s 1us No 100x Crail Crail needed land Throughput (Gbit/s) 100 100MB/s 10ms 88.3s Spark/Crail 80 hardware limit 10Gb/s Terasort Spark/Vanilla 60 20us 12.8 TB data 40 128 nodes 527.6s 20 0 0 100 200 300 400 500 Elapsed time (seconds)

  15. Performance Challenge Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD

  16. Performance Challenge Process chunk In reduce task Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD Fetch chunk HotNets’16 Over the network

  17. Performance Challenge Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD HotNets’16

  18. Performance Challenge software overhead are spread over the entire stack Sorting Application Sorter Serializer Data Processing Framework sockets filesystem Netty TCP/IP block layer JVM Ethernet iSCSI NIC SSD HotNets’16

  19. Crail Overview Multiple interfaces Multiple storage backends (pluggable, open interface)

  20. Crail Overview Multiple interfaces Multiple storage backends (pluggable, open interface) primary high-performance storage backends

  21. Crail Architecture & API MultiFile

  22. Crail Architecture & API optimized MultiFile for shuffle data key-value semantics append-only file

  23. Crail Architecture & API Java: MultiFile C++:

  24. Crail Architecture & API Java: MultiFile Node type C++:

  25. Crail Architecture & API Java: MultiFile non-blocking & asynchronous C++:

  26. Where does the performance come from?

  27. User-Level I/O: Metadata 1 2 1 2 Crail client library

  28. User-Level I/O: Metadata 1 2 1 2 Crail client library No threads No context switches

  29. User-Level I/O: Data 1 2 2 1

  30. zero-copy, User-Level I/O: Data transfer only data that is requested Application 1 2 2 1

  31. Crail Deployment Modes compute/storage storage flash storage co-located disaggregation disaggregation

  32. YCSB KeyValue Workload GET GET Value size: Value size: 1KB 100KB latency [us] latency [us] Crail offers Get latencies of ~12us and 30us for DRAM and NVM for 100 byte KV pairs Crail offers Get latencies of ~30us and 40us for DRAM and NVM for 1000 byte KV pairs

  33. Spark GroupBy (80M keys, 4K) 100 Throughput (Gbit/s) Spark/ 1 core 80 4 cores Vanilla 8 cores 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) 100 Spark/ Elapsed time (seconds) 1 core 80 4 cores Crail 8 cores 60 2x 40 2.5x 5x 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 Elapsed time (seconds) Spark shuffling via Crail on a single core is 2x faster than vanilla Spark on 8 cores per executor (8 executors)

  34. DRAM & Flash Disaggregation Crail enables disaggregation of temporary data at no cost

  35. DRAM/Flash Tiering 120 Runtime (seconds) Map 100 Vanilla Spark Reduce 80 (100% Memory) 60 40 20 0 100/0 100/0 80/20 60/40 40/60 20/80 0/100 Memory to Flash Ratio Using flash only increases the sorting time by around 48%

  36. Conclusions ● Apache Crail: Fast distributed “tmp” put your #assignedhashtag here by setting the footer in view-header/footer User-level I/O – Storage disaggregation – Memory/flash convergence – ● Applications Intra-job scratch space (shuffle, broadcast, etc.) – Multi-job pipelines – ● Coming soon Native Crail (C++) – Tensorflow-Crail –

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend