Data Processing at the Speed of 100 Gbps using Apache Crail - - PowerPoint PPT Presentation
Data Processing at the Speed of 100 Gbps using Apache Crail - - PowerPoint PPT Presentation
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache Crail (crail.apache.org) Apache Crail (crail.apache.org) Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce
Apache Crail (crail.apache.org)
Apache Crail (crail.apache.org)
Ephemeral Data
Broadcast Shuffle Map Reduce Input data Output data Map-reduce job
HDFS, S3 HDFS, S3
Ephemeral Data
Broadcast Shuffle Map Reduce Input data Output data Map-reduce job
HDFS, S3 HDFS, S3
Ephemeral Data
Broadcast Shuffle Map Reduce Input data Output data Apache Crail
HDFS, S3 HDFS, S3
Ephemeral Data
Broadcast Shuffle Map Reduce Input data Intermediate data Apache Crail
HDFS, S3 HDFS, S3 HDFS, S3
Ephemeral Data
Input data Apache Crail
HDFS, S3 HDFS, S3 HDFS, S3
ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images
Ephemeral Data
Input data Apache Crail
HDFS, S3 HDFS, S3 HDFS, S3
ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images
HDFS, S3
Ephemeral Data
Input data Apache Crail
HDFS, S3 HDFS, S3
ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images
Why/when to use Crail
Why/when to use Crail
No Crail needed
100MB/s 10ms 10Gb/s 20us
Why/when to use Crail
100x Crail land
10GB/s 10us 200Gb/s 1us
No Crail needed
100MB/s 10ms 10Gb/s 20us
Why/when to use Crail
100x Crail land No Crail needed
20 40 60 80 100 100 200 300 400 500 Throughput (Gbit/s) Elapsed time (seconds) Spark/Crail Spark/Vanilla hardware limit 88.3s 527.6s
100MB/s 10ms 10Gb/s 20us 10GB/s 10us 200Gb/s 1us
Terasort 12.8 TB data 128 nodes
Performance Challenge
Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
Performance Challenge
Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
HotNets’16 Fetch chunk Over the network Process chunk In reduce task
Performance Challenge
Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
HotNets’16
Performance Challenge
Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD
software overhead are spread
- ver the entire
stack
HotNets’16
Crail Overview
Multiple interfaces Multiple storage backends (pluggable,
- pen interface)
Crail Overview
Multiple interfaces Multiple storage backends (pluggable,
- pen interface)
primary high-performance storage backends
Crail Architecture & API
MultiFile
Crail Architecture & API
- ptimized
for shuffle data key-value semantics append-only file
MultiFile
Crail Architecture & API
Java: C++:
MultiFile
Crail Architecture & API
Java: C++:
Node type
MultiFile
Crail Architecture & API
Java: C++:
non-blocking & asynchronous
MultiFile
Where does the performance come from?
User-Level I/O: Metadata
2 1 1 2
Crail client library
User-Level I/O: Metadata
2 1 1 2
Crail client library
No threads No context switches
User-Level I/O: Data
1 2 2 1
User-Level I/O: Data
1 2 2 1 zero-copy, transfer only data that is requested Application
Crail Deployment Modes
compute/storage co-located storage disaggregation flash storage disaggregation
YCSB KeyValue Workload
Crail offers Get latencies of ~12us and 30us for DRAM and NVM for 100 byte KV pairs Crail offers Get latencies of ~30us and 40us for DRAM and NVM for 1000 byte KV pairs
latency [us] latency [us] GET Value size: 1KB GET Value size: 100KB
Spark GroupBy (80M keys, 4K)
Spark/ Vanilla 5x 2.5x 2x
20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores
Spark/ Crail
Spark shuffling via Crail on a single core is 2x faster than vanilla Spark on 8 cores per executor (8 executors)
DRAM & Flash Disaggregation
Crail enables disaggregation of temporary data at no cost
DRAM/Flash Tiering
Using flash only increases the sorting time by around 48%
20 40 60 80 100 120 100/0 100/0 80/20 60/40 40/60 20/80 0/100 Vanilla Spark (100% Memory) Runtime (seconds) Memory to Flash Ratio Map Reduce
put your #assignedhashtag here by setting the footer in view-header/footer
- Apache Crail: Fast distributed “tmp”
–
User-level I/O
–
Storage disaggregation
–
Memory/flash convergence
- Applications
–
Intra-job scratch space (shuffle, broadcast, etc.)
–
Multi-job pipelines
- Coming soon
–
Native Crail (C++)
–
Tensorflow-Crail