Data Processing at the Speed of 100 Gbps using Apache Crail - - PowerPoint PPT Presentation

data processing at the speed of 100 gbps using apache
SMART_READER_LITE
LIVE PREVIEW

Data Processing at the Speed of 100 Gbps using Apache Crail - - PowerPoint PPT Presentation

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache Crail (crail.apache.org) Apache Crail (crail.apache.org) Ephemeral Data HDFS, Input data S3 Map-reduce job Broadcast Map Shuffle Reduce


slide-1
SLIDE 1

Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail

slide-2
SLIDE 2

Apache Crail (crail.apache.org)

slide-3
SLIDE 3

Apache Crail (crail.apache.org)

slide-4
SLIDE 4

Ephemeral Data

Broadcast Shuffle Map Reduce Input data Output data Map-reduce job

HDFS, S3 HDFS, S3

slide-5
SLIDE 5

Ephemeral Data

Broadcast Shuffle Map Reduce Input data Output data Map-reduce job

HDFS, S3 HDFS, S3

slide-6
SLIDE 6

Ephemeral Data

Broadcast Shuffle Map Reduce Input data Output data Apache Crail

HDFS, S3 HDFS, S3

slide-7
SLIDE 7

Ephemeral Data

Broadcast Shuffle Map Reduce Input data Intermediate data Apache Crail

HDFS, S3 HDFS, S3 HDFS, S3

slide-8
SLIDE 8

Ephemeral Data

Input data Apache Crail

HDFS, S3 HDFS, S3 HDFS, S3

ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images

slide-9
SLIDE 9

Ephemeral Data

Input data Apache Crail

HDFS, S3 HDFS, S3 HDFS, S3

ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images

HDFS, S3

slide-10
SLIDE 10

Ephemeral Data

Input data Apache Crail

HDFS, S3 HDFS, S3

ML pre-processing (map-reduce job) ML training (Tensorflow job) normalized images

slide-11
SLIDE 11

Why/when to use Crail

slide-12
SLIDE 12

Why/when to use Crail

No Crail needed

100MB/s 10ms 10Gb/s 20us

slide-13
SLIDE 13

Why/when to use Crail

100x Crail land

10GB/s 10us 200Gb/s 1us

No Crail needed

100MB/s 10ms 10Gb/s 20us

slide-14
SLIDE 14

Why/when to use Crail

100x Crail land No Crail needed

20 40 60 80 100 100 200 300 400 500 Throughput (Gbit/s) Elapsed time (seconds) Spark/Crail Spark/Vanilla hardware limit 88.3s 527.6s

100MB/s 10ms 10Gb/s 20us 10GB/s 10us 200Gb/s 1us

Terasort 12.8 TB data 128 nodes

slide-15
SLIDE 15

Performance Challenge

Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD

slide-16
SLIDE 16

Performance Challenge

Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD

HotNets’16 Fetch chunk Over the network Process chunk In reduce task

slide-17
SLIDE 17

Performance Challenge

Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD

HotNets’16

slide-18
SLIDE 18

Performance Challenge

Sorting Application JVM Netty Sorter Serializer sockets Data Processing Framework TCP/IP Ethernet NIC filesystem block layer iSCSI SSD

software overhead are spread

  • ver the entire

stack

HotNets’16

slide-19
SLIDE 19

Crail Overview

Multiple interfaces Multiple storage backends (pluggable,

  • pen interface)
slide-20
SLIDE 20

Crail Overview

Multiple interfaces Multiple storage backends (pluggable,

  • pen interface)

primary high-performance storage backends

slide-21
SLIDE 21

Crail Architecture & API

MultiFile

slide-22
SLIDE 22

Crail Architecture & API

  • ptimized

for shuffle data key-value semantics append-only file

MultiFile

slide-23
SLIDE 23

Crail Architecture & API

Java: C++:

MultiFile

slide-24
SLIDE 24

Crail Architecture & API

Java: C++:

Node type

MultiFile

slide-25
SLIDE 25

Crail Architecture & API

Java: C++:

non-blocking & asynchronous

MultiFile

slide-26
SLIDE 26

Where does the performance come from?

slide-27
SLIDE 27

User-Level I/O: Metadata

2 1 1 2

Crail client library

slide-28
SLIDE 28

User-Level I/O: Metadata

2 1 1 2

Crail client library

No threads No context switches

slide-29
SLIDE 29

User-Level I/O: Data

1 2 2 1

slide-30
SLIDE 30

User-Level I/O: Data

1 2 2 1 zero-copy, transfer only data that is requested Application

slide-31
SLIDE 31

Crail Deployment Modes

compute/storage co-located storage disaggregation flash storage disaggregation

slide-32
SLIDE 32

YCSB KeyValue Workload

Crail offers Get latencies of ~12us and 30us for DRAM and NVM for 100 byte KV pairs Crail offers Get latencies of ~30us and 40us for DRAM and NVM for 1000 byte KV pairs

latency [us] latency [us] GET Value size: 1KB GET Value size: 100KB

slide-33
SLIDE 33

Spark GroupBy (80M keys, 4K)

Spark/ Vanilla 5x 2.5x 2x

20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 110 120 Throughput (Gbit/s) Elapsed time (seconds) 1 core 4 cores 8 cores

Spark/ Crail

Spark shuffling via Crail on a single core is 2x faster than vanilla Spark on 8 cores per executor (8 executors)

slide-34
SLIDE 34

DRAM & Flash Disaggregation

Crail enables disaggregation of temporary data at no cost

slide-35
SLIDE 35

DRAM/Flash Tiering

Using flash only increases the sorting time by around 48%

20 40 60 80 100 120 100/0 100/0 80/20 60/40 40/60 20/80 0/100 Vanilla Spark (100% Memory) Runtime (seconds) Memory to Flash Ratio Map Reduce

slide-36
SLIDE 36

put your #assignedhashtag here by setting the footer in view-header/footer

  • Apache Crail: Fast distributed “tmp”

User-level I/O

Storage disaggregation

Memory/flash convergence

  • Applications

Intra-job scratch space (shuffle, broadcast, etc.)

Multi-job pipelines

  • Coming soon

Native Crail (C++)

Tensorflow-Crail

Conclusions