Data Processing on Future Hardware
Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland
DAGSTUHL 17101, March 2017
Data Processing on Future Hardware Gustavo Alonso Systems Group - - PowerPoint PPT Presentation
Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017 www.systems.ethz.ch Systems Group 7 faculty ~30 PhD ~11 postdocs Researching all aspects
Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland
DAGSTUHL 17101, March 2017
INTEL HARP: This is an experimental system provided by Intel any results presented are generated using pre- production hardware and software, and may not reflect the performance
Istvan et al, FCCM’16
Owaida et al. FCCM 2017
Sidler et al., SIGMOD’17
From Oracle M7 documentation
Target Architecture: Intel Xeon+FPGA
96GB Main Memory
Caches
Intel Xeon CPU Altera Stratix V
~30 GB/s 6.5 GB/s
QPI Accelerator QPI Endpoint
Partitioner QPI Endpoint FPGA
QPIMemory R S
Counts R Partitioned R Counts S Partitioned S PaddingCPU
Core 0 Core 1 Core 2
64B Cache LineCounts R Counts S
CoreCore 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9
... Core 1 Core Core 1 Core 0 Core 0 Core 1 Core 1 ... ... ... Pointer Pointer 64B Cache LineKaan et al. SIGMOD 2017
16 floats 64B cache-line
x x
16 float multipliers
x x + + + + + + + + + + + + + + +
16 fixed adders
b fifo
ax b γ(ax-b)
a fifo
x x x
16 float multipliers
adders
x loading a
C C C C
16 float2fixed converters
+ c
1 fixed2float converter
C C C
16 float2fixed converters
C
16 fixed2float converters
γ γ(ax-b)a x - γ(ax-b)a
1 float adder 1 float multiplier
Model update Dot product Gradient calculation x
Batch size is reached? 1 2 3 4 5 6 B A 7 8 9
C D
Computation Custom Logic Storage Device FPGA BRAM Data Source Sensor Database
32-bit floating-point: 16 values Processing rate: 12.8 GB/s
Kaan et al. SIGMOD 2017
Kaan et al. FCCM 2017
(Woods, VLDB’14)
The goal is to be able to do this at all levels:
Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory)
Every element in the system (a node, a computer rack, a cluster) will be a processing component
08-Mar-17 18
Xilinx VC709 Evaluation Board SFP+ SFP+ SFP+ SFP+
DRAM (8GB)
FPGA
Networking Atomic Broadcast Replicated key-value store
Reads Writes SW Clients / Other nodes Other nodes Other nodes TCP Direct Direct
replication
19
X 12
10Gbps Switch 3 FPGA cluster Clients
connections
+ Leader election + Recovery
20
Consensus 15-35μs ~10μs Memaslap (ixgbe) TCP / 10Gbps Ethernet ~3μs Direct connections
1000 10000 100000 1000000 10000000 1 10 100 1000 Througput (consensus rounds/s) Consensus latency (us) FPGA (Direct) FPGA (TCP) DARE* (Infiniband) Libpaxos (TCP) Etcd (TCP) Zookeeper (TCP)
Specialized solutions
21
General purpose solutions
[1] Dragojevic et al. FaRM: Fast Remote Memory. In NSDI’14. [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15. *=We extrapolated from the 5 node setup for a 3 node setup.
10-100x
There is a killer application (data science/big data) There is a very fast evolution of the infrastructure for data processing (appliances, data centers) Conventional processors and architectures are not good enough FPGAs great tools to: Explore parallelism Explore new architectures Explore Software Defined X/Y/Z Prototype accelerators