Data Processing on Future Hardware Gustavo Alonso Systems Group - - PowerPoint PPT Presentation

data processing on future hardware
SMART_READER_LITE
LIVE PREVIEW

Data Processing on Future Hardware Gustavo Alonso Systems Group - - PowerPoint PPT Presentation

Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017 www.systems.ethz.ch Systems Group 7 faculty ~30 PhD ~11 postdocs Researching all aspects


slide-1
SLIDE 1

Data Processing on Future Hardware

Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland

DAGSTUHL 17101, March 2017

slide-2
SLIDE 2

www.systems.ethz.ch

Systems Group

  • 7 faculty
  • ~30 PhD
  • ~11 postdocs

Researching all aspects of system architecture, sw and hw

slide-3
SLIDE 3

Do not replace, enhance Help the engine to do what it does not do well

slide-4
SLIDE 4
  • Databases great
  • Persistence
  • Consistency
  • Fault tolerance
  • Concurrency
  • Optimization
  • Declarative

Interfaces

  • Extensible
  • Databases not great
  • Text
  • Large data types
  • Multimedia
  • Floating Point
  • Geometry, spatial
  • Graphs
  • Most of ML
  • Scalability
slide-5
SLIDE 5

Text search in databases

INTEL HARP: This is an experimental system provided by Intel any results presented are generated using pre- production hardware and software, and may not reflect the performance

  • f production or future systems.

Istvan et al, FCCM’16

slide-6
SLIDE 6

100% processing on FPGA

slide-7
SLIDE 7

Hybrid Processing CPU/FPGA

slide-8
SLIDE 8

Inside the FPGA …

Owaida et al. FCCM 2017

slide-9
SLIDE 9

Accelerating real engines

Sidler et al., SIGMOD’17

slide-10
SLIDE 10

Near memory processing

From Oracle M7 documentation

See previous talk, or …

slide-11
SLIDE 11

DoppioDB: An engine that actually processes data

  • MonetDB + Intel HARP (v1 and v2)
  • String processing
  • Skyline
  • Data Partitioning
  • Stochastic Gradient Descent
  • Decision trees
slide-12
SLIDE 12

Integration of Partitioned Hash Joins

Target Architecture: Intel Xeon+FPGA

96GB Main Memory

  • Mem. Controller

Caches

Intel Xeon CPU Altera Stratix V

~30 GB/s 6.5 GB/s

QPI Accelerator QPI Endpoint

Partitioner QPI Endpoint FPGA

QPI

Memory R S

Counts R Partitioned R Counts S Partitioned S Padding

CPU

Core 0 Core 1 Core 2

64B Cache Line

Counts R Counts S

Core

Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9

... Core 1 Core Core 1 Core 0 Core 0 Core 1 Core 1 ... ... ... Pointer Pointer 64B Cache Line

Kaan et al. SIGMOD 2017

slide-13
SLIDE 13

SGD on FPGA

16 floats 64B cache-line

x x

16 float multipliers

x x + + + + + + + + + + + + + + +

16 fixed adders

  • x

b fifo

ax b γ(ax-b)

a fifo

x x x

16 float multipliers

  • 16 fixed

adders

x loading a

C C C C

16 float2fixed converters

+ c

1 fixed2float converter

C C C

16 float2fixed converters

C

16 fixed2float converters

γ γ(ax-b)a x - γ(ax-b)a

1 float adder 1 float multiplier

Model update Dot product Gradient calculation x

Batch size is reached? 1 2 3 4 5 6 B A 7 8 9

C D

Computation Custom Logic Storage Device FPGA BRAM Data Source Sensor Database

32-bit floating-point: 16 values Processing rate: 12.8 GB/s

Kaan et al. SIGMOD 2017

Kaan et al. FCCM 2017

slide-14
SLIDE 14

If the data moves, do it efficiently

Bumps in the wire(s)

slide-15
SLIDE 15

(Woods, VLDB’14)

IBEX

slide-16
SLIDE 16

Sounds good?

The goal is to be able to do this at all levels:

Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory)

Every element in the system (a node, a computer rack, a cluster) will be a processing component

slide-17
SLIDE 17

Disaggregated data center

Near Data Computation

slide-18
SLIDE 18

08-Mar-17 18

Consensus in a Box (Istvan et al, NSDI’16)

Xilinx VC709 Evaluation Board SFP+ SFP+ SFP+ SFP+

DRAM (8GB)

FPGA

Networking Atomic Broadcast Replicated key-value store

Reads Writes SW Clients / Other nodes Other nodes Other nodes TCP Direct Direct

slide-19
SLIDE 19
  • Drop-in replacement for memcached with Zookeeper’s

replication

  • Standard tools for benchmarking (libmemcached)
  • Simulating 100s of clients

19

The system

X 12

10Gbps Switch 3 FPGA cluster Clients

  • Comm. over TCP/IP
  • Comm. over direct

connections

+ Leader election + Recovery

slide-20
SLIDE 20

20

Latency of puts in a KVS

Consensus 15-35μs ~10μs Memaslap (ixgbe) TCP / 10Gbps Ethernet ~3μs Direct connections

slide-21
SLIDE 21

1000 10000 100000 1000000 10000000 1 10 100 1000 Througput (consensus rounds/s) Consensus latency (us) FPGA (Direct) FPGA (TCP) DARE* (Infiniband) Libpaxos (TCP) Etcd (TCP) Zookeeper (TCP)

Specialized solutions

21

The benefit of specialization…

General purpose solutions

[1] Dragojevic et al. FaRM: Fast Remote Memory. In NSDI’14. [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15. *=We extrapolated from the 5 node setup for a 3 node setup.

10-100x

slide-22
SLIDE 22

This is the end …

There is a killer application (data science/big data) There is a very fast evolution of the infrastructure for data processing (appliances, data centers) Conventional processors and architectures are not good enough FPGAs great tools to: Explore parallelism Explore new architectures Explore Software Defined X/Y/Z Prototype accelerators