Data Processing on Future Hardware Gustavo Alonso Systems Group - PowerPoint PPT Presentation

Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017

www.systems.ethz.ch Systems Group • 7 faculty • ~30 PhD • ~11 postdocs Researching all aspects of system architecture, sw and hw

Do not replace, enhance Help the engine to do what it does not do well

• Databases great • Databases not great • Persistence • Text • Consistency • Large data types • Fault tolerance • Multimedia • Concurrency • Floating Point • Optimization • Geometry, spatial • Declarative • Graphs • Most of ML Interfaces • Extensible • Scalability

Text search in databases Istvan et al, FCCM’16 INTEL HARP: This is an experimental system provided by Intel any results presented are generated using pre- production hardware and software, and may not reflect the performance of production or future systems.

100% processing on FPGA

Hybrid Processing CPU/FPGA

Inside the FPGA … Owaida et al. FCCM 2017

Accelerating real engines Sidler et al., SIGMOD’17

Near memory processing See previous talk, or … From Oracle M7 documentation

DoppioDB: An engine that actually processes data • MonetDB + Intel HARP (v1 and v2) • String processing • Skyline • Data Partitioning • Stochastic Gradient Descent • Decision trees • …

Integration of Partitioned Hash Joins QPI QPI Endpoint R Pointer 64B 64B Cache Cache Line Line ~30 GB/s 6.5 GB/s Mem. Controller Pointer QPI 96GB Main QPI Partitioner Endpoint Memory S Caches FPGA Accelerator Counts R Counts S Intel Xeon CPU Counts R Altera Stratix V Core 0 Core Core ... Core Core ... Core 1 0 1 0 1 ... Partitioned R Target Architecture: Intel Xeon+FPGA Core 0 Core 5 Core 1 Core 6 Padding Counts S Core 0 Core 2 Core 7 Core 1 ... Partitioned S Core 3 Core 8 Kaan et al. SIGMOD 2017 Core 4 Core 9 Memory CPU

SGD on FPGA Kaan et al. FCCM 2017 Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ ( ax -b) a 9 Custom Logic 16 float2fixed C C C Storage Device converters Kaan et al. SIGMOD 2017 FPGA BRAM x - - - Model update 16 fixed C loading adders x - γ ( ax -b) a

If the data moves, do it efficiently Bumps in the wire(s)

(Woods, VLDB’14) IBEX

Sounds good? The goal is to be able to do this at all levels: Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory) Every element in the system (a node, a computer rack, a cluster) will be a processing component

Disaggregated data center Near Data Computation

Consensus in a Box (Istvan et al, NSDI’16 ) Xilinx VC709 Evaluation Board FPGA SW Clients / SFP+ TCP Reads Other nodes Replicated Other nodes Writes SFP+ Direct Networking key-value store Atomic Broadcast Other nodes SFP+ Direct SFP+ DRAM (8GB) 08-Mar-17 18

The system 3 FPGA cluster 10Gbps Switch Comm. over TCP/IP Comm. over direct + Leader election connections X 12 + Recovery Clients • Drop-in replacement for memcached with Zookeeper’s replication • Standard tools for benchmarking (libmemcached) • Simulating 100s of clients 19

Latency of puts in a KVS Direct connections ~3 μ s Consensus Memaslap (ixgbe) 15-35 μ s ~10 μ s TCP / 10Gbps Ethernet 20

The benefit of specialization… 10000000 Specialized Througput (consensus rounds/s) solutions 1000000 10-100x FPGA (Direct) FPGA (TCP) 100000 DARE* (Infiniband) General Libpaxos (TCP) purpose Etcd (TCP) 10000 solutions Zookeeper (TCP) 1000 1 10 100 1000 Consensus latency (us) [1] Dragojevic et al. FaRM: Fast Remote Memory . In NSDI’14. 21 [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC’15. *=We extrapolated from the 5 node setup for a 3 node setup.

This is the end … There is a killer application (data science/big data) There is a very fast evolution of the infrastructure for data processing (appliances, data centers) Conventional processors and architectures are not good enough FPGAs great tools to: Explore parallelism Explore new architectures Explore Software Defined X/Y/Z Prototype accelerators

Data Processing on Future Hardware Gustavo Alonso Systems Group - PowerPoint PPT Presentation

Data Processing on Future Hardware Gustavo Alonso Systems Group Computer Science ETH Zurich Switzerland DAGSTUHL 17101, March 2017 www.systems.ethz.ch Systems Group 7 faculty ~30 PhD ~11 postdocs Researching all aspects

Hardware Observability Framework Hardware Observability Framework Hardware Observability

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

software and hardware for the Internet of Things. Choose hardware Design hardware Design

BIOINSPIRED HARDWARE Erki Suurjaak Overview bioinspired hardware NASAs exploration

HARDWARE-CONSCIOUS DATA PROCESSING SYSTEMS Holger Pirk http://doc.ic.ac.uk/~hlgr Data

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

FUTURE PULL: Future Pull Creating Change From the THE FARMHOUSE IN MY FUTURE Future Back Bill

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group

Seeds of Progress Christopher J. Scolese NASA Associate Administrator October 22, 2009

The Microphysics of Cosmic Ray transport Alexandre Marcowith 1 Alexandre.Marcowith@umontpellier.fr

Introduction to interval analysis Julien Alexandre dit Sandretto Department U2IS ENSTA Paris

Status of WLCG Tier-0 Helge Meinhard, CERN-IT Grid Deployment Board 12 June 2013 Status of WLCG

Inner Regions and Interval Linearizations for Global Optimization G. Trombettoni, I. Araya, B.

Characterising exoplanet atmospheres, magnetospheres and stellar winds from energetic neutral

Phonological trends in the lexicon Practicum Michael Becker University of Massachusetts

Security Types for Web Applications Introduction Goals Browser security Antoine Delignat-Lavaud