(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ - - PowerPoint PPT Presentation

re configurable clouds and the dawn of a new era
SMART_READER_LITE
LIVE PREVIEW

(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ - - PowerPoint PPT Presentation

(Re)Configurable Clouds and the Dawn of a New Era Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016 Client Cloud Training Humans GPUs Inference ASICs ? 5.8+ billion 250+ million 400+ million active users active


slide-1
SLIDE 1

(Re)Configurable Clouds and the Dawn of a New Era

Doug Burger @ Microsoft Research NExT FPL Keynote August 30, 2016

slide-2
SLIDE 2
slide-3
SLIDE 3

Training Inference Client Cloud

Humans ASICs GPUs ?

slide-4
SLIDE 4
slide-5
SLIDE 5

2.4+ million emails per day

200+ Cloud Services: Div Diver ersity sity

1+ billion customers · 20+ million businesses · 90+ markets worldwide

5.8+ billion

worldwide queries each month 1 in 4

enterprise customers

50+ billion

minutes of connections handled each month 48+ million users in 41 markets 50+ million active users

400+ million

active accounts

250+ million

active users

8.6+ trillion

  • bjects in Microsoft Azure

storage

slide-6
SLIDE 6

What Drives a Post-CPU “Enhanced” Cloud?

Efficiency (ASICS) Homogeneity

6

slide-7
SLIDE 7
slide-8
SLIDE 8

Microsoft Confidential

Catapult V0: BFB (2011)

  • Use commodity SuperMicro servers
  • 6 Xilinx LX240T FPGAs
  • One appliance per rack
  • All rack machines communicate over 1Gb

Ethernet

8

  • 1U rack-mounted
  • 2 x 10Ge ports
  • 3 x16 PCIe slots
  • 12 Intel Westmere

cores (2 sockets)

slide-9
SLIDE 9

Bing Ranking Implementation Details

Complex ALU Ln, ÷, div Basic Tile Basic Tile Basic Tile Basic Tile

Registers Constants FFE 1 Inst. FFE n Inst. Compression Thresholds

Local ALU DSP DSP Scheduling Logic Distribution latches Control/Data Tokens

Feature Transmissio n Network Stream Preprocessing FSM

FE FFE DTS

FE0

89 Non-BodyBlock Features 34 State Machines 55 % Utilization

FE1

55 BodyBlock Features 20 State Machines 45 % Utilization

DTT [3][7] DTT [3][6] DTT [3][5] DTT [3][4] DTT [3][3] DTT [3][2] DTT [3][1] DTT [3][0] DTT [3][11] DTT [3][10] DTT [3][9] DTT [3][8] DTT [2][7] DTT [2][6] DTT [2][5] DTT [2][4] DTT [2][3] DTT [2][2] DTT [2][1] DTT [2][0] DTT [2][11] DTT [2][10] DTT [2][9] DTT [2][8] DTT [1][7] DTT [1][6] DTT [1][5] DTT [1][4] DTT [1][3] DTT [1][2] DTT [1][1] DTT [1][0] DTT [1][11] DTT [1][10] DTT [1][9] DTT [1][8] DTT [0][7] DTT [0][6] DTT [0][5] DTT [0][4] DTT [0][3] DTT [0][2] DTT [0][1] DTT [0][0] DTT [0][11] DTT [0][10] DTT [0][9] DTT [0][8]

FFE [1][3] FFE [1][2] FFE [1][1] FFE [1][0] FFE [0][3] FFE [0][2] FFE [0][1] FFE [0][0]

FFE: 64 cores / chip 256-512 threads DTT: 48 DTT tiles/chip 240 tree processors 2880 trees/chip

slide-10
SLIDE 10
  • Fundamental flaws:
  • Additional single point of failure
  • Additional SKU to maintain
  • Too much load on the 1Gb network
  • Inelastic FPGA scaling or stranded capacity
slide-11
SLIDE 11

Catapult V1 Card (2012-2013)

11

  • Altera Stratix V D5
  • 172.6K ALMs, 2014 M20Ks
  • 457KLEs
  • 1 KLE == ~12K gates
  • M20K is a 2.5KB SRAM
  • PCIe Gen 2 x8, 8GB DDR3
  • 20 Gb network among FPGAs

Stratix V 8GB DDR3 PCIe Gen3 x8

slide-12
SLIDE 12

Mapped Fabric into a Pod

  • Low-latency access to a local FPGA
  • Compose multiple FPGAs to accelerate large workloads
  • Low-latency, high-bandwidth sharing of storage and

memory across server boundaries

Server 1

FPGA

Server 48

FPGA

Top Of Rack Switch (TOR) Server 2

FPGA

Server 47

FPGA

Server 3

FPGA

Server 46

FPGA

Server 4

FPGA

Server 45

FPGA

Server 23

FPGA

Server 26

FPGA

Server 24

FPGA

Server 25

FPGA

… …

DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS DTWS S0 S0 S0 S0 S0 S0 S1 S1 S2 S2 S2 S2

10Gb Ethernet Links FPGA Torus

slide-13
SLIDE 13

1,632 server pilot deployed in production datacenter

13

slide-14
SLIDE 14
  • Fundamental flaws:
  • Microsoft was converging on a single SKU
  • No one else wanted the secondary network
  • Complex, difficult to handle failures
  • Difficult to service boxes
  • No killer infrastructure accelerator
  • Application presence is too small
slide-15
SLIDE 15

Catapult V2 Architecture

15

CPU CPU FPGA NIC DRAM DRAM DRAM

WCS 2.0 Server Blade Catapult V2

Gen3 2x8 Gen3 x8 QPI Switch QSFP QSFP QSFP 40Gb/s 40Gb/s

WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA

Pikes Peak WCS Tray Backplane

Option Card Mezzanine Connectors

Catapult v2 Mezzanine card

  • The architecture justifies the economics
  • 1. Can act as a local compute accelerator
  • 2. Can act as a network/storage accelerator
  • 3. Can act as a remote compute accelerator
slide-16
SLIDE 16
  • Prod. HW/RTL

Shell

(Also need to build a complete platform)

Health & Config DRAM PCIe DMA Network

Shell Package

HEX LTL Encryption JPEG compression Queue Manager Rams FIFO Flight Recorder

Hardware Acceleration Platform Development HW Library Package(s) Shell

Libs

Catapult Kernel Driver Software Dev Kit & Runtime Library SDK Package Role HW API

Catapult Runtime Library

Operation Factory & Integration Test Suites FPGA Watchdog

(AutoPilot Integration)

DTS LZMA

Driver Deployment

CloudBuild FPGA Flow

License Servers

Product SW CSI: Board & SKU Qualification Process

OpenCL BSP

Golden Image SVC Verification Team OpenCL/HLS Compiler

slide-17
SLIDE 17

Case 1: Use as a local accelerator

slide-18
SLIDE 18

Production Results (December 2015)

18

software FPGA 99.9% Query Latency versus Queries/sec HW vs. SW Latency and Load average software load 99.9% software latency 99.9% FPGA latency average FPGA query load

slide-19
SLIDE 19

Case 2. Use as an infrastructure accelerator

slide-20
SLIDE 20

FPGA SmartNIC for Cloud Networking

  • Azure runs Software Defined Networking on the hosts
  • Software Load Balancer, Virtual Networks – new features each month
  • We rely on ASICs to scale and to be COGS-competitive at 40G+
  • But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN
  • SmartNIC gives us the agility of SDN with the speed and COGS of HW
  • Base SmartNIC will provide common functions like crypto, GFT, QoS, RDMA on all hosts
  • 40Gb/s network, 20Gb/s crypto takes a significant fraction of a 24-core machine
  • Example: crypto and vswitch inline on the FPGA: 0% CPU cost
Rewrite

SLB Decap SLB NAT VNET ACL Metering

Rule Action Rule Action Rule Action Rule Action Rule Action Rule Action Decap * DNAT * Rewrite * Allow * Meter *

SmartNIC

VFP

VMSwitch VM

SR-IOV (Host Bypass)

50G

QoS

Crypto RDMA

Flow Action

Decap, DNAT, Rewrite, Meter 1.2.3.1->1.3.4.1, 62362->80

GFT

slide-21
SLIDE 21

Case 3: Use as a remote accelerator

slide-22
SLIDE 22

Inter-FPGA communication

ToR

FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server

CS0 CS1 CS2 CS3 ToR

FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server

SP0 SP1 SP2 SP3

  • FPGAs can encapsulate

their own UDP packets

  • Low-latency inter-FPGA

communication (LTL)

  • Can provide strong

network primitives

  • But this topology opens

up other opportunities

L0 L1/L2

22

slide-23
SLIDE 23

FPGA-to-FPGA LTL Round-Trip Latencies

slide-24
SLIDE 24

Hardware Acceleration as a Service

ToR ToR CS CS ToR ToR Bing Ranking SW

HPC

Bing Ranking HW

Speech to text Large-scale deep learning

slide-25
SLIDE 25

BrainWave: Scaling FPGAs To Ultra-Large Models

  • Thanks to Eric Chung and team
  • Distribute NN models across as

many FPGAs as needed (up to thousands)

  • Recent Imagenet competition: 152-

layer model

  • Use HaaS and LTL to manage multi-

FPGA execution

  • Very close to live production
  • Only vectors travel over network
  • Low FPGA-FPGA latency at ~1.8us per L0 hop
slide-26
SLIDE 26

V2 Architecture Makes Configurable Clouds Possible

ToR ToR ToR ToR CS CS CS CS ToR ToR ToR ToR

CPUs FPGAs DRAM

  • Massive amounts of programmable logic will change datacenter architecture broadly
  • Is an independent computer running outside of the CPU domain
  • Will affect network architecture (protocols, switches), storage architecture, security models
slide-27
SLIDE 27

Will Catapult v2 be Deployed at Scale?

27

slide-28
SLIDE 28

Configurable Clouds will Change the World

  • Ability to reprogram a datacenter’s hardware protocols
  • Networking, storage, security
  • Can turn homogenous machines into specialized SKUs dynamically
  • Unprecedented performance and low latency at hyperscale
  • Exa-ops of performance with a 10 microsecond diameter
  • What would you do with the world’s most powerful fabric?

28

slide-29
SLIDE 29
slide-30
SLIDE 30