Microsofts Production Configurable Cloud Derek Chiou Microsoft - - PowerPoint PPT Presentation

microsoft s production
SMART_READER_LITE
LIVE PREVIEW

Microsofts Production Configurable Cloud Derek Chiou Microsoft - - PowerPoint PPT Presentation

Microsofts Production Configurable Cloud Derek Chiou Microsoft Azure Cloud Silicon UT Austin H2RC Nov 14, 2016 1 Todays Data Centers O(100K) servers/data center Very dense, maximize number of servers Tens of MegaWatts


slide-1
SLIDE 1

Microsoft’s Production Configurable Cloud

Derek Chiou Microsoft Azure Cloud Silicon UT Austin

H2RC Nov 14, 2016 1

slide-2
SLIDE 2

Today’s Data Centers

  • O(100K) servers/data center
  • Very dense, maximize number of servers
  • Tens of MegaWatts
  • Strict power and cooling requirements
  • Secure, hot, noisy
  • Incrementally upgraded
  • 3 year server depreciation, upgraded quarterly
  • Applications change very rapid (weekly, monthly)
  • Many advantages including economies of scale, data all in one

place, etc.

  • At data center scales, don’t need to get an order of magnitude

improvement to make sense

  • Positive ROI at large scale easier to achieve
  • How can we improve efficiencies?

H2RC Nov 14, 2016 2

slide-3
SLIDE 3

Efficiency via Specialization

ASICs

Source: Bob Broderson, Berkeley Wireless group

FPGAs

H2RC Nov 14, 2016 3

slide-4
SLIDE 4

What Does a Data Center Server With an FPGA look like?

Depends on your point of view

H2RC Nov 14, 2016 4

slide-5
SLIDE 5

Classic View of Computer

DRAM CPU network Storage

H2RC Nov 14, 2016 5

slide-6
SLIDE 6

Networking View of Computer

Acc DRAM CPU network Storage

Network “offload”

H2RC Nov 14, 2016 6

slide-7
SLIDE 7

“Offload” Accelerator view of Server

NIC DRAM Acc Acc Acc CPU network Storage

H2RC Nov 14, 2016 7

Intel MCP

slide-8
SLIDE 8

Our View of a Data Center Computer

network DRAM CPU FPGA Storage DRAM CPU DRAM Acc

H2RC Nov 14, 2016 8

slide-9
SLIDE 9

Benefits

  • Software receives packets slowly
  • Interrupt or polling
  • Parse packet, start right work
  • FPGA processes every packet anyways
  • Packet arrival is an event that FPGA deals

with

  • Identify FPGA work, pass CPU work to CPU
  • Map common case work to FPGA
  • Processor never sees packet
  • Can read/modify system memory to keep

app state consistent

  • CPU is complexity offload engine for

FPGA!

  • Many possibilities
  • Distributed machine learning
  • Software defined networking
  • Memcached get

H2RC Nov 14, 2016 9

slide-10
SLIDE 10

Converged Bing/Azure Architecture

10

CPU CPU FPGA NIC DRAM DRAM DRAM

WCS 2.0 Server Blade Catapult V2

Gen3 2x8 Gen3 x8 QPI Switch QSFP QSFP QSFP 40Gb/s 40Gb/s

WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA

Pikes Peak WCS Tray Backplane

Option Card Mezzanine Connectors

Catapult v2 Mezzanine card

  • Completely flexible architecture
  • 1. local compute accelerator
  • 2. remote compute accelerator
  • 3. Network/storage accelerator

H2RC Nov 14, 2016

slide-11
SLIDE 11

Network Connectivity (IP)

H2RC Nov 14, 2016 11

slide-12
SLIDE 12

Case 1: Local compute accelerator Bing Ranking as a Service

H2RC Nov 14, 2016 12

slide-13
SLIDE 13

IFM 1 IFM 2 IFM 44 IFM 3 IFM 1 IFM 2 IFM 44 IFM 3 IFM 1 IFM 2 IFM 44 IFM 3

Bing Document Ranking Flow

SaaS 1 SaaS 2 SaaS 48 SaaS 3

Ranki king-as as-a-Service (RaaS) )

  • Compute scores for how relevant each selected

document is for the search query

  • Sort the scores and return the results

Sele lectio ion-as as-a-Service (S (SaaS)

  • Find all docs that contain query terms,
  • Filter and select candidate documents for

ranking Sele lectio ion as s a Serv rvice ice (S (SaaS)

IFM 1 IFM 2 IFM 44 IFM 3 IFM 1 IFM 2 IFM 44 IFM 3 IFM 1 IFM 2 IFM 44 IFM 3 RaaS 1 RaaS 2 RaaS 48 RaaS 3

Ranki king as s a Serv rvic ice (RaaS)

Qu Query Sel Selecte ted Do Documents ts

10 10 blu lue lin links ks

H2RC Nov 14, 2016 13

slide-14
SLIDE 14

FE: Feature Extraction

Query: “FPGA Configuration”

NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1

{Query, Document} ~4K Dynamic Features ~2K Synthetic Features L2 Score

Docu cument Sco core re

H2RC Nov 14, 2016 14

slide-15
SLIDE 15

PCIe Distribution latches Control/Data Tokens Compressed Document

Feature Gathering Network Free Form Expression (FFE) Stream Preprocessing FSM

Feature Extraction Accelerator

H2RC Nov 14, 2016 15

slide-16
SLIDE 16

Bing Production Results

16

software FPGA 99.9% Query Latency versus Queries/sec

HW vs. SW Latency and Load average software load 99.9% software latency 99.9% FPGA latency average FPGA query load

H2RC Nov 14, 2016

slide-17
SLIDE 17

Case 2: Remote accelerator

H2RC Nov 14, 2016 17

slide-18
SLIDE 18

Feature Extraction FPGA faster than needed

  • Single feature extraction FPGA

much faster than single server

  • Wasted capacity and/or wasted

FPGA resources

  • Two choices
  • Somehow reduce performance

and save FPGA resources

  • Allow multiple servers to use

single FPGA?

  • Use network to transfer

requests and return responses

H2RC Nov 14, 2016 18

slide-19
SLIDE 19

Inter-FPGA communication

ToR

FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server

CS0 CS1 CS2 CS3 ToR

FPGA NIC Server FPGA NIC Server FPGA NIC Server FPGA NIC Server

SP0 SP1 SP2 SP3

  • FPGAs can encapsulate

their own UDP packets

  • Low-latency inter-FPGA

communication (LTL)

  • Can provide strong

network primitives

  • But this topology opens

up other opportunities

L0 L1/L2

19 H2RC Nov 14, 2016

slide-20
SLIDE 20

Lightweight Transport Layer (LTL) Latencies

5 10 15 20 25 1 10 100 1000 10000 100000 1000000

Round-Trip Latency (us) LTL L0 (same TOR) LTL L1

Example L0 latency histogram Example L1 latency histogram Examples of L2 latency histograms for different pairs of FPGAs

Number of Reachable Hosts/FPGAs 6x8 Torus

(can reach up to 48 FPGAs)

LTL Average Latency LTL 99.9th Percentile 6x8 Torus Latency

LTL L2

10K 100K 250K

20 H2RC Nov 14, 2016

slide-21
SLIDE 21

Hardware Acceleration as a Service Across Data Center (or even across Internet)

ToR ToR CS CS ToR ToR Bing Ranking SW

HPC

Bing Ranking HW

Speech to text Large-scale deep learning

H2RC Nov 14, 2016 21

slide-22
SLIDE 22

BrainWave: Scaling FPGAs To Ultra-Large DNN Models

  • Distribute NN models across as

many FPGAs as needed (up to thousands)

  • Use HaaS and LTL to manage multi-

FPGA execution

  • Very close to live production

H2RC Nov 14, 2016 22

slide-23
SLIDE 23

BrainWave Publicly Demoed

  • Ignite 2016
  • Translation DNN running on FPGAs
  • 2 orders of magnitude lower latency

than CPU implementation

  • < 10% of power

H2RC Nov 14, 2016 23

slide-24
SLIDE 24

Case 3: Networking accelerator

H2RC Nov 14, 2016 24

slide-25
SLIDE 25

FPGA SmartNIC for Cloud Networking

  • Azure runs Software Defined Networking on the hosts
  • Software Load Balancer, Virtual Networks – new features each month
  • Before, we relied on ASICs to scale and to be COGS-competitive at 40G+
  • But 12 to 18 month ASIC cycle + time to roll out new HW is too slow to keep up with SDN
  • SmartNIC gives us the agility of SDN with the speed and COGS of HW
  • Base SmartNIC provide common functions like crypto, GFT, QoS, RDMA on all hosts
Rewrite

SLB Decap SLB NAT VNET ACL Metering

Rule Action Rule Action Rule Action Rule Action Rule Action Rule Action Decap * DNAT * Rewrite * Allow * Meter *

SmartNIC

VFP

VMSwitch VM

SR-IOV (Host Bypass)

50G

QoS

Crypto RDMA

Flow Action

Decap, DNAT, Rewrite, Meter 1.2.3.1->1.3.4.1, 62362->80

GFT

H2RC Nov 14, 2016 25

slide-26
SLIDE 26

Azure Accelerated Networking

  • SR-IOV turned on
  • VM accesses NIC hardware directly,

sends messages with no OS/hypervisor call

  • FPGA determines flow of each

packet, rewrites header to make data center compatible

  • Reduces latency to roughly bare

metal

  • Azure now has the fastest public

cloud network

  • 25Gb/s at 25us latency
  • Fast crypto developed

NIC VFP Hypervisor Guest OS VM NIC VM GFT/FPGA

H2RC Nov 14, 2016 26

slide-27
SLIDE 27

We Are Hiring and Collaborating

  • We are hiring FPGA and software folks
  • Academic engagements
  • Research.Microsoft.com/catapult
  • Will provide boards to a limited number of academics (1 page proposal)
  • Will be giving access to clusters of up to 48 at TACC
  • Research grants
  • Internships
  • Please contact me if you’re interested
  • dechiou@microsoft.com
  • catapult@Microsoft.com

H2RC Nov 14, 2016 27

slide-28
SLIDE 28

Will Configurable Clouds Change the World?

  • Being deployed for all new Azure and Bing machines
  • Many other properties as well
  • Ability to reprogram a datacenter’s hardware
  • Specialized compute acceleration
  • Networking, storage, security
  • Can turn homogenous machines into specialized SKUs dynamically
  • Hyperscale performance with low latency communication
  • Exa-ops of performance with a O(10us) diameter
  • What should we do with the world’s most powerful configurable

fabric?

28 H2RC Nov 14, 2016