Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - - PowerPoint PPT Presentation

heterogeneous datacenters options and opportunities
SMART_READER_LITE
LIVE PREVIEW

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - - PowerPoint PPT Presentation

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc. Data Center Energy Consumption is a Big Deal In 2013 , U.S.


slide-1
SLIDE 1

Heterogeneous Datacenters: Options and Opportunities

Jason Cong1, Muhuan Huang1,2, Di Wu1,2, Cody Hao Yu1

1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc.

slide-2
SLIDE 2

2

Data Center Energy Consumption is a Big Deal

In 2013, U.S. data centers consumed an estimated 91 billion kilowatt-hours of electricity, projected to increase to roughly 140 billion kilowatt-hours annually by 2020

  • 50 large power plants (500-megawatt coal-fired)
  • $13 billion annually
  • 100 million metric tons of carbon pollution per year.

https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growing- amounts-energy)

slide-3
SLIDE 3

3

◆ Understand the scale-out workloads

§ ISCA’10, ASPLOS’12 § Mismatch between workloads and processor designs; § Modern processors are over-provisioning

◆ Trade-off of big-core vs. small-core

§ ISCA’10: Web-search on small-core with better energy-efficiency § Baidu taps Mavell for ARM storage server SoC

Extensive Efforts on Improving Datacenter Energy Efficiency

slide-4
SLIDE 4

4

Focus of Our Research (since 2008) -- Customization

Parallelization

Source: Shekhar Borkar, Intel

Customization

Adapt the architecture to Application domain

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

◆ FPGA gaining popular among tech giants

§ Microsoft, IBM, Baidu, etc.

◆ Intel’s acquisition of Altera

§ Intel prediction: 30% datacenter nodes with FPGA by 2020

Computing Industry is Looking at Customization Seriously

Microsoft Catapult Intel HARP

slide-7
SLIDE 7

7

◆ Evaluation of different integration options of

heterogeneous technologies in datacenters

◆ Efficient programming support for heterogeneous

datacenters

Contributions of This Paper

slide-8
SLIDE 8

8

Small-core on Compute-intensive Workloads

10.97 5.26 7.8 3.13 8X ARM 8X ATOM NORMALIZED EXECUTION TIME LR KM 6.86 5.21 4.88 3.1 8X ARM 8X ATOM NORMALIZED ENERGY

◆ Data set

§ MNIST 700K Samples § 784 Features, 10 Labels

◆ Results

§ Normalized to reference Xeon performance

◆ Baselines

§ Xeon: Intel E5 2620 12 Core CPU 2.40GHz § Atom: Intel D2500 1.8GHz § ARM: A9 in Zynq 800MHz

◆ Power consumption

(averaged)

§ Xeon: 175W/node § Atom: 30W/node § ARM: 10W/node

slide-9
SLIDE 9

9

Small Cores Alone Are Not Efficient!

slide-10
SLIDE 10

10

Small Core + ACC: FARM

  • 8 Xilinx ZC706 boards
  • 24-port Ethernet switch
  • ~100W power

◆ Boost Small-core Performance with FPGA

slide-11
SLIDE 11

11

Small-core with FPGA Performance

10.97 5.26 0.69 7.8 3.13 1.06 8X ARM 8X ATOM 8X ZYNQ NORMALIZED EXECUTION TIME LR KM 6.86 5.21 0.43 4.88 3.1 0.66 8X ARM 8X ATOM 8X ZYNQ NORMALIZED ENERGY

◆ Setup

§ Data set

  • MNIST 700K Samples
  • 784 Features, 10 Labels

§ Power consumption (averaged)

  • Atom: 30W/node
  • ARM: 10W/node

◆ Results

§ Normalized to reference Xeon performance

slide-12
SLIDE 12

12

Small Cores + FPGAs Are More Interesting!

slide-13
SLIDE 13

13

◆ Slower core and memory clock

§ Task scheduling is slow § JVM-to-FPGA data transfer is slow

◆ Limited DRAM size and Ethernet bandwidth

§ Slow data shuffling between nodes

◆ Another option: Big-core + FPGA

Inefficiencies in Small-core

slide-14
SLIDE 14

14

22 workers 1 master / driver Each node:

  • 1. Two Xeon processors
  • 2. One FPGA PCIe card

(Alpha Data)

  • 3. 64 GB RAM
  • 4. 10GBE NIC

Alpha Data board:

  • 1. Virtex-7 FPGA
  • 2. 16GB on-board

RAM 1 file server 1 10GbE switch

◆ A 24-node cluster with FPGA-based accelerators

§ Run on top of Spark and Hadoop (HDFS)

Big-Core + ACC: CDSC FPGA-Enabled Cluster

slide-15
SLIDE 15

15

◆ Experimental setup

§ Data set

  • MNIST 700K Samples
  • 784 Features, 10 Labels

◆ Results

§ Normalized to reference Xeon performance

Experimental Results

0.33 0.69 0.5 1.06 1X XEON+AD 8X ZYNQ NORMALIZED EXECUTION TIME 0.38 0.43 0.56 0.66 1X XEON+AD 8X ZYNQ NORMALIZED ENERGY

slide-16
SLIDE 16

16

◆ Based on two machine learning workloads

§ Normalized performance (speedup), and energy efficiency (performance/W) relative to big-core solutions

Overall Evaluation Results

Performance Energy- Efficiency

Big-Core+FPGA Best | 2.5 Best | 2.6 Small-Core+FPGA Better | 1.2 Best | 1.9 Big-Core Good | 1.0 Good | 1.0 Small-Core Bad | 0.25 Bad | 0.24

slide-17
SLIDE 17

17

◆ Evaluation of different integration options of

heterogeneous technologies in datacenters

◆ Efficient programming support for heterogeneous

datacenters

§ Heterogeneity makes programming hard!

Contributions of This Paper

slide-18
SLIDE 18

18

More Heterogeneity in the Future

Intel-Altera HARP

AlphaData PCIE FPGA NVidia GPU Datacenter

slide-19
SLIDE 19

19

Programming Challenges

Lots of setup/ini-aliza-on codes Too much hardware-specific knowledge

  • Data-transfer between host and

accelerator

  • Manual data par--on, task

scheduling

Only support single applica-on Lack of portability

Applica-on Accelerator-Rich Systems

JAVA-FPGA: JNI FPGA-as-a-Service

  • Sharing, isolation

Heterogeneous hardware:

  • OpenCL (SDAccel)
slide-20
SLIDE 20

20

Introducing Blaze Runtime System

Node ACC Manager FPGA

Node

Node ACC Manager ACC

Node Spark

AccRDD

Hadoop MapRed

Client … GPU

Application Accelerator-Rich Systems

Blaze Runtime System

  • Friendly interface
  • User-transparent scheduling
  • Efficient
slide-21
SLIDE 21

21

Overview of Deployment Flow

User Application Node ACC Manager FPGA GPU ACC

ACC Requests Input data

Accelerator Designer:

ACC register

Acc Look-up Table

Output data

Application Designer:

slide-22
SLIDE 22

22

Programming Interface for Application

val points = sc.textfile().cache() for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x)))

  • 1) * p.y * p.x

).reduce(_ + _) w -= gradient } val points = blaze.wrap(sc.textfile()) for (i <- 1 to ITERATIONS) { val gradient = points.map( new LogisticGrad(w) ).reduce(_ + _) w -= gradient } class LogisticGrad(..) extends Accelerator[T, U] { val id: String = “Logistic” def call(in: T): U = {p => (1 / (1 + exp(-p.y*(w dot p.x)))

  • 1) * p.y * p.x

} }

Spark.mllib Integration

  • No user code changes or

recompilation

slide-23
SLIDE 23

23

Under the Hood: Getting data to FPGA

Solutions

  • Data caching
  • Pipelining

Spark Task NAM FPGA device Inter-process memcpy PCIE memcpy

17.99% 17.99% 32.13% 28.06% 3.84%

Receive data Data preprocessing Data transfer FPGA computa-on Other

slide-24
SLIDE 24

24

Programming Interface for Accelerator

class LogisticACC : public Task // extend the basic Task interface { LogisticACC(): Task(2) {;} // specify # of inputs // overwrite the compute function virtual void compute() { // get input/output using provided APIs int num_sample = getInputNumItems(0); int data_length = getInputLength(0) / num_sample; int weight_size = getInputLength(1); double *data = (double*)getInput(0); double *weights = (double*)getInput(1); double *grad = (double*)getOutput(0, weight_size, sizeof(double)); double *loss = (double*)getOutput(1, 1, sizeof(double)); // perform computation RuntimeClient runtimeClient; LogisticApp theApp(out, in, data_length*sizeof(double), &runtimeClient); theApp.run(); } }; Compile to ACC_Task (*.so)

slide-25
SLIDE 25

25

◆ Global Accelerator Manager (GAM)

§ Global resource allocation within a cluster to optimize system throughput and accelerator utilization

◆ Node Accelerator Manager (NAM)

§ Virtualize accelerators for application tasks § Provide accelerator sharing/isolation

Complete Accelerator Management Solution

MapReduce Spark MPI Node ... Yarn Global Acc Manager Node Acc Manager … Distributed File System (HDFS) acc acc Node Node Acc Manager … acc acc Node Other frameworks

slide-26
SLIDE 26

26

◆ Places accelerable application to the nodes w/ accelerators

§ Minimizes FPGA reprogramming overhead by accelerator-locality aware scheduling § è less reprogramming & less straggler effect

◆ Heartbeats with node accelerator manager to collect

accelerator status

Global Accelerator Management

Naïve allocation: multiple applications share the FPGA, frequent reprogramming needed Better allocation: applications on the node use same accelerator, no reprogramming needed

slide-27
SLIDE 27

27

Results of GAM Optimizations

static partition: manual cluster partition, KM cluster and LR cluster naïve sharing: workloads are free to run on any nodes

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.91

1.17 0.70 0.59 0.98 1.22 1.93

1.91 1.36 0.90 0.88 1.23 1.41 1.89 1 0.8 0.6 0.5 0.4 0.2

Throughput Ratio of LR workloads in the mixed LR-KM workloads static partition naïve sharing GAM

slide-28
SLIDE 28

28

Productivity and Efficiency of Blaze

Lines of Code (LOC) Application Accelerator setup Partial FaaS Logistic Regression 26 à 9 104 à 99 325 à 0 Kmeans 37 à 7 107 à 103 364 à 0 Compression 1 à 1 70 à 65 360 à 0 Genomics 1 à 1 227 à 142 896 à 0 The accelerator kernels are wrapped as a function Partial FaaS does not support accelerator sharing among different applications

slide-29
SLIDE 29

29

◆ Blaze has overhead compared to manual design ◆ With multiple threads the overheads can be efficiently

mitigated

Blaze Overhead Analysis

0.2 0.4 0.6 0.8 1 1.2

1 thread 12 threads Normalized Throughput to Manual Design

Software Manual Blaze

20 40 60 80 100 120 140 160 180 200

Manual Blaze Execution Time (ms)

JVM-to-native Native-to-FPGA FPGA-kernel Native-private-to- share

slide-30
SLIDE 30

30

Center for Domain-Specific Computing (CDSC) under the NSF Expeditions in Computing Program

C-FAR Center under the STARnet Program

Acknowledgements – CDSC and C-FAR

Bingjun Xiao (UCLA) Hui Huang (UCLA) Muhuan Huang (UCLA) Di Wu (UCLA) Yuting Chen (UCLA) Cody Hao Yu (UCLA)