Heterogeneous Datacenters: Options and Opportunities
Jason Cong1, Muhuan Huang1,2, Di Wu1,2, Cody Hao Yu1
1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc.
Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - - PowerPoint PPT Presentation
Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc. Data Center Energy Consumption is a Big Deal In 2013 , U.S.
1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc.
2
https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growing- amounts-energy)
3
4
Parallelization
Source: Shekhar Borkar, Intel
Customization
Adapt the architecture to Application domain
5
6
Microsoft Catapult Intel HARP
7
8
10.97 5.26 7.8 3.13 8X ARM 8X ATOM NORMALIZED EXECUTION TIME LR KM 6.86 5.21 4.88 3.1 8X ARM 8X ATOM NORMALIZED ENERGY
9
10
11
10.97 5.26 0.69 7.8 3.13 1.06 8X ARM 8X ATOM 8X ZYNQ NORMALIZED EXECUTION TIME LR KM 6.86 5.21 0.43 4.88 3.1 0.66 8X ARM 8X ATOM 8X ZYNQ NORMALIZED ENERGY
12
13
14
22 workers 1 master / driver Each node:
(Alpha Data)
Alpha Data board:
RAM 1 file server 1 10GbE switch
15
0.33 0.69 0.5 1.06 1X XEON+AD 8X ZYNQ NORMALIZED EXECUTION TIME 0.38 0.43 0.56 0.66 1X XEON+AD 8X ZYNQ NORMALIZED ENERGY
16
17
18
19
JAVA-FPGA: JNI FPGA-as-a-Service
Heterogeneous hardware:
20
Node ACC Manager FPGA
Node
Node ACC Manager ACC
Node Spark
AccRDD
Hadoop MapRed
Client … GPU
21
User Application Node ACC Manager FPGA GPU ACC
ACC Requests Input data
Accelerator Designer:
ACC register
Acc Look-up Table
Output data
Application Designer:
22
val points = sc.textfile().cache() for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x)))
).reduce(_ + _) w -= gradient } val points = blaze.wrap(sc.textfile()) for (i <- 1 to ITERATIONS) { val gradient = points.map( new LogisticGrad(w) ).reduce(_ + _) w -= gradient } class LogisticGrad(..) extends Accelerator[T, U] { val id: String = “Logistic” def call(in: T): U = {p => (1 / (1 + exp(-p.y*(w dot p.x)))
} }
23
Spark Task NAM FPGA device Inter-process memcpy PCIE memcpy
17.99% 17.99% 32.13% 28.06% 3.84%
Receive data Data preprocessing Data transfer FPGA computa-on Other
24
class LogisticACC : public Task // extend the basic Task interface { LogisticACC(): Task(2) {;} // specify # of inputs // overwrite the compute function virtual void compute() { // get input/output using provided APIs int num_sample = getInputNumItems(0); int data_length = getInputLength(0) / num_sample; int weight_size = getInputLength(1); double *data = (double*)getInput(0); double *weights = (double*)getInput(1); double *grad = (double*)getOutput(0, weight_size, sizeof(double)); double *loss = (double*)getOutput(1, 1, sizeof(double)); // perform computation RuntimeClient runtimeClient; LogisticApp theApp(out, in, data_length*sizeof(double), &runtimeClient); theApp.run(); } }; Compile to ACC_Task (*.so)
25
MapReduce Spark MPI Node ... Yarn Global Acc Manager Node Acc Manager … Distributed File System (HDFS) acc acc Node Node Acc Manager … acc acc Node Other frameworks
26
Naïve allocation: multiple applications share the FPGA, frequent reprogramming needed Better allocation: applications on the node use same accelerator, no reprogramming needed
27
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.91
1.17 0.70 0.59 0.98 1.22 1.93
1.91 1.36 0.90 0.88 1.23 1.41 1.89 1 0.8 0.6 0.5 0.4 0.2
Throughput Ratio of LR workloads in the mixed LR-KM workloads static partition naïve sharing GAM
28
29
0.2 0.4 0.6 0.8 1 1.2
1 thread 12 threads Normalized Throughput to Manual Design
Software Manual Blaze
20 40 60 80 100 120 140 160 180 200
Manual Blaze Execution Time (ms)
JVM-to-native Native-to-FPGA FPGA-kernel Native-private-to- share
30
♦
♦
Bingjun Xiao (UCLA) Hui Huang (UCLA) Muhuan Huang (UCLA) Di Wu (UCLA) Yuting Chen (UCLA) Cody Hao Yu (UCLA)