Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin - - PowerPoint PPT Presentation

modeling communication costs in blade servers
SMART_READER_LITE
LIVE PREVIEW

Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin - - PowerPoint PPT Presentation

Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015 Duke Computer Architecture Case for Blade Servers An era of big data Duke Computer Architecture 2 Case for Blade Servers An era of


slide-1
SLIDE 1

Duke Computer Architecture

Modeling Communication Costs in Blade Servers

Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015

slide-2
SLIDE 2

Duke Computer Architecture

Case for Blade Servers

An era of big data

2

slide-3
SLIDE 3

Duke Computer Architecture

Case for Blade Servers

An era of big data needs big memory.

3

  • Machines with large memory
  • Distributed memory systems

HP Moonshot Server Cartridge Distributed systems

slide-4
SLIDE 4

Duke Computer Architecture

4

Figure 1: Two blade server nodes connected through Ethernet [1,2]

Case for Blade Servers

a node a blade

[1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers.

slide-5
SLIDE 5

Duke Computer Architecture

5

blade&0

M

MC C NTB RC

M M

M M M M M

blade&1

M M M M

blade&2

M M M M

blade&3 Inter-processor Links (e.g., HyperTransport) Inter-blade Links (e.g., PCIe)

Figure 2 : 2D figure of a server node design with inter-blade links and inter-processor links. Figure 1: Two blade server nodes connected through Ethernet [1,2]

Case for Blade Servers

[1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers.

Blade servers provide compute and memory capacity in a dense form factor.

slide-6
SLIDE 6

Duke Computer Architecture

6

Applications: in-memory computational frameworks

  • Big data analytical frameworks: e.g., Spark
  • Graph type of workloads: e.g., GraphLab, Spark GraphX
  • In-memory databases: e.g. MonetDB

Challenges: hardware-software co-design costs both engineering efforts and time. A fast and cost-effect way to understand the system is through technology models.

Case for Blade Servers

slide-7
SLIDE 7

Duke Computer Architecture

Motivation for Technology Models

We identify and derive key technology parameters for analyzing their effects on system performance, throughput and energy. Potentially, those models can help to

  • choose hardware technologies and configurations
  • understand performance and energy impacts
  • close the loop for hardware and software co-design

7

slide-8
SLIDE 8

Duke Computer Architecture

  • 1. Derive technology models
  • 2. Characterizing non-uniform memory access
  • 3. Develop NUMA-aware schedulers

8

Agenda

slide-9
SLIDE 9

Duke Computer Architecture

9

blade&0

M

MC C NTB RC

M M

M M M M M

blade&1

M M M M

blade&2

M M M M

blade&3 Inter-processor Links (e.g., HyperTransport) Inter-blade Links (e.g., PCIe) Figure 2 : A blade server node design with inter-blade links and inter-processor links.

memory inter- processor inter-blade

Communication Technologies

DDR3

  • HyperTransport
  • Intel Quick Path
  • PCIe 3.0
  • InfiniBand
slide-10
SLIDE 10

Duke Computer Architecture

10

Figure 3: Derived and surveyed technology and architectural parameters

Delay and Energy Estimates

Key Estimates

slide-11
SLIDE 11

Duke Computer Architecture

  • Explore system organizations for blade servers
  • Analyze communication delay and energy
  • Address challenges in system management
  • e.g.: non-uniform memory access (NUMA)

11

With these Estimates

slide-12
SLIDE 12

Duke Computer Architecture

  • 1. Derive technology models
  • 2. Characterizing non-uniform memory access
  • 3. Develop NUMA-aware schedulers

12

Agenda

slide-13
SLIDE 13

Duke Computer Architecture

13

  • Processors access different

memory regions with different latencies — non-uniform memory access (NUMA)

  • NUMA degrades application

performance

  • Multiple communication paths

introduce multiple levels of NUMA

CPI normalized to local access 1 1.2 1.4 1.6 1.8 w

  • r

d c

  • u

n t P a g e r a n k L

  • g

i s t i c r e g r e s s i

  • n

Interprocessor Interblade

Figure 4 : Single thread performance (CPI) degradation for NUMA access.

NUMA Effects

slide-14
SLIDE 14

Duke Computer Architecture

14

Figure 5: NUMA-aware scheduling algorithms [3] [3] M. Zaharia et el, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling.

NUMA-aware Scheduling Policies

slide-15
SLIDE 15

Duke Computer Architecture

15

  • Local execution
  • IP-1: inter-processor 1-hop

execution

  • IP-2: inter-processor 2-hop

execution

  • IB: inter-blade execution

Applications’ NUMA effects vary; throughput and latency goals differ. Choose the optimal policy accordingly.

NUMA-aware Scheduling Policies

slide-16
SLIDE 16

Duke Computer Architecture

16

Methods - NUMA Simulation

CPU DRAM CPU DRAM

Marssx86 + DRAMSim

Add additional latency for different communication paths

Interconnections

Characterize application sensitivity to NUMA over each type of communication technology

slide-17
SLIDE 17

Duke Computer Architecture

17

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Local Remote

Figure 7: distinguish remote vs local (assuming heap is remote; x-axis is benchmark id; y-axis is percentage.)

Benchmarks:

  • 1-7: Apache Spark
  • 8-11: Phoenix MapReduce
  • 12-20: PARSEC 2.0

Methods - Remote vs Local

slide-18
SLIDE 18

Duke Computer Architecture

18

One blade server node Cores per socket 16 Sockets per blade 4 Number of blades per node 4 Task size 100M instructions Inter-arrival time exponential distribution λ = 6000 t/s Service time/core # Instructions/IPC/core frequency

Figure 8: Queueing simulation parameters

Service time per core changes based on NUMA effects Change inter-arrival time to vary system utilization

Model task queues and analyze queueing dynamics.

Methods - Queueing Simulation

slide-19
SLIDE 19

Duke Computer Architecture

19

Results — Throughput

  • Increase the system load

to test the maximum sustained throughput.

  • Avoiding NUMA always

increases throughput.

  • Compute-intensive: 7, 9-11, 13-20
  • Memory-intensive: 1-6, 8, 12

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1 2 3 4 5 6 7 8 91011121314151617181920 Normalized to IB Maximum Sustained Throughput Local IP−1 IP−2

slide-20
SLIDE 20

Duke Computer Architecture

20

Results — Latency/QoS

0.8 1 1.2 1.4 1.6 1.8 2 2.2 1 2 3 4 5 6 7 8 91011121314151617181920 Speed−up Relative to IB 95th Percentile Response Time (High Utilization) Local IP−1 IP−2

  • Permitting NUMA can

improve the quality of service.

  • CI tasks should choose

IB to permit NUMA.

  • MI tasks should choose

IP-1 and IP-2 to selectively permit NUMA in highly loaded servers.

  • Compute-intensive: 7, 9-11, 13-20
  • Memory-intensive: 1-6, 8, 12
slide-21
SLIDE 21

Duke Computer Architecture

21

Results — Communication Energy

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 91011121314151617181920 Normalized to Remote Access Data Migration Energy Inter−processor 1−Hop Links Inter−processor 2−Hop Links Inter−blade Links

  • Compute-intensive: 7, 9-11, 13-20
  • Memory-intensive: 1-6, 8, 12
  • 18-20 is out of scope
  • If data is near, remote

access is more beneficial (3-4x) on for saving energy.

  • If data is far, remote

access is less beneficial because of high-cost links.

  • Energy benefits depend
  • n page reuse rate and

communication channels.

slide-22
SLIDE 22

Duke Computer Architecture

22

Results — Communication Channels

0.2 0.4 0.6 0.8 1 Local IP−1 IP−2 IB Local DRAM Inter−processor 1−Hop Inter−processor 2−Hop Inter−blade

Figure 9: link utilization percentages for application 1.

  • Use link utilization percentage to estimate average

communication power.

slide-23
SLIDE 23

Duke Computer Architecture

23

Results — Communication Power

  • Compute-intensive: 7, 9-11, 13-20
  • Memory-intensive: 1-6, 8, 12
  • 12 is out-of-scope

5 10 15 20 1 2 3 4 5 6 7 8 91011121314151617181920 W Communication Power Local IP−1 IP−2 IB

  • HyperTransport and

PCIe dissipate around 40W, 60W at peak utilization.

  • S1-S6 suggests that

these Spark workloads use about 25% of the link bandwidth.

slide-24
SLIDE 24

Duke Computer Architecture

24

  • Model blade servers for emerging big-data applications.
  • Study NUMA-aware schedulers and their effects on

throughput, latency and power.

  • Provide guidelines for choosing an optimal policy.

Future directions:

  • Extend validation to real system measurements.

Conclusions and Future Directions

slide-25
SLIDE 25

Duke Computer Architecture

Modeling Communication Costs in Blade Servers

Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015