Duke Computer Architecture
Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin - - PowerPoint PPT Presentation
Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin - - PowerPoint PPT Presentation
Modeling Communication Costs in Blade Servers Qiuyun Wang, Benjamin Lee Duke University October 4th, 2015 Duke Computer Architecture Case for Blade Servers An era of big data Duke Computer Architecture 2 Case for Blade Servers An era of
Duke Computer Architecture
Case for Blade Servers
An era of big data
2
Duke Computer Architecture
Case for Blade Servers
An era of big data needs big memory.
3
- Machines with large memory
- Distributed memory systems
HP Moonshot Server Cartridge Distributed systems
Duke Computer Architecture
4
Figure 1: Two blade server nodes connected through Ethernet [1,2]
Case for Blade Servers
a node a blade
[1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers.
Duke Computer Architecture
5
blade&0
M
MC C NTB RC
M M
M M M M M
blade&1
M M M M
blade&2
M M M M
blade&3 Inter-processor Links (e.g., HyperTransport) Inter-blade Links (e.g., PCIe)
Figure 2 : 2D figure of a server node design with inter-blade links and inter-processor links. Figure 1: Two blade server nodes connected through Ethernet [1,2]
Case for Blade Servers
[1] K. Lim, J Chang, T. Mudge, P. Ranganathan. Disaggregated memory for expansion and sharing in blade servers. [2] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong. Cost effective data center servers.
Blade servers provide compute and memory capacity in a dense form factor.
Duke Computer Architecture
6
Applications: in-memory computational frameworks
- Big data analytical frameworks: e.g., Spark
- Graph type of workloads: e.g., GraphLab, Spark GraphX
- In-memory databases: e.g. MonetDB
Challenges: hardware-software co-design costs both engineering efforts and time. A fast and cost-effect way to understand the system is through technology models.
Case for Blade Servers
Duke Computer Architecture
Motivation for Technology Models
We identify and derive key technology parameters for analyzing their effects on system performance, throughput and energy. Potentially, those models can help to
- choose hardware technologies and configurations
- understand performance and energy impacts
- close the loop for hardware and software co-design
7
Duke Computer Architecture
- 1. Derive technology models
- 2. Characterizing non-uniform memory access
- 3. Develop NUMA-aware schedulers
8
Agenda
Duke Computer Architecture
9
blade&0
M
MC C NTB RC
M M
M M M M M
blade&1
M M M M
blade&2
M M M M
blade&3 Inter-processor Links (e.g., HyperTransport) Inter-blade Links (e.g., PCIe) Figure 2 : A blade server node design with inter-blade links and inter-processor links.
memory inter- processor inter-blade
Communication Technologies
DDR3
- HyperTransport
- Intel Quick Path
- PCIe 3.0
- InfiniBand
Duke Computer Architecture
10
Figure 3: Derived and surveyed technology and architectural parameters
Delay and Energy Estimates
Key Estimates
Duke Computer Architecture
- Explore system organizations for blade servers
- Analyze communication delay and energy
- Address challenges in system management
- e.g.: non-uniform memory access (NUMA)
11
With these Estimates
Duke Computer Architecture
- 1. Derive technology models
- 2. Characterizing non-uniform memory access
- 3. Develop NUMA-aware schedulers
12
Agenda
Duke Computer Architecture
13
- Processors access different
memory regions with different latencies — non-uniform memory access (NUMA)
- NUMA degrades application
performance
- Multiple communication paths
introduce multiple levels of NUMA
CPI normalized to local access 1 1.2 1.4 1.6 1.8 w
- r
d c
- u
n t P a g e r a n k L
- g
i s t i c r e g r e s s i
- n
Interprocessor Interblade
Figure 4 : Single thread performance (CPI) degradation for NUMA access.
NUMA Effects
Duke Computer Architecture
14
Figure 5: NUMA-aware scheduling algorithms [3] [3] M. Zaharia et el, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling.
NUMA-aware Scheduling Policies
Duke Computer Architecture
15
- Local execution
- IP-1: inter-processor 1-hop
execution
- IP-2: inter-processor 2-hop
execution
- IB: inter-blade execution
Applications’ NUMA effects vary; throughput and latency goals differ. Choose the optimal policy accordingly.
NUMA-aware Scheduling Policies
Duke Computer Architecture
16
Methods - NUMA Simulation
CPU DRAM CPU DRAM
Marssx86 + DRAMSim
Add additional latency for different communication paths
Interconnections
Characterize application sensitivity to NUMA over each type of communication technology
Duke Computer Architecture
17
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Local Remote
Figure 7: distinguish remote vs local (assuming heap is remote; x-axis is benchmark id; y-axis is percentage.)
Benchmarks:
- 1-7: Apache Spark
- 8-11: Phoenix MapReduce
- 12-20: PARSEC 2.0
Methods - Remote vs Local
Duke Computer Architecture
18
One blade server node Cores per socket 16 Sockets per blade 4 Number of blades per node 4 Task size 100M instructions Inter-arrival time exponential distribution λ = 6000 t/s Service time/core # Instructions/IPC/core frequency
Figure 8: Queueing simulation parameters
Service time per core changes based on NUMA effects Change inter-arrival time to vary system utilization
Model task queues and analyze queueing dynamics.
Methods - Queueing Simulation
Duke Computer Architecture
19
Results — Throughput
- Increase the system load
to test the maximum sustained throughput.
- Avoiding NUMA always
increases throughput.
- Compute-intensive: 7, 9-11, 13-20
- Memory-intensive: 1-6, 8, 12
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1 2 3 4 5 6 7 8 91011121314151617181920 Normalized to IB Maximum Sustained Throughput Local IP−1 IP−2
Duke Computer Architecture
20
Results — Latency/QoS
0.8 1 1.2 1.4 1.6 1.8 2 2.2 1 2 3 4 5 6 7 8 91011121314151617181920 Speed−up Relative to IB 95th Percentile Response Time (High Utilization) Local IP−1 IP−2
- Permitting NUMA can
improve the quality of service.
- CI tasks should choose
IB to permit NUMA.
- MI tasks should choose
IP-1 and IP-2 to selectively permit NUMA in highly loaded servers.
- Compute-intensive: 7, 9-11, 13-20
- Memory-intensive: 1-6, 8, 12
Duke Computer Architecture
21
Results — Communication Energy
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 91011121314151617181920 Normalized to Remote Access Data Migration Energy Inter−processor 1−Hop Links Inter−processor 2−Hop Links Inter−blade Links
- Compute-intensive: 7, 9-11, 13-20
- Memory-intensive: 1-6, 8, 12
- 18-20 is out of scope
- If data is near, remote
access is more beneficial (3-4x) on for saving energy.
- If data is far, remote
access is less beneficial because of high-cost links.
- Energy benefits depend
- n page reuse rate and
communication channels.
Duke Computer Architecture
22
Results — Communication Channels
0.2 0.4 0.6 0.8 1 Local IP−1 IP−2 IB Local DRAM Inter−processor 1−Hop Inter−processor 2−Hop Inter−blade
Figure 9: link utilization percentages for application 1.
- Use link utilization percentage to estimate average
communication power.
Duke Computer Architecture
23
Results — Communication Power
- Compute-intensive: 7, 9-11, 13-20
- Memory-intensive: 1-6, 8, 12
- 12 is out-of-scope
5 10 15 20 1 2 3 4 5 6 7 8 91011121314151617181920 W Communication Power Local IP−1 IP−2 IB
- HyperTransport and
PCIe dissipate around 40W, 60W at peak utilization.
- S1-S6 suggests that
these Spark workloads use about 25% of the link bandwidth.
Duke Computer Architecture
24
- Model blade servers for emerging big-data applications.
- Study NUMA-aware schedulers and their effects on
throughput, latency and power.
- Provide guidelines for choosing an optimal policy.
Future directions:
- Extend validation to real system measurements.
Conclusions and Future Directions
Duke Computer Architecture