Memory Scalability Evaluation of the Next-Generation Intel Bensley - - PowerPoint PPT Presentation

memory scalability evaluation of the next generation
SMART_READER_LITE
LIVE PREVIEW

Memory Scalability Evaluation of the Next-Generation Intel Bensley - - PowerPoint PPT Presentation

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The


slide-1
SLIDE 1

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University

slide-2
SLIDE 2

Introduction

  • Computer systems have increased significantly in

processing capability over the last few years in various ways

– Multi-core architectures are becoming more prevalent – High-speed I/O interfaces, such as PCI-Express have enabled high-speed interconnects such as InfiniBand to deliver higher performance

  • The area that has improved the least during this

time is the memory controller

slide-3
SLIDE 3

Traditional Memory Design

  • Traditional memory controller design has limited the

number of DIMMs per memory channel as signal rates have increased

  • Due to high pin count (240) required for each

channel, adding additional channels is costly

  • End result is equal or lesser memory capacity in

recent years

slide-4
SLIDE 4

Fully-Buffered DIMMs (FB-DIMMs)

  • FB-DIMM uses serial

lanes with a buffer on each chip to eliminate this tradeoff

  • Each channel requires
  • nly 69 pins
  • Using the buffer allows

larger numbers of DIMMs per channel as well as increased parallelism

Image courtesy of Intel Corporation

slide-5
SLIDE 5

Evaluation

  • With multi-core systems coming, a scalable memory

subsystem is increasingly important

  • Our goal is to compare FB-DIMM against a traditional

design and evaluate the scalability

  • Evaluation Process

– Test memory subsystem on a single node – Evaluate network-level performance with two InfiniBand Host Channel Adapters (HCAs)

slide-6
SLIDE 6

Outline

  • Introduction & Goals
  • Memory Subsystem Evaluation

– Experimental testbed – Latency and throughput

  • Network results
  • Conclusions and Future work
slide-7
SLIDE 7

Evaluation Testbed

  • Intel “Bensley” system

– Two 3.2 GHz dual-core Intel Xeon “Dempsey” processors – FB-DIMM-based memory subsystem

  • Intel Lindenhurst system

– Two 3.4 GHz Intel Xeon processors – Traditional memory subsystem (2 channels)

  • Both contain:

– 2 8x PCI-Express slots – DDR2 533-based memory – 2 dual-port Mellanox MT25208 InfiniBand HCAs

slide-8
SLIDE 8

Bensley Memory Configurations

Branch0 Branch1

Channel3 Channel2 Channel1 Channel0 Slot 0 Slot 1 Slot 2 Slot 3 Slot 0 Slot 1 Slot 2 Slot 3 Slot 0 Slot 1 Slot 2 Slot 3 Slot 0 Slot 1 Slot 2 Slot 3

Processor 0 Processor 1

  • The standard

allows up to 6 channels with 8 DIMMs/channel for 192GB

  • Our systems have

4 channels, each with 4 DIMM slots

  • To fill 4 DIMM

slots we have 3 combinations

slide-9
SLIDE 9

Subsystem Evaluation Tool

lmbench 3.0-a5: Open-source benchmark suite for evaluating system-level performance

  • Latency

– Memory read latency

  • Throughput

– Memory read benchmark – Memory write benchmark – Memory copy benchmark Aggregate performance is obtained by running multiple long-running processes and reporting the sum of averages

slide-10
SLIDE 10

Bensley Memory Throughput

  • To study the impact of additional channels we evaluated using 1, 2,

and 4 channels

  • Throughput increases significantly from one to two channels in all
  • perations

1000 2000 3000 4000 5000 6000 1 2 4 Number of Processes Aggregate Throughput (MB/sec) 500 1000 1500 2000 2500 3000 3500 4000 1 2 4 Number of Processes Aggregate Throughput (MB/sec) 500 1000 1500 2000 2500 1 2 4 Number of Processes Aggregate Throughput (MB/sec)

Read Write Copy

slide-11
SLIDE 11

Access Latency Comparison

20 40 60 80 100 120 140 160 180 200 4GB 8GB 16GB 2GB 4GB Read Latency (ns)

Unloaded Loaded

  • Comparison when unloaded

and loaded

  • Loaded is when a memory

read throughput test is run in the background while the latency test is runnning

  • From unloaded to loaded

latency: – Lindenhurst: 40% increase – Bensley: 10% increase

Bensley Lindenhurst

slide-12
SLIDE 12

Memory Throughput Comparison

  • Comparison of Lindenhurst and Bensley platforms with increasing

memory size

  • Performance increases with two concurrent read or write operations
  • n the Bensley platform

1000 2000 3000 4000 5000 6000 1 2 4 Number of Processes Aggregate Throughput (MB/sec)

Bensley 4GB Bensley 8GB Bensley 16GB

500 1000 1500 2000 2500 1 2 4 Number of Processes Aggregate Throughput (MB/sec)

Bensley 4GB Bensley 8GB Bensley 16GB

500 1000 1500 2000 2500 3000 3500 4000 1 2 4 Number of Processes Aggregate Throughput (MB/sec)

Bensley 4GB Bensley 8GB Bensley 16GB

Read Write Copy

slide-13
SLIDE 13

Outline

  • Introduction & Goals
  • Memory Subsystem Evaluation

– Experimental testbed – Latency and throughput

  • Network results
  • Conclusions and Future work
slide-14
SLIDE 14

OSU MPI over InfiniBand

  • Open Source High Performance Implementations

– MPI-1 (MVAPICH) – MPI-2 (MVAPICH2)

  • Has enabled a large number of production IB clusters all over the

world to take advantage of InfiniBand

– Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors)

  • Have been directly downloaded and used by more than 395
  • rganizations worldwide (in 30 countries)

– Time tested and stable code base with novel features

  • Available in software stack distributions of many vendors
  • Available in the OpenFabrics(OpenIB) Gen2 stack and OFED
  • More details at

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

slide-15
SLIDE 15

Experimental Setup

  • Evaluation is with two InfiniBand DDR HCAs, which

uses the “multi-rail” feature of MVAPICH

  • Results with one process use both rails in a round-robin

pattern

  • 2 and 4 process pair results are done using a process

binding assignment

HCA 0 HCA 1 P0 P1 P2 P3 HCA 0 HCA 1 P0 P1 P2 P3 HCA 0 HCA 1 P0 P1 P2 P3 HCA 0 HCA 1 P0 P1 P2 P3

Round Robin Process Binding

slide-16
SLIDE 16

Uni-Directional Bandwidth

  • Comparison of Lindenhurst and Bensley with dual DDR HCAs
  • Due to higher memory copy bandwidth, Bensley signficantly
  • utperforms Lindenhurst for the medium-sized messages

500 1000 1500 2000 2500 3000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Throughput (MB/sec)

Bensley: 1 Process Bensley: 2 Processes Bensley: 4 Processes Lindenhurst: 1 Process Lindenhurst: 2 Processes

slide-17
SLIDE 17

Bi-Directional Bandwidth

  • At 1K improvement:

– Lindenhurst: 1 to 2 processes:15% – Bensley: 1 to 2 processes: 75%, 2 to 4: 45%

  • Lindenhurst peak bi-directional bandwidth is only 100 MB/sec

greater than uni-directional

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Throughput (MB/sec)

Bensley: 1 Process Bensley: 2 Processes Bensley: 4 Processes Lindenhurst: 1 Process Lindenhurst: 2 Processes

slide-18
SLIDE 18

Messaging Rate

  • For very small messages, both show similar performance
  • At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1

process, Bensley still shows 100% improvement

0.5 1 1.5 2 2.5 3 3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K Message Size (bytes) Throughput (millions messages/sec)

Bensley: 1 Process Bensley: 2 Processes Bensley: 4 Processes Lindenhurst: 1 Process Lindenhurst: 2 Processes

slide-19
SLIDE 19

Outline

  • Introduction & Goals
  • Memory Subsystem Evaluation

– Experimental testbed – Latency and throughput

  • Network results
  • Conclusions and Future work
slide-20
SLIDE 20

Conclusions and Future Work

  • Performed detailed analysis of the memory subsystem

scalability of Bensley and Lindenhurst

  • Bensley shows significant advantage in scalable

throughput and capacity in all measures tested

  • Future work:

– Profile real-world applications on a larger cluster and

  • bserve the effects of contention in multi-core

architectures – Expand evaluation to include NUMA-based architectures

slide-21
SLIDE 21

21

Acknowledgements

Our research is supported by the following organizations

  • Current Equipment support by
  • Current Funding support by
slide-22
SLIDE 22

22

Web Pointers

http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

{koop, huanwei, vishnu, panda}@cse.ohio-state.edu