Memory Scalability Evaluation of the Next-Generation Intel Bensley - PowerPoint PPT Presentation

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University

Introduction • Computer systems have increased significantly in processing capability over the last few years in various ways – Multi-core architectures are becoming more prevalent – High-speed I/O interfaces, such as PCI-Express have enabled high-speed interconnects such as InfiniBand to deliver higher performance • The area that has improved the least during this time is the memory controller

Traditional Memory Design • Traditional memory controller design has limited the number of DIMMs per memory channel as signal rates have increased • Due to high pin count (240) required for each channel, adding additional channels is costly • End result is equal or lesser memory capacity in recent years

Fully-Buffered DIMMs (FB-DIMMs) • FB-DIMM uses serial lanes with a buffer on each chip to eliminate this tradeoff • Each channel requires only 69 pins • Using the buffer allows larger numbers of DIMMs per channel as well as increased Image courtesy of Intel Corporation parallelism

Evaluation • With multi-core systems coming, a scalable memory subsystem is increasingly important • Our goal is to compare FB-DIMM against a traditional design and evaluate the scalability • Evaluation Process – Test memory subsystem on a single node – Evaluate network-level performance with two InfiniBand Host Channel Adapters (HCAs)

Outline • Introduction & Goals • Memory Subsystem Evaluation – Experimental testbed – Latency and throughput • Network results • Conclusions and Future work

Evaluation Testbed • Intel “Bensley” system – Two 3.2 GHz dual-core Intel Xeon “Dempsey” processors – FB-DIMM-based memory subsystem • Intel Lindenhurst system – Two 3.4 GHz Intel Xeon processors – Traditional memory subsystem (2 channels) • Both contain: – 2 8x PCI-Express slots – DDR2 533-based memory – 2 dual-port Mellanox MT25208 InfiniBand HCAs

Bensley Memory Configurations • The standard allows up to 6 Slot 3 Channel3 Channel2 Channel1 Channel0 Slot 2 Branch1 channels with 8 Slot 1 DIMMs/channel for Slot 0 Processor 0 Slot 3 192GB Slot 2 Slot 1 • Our systems have Slot 0 4 channels, each Slot 3 Slot 2 with 4 DIMM slots Branch0 Slot 1 • To fill 4 DIMM Slot 0 Processor 1 Slot 3 slots we have 3 Slot 2 combinations Slot 1 Slot 0

Subsystem Evaluation Tool lmbench 3.0-a5 : Open-source benchmark suite for evaluating system-level performance • Latency – Memory read latency • Throughput – Memory read benchmark – Memory write benchmark – Memory copy benchmark Aggregate performance is obtained by running multiple long-running processes and reporting the sum of averages

Bensley Memory Throughput Copy Read Write 6000 4000 2500 3500 5000 2000 Aggregate Throughput Aggregate Throughput Aggregate Throughput 3000 4000 2500 1500 (MB/sec) (MB/sec) (MB/sec) 3000 2000 1000 1500 2000 1000 500 1000 500 0 0 0 1 2 4 1 2 4 1 2 4 Number of Processes Number of Processes Number of Processes • To study the impact of additional channels we evaluated using 1, 2, and 4 channels • Throughput increases significantly from one to two channels in all operations

Access Latency Comparison • Comparison when unloaded Unloaded Loaded and loaded 200 180 160 • Loaded is when a memory read throughput test is run in 140 Read Latency (ns) the background while the 120 latency test is runnning 100 80 • From unloaded to loaded 60 latency: 40 – Lindenhurst: 40% increase 20 – Bensley: 10% increase 0 Bensley Lindenhurst 4GB 8GB 16GB 2GB 4GB

Memory Throughput Comparison Copy Read Write 6000 4000 2500 3500 5000 Aggregate Throughput Aggregate Throughput Aggregate Throughput 2000 3000 4000 2500 (MB/sec) (MB/sec) (MB/sec) 1500 2000 3000 1000 1500 2000 1000 500 1000 500 0 0 0 1 2 4 1 2 4 1 2 4 Number of Processes Number of Processes Number of Processes Bensley 4GB Bensley 8GB Bensley 16GB Bensley 4GB Bensley 8GB Bensley 16GB Bensley 4GB Bensley 8GB Bensley 16GB • Comparison of Lindenhurst and Bensley platforms with increasing memory size • Performance increases with two concurrent read or write operations on the Bensley platform

OSU MPI over InfiniBand • Open Source High Performance Implementations – MPI-1 (MVAPICH) – MPI-2 (MVAPICH2) • Has enabled a large number of production IB clusters all over the world to take advantage of InfiniBand – Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors) • Have been directly downloaded and used by more than 395 organizations worldwide (in 30 countries) – Time tested and stable code base with novel features • Available in software stack distributions of many vendors • Available in the OpenFabrics(OpenIB) Gen2 stack and OFED • More details at http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

Experimental Setup Round Robin Process Binding HCA 0 HCA 1 HCA 0 HCA 1 HCA 0 HCA 1 HCA 0 HCA 1 P2 P0 P0 P2 P2 P0 P0 P2 P3 P1 P1 P3 P3 P1 P1 P3 • Evaluation is with two InfiniBand DDR HCAs, which uses the “multi-rail” feature of MVAPICH • Results with one process use both rails in a round-robin pattern • 2 and 4 process pair results are done using a process binding assignment

Uni-Directional Bandwidth 3000 Bensley: 1 Process Bensley: 2 Processes 2500 Bensley: 4 Processes Throughput (MB/sec) Lindenhurst: 1 Process 2000 Lindenhurst: 2 Processes 1500 1000 500 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 1M 128 256 512 128K 256K 512K Message Size (bytes) • Comparison of Lindenhurst and Bensley with dual DDR HCAs • Due to higher memory copy bandwidth, Bensley signficantly outperforms Lindenhurst for the medium-sized messages

Bi-Directional Bandwidth 5000 Bensley: 1 Process 4500 Bensley: 2 Processes Bensley: 4 Processes 4000 Lindenhurst: 1 Process Throughput (MB/sec) 3500 Lindenhurst: 2 Processes 3000 2500 2000 1500 1000 500 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 1M 128 256 512 128K 256K 512K Message Size (bytes) • At 1K improvement: – Lindenhurst: 1 to 2 processes:15% – Bensley: 1 to 2 processes: 75%, 2 to 4: 45% • Lindenhurst peak bi-directional bandwidth is only 100 MB/sec greater than uni-directional

Messaging Rate 3.5 Bensley: 1 Process Bensley: 2 Processes 3 Bensley: 4 Processes (millions messages/sec) Lindenhurst: 1 Process 2.5 Lindenhurst: 2 Processes Throughput 2 1.5 1 0.5 0 1 2 4 8 1K 2K 4K 8K 16 32 64 16K 32K 64K 128K 256K 128 256 512 Message Size (bytes) • For very small messages, both show similar performance • At 512 bytes: Lindenhurst 2 process case is only 52% higher than 1 process, Bensley still shows 100% improvement

Conclusions and Future Work • Performed detailed analysis of the memory subsystem scalability of Bensley and Lindenhurst • Bensley shows significant advantage in scalable throughput and capacity in all measures tested • Future work: – Profile real-world applications on a larger cluster and observe the effects of contention in multi-core architectures – Expand evaluation to include NUMA-based architectures

Acknowledgements Our research is supported by the following organizations • Current Funding support by • Current Equipment support by 21

Web Pointers {koop, huanwei, vishnu, panda}@cse.ohio-state.edu http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ 22

Memory Scalability Evaluation of the Next-Generation Intel Bensley - PowerPoint PPT Presentation

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Next Generation Climate Next Generation Climate Grades 6-8 Supports NGSS Lots of graphs and

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 29,

Memory Management 1 Overview Basic memory management Address Spaces Virtual

Habanero Operating Committee January 25 2017 Habanero Overview 1. Execute Nodes 2. Head Nodes

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014

1 Clock algorithm Least Recently Used (LRU) Same functionality as Assume pages used

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

Lecture 32: Volatile variables, Java memory model Vivek Sarkar Department of Computer Science,

Combating the Advanced Memory Exploitation Techniques: Detecting ROP with Memory Information Leak

Previous Lecture Slides for Lecture 15 ENCM 501: Principles of Computer Architecture Winter 2014