Lecture 7: Single Node Architectures Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 7: Single Node Architectures Abhinav Bhatele, Department of Computer Science

Summary of last lecture • Task-based programming models and Charm++ • Key principles: • Over-decomposition, virtualization • Message-driven execution • Automatic load balancing, checkpointing, fault tolerance Abhinav Bhatele, CMSC714 2

von Neumann architecture https://en.wikipedia.org/wiki/Von_Neumann_architecture Abhinav Bhatele, CMSC714 3

UMA vs. NUMA Uniform Memory Access Non-uniform Memory Access https://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/ Abhinav Bhatele, CMSC714 4

Fast vs. slow cores • Intel Core line (Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, …) • AMD processors • IBM Power line • Slower cores: Low frequency, low power • IBM PowerPC line (440, 450, A2, …) Abhinav Bhatele, CMSC714 5

Intel Haswell Chip Abhinav Bhatele, CMSC714 6

BQC Chip • A2 processor core • Runs at 1.6 GHz • Shared L2 cache • Peak performance per core: • 12.8 Gflop/s • Total performance per node: 204.8 Gflop/s SerDes SerDes Abhinav Bhatele, CMSC714 7

900 GB/s 900 GB/s GPUs 16 GB 16 GB DRAM DRAM HBM GPU HBM GPU 7 TF 7 TF 256 GB 256 GB 50 GB/s 50 GB/s 135 GB/s 135 GB/s 50 GB/s 50 GB/s 64 900 GB/s 900 GB/s 50 GB/s 50 GB/s GB/s 16 GB 16 GB HBM GPU HBM GPU 7 TF 7 TF P9 P9 • NVIDIA: Fermi, Kepler, Maxwell, 1 s 6 / B G G 50 GB/s 50 GB/s B Pascal, Volta, … 50 GB/s 50 GB/s 6 / s 1 900 GB/s 900 GB/s • AMD 16 GB 16 GB HBM GPU HBM GPU 7 TF 7 TF NIC • Intel 12.5 GB/s 12.5 GB/s 6.0 GB/s Read NVM 2.2 GB/s Write • Figure on the right shows a single TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W) HBM 96 GB (6x16 GB) NVLINK node of Summit @ ORNL DRAM 512 GB (2x16x16 GB) X-Bus (SMP) NET 25 GB/s (2x12.5 GB/s) PCIe Gen4 MMsg/s 83 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional. Abhinav Bhatele, CMSC714 8

Volta GV100 Abhinav Bhatele, CMSC714 9 The World’s Most Advanced Data Center GPU

Volta GV100 SM • Each Volta Streaming Multiprocessor (SM) has: • 64 FP32 cores • 64 INT32 cores • 32 FP64 cores • 8 Tensor cores https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Abhinav Bhatele, CMSC714 10 The World’s Most Advanced Data Center GPU

Questions The IBM Blue Gene/Q Compute Chip • Why are the L2 caches sitting on the center of the chip? Why not vise-versa? Is this a standard design? • Why is this paper, Blue Gene/Q, or A2 processor, so important? • Are there new significant prefetching methods other than list and stream prefetching in recent architectures? • Is "multiply add pipeline" a commonly used operation, or is the architecture just trying to increase its FLOP count? What are other commonly used operations that get pipelined in other architectures? Abhinav Bhatele, CMSC714 11

Questions Debunking the 100X GPU vs. CPU myth • The paper is from 2010 and this is rather old. It seems that GPUs have evolved a lot in the last decade. How would it compare today? • The GPU in the first paper is 1.5 years older than the CPU, what would be the results if they were both from the same time? How does Moore's law apply to the GPUs, do they get 2x faster every 2 years? • GPUs have several types of caches (shared buffer, constant cache, texture cache). How should these caches be differentiated (chosen) for a purpose? • Where did the "myth" come from? Is the CPU more difficult to optimize? • Have the features, recommended by the author, become true in current CPUs/GPUs? • Why radix sort is chosen as a benchmark metric, while it's not used as the default algorithm in most programming languages? (java has mergesort, python timsort, C++ implements quicksort) Is it used more in HPC? • The paper says they discarded the delays related to memory bandwidth because GPU have 5x faster b/w than CPU. What would be the approximate real life speeds with the memory included? How important is that to optimize bandwidth? Abhinav Bhatele, CMSC714 12

Questions? Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu

Lecture 7: Single Node Architectures Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 7: Single Node Architectures Abhinav Bhatele, Department of Computer Science Summary of last lecture Task-based programming models and Charm++ Key principles: Over-decomposition,

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Architectures Architectural styles Software architectures Architectures versus middleware

Network Kernel Architectures and Implementation (01204423) Single-Node Architectures Chaiporn

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&Modules Module&Loaders Node&Packages

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Common Features of Flow Networks Single Source Single Sink Flows Simple setting: single

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Medial left-node raising in Japanese Shichi Yatabe University of Tokyo Right-node raising

R-Tree An R-tree is a depth-balanced tree Each node corresponds to a disk page Leaf

A Study on Accelerated Built-in Self Test of Multi-Gb/s High Speed Interfaces Seong-Won Kang

LOFAR DATA MANAGEMENT R. F. Pizzo ASTRON, December2 nd 2015 THE LOW FREQUENCY ARRAY KEY

Safe Spark The smart lighter Orange A 10 Everyday nearly 10 people die from accidental deaths

What is Latent Tree Analysis (LTA)? Repeated event co-occurrences might Due to common

Science highlights from the Cape Verde Observatory (CVAO) Lee, J.D. a , Read, K.A. b , Carpenter,

Southern Europe Investment Briefing Eri Mitsostergiou - European Research Director September 2016

Title Antioxidant , Antibacterial and Spectroscopy study of new schiff base

Co cept Co cept Concept Detection Based on Concept Detection Based on LDA etect o etect o

Lecture 7: Single Node Architectures Abhinav Bhatele, Department of - PowerPoint PPT Presentation

High Performance Computing Systems (CMSC714) Lecture 7: Single Node Architectures Abhinav Bhatele, Department of Computer Science Summary of last lecture Task-based programming models and Charm++ Key principles: Over-decomposition,

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Architectures Architectural styles Software architectures Architectures versus middleware

Network Kernel Architectures and Implementation (01204423) Single-Node Architectures Chaiporn

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&amp;Modules Module&amp;Loaders Node&amp;Packages

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Common Features of Flow Networks Single Source Single Sink Flows Simple setting: single

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Medial left-node raising in Japanese Shichi Yatabe University of Tokyo Right-node raising

R-Tree An R-tree is a depth-balanced tree Each node corresponds to a disk page Leaf

A Study on Accelerated Built-in Self Test of Multi-Gb/s High Speed Interfaces Seong-Won Kang

LOFAR DATA MANAGEMENT R. F. Pizzo ASTRON, December2 nd 2015 THE LOW FREQUENCY ARRAY KEY

Safe Spark The smart lighter Orange A 10 Everyday nearly 10 people die from accidental deaths

What is Latent Tree Analysis (LTA)? Repeated event co-occurrences might Due to common

Science highlights from the Cape Verde Observatory (CVAO) Lee, J.D. a , Read, K.A. b , Carpenter,

Southern Europe Investment Briefing Eri Mitsostergiou - European Research Director September 2016

Title Antioxidant , Antibacterial and Spectroscopy study of new schiff base

Co cept Co cept Concept Detection Based on Concept Detection Based on LDA etect o etect o

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

1 Agenda Node&Modules Module&Loaders Node&Packages