TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms - PowerPoint PPT Presentation

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao OS Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP’2010 1 Atlanta, Georgia

Introduction Modern OS based upon a sequential execution model (the von Neumann model). Rapid progress of multi-core/many- core chip technology. Parallel Computer systems now implemented on single chips. MTAAP’2010 2

Introduction Conventional OS model must adapt to the underlying changes. Further exploit the many levels of parallelism. Hardware as well as Software We introduce a study on how to do this adaptation for the IBM BlueGene/P multi- core system. MTAAP’2010 3

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’2010 4

Contributions Isolate traditional OS functions to a single core of the BG/P multi-core chip. Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores. Expanded the design framework to a multi-chip system designed for scalability to a large number of chips. MTAAP’2010 5

TiNy Threads on BG/P TiNy Threads Lightweight, non-preemptive, threads API similar to POSIX Threads. Originally presented in “TiNy Threads: A Thread Virtual Machine for the Cyclopse-64 Cellular Architecture” Runs on IBM Cyclops64 Kernel Modifications Alterations to the thread scheduler to allow for non- preemption MTAAP’2010 7

Multinode Thread Scheduler Thread Scheduler allows TNT to run across multiple nodes. Requests facilitated through IBM’s D eep C omputing M essaging F ramework’s RPCs. Multiple Scheduling Algorithms Workload Un-Aware ● Random ● Round-Robin Workload Aware MTAAP’2010 9

Multinode Thread Scheduling tid tnt_create() … … tnt_join() … tnt_exit() tnt_join() Node A Node B MTAAP’2010 10

Synchronization Three forms Mutex Thread Joining Barrier Similar to thread scheduling Lock requests, Join requests, and Barrier notifications sent to node responsible for said synchronization MTAAP’2010 12

Multinode Thread Scheduling A tid tnt_join() tnt_exit() … tnt_exit() Node A Node B MTAAP’2010 13

Characteristics of TDSM Provides One-Sided access to memory distributed among nodes through IBM’s DCMF . Allows for virtual address manipulation Maps distributed memory to a single virtual address space. Allows for array indexing and memory offsets. Scalable to a variety of applications Size of desired global shared memory set at runtime. Mutability Memory allocation algorithm and memory distribution algorithm can be easily altered and/or replaced. MTAAP’2010 15

Example of TDSM 0x00040012 … Node 5 Node 6 Node 7 0 15 30 45 global t dsm _read( gl obal [ 15] , l ocal , 20*si zeof ( i nt ) ) ; Node 6: 0 to 14 0x0004004E Local Buffer global[15] to global[34] and to Node 7: 0 to 4 0x0004009A 16

Summary of Results The performance of the TNT thread system shows comparable speedup to that of Pthreads running on the same hardware. The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor. The cost of thread creation shows a linear relationship as the number of threads increase. The cost of waiting at a barrier is constant and independent of the number of threads involved. MTAAP’2010 18

Single-Node Thread System Performance Based upon Radix-2 Cooley- Tukey algorithm with the Kiss FFT library for the underlying DFT. Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores. MTAAP’2010 19

Memory System Performance Reads and writes of varying sizes between one and two nodes. For inter-node communications, data can be transferred at approximately 357 MB/s. Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper. MTAAP’2010 20

Memory System Performance Size of Read/Write is a function of the number of nodes across which the data is distributed. Latency linearly increases as the amount of data increases, regardless of distance between nodes. MTAAP’2010 21

Multinode Thread Creation Cost Approximately 0.2 seconds per thread Remained effectively constant MTAAP’2010 22

Synchronization Costs Performance of barrier is effectively a constant 0.2 seconds. MTAAP’2010 23

Conclusions and Future Work Proven feasibility of system Benefits of Execution Model-Driven approach Room for Improvement Improvements to kernel More rigorous benchmarks Improved allocation and scheduling algorithms MTAAP’2010 25

Thank You MTAAP’2010 26 Atlanta, Georgia

Bibliography J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, 2005. S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p MTAAP’2010 27 supercomputer,” in ICS ’08:

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms - PowerPoint PPT Presentation

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao OS Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP2010

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

WHERE CAN I PUT MY TINY HOUSE? TINY HOMES CARNIVAL 8 MARCH 2020 1 08 MAR 2020 WHO ARE WE? 2

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

The Small (Tiny) House Movement SCAPA Fall Conference October 16, 2014 Photo credit Tumbleweed

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Bring = = - 1-1 Tz=TzT , /\ . Tz Tz T : : ' . , = = : /\

RECURSIVE LEAST SQUARES ALGORITHM DEDICATED TO EARLY RECOGNITION OF EXPLOSIVE COMPOUNDS THANKS TO

Perlen der Informatik I Jan K ret nsk y Technische Universit at M unchen Winter

Rapid Computation of I-vector Longting XU 1,2 , Kong Aik LEE 1 , Haizhou Li 1 and Zhen Yang 2 1

Dwarf Galaxy Groups: A Unique Test of CDM Sabrina Stierwalt (Occidental College/Caltech)*

Webinar inar EcoDistricts Certified: A New Standard for Community Development February 7, 2018

Jump-starting Social Networks: Using Lead Partnerships to Ignite Companies CSR Programs *with

G odels First Incompleteness Theorem UIT2206: The Importance of Being Formal Martin Henz

Sambuz

Useful Links

Newsletter

Mail Us

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms - PowerPoint PPT Presentation

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao OS Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP2010

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Threads and Concurrency Threads and Concurrency Threads Threads A thread is a schedulable stream

WHERE CAN I PUT MY TINY HOUSE? TINY HOMES CARNIVAL 8 MARCH 2020 1 08 MAR 2020 WHO ARE WE? 2

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer CS

Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001 1 Objective

Unit 14: The Mach Operating System 14.2. Threads and Scheduling in Mach AP 9/01 Threads

1 User Threads Benefits Responsiveness Thread management done by a user-level threads

The Small (Tiny) House Movement SCAPA Fall Conference October 16, 2014 Photo credit Tumbleweed

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes &amp; Threads Chapter 2 Processes and threads Processes Threads

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Threads Threads Threads vs Processes Multi-threading Models Threading Issues

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Bring = = - 1-1 Tz=TzT , /\ . Tz Tz T : : ' . , = = : /\

RECURSIVE LEAST SQUARES ALGORITHM DEDICATED TO EARLY RECOGNITION OF EXPLOSIVE COMPOUNDS THANKS TO

Perlen der Informatik I Jan K ret nsk y Technische Universit at M unchen Winter

Rapid Computation of I-vector Longting XU 1,2 , Kong Aik LEE 1 , Haizhou Li 1 and Zhen Yang 2 1

Dwarf Galaxy Groups: A Unique Test of CDM Sabrina Stierwalt (Occidental College/Caltech)*

Webinar inar EcoDistricts Certified: A New Standard for Community Development February 7, 2018

Jump-starting Social Networks: Using Lead Partnerships to Ignite Companies CSR Programs *with

G odels First Incompleteness Theorem UIT2206: The Importance of Being Formal Martin Henz

Sambuz

Useful Links

Newsletter

Mail Us

Chapter 2: Processes & Threads Chapter 2 Processes and threads n Processes n Threads n

Chapter 2: Processes & Threads Chapter 2 Processes and threads Processes Threads