TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms - - PowerPoint PPT Presentation

tiny threads on bluegene p exploring many core
SMART_READER_LITE
LIVE PREVIEW

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms - - PowerPoint PPT Presentation

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao OS Department of Electrical & Computer Engineering University of Delaware 2010-04-23 MTAAP2010


slide-1
SLIDE 1

Atlanta, Georgia

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS

Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware 2010-04-23

MTAAP’2010 1

slide-2
SLIDE 2

Introduction

Modern OS based upon a sequential execution model (the von Neumann model). Rapid progress of multi-core/many- core chip technology. Parallel Computer systems now implemented on single chips.

MTAAP’2010 2

slide-3
SLIDE 3

Introduction

Conventional OS model must adapt to the underlying changes. Further exploit the many levels of parallelism. Hardware as well as Software We introduce a study on how to do this adaptation for the IBM BlueGene/P multi- core system.

MTAAP’2010 3

slide-4
SLIDE 4

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 4

slide-5
SLIDE 5

Contributions

Isolate traditional OS functions to a single core of the BG/P multi-core chip. Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores. Expanded the design framework to a multi-chip system designed for scalability to a large number of chips.

MTAAP’2010 5

slide-6
SLIDE 6

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 6

slide-7
SLIDE 7

TiNy Threads on BG/P

TiNy Threads

Lightweight, non-preemptive, threads API similar to POSIX Threads. Originally presented in “TiNy Threads: A Thread Virtual Machine for the Cyclopse-64 Cellular Architecture” Runs on IBM Cyclops64

Kernel Modifications

Alterations to the thread scheduler to allow for non- preemption

MTAAP’2010 7

slide-8
SLIDE 8

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 8

slide-9
SLIDE 9

Multinode Thread Scheduler

Thread Scheduler allows TNT to run across multiple nodes. Requests facilitated through IBM’s Deep Computing Messaging Framework’s RPCs. Multiple Scheduling Algorithms Workload Un-Aware

  • Random
  • Round-Robin

Workload Aware

MTAAP’2010 9

slide-10
SLIDE 10

Multinode Thread Scheduling

MTAAP’2010 10

Node A Node B tnt_create() … tnt_join() tid … tnt_exit() … tnt_join()

slide-11
SLIDE 11

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 11

slide-12
SLIDE 12

Synchronization

Three forms Mutex Thread Joining Barrier Similar to thread scheduling Lock requests, Join requests, and Barrier notifications sent to node responsible for said synchronization

MTAAP’2010 12

slide-13
SLIDE 13

Multinode Thread Scheduling

MTAAP’2010 13

Node A Node B tid … tnt_exit() tnt_join() A tnt_exit()

slide-14
SLIDE 14

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 14

slide-15
SLIDE 15

Characteristics of TDSM

Provides One-Sided access to memory distributed among nodes through IBM’s DCMF . Allows for virtual address manipulation Maps distributed memory to a single virtual address space. Allows for array indexing and memory offsets. Scalable to a variety of applications Size of desired global shared memory set at runtime. Mutability Memory allocation algorithm and memory distribution algorithm can be easily altered and/or replaced.

MTAAP’2010 15

slide-16
SLIDE 16

16

Example of TDSM

global 15 30 45

t dsm _read( gl obal [ 15] , l ocal , 20*si zeof ( i nt ) ) ;

global[15] to global[34]

0x00040012 Node 6 Node 5 Node 7

Local Buffer

0x0004004E to 0x0004009A Node 6: 0 to 14 and Node 7: 0 to 4

slide-17
SLIDE 17

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 17

slide-18
SLIDE 18

Summary of Results

The performance of the TNT thread system shows comparable speedup to that of Pthreads running on the same hardware. The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor. The cost of thread creation shows a linear relationship as the number of threads increase. The cost of waiting at a barrier is constant and independent of the number of threads involved.

MTAAP’2010 18

slide-19
SLIDE 19

Single-Node Thread System Performance

Based upon Radix-2 Cooley- Tukey algorithm with the Kiss FFT library for the underlying DFT. Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores.

MTAAP’2010 19

slide-20
SLIDE 20

Memory System Performance

Reads and writes of varying sizes between one and two nodes. For inter-node communications, data can be transferred at approximately 357 MB/s. Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper.

MTAAP’2010 20

slide-21
SLIDE 21

Memory System Performance

Size of Read/Write is a function of the number

  • f nodes across which

the data is distributed. Latency linearly increases as the amount of data increases, regardless of distance between nodes.

MTAAP’2010 21

slide-22
SLIDE 22

Multinode Thread Creation Cost

Approximately 0.2 seconds per thread Remained effectively constant

MTAAP’2010 22

slide-23
SLIDE 23

Synchronization Costs

Performance of barrier is effectively a constant 0.2 seconds.

MTAAP’2010 23

slide-24
SLIDE 24

Outline

Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work

MTAAP’2010 24

slide-25
SLIDE 25

Conclusions and Future Work

Proven feasibility of system Benefits of Execution Model-Driven approach Room for Improvement Improvements to kernel More rigorous benchmarks Improved allocation and scheduling algorithms

MTAAP’2010 25

slide-26
SLIDE 26

Atlanta, Georgia

Thank You

MTAAP’2010 26

slide-27
SLIDE 27

Bibliography

  • J. del Cuvillo, W. Zhu, Z. Hu, and G. R.

Gao, “Tiny threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International,

  • vol. 15, p. 265b, 2005.
  • S. Kumar, G. Dozsa, G. Almasi et al.,

“The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08:

MTAAP’2010 27