THEIA GPU Open Source multicore programmable GPU Problem Statement - PowerPoint PPT Presentation

THEIA GPU Open Source multicore programmable GPU

Problem Statement ● Develop an open source 3D Graphic Processor (GPU). ● Develop a high level language to program the GPU. ● Provide all of the necessary tools, test-bench and regressions. ● Should be different from current state-of-the-art (at least a little different).

What kind of GPU? ● Vector Processing. ● Multiple hardware threads. ● Multiple cores. ● Out-of-order execution. ● And many other funky stuff...

VECTOR PROCESSING ADDS DATA LEVEL PARALELISM Array1[n] Array2[n] Array1[4] Array2[4] Array1[3] Array2[3] Array1[2] Array2[2] Array1[1] Array2[1] Array1[0] Array2[0] Instructions operates on • “Ranges” of registers Reservation Station 0 instead of operating on single registers. Execution Unit Example • R3[50:10] = R1[50:10] + R2[50:10] •

3 Data LANES adds further parallelism to vector operations Array1.x[n] Array1.y[n] Array1.z[n] Array2.x[n] Array2yx[n] Array2.z[n] Array1.y[4] Array1.z[4] Array2.x[4] Array1.x[4] Array2.y[4] Array2.z[4] Array1.y[3] Array1.z[3] Array1.x[3] Array2.x[3] Array2.y[3] Array2.z[3] Each Execution unit is Array1.y[2] Array1.z[2] Array1.x[2] Array2.x[2] Array2.y[2] replicated three times for Array2.z[2] Array1.y[1] Array1.z[1] Array2.x[1] parallel execution . Array1.x[1] Array2.y[1] Array2.z[1] Array1.y[0] Array1.z[0] Array1.x[0] Array2.x[0] Array2.y[0] Array2.z[0] Memory locations are logically divided into x, y Reservation Station 0 and z components (32 bits each) Execution Unit X Execution Unit Y Execution Unit Z

More parallelism: Out of order execution of the vector operations vector array1[10],array2[10]; Vectors operations can be executed vector result[10],result[10],result3[10]; out of order as long as as there are result1 = array1 / array2; available reservation stations. result1 = array1 + array2; Register renaming is used result1 = array1 * array2; (Tomasulu's algorithm) ... Reservation Station 0 Reservation Station 1 Reservation Station k

Simultaneous multi-threading (SMT) Only 1 thread can Thread N: Thread 1: issue at a given point in time (in-order- ... result1 = array1 / array2; result1 = array1 / array2; issue). result1 = array1 + array2; result1 = array1 + array2; Operations can start executing whenever RS become available (out-of-order execution) ... Reservation Station 0 Reservation Station 1 Reservation Station k

Multiple Vector processing Cores Core0 CoreM Thread 1 Thread N Thread 1 Thread N ... ... ... ... ... RS0 RSk RS0 RSk Multiple vector processing cores operate in parallel. Each core vector processing core executes multiple threads in parallel.

Control processor handles Load and resource distribution of the system * The CP allows the user to programmatically control the resource allocation and the workload distribution of the GPU. Control Processor * Instead of implementing complex dynamic (CP) hardware based scheduling algorithms, the CP allows for these algorithms to be implemented in software. Core0 CoreM Thread 0 Thread N Thread 0 Thread N ...

The control processor The CP controls the global execution of the system #include "theia.thh" #include "code_block_header.thh" The CP does not process data, it only schedules the data processing that scalar DstOffsetAndLen,SrcOffset,CoredId; //First send the data into cores will occur in the VPS SrcOffset = 0; DstOffsetAndLen = (0x0 | (CORE_INPUT_AREA_SIZE << 20) ); while (CoredId <= THEIA_CAPABILTIES_MAX_CORES) { The CP is a simple but fully copy_data_block< CoredId, DstOffsetAndLen, SrcOffset>; programmable processor. SrcOffset += INPUT_DATA_LEN; CoredId++; } A special extension of the high level //wait until enqueued block transfers are complete while ( block_transfer_in_progress ) {} language has been developed specifically for the CP. SrcOffset = SIMPLE_RENDER_OFFSET; DstOffsetAndLen = (0x0 | SIMPLE_RENDER_SIZE | VP_DST_CODE_MEM ); copy_data_block < ALLCORES , DstOffsetAndLen ,SrcOffset>; The CP controls the interface between start <ALLCORES>; the VP cores and the GPU memory exit ;

Memories and the memory controller External memory Control Processor Memory controller (CP) Core0 CoreM OM0 Thread 0 Thread N Thread 0 Thread N ... ... OMK Cross bar Texture memory (TMEM)

The memory controller External memory Control Processor Memory controller (CP) Takes care of transferring data from the “external memory” to the Texture memory or the OM0 Core0 CoreM vector processors. Thread Thread Thread Thread ... ... 0 N 0 N The CP controls the memory controller, issuing asynchronous OMK block transfer commands Cross bar Texture memory (TMEM)

The external memory External memory Used by the CPU in order to read or read data for the GPU to Control Processor Memory controller process. (CP) Can store GPU code or data OM0 Core0 CoreM Is not part of the GPU, per-se. Thread Thread Thread Thread ... ... ... 0 N 0 N Conceptually a large RAM. OMK GPU can only access this Cross bar memory via the Memory controller. Texture memory (TMEM)

The texture memory External memory Read-Only from the vector processor perspective. Control Processor Memory controller (CP) Multiple VPs can simultaneously read using a full mesh cross bar. OM0 Core0 CoreM Only Memory controller can write into TMEM. Thread Thread Thread Thread ... ... 0 N 0 N Default store location for texture OMK data (although the CP code can decide to store anything in there) Cross bar Texture memory (TMEM)

The output memories External memory Write-Only from the vector processor perspective. Control Processor Each VP can only write into a Memory controller (CP) single and unique OM. Each OM is “owned” a VP to do OM0 Core0 CoreM write operations, the OM cannot Thread Thread be shared. Thread Thread ... ... 0 N 0 N Default store location for OMK program result data. The CP can request the OM data to be Cross bar transfer back into the external memory, or into the graphics Texture memory (TMEM) frame buffer

Programming the GPU * Has a high level programming language called “T- Language”. * Reminds of C but designed for 3D operations. * Clean exposes the features of the hardware with no need for the user to know about low-level details. * User writes separate code for the CP and the VP (grammar is similar, but features change)

How does the VP code looks like?

How does the VP code looks like? T-Language allows thread declaration as part of the grammar. Variables are declared as “Vector” data types, 3D vectors divided into x, y and z. Allows subroutines, variable stacks, arrays and many other things

THEIA GPU Open Source multicore programmable GPU Problem Statement - PowerPoint PPT Presentation

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D Graphic Processor (GPU). Develop a high level language to program the GPU. Provide all of the necessary tools, test-bench and regressions.

Introduction to THEIA and low-energy neutrino program 70m 18m MooD Workshop T HEIA25 BNL, Nov

Prospects for THEIA: An Advanced Liquid Scintillator Neutrino Experiment Daniele Guffanti on

GRAPHICAL VIEWS @sponemann FOR WEB-BASED MODELING TOOLS WITH THEIA AND SPROTTY ECLIPSE THEIA

Solar Neutrino Detection in Solar Neutrino Detection in SNO, SNO+, and Theia SNO, SNO+, and

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

ANGULAR MOMENTUM ANGULAR MOMENTUM AND THE ROLLING CHAIN AND THE ROLLING CHAIN Anthony Toljanich

Triblade Cheaper, Longer and Easier to Transport Who am I? Asil Erguner MSc. Eng. in

Plasma shielding during ITER disruptions Sergey Pestchanyi and Richard Pitts 1 ITER Organization

Competitive Intelligence New vector of cooperation Henri Dou (ESCEM), Jean-Marie Dou Jr (CCIMP)

Hypothesis testing When we are concerned with a real situation in which observations may be made

1Q 2017 Earnings Call Charles E. Jones, President and CEO James F. Pearson, EVP and CFO

SWIFT NETWORKS (ASX SW1) Presented by Xavier Kris (CEO) 11 th microEQUITIES RISING STARS MICROCAP

Beneficiary Engagement May 7, 2018 Welcome Marilyn Pearson, MCAC Representative Jenny Hobbs,

THEIA GPU Open Source multicore programmable GPU Problem Statement - PowerPoint PPT Presentation

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D Graphic Processor (GPU). Develop a high level language to program the GPU. Provide all of the necessary tools, test-bench and regressions.

Introduction to THEIA and low-energy neutrino program 70m 18m MooD Workshop T HEIA25 BNL, Nov

Prospects for THEIA: An Advanced Liquid Scintillator Neutrino Experiment Daniele Guffanti on

GRAPHICAL VIEWS @sponemann FOR WEB-BASED MODELING TOOLS WITH THEIA AND SPROTTY ECLIPSE THEIA

Solar Neutrino Detection in Solar Neutrino Detection in SNO, SNO+, and Theia SNO, SNO+, and

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

ANGULAR MOMENTUM ANGULAR MOMENTUM AND THE ROLLING CHAIN AND THE ROLLING CHAIN Anthony Toljanich

Triblade Cheaper, Longer and Easier to Transport Who am I? Asil Erguner MSc. Eng. in

Plasma shielding during ITER disruptions Sergey Pestchanyi and Richard Pitts 1 ITER Organization

Competitive Intelligence New vector of cooperation Henri Dou (ESCEM), Jean-Marie Dou Jr (CCIMP)

Hypothesis testing When we are concerned with a real situation in which observations may be made

1Q 2017 Earnings Call Charles E. Jones, President and CEO James F. Pearson, EVP and CFO

SWIFT NETWORKS (ASX SW1) Presented by Xavier Kris (CEO) 11 th microEQUITIES RISING STARS MICROCAP

Beneficiary Engagement May 7, 2018 Welcome Marilyn Pearson, MCAC Representative Jenny Hobbs,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,