Portable Designs for Performance Using the Hybrid Task Graph - - PowerPoint PPT Presentation

portable designs for performance using the hybrid task
SMART_READER_LITE
LIVE PREVIEW

Portable Designs for Performance Using the Hybrid Task Graph - - PowerPoint PPT Presentation

Portable Designs for Performance Using the Hybrid Task Graph Scheduler Tim Blattner NIST | ITL | SSD | ISG Disclaimer No approval or endorsement of any commercial product by NIST is intended or implied. Certain commercial software, products,


slide-1
SLIDE 1

Portable Designs for Performance Using the Hybrid Task Graph Scheduler

Tim Blattner NIST | ITL | SSD | ISG

slide-2
SLIDE 2

Disclaimer

2018-03-28 GPU Technology Conference 2

No approval or endorsement of any commercial product by NIST is intended or

  • implied. Certain commercial software, products, and systems are identified in

this report to facilitate better understanding. Such identification does not imply recommendations or endorsement by NIST, nor does it imply that the software and products identified are necessarily the best available for the purpose.

slide-3
SLIDE 3

Acknowledgements

2018-03-28 GPU Technology Conference 3

} University of Maryland, College Park

} Shuvra Bhattacharyya, Jiahao Wu

} Green Bank Observatory, WV

} Richard Prestage

} NIST

} Walid Keyrouz, Derek Juba, Alexandre Bardakoff, Peter Bajcsy, Mike Majurski,

Adele Peskin, Zachary Levine, Adam Pintar, Mary Brady

slide-4
SLIDE 4

Outline

2018-03-28 GPU Technology Conference 4

} Introduction } Experiments with HTGS } Current HTGS applications } Lessons Learned and Future } Closure

slide-5
SLIDE 5

Performance of Scalable Systems --- Research Goals

2018-03-28 GPU Technology Conference 5

Instruments / Sensors GBs/TBs/PBs

  • f data

} Software approaches for parallelism } Scale with hardware parallelism

} Multicore, GPU, and cluster computing

} Scalable programmer

} Modest programming effort

} Achieve 80-90% attainable performance } Built on abstractions and software

infrastructure

¨ Accessible performance model

IBM “Minsky” Power8+ NVLink 4 Tesla P100 GPUs, GTC DC 2016

Storing / Streaming Data

50000-500-MultiGPU-Block-Panel-LUD Compute time: 87.671914 s Creation time: 0.000038 s 885.256672 GFLOPS Execution Pipeline0 Compute time: 87.670352 s Creation time: 0.000633 s

Graph Input GausElimTask x1 computeTime: 5.395365 s waitTime: 82.273382 s maxQueueSize: 1 Bookkeeper x1 computeTime: 0.194452 s waitTime: 87.458737 s maxQueueSize: 98 GausElimRuleUpper GausElimRuleLower DecompositionRule FactorLowerTask x20 computeTime: 5.177253 s waitTime: 82.482351 s maxQueueSize: 87 Bookkeeper x1 computeTime: 0.029889 s waitTime: 87.630479 s maxQueueSize: 25 CopyFactorMatrixRule CopyUpdateRuleUpperWindow CopyUpdateRuleUpper CudaCopyInPanelTaskLower x1 computeTime: 3.973354 s waitTime: 51.270483 s maxQueueSize: 1 memoryWaitTime: 32.424224 sec CudaCopyInPanelWindowTask x1 computeTime: 10.855282 s waitTime: 76.720687 s maxQueueSize: 72 memoryWaitTime: 0.085142 sec CudaCopyInPanelTaskUpper x1 computeTime: 32.626862 s waitTime: 22.028843 s maxQueueSize: 25 memoryWaitTime: 33.011533 sec Bookkeeper x1 computeTime: 0.013551 s waitTime: 87.646589 s maxQueueSize: 24 MatrixMulRule MatrixMulPanelTask x1 computeTime: 82.606453 s waitTime: 4.580509 s maxQueueSize: 77 CudaCopyOutPanelTask x1 computeTime: 39.324830 s waitTime: 48.332059 s maxQueueSize: 6 Bookkeeper x1 computeTime: 0.084473 s waitTime: 87.572424 s maxQueueSize: 7 Bookkeeper x1 computeTime: 0.006953 s waitTime: 87.652483 s maxQueueSize: 29 GatherBlockRule GausElimRule UpdateFactorRule
slide-6
SLIDE 6

Challenging Hardware Landscape

} Modern computers have

} Multi-core CPUs

} 10+ per CPU

} Many-core accelerators

} GPUs

} How to take advantage of these machines?

} Particularly with multi-GPU configurations

} Need a programming model at a higher level of abstraction

} Focus on parallelism } Data motion } Memory usage

2018-03-28 GPU Technology Conference 6

Abstract execution models

slide-7
SLIDE 7

Current Practice—Scalability Perspective

2018-03-28 GPU Technology Conference 7

} Retrofitting approach

} Fine vs coarse-grained parallelism

} Parallel directives

} OpenMP, OpenACC

} Parallel libraries

} OpenCV, OpenBLAS, …

} Task libraries

} OpenMP } StarPU, Legion, …

} Performance Portability Programming

} Kokkos

Traditional offload approach Pipelined workflow approach

slide-8
SLIDE 8

Expanding on our Lessons Learned

8

} Image Stitching (2013)

} Hybrid Pipeline Workflows

} Multi-GPU } Multiple producers – multiple consumers } Significant effort to implement

} Generalize and extend on the

image stitching workflow for other applications

} The Hybrid Task Graph Scheduler

} The scalable programmer } Experimentation for performance

Q01 read Q12 FFT/Disp. Q23 BK 1 Threads ≥ 1 ≥ 1 1

CPU Image Stitching Hybrid Pipeline Workflow

Q01 read Q12 copier Q23 FFT Q34 BK1 Q45 Disp

GPU0 Pipeline

BKn Q34 FFT Q23 copier Q12 read Q01 Q45 Disp

GPUn Pipeline

Q56 CCF

. . .

≥ 1

6

Multi-GPU Image Stitching Hybrid Pipeline Workflow

2018-03-28 GPU Technology Conference

slide-9
SLIDE 9

Experimentation for Performance

2018-03-28 GPU Technology Conference 9

} Is the essence to portable designs for performance } Ability to programmatically adapt

} T

  • hardware landscape

} Modify algorithms at a high level of abstraction

as new techniques are discovered

} Easily identify bottlenecks

} Modify traversal strategies } Decomposition strategies

} Must maintain high level abstractions from analysis to execution

} Improved Profiling and debugging à Experimentation for performance

slide-10
SLIDE 10

Hybrid Task Graph Scheduler

10

} Maintains explicit dataflow

representation

} Persists through analysis and

implementation

} Experimentation for performance } Debug, Profile, Visualize performance using

the dataflow representation

} Focus on

} Separation of concerns } Coarse-grain parallelism } Hide latency of data motion } Memory management

} Efforts spilling over

} Computational T

  • mography

} Fast Image – high performance image

processing (prior to MITS)

} Radio Astronomy Radio Frequency

Interference Mitigation

2018-03-28 GPU Technology Conference

slide-11
SLIDE 11

HTGS

2018-03-28 GPU Technology Conference 11

Model

} Blends dataflow and task graph

} Nodes – Tasks } Edges – Dataflow between tasks

C++ API

} Header only

Methodology

  • 1. Start with parallel algorithm
  • 2. Represent it as a dataflow graph
  • 3. Map it onto an HTGS task graph
  • 4. Implement graph using API &

annotate for memory

  • 5. Refine & optimize
slide-12
SLIDE 12

HTGS API

2018-03-28 GPU Technology Conference 12

} Task interface

} Initialize } Execute } Can-T

erminate

} Shutdown

} Each task binds to one or more CPU

thread(s)

} Edges between tasks are thread safe data

queues

} Apply binding to accelerator

} GPU (NVIDIA/AMD), FPGA, …

} Specialty tasks

} Bookkeeper task

} Manages complex data dependencies } Maintains state of computation

} CUDA Task

} Binds task to NVIDIA CUDA GPU

} Execution Pipeline Task

} Creates copies of a task graph

¨ Each copy bound to a specified GPU

} Memory Manager

} Attaches memory edge to a task

¨ getMemory(“nameOfEdge”) ¨ Binds memory allocation to address space

¨ CPU, GPU, etc.

slide-13
SLIDE 13

Sample Code to Build Graph (RFI Mitigation)

2018-03-28 GPU Technology Conference 13

ReadStreamTask *readTask = new ReadStreamTask(inputFileName); MADTask *madTask = new MADTask(numMADThreads, …); WriteResultTask *writeResultTask = new WriteResultTask(…); // build HTGS graph auto graph = new htgs::TaskGraphConf<htgs::VoidData, htgs::VoidData>(); graph->addEdge(readTask, madTask); graph->addEdge(madTask, writeResultTask); graph->addMemoryManagerEdge("DataBlock", readTask, new DataBlockAllocator(size), numDataBlocks, htgs::MMType::Static); graph->writeDotToFile("MADGraph-Pre-Exec.dot"); // Launch the graph htgs::TaskGraphRuntime * runtime = new htgs::TaskGraphRuntime(graph); // Launch runtime and produce/consume data to/from graph . . .

slide-14
SLIDE 14

Pre-Execution Graph

2018-03-28 GPU Technology Conference 14

R eadStreamT ask x1 1 MADT ask x40 char get 1 W riteR esultT ask x1 MM(static): DataBlock x1

} Memory manager “DataBlock”

} Ensures system stays within memory

limits

} Separate threads for read/write

} Asynchronous I/O

} MADTask pool of 40 threads

} Dual 10-core CPU w/ hyperthreading } Parallel processing } ~90x speedup over sequential

slide-15
SLIDE 15

Experiments with HTGS

Image Stitching for Microscopy | Matrix Multiplication | LU Decomposition

2018-03-28 GPU Technology Conference 15

slide-16
SLIDE 16

Microscopy Image Stitching

2018-03-28 GPU Technology Conference 16

} Grid of overlapping images

} Compute pair-wise relative displacement } 17k x 22k pixels per image } Studying cell growth over time

} >300 time series images

} ImageJ software took ~6 hours to stitch

Read FFT BK PCIAM

CCF

MemManagerFFT

Stem cell data with red outline for each tile Stitching Multi-GPU HTGS Implementation

slide-17
SLIDE 17

Stitching Results

2018-03-28 GPU Technology Conference 17

2013-05-20 results

Reference code: >3 hours Speedup: Sequential / Implementation Effective Speedup: Reference / Implementation Dual quad-core Xeon 32 GB DDR3 2 Tesla C2070s

Time Speed up

Effective Speedup

Threads CPU Sequential 10 min 37 s 21x 1 Pipelined Multi- Threaded 1 min 20 s 7.7x 162.4x 19 CPU-GPU Simple GPU 9 min 17 s 1.08x 22.7x 1 Pipelined-GPU, 1 GPU 43.6s 14.6x 305.5x 11 Pipelined-GPU, 2 GPUs 24.5 s 26x 512.3x 15

slide-18
SLIDE 18

Stitching Results

2018-03-28 GPU Technology Conference 18

Time Speedup

Effective Speedup

Threads CPU Sequential 4.1min 52.48x 1 Pipelined Multi-Threaded 13 s 18.9x 993x 40 CPU-GPU Simple GPU 2.1 min 1.95x 102.46x 1 Pipelined-GPU, 1 GPU 17.3 s 14.2x 746.28x 40 Pipelined-GPU, 2 GPUs 9.7 s 25.36x 1331x 40 Pipelined-GPU, 3 GPUs 8.3 s 29.6x 1555.5x 40

Dual 10-core Xeon 128 GB DDR3 3 Tesla K40

slide-19
SLIDE 19

Motivation – Hybrid Pipeline Workflows

2018-03-28 19

} Performance gains from HTGS

Simple GPU Profile HTGS

GPU Technology Conference

slide-20
SLIDE 20

Multi-GPU Stitching

2018-03-28 GPU Technology Conference 20

slide-21
SLIDE 21

Matrix Multiplication (GEMM)

2018-03-28 GPU Technology Conference 21

} Well-known algorithm

} “Hello World” of numerical computing } Naïve implementation: !(#$)

} Component of BLAS } With unlimited parallelism

} Can compute all elements of C

independently

} &×( = *

} Memory & communication issues!

} Assumptions

} In-core kernels available } Organize data as blocks

} Implementations

} CPU only } CPU + GPU(s)

slide-22
SLIDE 22

A × B = C

Pk i=1(Ar,i × Bi,c) = Cr,c

A B C

Cr,c Ar,1 Ar,k B1,c Bk,c

Matrix Multiplication (GEMM)

2018-03-28 GPU Technology Conference 22

BK1 LoadA LoadB BK2 MatMul: Ar,k × Bk,c = Ck

r,c

BK3 Accumulate: Ck

r,c+ = Ck+1 r,c

WriteC

BK1 LoadA LoadB CopyA CopyB BK2 MatMul: Ar,k × Bk,c = Ck

r,c

CopyC BK1 LoadA LoadB CopyA CopyB BK2 MatMul: Ar,k × Bk,c = Ck

r,c

CopyC

MemManagerA MemManagerB MemManagerC

BK3 Accumulate: Ck

r,c+ = Ck+1 r,c

WriteC

MemManagerA MemManagerB MemManagerC

CPU only

  • Inner product

CPU + k GPUs

  • Outer product

A × B = C A B C1...k

A1,k Am,k Bk,1 Bk,n k k

16k2 matrices 32k2 matrices

slide-23
SLIDE 23

HTGS MM in IBM Minsky

2018-03-28 GPU Technology Conference 23

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 1024 2048 4096 8192

Time (s) Block Size

64k Matrix Multiplication HTGS vs cuBLAS-XT

1 GPU HTGS 1 GPU cuBLAS-XT 2 GPUs HTGS 2 GPUs cuBLAS-XT 3 GPUs HTGS 3 GPUs cuBLAS-XT 4 GPUs HTGS 4 GPUs cuBLAS-XT

2x 10-core IBM Power8+ 1 TB DDR3 4x Tesla P100 w/ NVLink

slide-24
SLIDE 24

HTGS MM in IBM Minsky…

2018-03-28 GPU Technology Conference 24

2 4 6 8 10 12 14 16 18 20 1024 2048 4096 8192

DP TFlops Block Size

64k Matrix Multiplication DP TFlops HTGS vs cuBLAS-XT

1 GPU HTGS 1 GPU cuBLAS-XT 2 GPUs HTGS 2 GPUs cuBLAS-XT 3 GPUs HTGS 3 GPUs cuBLAS-XT 4 GPUs HTGS 4 GPUs cuBLAS-XT

IBM Minsky: Peak ~21 Tflops

slide-25
SLIDE 25

4 GPU HTGS vs cuBLAS-XT 64k matrix size

2018-03-28 GPU Technology Conference 25

} Why the jump in performance

between 4k and 8k block size?

} Let’s visualize . . . Block size Runtime TFlops HTGS 4k 64.2915 8.76 cuBLAS-XT 4k 54.5407 10.32 HTGS 8k 31.1398 18.63 cuBLAS-XT 8k 42.5759 13.22

slide-26
SLIDE 26

HTGS MM Visual Profile I

2018-03-28 GPU Technology Conference 26

Execution Pipeline0 Execution Pipeline1 Execution Pipeline2 Execution Pipeline3

Graph Input MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule Bookkeeper Compute Time (sec): 0.001108 Bookkeeper Compute Time (sec): 0.001070 Bookkeeper Compute Time (sec): 0.000992 Bookkeeper Compute Time (sec): 0.000942 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000117 ReadMatrixTask(B) x1 Compute Time (sec): 0.000071 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167 Bookkeeper ReadMatrixTask(A) x1 Compute Time (sec): 0.000110 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 44.487608 Bookkeeper Compute Time (sec): 0.002163 ReadMatrixTask(B) x1 Compute Time (sec): 0.000063 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 45.635919 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 60.707234 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 60.628967 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000100 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 46.352207 Bookkeeper Compute Time (sec): 0.004679 ReadMatrixTask(B) x1 Compute Time (sec): 0.000072 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 47.774502 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 63.083254 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.875404 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000090 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 45.127101 Bookkeeper Compute Time (sec): 0.006831 ReadMatrixTask(B) x1 Compute Time (sec): 0.000069 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 43.879753 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 59.599396 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 59.547325 MatrixDistributeRule MatrixDistributeRule MatrixAccumulateRule MatrixOutputRule MatrixAccumTask x40 Compute Time (sec): 5.591368 OutputTask x1 Compute Time (sec): 13.525106 Graph Output
  • 64k x 64k matrices
  • 4k x 4k blocks

CudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167

slide-27
SLIDE 27

HTGS MM Visual Profile II

2018-03-28 GPU Technology Conference 27

Execution Pipeline0 Execution Pipeline1 Execution Pipeline2 Execution Pipeline3

Graph Input MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule Bookkeeper Compute Time (sec): 0.000212 Bookkeeper Compute Time (sec): 0.000222 Bookkeeper Compute Time (sec): 0.000179 Bookkeeper Compute Time (sec): 0.000164 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000042 ReadMatrixTask(B) x1 Compute Time (sec): 0.000026 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527 Bookkeeper ReadMatrixTask(A) x1 Compute Time (sec): 0.000040 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 15.731764 Bookkeeper Compute Time (sec): 0.000800 ReadMatrixTask(B) x1 Compute Time (sec): 0.000037 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.041090 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.187993 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.616045 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000037 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 14.407581 Bookkeeper Compute Time (sec): 0.000236 ReadMatrixTask(B) x1 Compute Time (sec): 0.000018 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 16.507695 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 31.880078 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 29.501592 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000035 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 15.615444 Bookkeeper Compute Time (sec): 0.000335 ReadMatrixTask(B) x1 Compute Time (sec): 0.000031 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 13.946837 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 30.324032 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 27.317598 MatrixDistributeRule MatrixDistributeRule MatrixAccumulateRule MatrixOutputRule MatrixAccumTask x40 Compute Time (sec): 2.024726 OutputTask x1 Compute Time (sec): 13.195293 Graph Output
  • 64k x 64k matrices
  • 8k x 8k blocks

CudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527

slide-28
SLIDE 28

2018-03-28 GPU Technology Conference 28

CudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527

  • 4k x 4k blocks
  • 8k x 8k blocks
slide-29
SLIDE 29

LU Decomposition

2018-03-28 GPU Technology Conference 29

} LAPACK’s GETRF

} GEneral TRiangular Factorization

} Complex algorithm

} Diagonal is a dependency critical path } Non-uniform computation

} Decreases with traversal of main diagonal

} No pivoting

} Assume in-core kernels available } Implementations

} CPU only

} Block; block+panel

} CPU + GPU(s)

} Block; block+panel; block+panel+window

slide-30
SLIDE 30

Block LU Decomposition

} Recursive algorithm

} GETRF of block on diagonal } TRSM along horizontal & vertical } Update rest with GEMM

2018-03-28 GPU Technology Conference 30

slide-31
SLIDE 31

Block LU Decomposition Dataflow and Task Graph

2018-03-28 31

} Block LUD dataflow } Block LUD task graph

} Three bookkeepers

} BK1

¨ produces for factor tasks for blocks that

have been updated

¨ Enables concurrent execution of Factor,

Update, and GaussElim

Figure 1: Block LU decomposition dataflow Figure 2: Block LU decomposition task graph

GPU Technology Conference

slide-32
SLIDE 32

Block LU Decomposition CPU Results

2018-03-28 32

} Increasing the block size

} Improves matrix multiplication

utilization

} Gaussian Elimination becomes

bottleneck

} How to increase computation for

each matrix multiplication, while using small block for Gauss

GPU Technology Conference

slide-33
SLIDE 33

Block 50k Unknowns -- 500 vs 5000 Block Sizes

2018-03-28 GPU Technology Conference 33

slide-34
SLIDE 34

Block + Panel LU Decomposition

2018-03-28 34

} Use blocks for Gauss and

Factor

} Panels for Update } Common approach

} Used in OpenBLAS, MAGMA,

and PLASMA

GPU Technology Conference

slide-35
SLIDE 35

Block + Panel LU Decomposition Task Graph

2018-03-28 35

} New graph collects blocks until a panel is formed

} BK2 produces for BK3 only when one of the lower diagonal panels have been factored } BK4 distributes the panels back into blocks for BK1 and GaussElim Figure 2: Block+Panel HTGS task graph

GPU Technology Conference

Figure 1: Block HTGS task graph

slide-36
SLIDE 36

Block + Panel LU Decomposition CPU Results

2018-03-28 36 GPU Technology Conference

Block Approach Block + Panel Approach

slide-37
SLIDE 37

Block+Panel 50k Unknowns -- 500 vs 5000 Block Sizes

2018-03-28 GPU Technology Conference 37

slide-38
SLIDE 38

LU Decomposition on the GPU

2018-03-28 38

} Consider memory limits between CPU and GPU } Block LUD

} Gauss elimination and Factor on CPU } Update on GPU

} Keep factored blocks in GPU memory until all computation is complete } Blocks that still require update must be copied to/from GPU multiple times

GPU Technology Conference

slide-39
SLIDE 39

Block LU Decomposition GPU Task Graph

2018-03-28 39

} Similar to block LUD on CPU

} Memory manager edges

} MemManagerL keeps memory on GPU until it is used for entire update } MemManagerU and MemManagerUpdate release memory each iteration

¨ Requires re-copy blocks to/from the GPU

GPU Technology Conference

Figure 1: Block HTGS task graph Figure 2: Block HTGS GPU task graph

slide-40
SLIDE 40

Block LU Decomposition 1 GPU Results

2018-03-28 40 GPU Technology Conference

CPU GPU GPU

slide-41
SLIDE 41

Block LU Decomposition on GPU Analysis

2018-03-28 41

} Poor GPU utilization } Analyze performance using HTGS profiling

} Copy in task for update dominating

} Average max Q size ~10-20 } Copy in for update max Q size 347

} How to improve data locality?

GPU Technology Conference

slide-42
SLIDE 42

Sliding Window Gauss Factor Update Iterate

GPU LU Decomposition

} Use window for GEMM & horizontal

TRSM (on GPUs)

} Keep panels in window in GPU

memory until completely computed

} Round-robin for multiple GPUs

} Window size x #GPUs

2018-03-28 GPU Technology Conference 42

slide-43
SLIDE 43

LUD Sliding Window Task Graphs

2018-03-28 GPU Technology Conference 43

GaussElim BK1 FactorL BK2 BK3 CopyInU CopyInW CopyInL BK4 FactorU and Update CopyOut BK5

MML MMU MMW

CPU + GPU(s)

slide-44
SLIDE 44

HTGS Block+Panel+Window GPU Results

2018-03-28 GPU Technology Conference 44

Block+Panel CPU

} Uses sliding window for updating panels of the

matrix

}

Keeps window in GPU memory

}

Panels outside window are copied to/from GPU

slide-45
SLIDE 45

Visualizing 1 GPU versus 2 GPU

2018-03-28 GPU Technology Conference 45

slide-46
SLIDE 46

WIP Applications Designed with HTGS

2018-03-28 46 GPU Technology Conference

slide-47
SLIDE 47

Radio Astronomy RFI Mitigation

2018-03-28 GPU Technology Conference 47

} Richard Prestage (NRAO & WVU) } Real-time data filtering with FPGAs

and GPUs

} HTGS

} Model decomposes processing pipeline

} Acceleration

} ~90x

slide-48
SLIDE 48

HTGS and Data Streaming

2018-03-28 GPU Technology Conference 48

} Graph to process a block of data sent from an IP address

} Wrap in execution pipeline

} 1 pipeline per IP address } CPU core affinity when processing data from IP

} 1 graph per computer } Implementation coming soon . . .

StreamIPi CopyIn FFT Processing FFT−1 CopyOut

Write

MemManagerFFT

slide-49
SLIDE 49

Computational Tomography

2018-03-28 GPU Technology Conference 49

Computational T

  • mography

} Zach Levine, Adele Peskin, Adam Pintar

} Iterative technique to predict materials

through measurements

} Using new techniques to improve

performance and accuracy

} Better physics integrated

} Fixed force detection } Pre-conditioning } Scattering } …

slide-50
SLIDE 50

Computational Tomography

2018-03-28 GPU Technology Conference 50

} HTGS implementation

} Iterative } Forward problem

} Compute each angle in parallel } Parallel reduction

} Inverse problem

} Use L-BFGS-B

¨ Limited-memory quasi-Newton code for

bound-constrained optimization

} Incorporation of new physics and

  • ptimizations into tasks

} Maintains parallelism and performance

Graph Input Bookkeeper x1 computeTime: 0.993558 sec waitTime: 23.366514 sec maxQueueSize: 1 FwdCTR ule BfgsR ule FwdCTT ask x40 computeTime: 6.852719 sec waitTime: 17.504303 sec maxQueueSize: 111 BfgsT ask x1 computeTime: 15.133390 sec waitTime: 9.227225 sec maxQueueSize: 1 Bookkeeper x1 computeTime: 2.674292 sec waitTime: 21.204316 sec maxQueueSize: 50 Like get Graph Output OutputR ule AccumulateR ule AccumulateLike x10 computeTime: 0.672575 sec waitTime: 23.659966 sec maxQueueSize: 5 MM(static): LikeMem x1 computeTime: 0.705370 sec waitTime: 23.424898 sec maxQueueSize: 48

slide-51
SLIDE 51

Fast Image

51

} C++ Library based on HTGS } High level API to access an image

} Or part of it

} Only access interesting views in an

image

} Available upon request, eventually

will be published on Github

End User Algorithm Fast Image Tile Loader

Compute with views Views from image tiles

  • High level image accessor
  • Cache system
  • Parallelization system

(HTGS) Load image tiles

2018-03-28 GPU Technology Conference

slide-52
SLIDE 52

Views and Tiles

52

Tile

} Part of the image (here 28x28

pixels) given by the image loader

View

} Center tile (here 4x4 pixels) } Neighboring pixels within a radius

(here 2 pixels)

} Ghost values

} Advantages

} Reduce memory footprint } Enable tile caching

2018-03-28 GPU Technology Conference

slide-53
SLIDE 53

Result overview

53

Statistics Convolution Connectivity Execution time speedup 12.63x – 30.38x 1.55x – 2.83x 5.78x – 17.60x Memory footprint reduction 9.80x – 82.85x 3.22x – 35.97x 9.94x – 22.15x

2018-03-28 GPU Technology Conference

slide-54
SLIDE 54

Inferencing on Large Microscopy Images

2018-03-28 GPU Technology Conference 54 54

graphàproduceData(tile) fastImageàrequestAllTiles() Bookkeeper QA Blur x20 QA Intensity Inference Point Cloud x10 Build Pyramid Bookkeeper Progress SQL Detection SQL ROI Builder Bookkeeper Write Thumbnails ROI SQL SQL Query 100,000 x 60,000 px image (~15 GB @ 24bpp)

slide-55
SLIDE 55

Accelerated Deep Learning Framework for Biomedical Image Segmentation

2018-03-28 GPU Technology Conference 55

} Collaboration with UMBC

} Dorsa Ziaei

} Build model in favorite framework

} Abstractions in HTGS to define

input/output

} Load model and train

} Built-in augmentation methods

} Blur } Rotation } …

} Simplify multi-GPU training

} Possibly cluster-training Fast Image Bookkeeper Augmentation Training

slide-56
SLIDE 56

Lessons Learned and Future

2018-03-28 GPU Technology Conference 56

slide-57
SLIDE 57

HTGS Software Engineering

57

} HTGS Tasks

} Specifications, requirements, interfaces

} Well-defined inputs and outputs

} +Performance specifications

} Mapped tasks to individual contributors

} Performance requirements

} Clear critical path

} Consume data based on input requirements } Produce data based on output requirements

} Maintain parallelism

2018-03-28 GPU Technology Conference

slide-58
SLIDE 58

HTGS Profiling

58

} Zero overhead profiling

} Profiling is gathered in both Release and Debug

} Graph visualization report generated after every run

} Immediately identify performance impacts per task } Customize task profiling to obtain more details

} Optimize per task

} Find alternative methods

2018-03-28 GPU Technology Conference

slide-59
SLIDE 59

HTGS Adapting

59

} Scaling

} Modify threading } Multi-GPU inferencing

} Add/remove task edges

2018-03-28 GPU Technology Conference

slide-60
SLIDE 60

60

} New profiling visualization

} Gantt chart

} Integration into NVIDIA profiler (?) } Time selector + generate bar graphs

} HTGS project generator

} Graphical interface to draw the graph } Export as a ZIP file } Compile and executable

} Fill in the task functionality

} Ultimate goal (the dream)

} Standardized pre-defined tasks with compatible

inputs/outputs

} Build entire parallel program from GUI

Tasks Image Load Bookkeeper Blur Intensity DetectA DetectB Build Pyramid Progress SQL DetectionSQL ROI Builder ROI SQL SQL Query Write Thumbnails

Future of HTGS

Time (ms) à

2018-03-28 GPU Technology Conference

slide-61
SLIDE 61

Conclusions

2018-03-28 GPU Technology Conference 61

} Task graph representation persists throughout execution

} Guides developer in algorithm analysis } Analysis maps to implementation (and back)

} Localizes performance bottlenecks } Complexity is important

} Runtime } Space } Communication } Memory operation

} Experimentation for Performance

slide-62
SLIDE 62

Thank You

2018-03-28 GPU Technology Conference 62

Questions?

Email: timothy.blattner@nist.gov Landing Page: https://pages.nist.gov/HTGS Main Repo: https://github.com/usnistgov/HTGS Tutorials: https://github.com/usnistgov/HTGS-Tutorials