A CnC-driven Implementation of Medical Imaging Algorithms on - - PowerPoint PPT Presentation

a cnc driven implementation of medical imaging algorithms
SMART_READER_LITE
LIVE PREVIEW

A CnC-driven Implementation of Medical Imaging Algorithms on - - PowerPoint PPT Presentation

A CnC-driven Implementation of Medical Imaging Algorithms on Heterogeneous Processors Yi Zou * , Zoran Budimli + , Alina Sbrlea + , Sa nak Ta rlar + , Vivek Sarkar + * University of California at Los Angeles + Rice University Outline


slide-1
SLIDE 1

A CnC-driven Implementation of Medical Imaging Algorithms on Heterogeneous Processors

Yi Zou*, Zoran Budimlić+, Alina Sbîrlea+, Sağnak Taşırlar+, Vivek Sarkar+

*University of California at Los Angeles +Rice University

slide-2
SLIDE 2

2

Outline

 Domain-Specific Computation  Medical Imaging Pipeline  CnC Model of the Medical Imaging Pipeline  Locality and Heterogeneity: Hierarchical Place Trees  Experimental Results and Conclusions

slide-3
SLIDE 3

3

Mapping Modeling

Customizable Heterogeneous Platform (CHP)

$ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric DRAM DRAM I/O CHP CHP CHP

Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

Domain-Specific Modeling

CHP mapping Source-to-source CHP mapper Reconfiguring & optimizing backend Adaptive runtime Domain-specific-modeling (healthcare applications) CHP creation Customizable computing engines Customizable interconnects Architecture modeling Customization setting

Design once (configure) Invoke many times (customize)

slide-4
SLIDE 4

4

Heterogeneous Server Testbed HC-1 Architecture

Xeon Dual Core LV5138 35W TDP Tesla C1060 100GB/s off-chip bandwidth 200W TDP 4 XC5LX330 FPGAs 80GB/s off-chip bandwidth 90W Design Power

slide-5
SLIDE 5

5

Outline

 Domain-Specific Computation  Medical Imaging Pipeline  CnC Model of the Medical Imaging Pipeline  Locality and Heterogeneity: Hierarchical Place Trees  Experimental Results and Conclusions

slide-6
SLIDE 6

6

Case Study: Medical Imaging Pipeline

♦ Medical image processing pipeline

§ Covering all imaging tasks: reconstruction, denoising/deblurring, registration, segmentation, and analysis § Each task can involve the use of different algorithms, dependent on the data and disease domain § Initially targeting automated volumetric tumor assessment for cancer

♦ Base sequential pipeline

§ C/C++ with a common data API to wrap each algorithm (handles image and parameter passing; result output) § Java Native Interfaces (JNI) is used to execute the pipeline from an image viewing application Raw data acquisition Reconstruction Image restoration (denoising, deblurring) Registration Segmentation Analysis

slide-7
SLIDE 7

7

Pipeline Algorithms

Raw data acquisition Reconstruction Image restoration (denoising, deblurring) Registration Segmentation Analysis

Algorithm Language(s) Platform(s)

CoSAMP IHT EM+TV SART+TV MatLab MatLab MatLab, C++ MatLab Single-thread Single-thread Single/multi-thread Single-thread Rician denoising Poisson denoising Poisson/Rician denoising and deblurring MatLab, C, CnC, Cuda MatLab, C MatLab, C, Cuda Single-thread, GPU, FPGA Single-thread Single-thread, GPU, FPGA Fluid (non-rigid) registration C++, CnC Single/multi-thread, GPU, FPGA Geodesic active contours Two-phase active contours C++ C++, CnC Single-thread Single/multi-thread GPU, FPGA

slide-8
SLIDE 8

Raw data acquisition Reconstruction Image restoration (denoising, deblurring) Registration Segmentation Analysis

slide-9
SLIDE 9

9

Outline

 Domain-Specific Computation  Medical Imaging Pipeline  CnC Model of the Medical Imaging Pipeline  Locality and Heterogeneity: Hierarchical Place Trees  Experimental Results and Conclusions

slide-10
SLIDE 10

10

Toolchain

CnC-HC (Application Modeling) Multi-core parallelism using Habanero-C GPU programming CUDA tasks called from Habanero-C FPGA design using autoPilot, FPGA tasks called from Habanero-C Habanero-C runtime using Hierarchical Place Trees

slide-11
SLIDE 11

11

Why CnC for Modeling?

 Specify only the semantic

  • rdering requirements

§ Easier and depends only on application § Separation of concerns

 Application modeling is similar to

drawing on a white board

 Reuse the CnC model for

mapping

slide-12
SLIDE 12

12

Coarse-Grained CnC Graph for the Image Pipeline

slide-13
SLIDE 13

13

Lessons Learned: Registration and Segmentation

 CnC is great for coarse-grained modeling  Hierarchy would help a lot in the modeling phase

§ Right now, we have multiple versions of the same CnC code

 Memory management an issue

§ Still have to resort to “cheating” (violating DSA) § Relatively simple problem, get counts and/or DSA space folding would solve it

 Habanero-C still a more “natural” choice for expressing fine-grained,

regular parallelism

§ Parallel loops inside CnC steps implemented in HC

slide-14
SLIDE 14

14

Fine-Grained CnC Graph for the 3D Denoise

slide-15
SLIDE 15

15

Lessons Learned: Rician Denoising

 Lack of reductions

§ Convergence checking is an AND-Reduction that is hardcoded

 Non-native iteration-space description

§ 2D Tiling increases tuple sizes to 5 § Non intuitive coding of time dimension

 Tag function restrictions for data-driven execution

§ 5-stencil computation needs padding if step code doesn’t change § Or every base condition has to be a separate step implementation

slide-16
SLIDE 16

16

Outline

 Domain-Specific Computation  Medical Imaging Pipeline  CnC Model of the Medical Imaging Pipeline  Locality and Heterogeneity: Hierarchical Place Trees  Experimental Results and Conclusions

slide-17
SLIDE 17

17

Implementing Application Steps using Habanero-C

 Extension of C language with support of async-finish

lightweight task parallelism

§ Principle is similar to X10 and Habanero Java § Lower-level compared to CnC

  • CnC does dependency tracking; HC requires manual dependency

control between async tasks

§ More suitable for loop-level parallelism with in-place updates § Coprocessor invocation can also be done from HC

slide-18
SLIDE 18

18

Hierarchical Place Trees (HPT)

 Past approaches

§ Flat single-level partition e.g., HPF, PGAS § Hierarchical memory model with static parallelism e.g., Sequoia

 HPT approach

§ Hierarchical memory + Dynamic parallelism

 Place represents a memory hierarchy level

§ Cache, SDRAM, device memory, …

 Leaf places include worker threads

§ e.g., W0, W1, W2, W3

 Places can be used for CPUs and accelerators  Multiple HPT configurations

§ For same hardware and programs § Trade-off between locality and load-balance

“Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement”, Y.Yan et al, LCPC 2009

Three different HPT’s for a quad-core processor

slide-19
SLIDE 19

19

Locality-aware Scheduling using the HPT

 Workers attached to leaf places

§ Bind to hardware core

 Each place has a queue

§ async at(pl) <stmt>: push task

  • nto pl’s queue
  • A worker executes tasks from ancestor places
  • W0 executes tasks from PL3, PL1, PL0
  • Tasks in a place queue can be executed by all workers in the

place’s subtree

  • Task in PL2 can be executed by workers W2 or W3

PL1 PL2 PL0 PL3

w0

PL4

w1

PL5

w2

PL6

w3

slide-20
SLIDE 20

20

Adding Heterogeneity to HPT

PL0 PL1 PL2 PL3 PL4 PL5 PL6 PL7 PL8 PL PL PL PL Physical memory Cache GPU memory Reconfigurable FPGA Implicit data movement Explicit data movement W0 W1 W2 W3 W4 W5 Wx CPU computation worker Wx Device agent worker Legend

 Devices (GPU or FPGA) are represented as memory

module places and agent workers

§ GPU memory configurations are fixed, while FPGA memory is reconfigurable at runtime

 Explicit data transfer between main memory and device

memory

§ Programmer may still enjoy implicit data copy between them  Device agent workers § Perform asynchronous data copy and task launching for device § Lightweight, event-based, and time-sharing with CPU

slide-21
SLIDE 21

21

Hybrid scheduling

 Device place has two HC (half-concurrent) mailboxes: inbox (green) and

  • utbox (red)

§ No locks – highly efficient

 Inbox maintains asynchronous device tasks (with IN/OUT)

§ Concurrent enqueuing device tasks by CPU workers from tail § Sequential dequeuing tasks by device agent workers from head

 Outbox maintains continuation of the finish scope of tasks

§ Sequential enqueuing continuation by agent workers § Concurrent dequeuing (steal) by CPU workers

PL7

head

Continuations stolen by CPU workers

tail

W4

Device tasks created from CPU worker via async (gpl) IN OUT { … }

tail head

slide-22
SLIDE 22

22

Asynchronous data copy and task execution

 Three asynchronous stages of each device tasks

§ Data copy-in, task launching, data copy-out § They all can overlap for different tasks; data copy utilizes hardware DMAs

 Lightweight event-based agent workers

§ No blocking on any of the three stages § Zero-contention to access both inbox and outbox

 Can be implemented in hardware!

async IN IN finish event async tasking task complete event async OUT OUT complete event possible continuation Device, e.g. GPU or FPGA head tail tail head

W4

task

slide-23
SLIDE 23

23

Cross-Platform Work Stealing

 Steps are compiled for execution on CPU, GPU or FPGA

§ Same-source multiple-target compilation in future

 Device inbox is now a concurrent queue and tasks can be stolen by

CPU or other device workers

§ Multitasks, range stealing and range merging in future

PL7

head

Continuations stolen by CPU workers

tail

W4

Device tasks created by CPU workers via async (gpl) IN OUT { … }

tail head

Device tasks stolen by CPU and

  • ther device workers
slide-24
SLIDE 24

24

Support for heterogeneous execution in CnC

 GPU steps can be specified in CnC using a {step} syntax

§ The translator generates appropriate async at (gpu_place) calls

 Ranges (t1..t2) are useful for specifying sets of steps to be

executed on GPU

 CnC-HC requires the tags of input items to be a function of

step tags

§ Simplifies the scheduling since we only create a device task when all its input data is available

 Similar approach can be done for FPGA step specification

slide-25
SLIDE 25

25

Outline

 Domain-Specific Computation  Medical Imaging Pipeline  CnC Model of the Medical Imaging Pipeline  Locality and Heterogeneity: Hierarchical Place Trees  Experimental Results and Conclusions

slide-26
SLIDE 26

26

Experimental Results

Pipeline: Denoise-->Registration (200 iterations)-->Segmentation (100 iterations) Multi-images (4 images)

slide-27
SLIDE 27

27

Conclusions and Future Work

 CnC is very suitable for domain-specific application

modeling

§ Hierarchy, reductions, better iteration space description would make it even better

 With an efficient runtime and translator implementation,

CnC can lead to a very efficient application mapping onto heterogeneous, customizable platforms

§ Cross-platform work stealing, load balancing vs. data movement, memory management

slide-28
SLIDE 28

28

Habanero-C Compiler

  • Habanero C compiler (source-to-source)

– AST nodes and parser – Traversal pass: Canonicalization of function calls – Traversal pass: mark suspendable functions – Traversal pass: build function frame (optimization) – Transformation: async, finish, and suspendable functions – Example Habanero-C program finish { /* launch GPU partitions on ngpus GPUs */ for (i=0; i<ngpus; i++) { async at (gpu_pls[i]) in(A_part, B_part, part_size) out(C_part, part_size) { vecadd_gpu(A_part, B_part, C_part, part_size); } // C[*] = A[*] + B[*] } // for . . . } // finish

Habanero C Compiler HC Code

C code with calls to HC Runtime HC Runtime System

slide-29
SLIDE 29

29

Habanero-C runtime: Scheduling Paradigms

  • Work-Sharing (X10 v1.5, OpenMP, …)
  • Busy worker re-distributes the task eagerly
  • Global thread/task/team queue
  • Access to the global queue needs to be

synchronized: scalability bottleneck

  • Work-Stealing (Cilk, TBB, …)
  • Distributed task pools: Each worker has a local

double-ended queue (deque)

  • Idle worker steals the tasks from busy workers
  • Busy worker pays little overhead just to enable

stealing

  • Better scalability

w1

work-stealing

pop tasks push tasks

w2 w3 w4

steal tasks deque tail head

w1 w2 w3 w4

get tasks

work-sharing

put tasks global queue tail head

slide-30
SLIDE 30

30

CHP Modeling Role

Domain-specific applications Abstract execution Programmer Domain-specific programming model (Domain-specific coordination graph and domain-specific language extensions) Source-to source CHP Mapper

Application characteristics CHP architecture models

C/C++ code C/C++ front-end Reconfiguring and optimizing back-end Analysis annotations Binary code for fixed & customized cores Customized target code RTL for programmable fabric RTL Synthesizer (xPilot) C/SystemC behavioral spec

Performance feedback

Adaptive runtime Lightweight threads and adaptive configuration CHP architectural prototypes (CHP hardware testbeds, CHP simulation testbed, full CHP)

slide-31
SLIDE 31

31

Coprocessor Invocation with Multi-Threading

 Problem: the coprocessor may not allow

two simultaneous coprocessor calls

§ Use HC places (and HPT) to enqueue the coprocessor calls to a queue that is dedicated to each coprocessor

 Example: a simple pipeline where denoise is mapped to

CPU, registration is mapped to FPGA, segmentation is mapped to GPU

place_t **fpga_pls=(place_t**) malloc(sizeof(place_t*)); hc_get_places(fpga_pls,FPGA_PLACE); finish { place_t* pl=fpga_pls[0]; async(pl) IN(denoisedT0,S1,interpT_float_h,m,n,p){ REG_fpga(denoisedT0,S1,interpT_float_h,m,n,p); } }