[PDF] - Outline Introduction Contribution: Novel Vectorization and Mapping PDF Document

SLIDE 1

11/4/14 ¡ 1 ¡

Vectorization and Mapping of Software Defined Radio Applications on GPU Platforms

GPU Summit at the UMD/NVIDIA CUDA Center for Excellence, College Park MD, October 27, 2014 Shuvra S. Bhattacharyya Professor, ECE and UMIACS University of Maryland at College Park ssb@umd.edu , http://www.ece.umd.edu/~ssb With Contributions from G. Zaki, W. Plishker, C. Clancy, and J. Kuykendall 2

Outline

Introduction
Contribution:

Novel Vectorization and Mapping Workflow.

Evaluation
Summary

SLIDE 2

11/4/14 ¡ 2 ¡

3

Outline

Introduction
Contribution:

Novel Vectorization and Mapping Workflow.

Evaluation
Summary

4

DSPCAD Methodologies: Computer-Aided Design (CAD) for Digital Signal Processing (DSP) Systems

FPGA Programmable DSP GPU

Platforms Applications and Tasks [Bhattacharyya 2013]

Image: medical, computer vision, feature detection, etc. Video: coding, compression, etc. Audio: sample rate conversion, speech, etc.

Color processing Prediction Transformation & Quantization Entropy Coding Imaging device Data preprocessing Image reconstruction Post reconstruction Advanced image analysis Image visualization Audio device Data preprocessing Feature extraction Data postprocessing

Microcontroller

Wireless communication systems

Source encoding Channel encoding Digital modulation D/A conversion

RF Back-end

SLIDE 3

11/4/14 ¡ 3 ¡

5

Motivation

Diversity of platforms:

– ASIC, FPGA, DSPs, GPUs, GPP

Complex application environments

– GNU Radio

Exposing parallelism

– Task, Data and Pipeline

Difficult Mapping Problem
Multi-objective (throughput, Latency)

6

Background: GNU Radio

A software development framework that provides

software defined radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications [Blossom 2004]

SLIDE 4

11/4/14 ¡ 4 ¡

7

DSP-oriented Dataflow Models of Computation

Applica.on ¡is ¡modeled ¡as ¡a ¡directed ¡graph ¡

– Nodes ¡(actors) ¡represent ¡func.ons ¡of ¡arbitrary ¡complexity ¡ – Edges ¡represent ¡communica.on ¡channels ¡between ¡func.ons ¡ – Nodes ¡produce ¡and ¡consume ¡data ¡from ¡edges ¡ – Edges ¡buffer ¡data ¡(logically) ¡in ¡a ¡FIFO ¡(first-‑in, ¡first-‑out) ¡fashion ¡

Data-‑driven ¡execu.on ¡model ¡ ¡

– An ¡actor ¡can ¡execute ¡whenever ¡it ¡has ¡sufficient ¡data ¡on ¡its ¡ input ¡edges. ¡ – The ¡order ¡in ¡which ¡actors ¡execute ¡is ¡not ¡part ¡of ¡the ¡

specifica9on. ¡

– The ¡order ¡is ¡typically ¡determined ¡by ¡the ¡compiler, ¡the ¡ hardware, ¡or ¡both. ¡

Itera.ve ¡execu.on ¡

– Body ¡of ¡loop ¡to ¡be ¡iterated ¡a ¡large ¡or ¡infinite ¡number ¡of ¡.mes ¡ ¡

8

DSP-oriented Dataflow Graphs

Ver.ces ¡(actors) ¡represent ¡computa.onal ¡modules ¡
Edges ¡represent ¡FIFO ¡buffers ¡
Edges ¡may ¡have ¡delays, ¡implemented ¡as ¡ini.al ¡tokens

¡

Tokens ¡are ¡produced ¡and ¡consumed ¡on ¡edges ¡
Different ¡models ¡have ¡different ¡rules ¡for ¡produc.on ¡

(SDF ¡à ¡fixed, ¡CSDF ¡à ¡periodic, ¡BDF ¡à ¡dynamic) ¡ X Y

5 Z p1,i c1,i p2,i c2,i e1 e2

SLIDE 5

11/4/14 ¡ 5 ¡

9

Dataflow Production and Consumption Rates

X Y

5 Z p1,i c1,i p2,i c2,i e1 e2

10

Dataflow Graph Scheduling

Assigning actors to processors, and ordering

actor subsets that share common processors

Here, a “processor” means a hardware

resource for actor execution on which assigned actors are time-multiplexed

Scheduling objectives include

– Exploiting parallelism – Buffer management – Minimizing power/energy consumption

SLIDE 6

11/4/14 ¡ 6 ¡

11

Background: Contemporary Architectures

Graphics Processing Units(GPUs) Vector Operations in General Purpose Processors (GPPs) 12

Primary Contribution

A novel workflow for scheduling SDF graphs while taking into account – Actor execution times. – Efficient vectorization. – Heterogeneous multiprocessors. Demonstration system – Applications described in a domain specific language. – Systematic integration of precompiled libraries. – Targeted to architectures consisting of GPPs and GPUs.

SLIDE 7

11/4/14 ¡ 7 ¡

13

Previous Work

Automatic SIMDzation [Hormati, 2010]
Based on StreamIT compiler.
Hierarchical models for SDR [Lin, 2007]
Targeted towards special architectures.
Multi-processor scheduling [Stuijk, 2007]:
Formulation towards special objectives.
Vectorization [Ritz,1992]:
Single processor block processing optimization.

14

Outline

Introduction
Contribution:

Novel Vectorization and Mapping Workflow.

Evaluation
Summary

SLIDE 8

11/4/14 ¡ 8 ¡

15

DIF-GR-GPU Workflow

Start from a model-based

application description.

Use tools to optimize

scheduling, assignment.

Generate an accelerated

system. Dataflow Scheduler

Data Flow Graph Throughput, Latency Constraints

Application Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description Actor Profiles Library of Actor Implementations

GNU Radio

16

Workflow Goals

Adequately make use of all sources of parallelism in
rder to utilize the underlying architecture.

Sources of parallelism:

– Data Parallelism (prod and cons in SDF) – Task Parallelism (implicit in DFG) – Pipeline Parallelism (Looped schedules)

A B 100 1 A B C

SLIDE 9

11/4/14 ¡ 9 ¡

17

SDF Scheduling Preliminaries

An SDF graph G = (V,E) has a valid (periodic) schedule

if it is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least

nce and produces no net change in the number of

tokens on each edge).

For each actor v in a consistent SDF graph, there is a

unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule.

A B

2 3 q(A) = 3 q(B) = 2 Some Possible Schedules: (1) AABAB (2) AAA BB 18

DIF-GR-GPU: Dataflow Scheduler

Objective: – Optimize exploitation of data and pipeline parallelism à Higher throughput. Flat Schedule: Executes an SDF graph as a cascade of distinct loops with no inter-actor nesting of loops. Vectorization of a schedule S: A unique positive integer B, called the blocking factor of S, such that S invokes each actor v exactly (B x q(v)) times.

Original SDF Graph Corresponding BPDAG, B = 10 Data Parallelism Pipeline Parallelism

10 10 20 20 10 10 20 20 60 60 60 60

SLIDE 10

11/4/14 ¡ 10 ¡

19

DIF-GR-GPU Workflow

Start from a model-based

application description

Use tools to optimize

scheduling, assignment.

Generate an accelerated

system. Dataflow Scheduler

Data Flow Graph Throughput, Latency Constraints

Application Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description Actor Profiles Library of Actor Implementations

GNU Radio

20

Heterogeneous Multiprocessor Scheduler

Objective: Utilize available multiprocessors in the platform.
Architecture Descriptions: The platform is described by a set

P of processors and a set B of all to all communication buses.

Execution times depend on the blocking factor.
Every processor is assumed to have a shared memory.

Task Parallelism

GPU0

Communication Bus

GPU1

GPU N

1 N

SLIDE 11

11/4/14 ¡ 11 ¡

21

Scheduler Inputs

Architecture description: set P of processors and a set B of

communication buses.

Application description: The application model (input BPDAG)

consists of a set T of tasks, and a set E of edges.

Task and edge profiles: These profiles are described by two

functions:

RTP(t ∈ T, p ∈ P) → R defines the execution time of task t
n processor p,
REB(e ∈ E, b ∈ B) → R defines the execution time of edge e
n bus b.
Dependency analysis: Task t1 is said to be dependent on task t2 if

there is a path that starts at t1 and ends at t2 . If no such path exists between t1 and t2, then they are called parallel tasks. A similar concept can be applied to edges.

22

Multiprocessor Scheduler

The basic scheduler functionality is to

– Map every task to a given processor. – Order the execution of parallel actors assigned to the same processor. – “Zero” out the communication cost of collocated dependent actors .

The scheduler objective is:

Minimize the latency LB of B graph iterations.

SLIDE 12

11/4/14 ¡ 12 ¡

23

MLP formulation

Why?

– Offline analysis of SDF graphs. – Coarse grain nature of SDF graphs. – Solver gives a bound from optimal solution.

Basic Variables:

– Mapping: XT[t, p] = 1 if task t is assigned to processor p; XT[t, p] = 0 otherwise. – Ordering: For all parallel tasks t1, t2 that are assigned to the same processor YT[t1, t2] = 1 if t1 is ordered before t2; YT[t1, t2] = 1 otherwise. – Running time: RT[t] = actual (platform dependent) execution time of the task t depending on its mapping. – Start time: ST[t] is the start time for execution of task t.

24

MLP formulation (continued)

Constraints:

– Assignment: – Dataflow dependency: – Zero cost communication:

Objective: Minimize M

SLIDE 13

11/4/14 ¡ 13 ¡

25

Outline

Introduction
Contribution:

Novel Vectorization and Mapping Workflow.

Evaluation
Summary

26

DIF-GR-GPU Workflow

Start from a model-based

application description using the dataflow interchange format (DIF)

Use tools to optimize

scheduling, assignment.

Generate an accelerated

system. Dataflow Scheduler

Dataflow Graph Throughput, Latency Constraints

Application Graph

Multiprocessor Scheduler

Mapping and Ordering Schedule

GNU Radio Engine

Final Implementation

Platform Description Actor Profiles Library of Actor Implementations

GNU Radio

SLIDE 14

11/4/14 ¡ 14 ¡

27

GPU interface GRGPU [Plishker 2011] GRGPU

GNU Radio

GPU Kernel in CUDA (.cu) CUDA Synthesizer (nvcc)

CUDA Libraries

C++ Block with call to GPU Kernel (.cc)

GPU Kernel in C++ (.cc)

libtool Standalone Python package libcudart libcutil device_work() source H2D sink D2H FIR (GPU accelerated) 28

MP-Sched Benchmark (GNU Radio)

SRC FIR FIR FIR FIR FIR FIR FIR FIR FIR FIR FIR FIR SNK # of Stages # of Pipelines

SLIDE 15

11/4/14 ¡ 15 ¡

29

Realization of a 2x5 MP-Sched Graph

Application: 2x5 mp-sched graph Platform: 1 GPP (Intel XeonCPU 3GHz), 1 GPU (a NVidia GTX 260), and a PCI Blocking factor B : 2048. Amount of improvement All GPP All GPU Analytical 55% 19% Empirical 39% 21% 30

Latency vs. Throughput Trade-offs

0 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡ 9 ¡ 10 ¡ 0 ¡ 0.05 ¡ 0.1 ¡ 0.15 ¡ 0.2 ¡ 0.25 ¡ 0.3 ¡ 0.35 ¡ 0.4 ¡ 0.45 ¡ 0.5 ¡ 1K ¡ 2K ¡ 4K ¡ 8K ¡ 16K ¡ 32K ¡ 64K ¡ (microseconds) ¡ Blocking ¡Factor ¡ ¡(B) ¡ Latency ¡per ¡itera.on ¡ (primary ¡axis) ¡ Overall ¡Latency ¡of ¡B ¡itera.ons. ¡ ¡ (secondary ¡axis) ¡

(milliseconds) ¡

Each point an optimized assignment for each blocking factor.

SLIDE 16

11/4/14 ¡ 16 ¡

31

MLP Solver Running Time

Problem written in MathProg.
Solved using the IBM ILOG CPLEX optimizer
On Intel Core 2 Duo processor at 3 GHz.

32

Outline

Introduction
Contribution:

Novel Vectorization and Mapping Workflow.

Evaluation
Summary

SLIDE 17

11/4/14 ¡ 17 ¡

33

Summary

Diversity of platforms:

– ASIC, FPGA, DSPs, GPUs, GPP

Complex application environments

– GNU Radio

Exposing parallelism

– Task, Data and Pipeline

Difficult Mapping Problem
Multi-objective (throughput, Latency)

34

To Probe Further …

G. Zaki, W. Plishker, S. Bhattacharyya, C. Clancy, and J. Kuykendall.

Vectorization and mapping of software defined radio applications on heterogeneous multi-processor platforms. In Proceedings of the IEEE Workshop on Signal Processing Systems, pages 31-36, Beirut, Lebanon, October 2011.

G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall.

Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. Journal of Signal Processing Systems, 70(2): 177-191, February 2013. DOI:10.1007/s11265-012-0696-0.

SLIDE 18

11/4/14 ¡ 18 ¡

35

To Probe Even Further

Foreword by S. Y. Kung
Part 1: Applications
Part 2: Architectures
Part 3: Programming

and Simulation Tools

Part 4: Design Methods

First edition, 2010, Second edition, 2013

36

This research was supported in part by the

Laboratory for Telecommunications Sciences.

For more details on this project, other

projects in the Maryland DSPCAD Research Group, and associated publications:

http://www.ece.umd.edu/DSPCAD/home/ dspcad.htm.

Acknowledgements

SLIDE 19

11/4/14 ¡ 19 ¡

37

References 1

[Bhattacharyya 2013] S. S. Bhattacharyya, E. Deprettere, R. Leupers, and
J. Takala, editors. Handbook of Signal Processing Systems. Springer,

second edition, 2013. ISBN: 978-1-4614-6858-5 (Print); 978-1-4614-6859-2 (Online).

[Plishker 2011] W. Plishker, G. Zaki, S. S. Bhattacharyya, C. Clancy, and
J. Kuykendall. Applying graphics processor acceleration in a software

defined radio prototyping environment. In Proceedings of the International Symposium on Rapid System Prototyping, pages 67-73, Karlsruhe, Germany, May 2011.

[Hormati 2010] A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah,
T. Mudge, and S. Mahlke. MacroSS: macro-SIMDization of streaming
applications. In Symposium on Architectural Support for Programming

Languages and Operating Systems, pages 285-296, 2010.

[Lin 2007] Y. Lin, M. Kudlur, S. Mahlke, and T. Mudge. Hierarchical

coarse-grained stream compilation for software defined radio. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems, pages 115-124, 2007. 38

References 2

[Stuijk 2007] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal.

Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Proceedings of the Design Automation Conference, 2007.

[Blossom 2004] E. Blossom. GNU radio: tools for exploring the radio

frequency spectrum. Linux Journal, June 2004.

[Ritz 1992] S. Ritz, M. Pankert, and H. Meyr. High level software synthesis