Pattern-guided Big Data Processing on Hybrid Parallel Architectures - - PowerPoint PPT Presentation

pattern guided big data processing
SMART_READER_LITE
LIVE PREVIEW

Pattern-guided Big Data Processing on Hybrid Parallel Architectures - - PowerPoint PPT Presentation

Pattern-guided Big Data Processing on Hybrid Parallel Architectures Fahad Khalid, Frank Feinbube, and Andreas Polze Operating Systems and Middleware Group Motivation Insights from developing simulations for, Enumeration of Elementary


slide-1
SLIDE 1

Pattern-guided Big Data Processing

  • n Hybrid Parallel Architectures

Fahad Khalid, Frank Feinbube, and Andreas Polze

Operating Systems and Middleware Group

slide-2
SLIDE 2
  • Insights from developing simulations for,

– Enumeration of Elementary Flux Modes in Metabolic Networks – Prediction of aftershocks following earthquakes – Prediction of volcanic events – Adiabatic Quantum Computing

  • Collaborations

– Max Planck Institute of Molecular Plant Physiology – GFZ German Research Center for Geosciences

September 25, 2014 Frank Feinbube | BigSys 2014 2

Motivation

slide-3
SLIDE 3
  • Complications with Hybrid Architectures

– Memory hierarchy per processor type – Designed for high FLOP/s, not Big Data

  • Then, assuming the hardware available is hybrid,

– How can we improve both performance and productivity

  • f a simulation that requires processing of very large data

sets?

September 25, 2014 Frank Feinbube | BigSys 2014 3

Motivation

slide-4
SLIDE 4

Definitions

  • Performance

– Significant speedup

  • Productivity

– Ease of development

  • Hybrid Architecture

– One or more CPUs = Host – One or more accelerators, e.g., GPUs = Device

September 25, 2014 Frank Feinbube | BigSys 2014 4

slide-5
SLIDE 5

Efficient Hybrid-Resource Utilization (EHRU)

September 25, 2014 Frank Feinbube | BigSys 2014 5

  • Design Approach

– Hierarchical application of patterns for parallel programming

  • Expected Outcome

– Improved simulation performance – Improved productivity, by serving as foundation for:

  • Frameworks
  • Automation tools
slide-6
SLIDE 6

Parallel Pipeline Pattern

September 25, 2014 Frank Feinbube | BigSys 2014 6

Serial processing of stages Pipelined processing of stages 𝑇2 𝑇1 𝑇3 𝑇2 𝑇1 𝑇3 𝑇2 𝑇1 𝑇3

3 1 7 4 9 5 4 3 1 7 4 9 5 4

𝑇2 𝑇1 𝑇3 ⋯ 𝑇2 𝑇1 𝑇3 ⋯

slide-7
SLIDE 7

Parallel Pipeline Pattern

September 25, 2014 Frank Feinbube | BigSys 2014 7

  • Simulation as Pipeline

Analytical solutions to 3D Partial Differential Equations in Vectors Numerical solution to a System of Linear Equations Read input data from file Write output data to file

slide-8
SLIDE 8

September 25, 2014 Frank Feinbube | BigSys 2014 8

Data Partitioning

  • Motivation

– Main memory and Cache sizes are limited

  • Factors affecting partitioning

– Total memory required/available – Impact of partition size on pipeline performance

P1

Out of Memory

P1,1 P1,2 P1,3

OK OK OK

Partition 0 Partition 1 ⋮ Partition 0 Partition 1 ⋮ ⋯ Complete Dataset Chunk

slide-9
SLIDE 9

EHRU Pattern Hierarchy

September 25, 2014 Frank Feinbube | BigSys 2014 9

slide-10
SLIDE 10

Hybrid Pipeline

September 25, 2014 Frank Feinbube | BigSys 2014 10

  • Uses of Hybrid Pipelining

– Overlapping computation and communication – Load balancing and optimal resource utilization – Kernel placement based on architecture

⋯ ⋯ ⋯ ⋯ ⋯ ⋯

slide-11
SLIDE 11

Hybrid Pipeline Framework (HyPi)

September 25, 2014 Frank Feinbube | BigSys 2014 11

  • HyPi Stages

– DeviceFilter: CUDA Device kernel – CallbackFilter: D2H Communication – PostProcessFilter: Host processing

Device Filter Callback Filter PostProcess Filter Device Filter Callback Filter PostProcess Filter Device Filter Callback Filter PostProcess Filter ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

slide-12
SLIDE 12

HyPi & EHRU – Evaluation

September 25, 2014 Frank Feinbube | BigSys 2014 12

5 10 15 20 25 30 35 40 45 50 55 60 500 million 2 billion 2.5 billion 3.5 billion 4.5 billion 6.3 billion 8.1 billion Time (seconds)

  • No. of candidate vectors generated

CPU-only Parallel Custom Pipeline HPF Pipeline

slide-13
SLIDE 13

Feasibility and Limitations of EHRU

September 25, 2014 Frank Feinbube | BigSys 2014 13

  • Suitable for

– Dense Linear Algebra – Structured Grids – Monte Carlo

  • Not suitable for

– Sparse Linear Algebra – Unstructured Grids – Graph Traversal

slide-14
SLIDE 14

Architecture-based Algorithm Decomposition

September 25, 2014 Frank Feinbube | BigSys 2014 14

  • Decompose the algorithm into two

parts:

1. Suitable for execution on the GPU 2. Suitable for execution on the CPU

  • CPUs support a diverse range of

kernels

– Everything goes, except for massive parallelism

  • How do we decide which part of

the algorithm is suitable for GPUs?

Pattern 𝑜 − 1 Pattern 𝑜 Pattern 1 Pattern 2

Accelerator CPU Algorithm

slide-15
SLIDE 15

Characteristics of Computational Kernels

September 25, 2014 Frank Feinbube | BigSys 2014 15

  • Degree of Parallelism (DoP)

– The amount of parallelism exposed by the kernel

  • Arithmetic Intensity

– Ratio of No. of arithmetic instructions to the No. of memory access instructions

  • Control Divergence

– No. and complexity of conditional statements

slide-16
SLIDE 16

Design Patterns and Algorithm Decomposition

September 25, 2014 Frank Feinbube | BigSys 2014 16

  • Patterns suitable for GPUs

– Map – Stencil

  • Patterns NOT suitable for GPUs

– Reduce – Scan – Dynamic Programming

  • This categorization is based on Degree of Parallelism
slide-17
SLIDE 17

Program Flow with Algorithm Decomposition

September 25, 2014 Frank Feinbube | BigSys 2014 17

<<Map>> GPU Kernel <<Reduce>> CPU Kernel Intermediate Result

slide-18
SLIDE 18

Tool-guided Parallelization for Hybrid Architectures

September 25, 2014 Frank Feinbube | BigSys 2014 18

  • Motivation

– Automatically discerning patterns from serial code – Efficient mapping of parallel code with EHRU

  • How?

– Dependence Analysis to discern patterns – Developer feedback to improve affine transformations

  • This is work in progress
slide-19
SLIDE 19

September 25, 2014 Frank Feinbube | BigSys 2014 19

  • Information Theoretic approach to improve serial to

parallel transformations

  • Partitioning for Complex Data structures
  • Automated tool for architecture-based algorithm

decomposition

Future Work

Th Thank Yo You!