Multi-platform Automatic Parallelization and Power Reduction by - - PowerPoint PPT Presentation

multi platform automatic parallelization and power
SMART_READER_LITE
LIVE PREVIEW

Multi-platform Automatic Parallelization and Power Reduction by - - PowerPoint PPT Presentation

Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan


slide-1
SLIDE 1

Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler Hironori Kasahara

Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute

Waseda University, Tokyo, Japan

IEEE Computer Society Board of Governors IEEE Computer Society Multicore STC Chair

URL: http://www.kasahara.cs.waseda.ac.jp/

slide-2
SLIDE 2

To improve effective performance, cost-performance and software productivity and reduce power

OSCAR Parallelizing Compiler

Multigrain Parallelization

coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data Localization

Automatic data management for distributed shared memory, cache and local memory

Data Transfer Overlapping

Data transfer overlapping using Data Transfer Controllers (DMAs)

Power Reduction

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Data Localization Group dlg0 dlg3 dlg1 dlg2

slide-3
SLIDE 3

Low Power Heterogeneous Multicore Code Generation

API Analyzer

(Available from Waseda)

Existing sequential compiler

Multicore Program Development Using OSCAR API V2.0

Sequential Application Program in Fortran or C

(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Low Power Homogeneous Multicore Code Generation

API Analyzer Existing sequential compiler

Proc0 Thread 0 Code with directives

Waseda OSCAR Parallelizing Compiler

  • Coarse grain task

parallelization

  • Data Localization
  • DMAC data transfer
  • Power reduction using

DVFS, Clock/ Power gating

Proc1 Thread 1 Code with directives Parallelized API F or C program

OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Directives for thread generation, memory, data transfer using DMA, power managements Generation of parallel machine codes using sequential compilers

Executable on various multicores

OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface

Homegeneous Multicore s from Vendor A (SMP servers)

Server Code Generation OpenMP Compiler

Shred memory servers

Heterogeneous Multicores from Vendor B

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

Accelerator 1

Code

Accelerator 2

Code

Homogeneous

Accelerator Compiler/ User

Add “hint” directives before a loop or a function to specify it is executable by the accelerator with how many clocks

Hetero Manual parallelization / power reduction

slide-4
SLIDE 4

Model Base Designed Engine Control on V850 Multicore with Denso

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.

1 core 2 cores

Hard real-time automobile engine control by multicore

slide-5
SLIDE 5

Parallelizing Handwritten Engine Control Programs

  • n Multi‐core processors
  • Current automotive crankshaft program

– Developed by TOYOTA Motor Corp – About 300,000 Lines – Difficulty of parallel processing

  • Too fine granularity
  • Many conditional branches and small basic blocks,

but no parallelizable loops

– Minimizing run‐time overhead and improvement of parallelism are necessary

  • Current product compilers can not parallelize
  • Current accelerators are not applicable

 Automatic parallelization of a crankshaft program using

multi‐grain parallelization in OSCAR Compiler

  • Performance improvement and efficient multi‐threaded

programming development

2013/04/19 Cool Chips XVI 5

slide-6
SLIDE 6

Analysis of Coarse Grain Parallelism by OSCAR Compiler

2013/04/19 Cool Chips XVI 6

1 2

3

4

5 6 7

8

9 10

11

12 13 14

Macro-Flow Graph Macro-Task Graph Earliest Executable Condition

1 2

3

4

5 6 7

8

9 10

11

12 13 14

 Decomposes a program into coarse grain

tasks, or macro tasks(MTs)

1.

BB (Basic Block)

2.

RB (Repetition Block, or loop)

3.

SB (Subroutine Block, or function)  Generate MFG(Macro Flow Graph)  Control flow and data dependencies  Generates MTG(Macro Task Graph)  Coarse grain parallelism

: Data Dependency : Control Flow : Conditional Branch

Data Dependency Control Flow Conditional Branch

slide-7
SLIDE 7

Coarse Grain Task Parallelization

  • f Hand-written Engine Control Program

2013/04/19 Cool Chips XVI 7

MTG of crankshaft programs

Loop parallelization

  • No parallelizable loops

in engine control codes

Fine grain parallelization

  • Each BBs are very low cost
  • less than 100 clock cycles
  • Branches prevent compilers

Coarse grain parallelization

  • Utilize parallelism between

SBs and BBs

slide-8
SLIDE 8

Static Task Scheduling

2013/04/19 Cool Chips XVI 8 MFG of sample program

 Dynamic task scheduling

Prevent from traceability

Add run-time overhead

 Static task scheduling

Guarantee Real-time constraints

Ensure traceability

Minimize run-time overhead

 Cannot assign BBs having braches statically

Static task scheduling can be applied if the MTG has only data dependency

The compiler cannot see if the branch is taken or not at compile time.

 Fuse tasks by hiding conditional branches in MFG

to avoid dynamic task scheduling

  • Macro Task Fusion
slide-9
SLIDE 9

Analysis of A Crankshaft Program Using Macro Task Fusion

2013/04/19 Cool Chips XVI 9

There is not enough parallelism

Can not schedule MTs at compile time MTG of crankshaft program before macro task fusion MTG of crankshaft program after macro task fusion sb4 and block5 account for

  • ver 90% of whole execution

time.

slide-10
SLIDE 10

MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements

2013/04/19 Cool Chips XVI 10

Successfully increased coarse grain parallelism

Critical Path(CP)

CP accounts for

  • ver 99% of whole

execution time.

MTG of crankshaft program before restructuring

Critical Path(CP)

CP accounts for about 60% of whole execution time.

MTG of crankshaft program after restructuring

 Succeed to reduce CP

99% -> 60%

slide-11
SLIDE 11

Evaluation Environment : Embedded Multi-core Processor RPX

  • SH-4A 648MHz * 8

– As a first step, we use just two SH-4A cores because target dual-core processors are currently under design for next-generation automobiles

2013/04/19 Cool Chips XVI 11

. t-

slide-12
SLIDE 12

Evaluation of Crankshaft Program with Multi- core Processors

  • Attain 1.54 times speedup on RPX

– There are no loops, but only many conditional branches and small basic blocks and difficult to parallelize this program

  • This result shows possibility of multi-core processor for engine

control programs

2013/04/19 Cool Chips XVI 12

1.00 1.54 0.57 0.37 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 1 core 2 core execution time(us) speedup ratio

slide-13
SLIDE 13

Performance of OSCAR Compiler

  • n Intel Core i7 Notebook PC
  • OSCAR Compiler accelerate Intel Compiler about 2.0 times
  • n average

1.00 1.18 2.70 1.00 1.00

1.70 2.24 4.12 2.91 2.53

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

SPEC95 su2cor SPEC95 hydro2d SPEC95 mgrid SPEC95 turb3d AAC Encoder

Speesup Ratio Intel Compiler Ver.14.0 OSCAR CPU: Intel Core i7 3720QM (Quad‐core) MEM: 32GB DDR3‐SODIMM PC3‐12800 OS: Ubuntu 12.04 LTS

slide-14
SLIDE 14

Parallel Processing of JPEG XR Encoder on TILEPro64

Multimedia Applications: Sequential C Source Code Parallelized C Program with OSCAR API

OSCAR Compiler

Parallelized Executable Binary for TILEPro64 API Analyzer + Sequential Compiler Cache Allocation Setting

1x

28x

55x

0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 64 Speedup # Cores

Speedup (JPEG XR Encoder)

Default Cache Allocation Our Cache Allocation

(1)OSCAR Parallelization (2)Cache Allocation Setting

Local cache optimization: Parallel Data Structure (tile) on heap allocating to local cache 55x speedup on 64 cores

AAC Encoder JPEG XR Encoder Optical Flow Calc.

, 1 , 2 , 3 , 4 , 5 , 6 , 7 , , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 , 2 1 , 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 , 3 1 , 3 2 , 3 3 , 3 4 , 3 5 , 3 6 , 3 7 , 3 , 4 1 , 4 2 , 4 3 , 4 4 , 4 5 , 4 6 , 4 7 , 4 , 5 1 , 5 2 , 5 3 , 5 4 , 5 5 , 5 6 , 5 7 , 5 , 6 1 , 6 2 , 6 3 , 6 4 , 6 5 , 6 6 , 6 7 , 6 , 7 1 , 7 2 , 7 3 , 7 4 , 7 5 , 7 6 , 7 7 , 7 I/O I/O I/O Memory Controller 0 Memory Controller 1 Memory Controller 2 Memory Controller 3 Ds t0 X4) rt 1 al) nal) X4)

{

slide-15
SLIDE 15

Parallel Processing of Face Detection on Manycore, Highendand PC Server

15

  • OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core
  • n SR16000 Power7 highendserver

.

1.00 1.72 3.01 5.74 9.30 1.00 1.93 3.57 6.46 11.55 1.00 1.93 3.67 6.46 10.92

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00

1 2 4 8 16 速度向上率 コア数

速度向上率

tilepro64 gcc SR16k(Power7 8core*4cpu*4node) xlc rs440(Intel Xeon 8core*4cpu) icc

Automatic Parallelization of Face Detection

slide-16
SLIDE 16

92 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000

(Power7 Based 128 Core Linux SMP)

10 20 30 40 50 60 70 80 90 100 1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe

Speedup against sequential processing

  • scar
slide-17
SLIDE 17

Profile-Based Automatic Parallelization and Sequential Program Tuning for Android 2D Rendering on Nexus7

WASEDA UNIVERSITY

OSCAR Compiler

Profile-Based Automatic Parallelization

OSCAR Parallelization Compiler

Analyze Result

Profile Result Sequential Binary File

Hotspot Analyzer

Original Source files

ARM Compil er (GCC)

Paralleized Binary File Parallelized Source Files with OSCAR API

ARM Compile r (GCC) A B

Rewite to Parallelizable C

OSCAR API Analyzer

Parallelized Source Files Hotspot Source File Green Computing Systems Research and Development Center Waseda University Path Generation Rasterizatio n Shading BitBlit

paths mask destination Image source image

modified destination Image

Transform figure to some paths Transform path to Bitmap(Mask)

Calculate and transfer display image from source image, mask and destination image.

Make a color data from a Command

Standard library which performs 2D rendering on Android

Skia 2D Rendering Pipeline

Android 2D Rendering “Skia”

Execution flow is different on each benchmar High load on the BitBlit process. SkRGB16_Blitter ::blitH 81.84% memset_128_loo p 2.70% sk_fill_path 2.21% memset32_loop1 28 1.42% SkString::~SkStri ng() 0.61% Others 11.22%

DrawArc

  • thers

2.57% SkRGB16_Blitter::blitRect 97.43%

DrawRect

A B

void SkRGB16_Blitter_blitRect_oscar(int width, int height, uint16_t* device, unsigned deviceRB, SkPMColor src32) { int i; uint16_t* deviceTMP; for (i = height; i > 0; i--){ deviceTMP = (uint16_t*)((char*)device + (deviceRB * (height - i))); blend32_16_row(src32, deviceTMP, width); } } void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) { SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height()); uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y); unsigned deviceRB = fDevice.rowBytes(); SkPMColor src32 = fSrcColor32; SkRGB16_Blitter_blitRect_oscar(width, height, device, deviceRB, src32); }

Original Source Code After Tuning True dependency on variable deviceRB is solved. Separate C++ File

Tuning Method for “Skia” [DrawRect] C

Evaluations

GPU~~~

CPU Load Graph [DrawRect]

22.82 38.58 33.86 27.16 43.57 50.98 50.77 52.88

10 20 30 40 50 60

DrawRect DrawCircle2… FPS(Frames Per second)

Benchmarks

Performance

0xbenchmark 2D Canvas Test

Ordinal Compilation

Google NEXUS 7

NVIDIA Tegra3 Chip Processor : NVIDIA Tegra3 ARM Cortex A9 - 4Core Clock Frequency : 1.2[GHz]

1.91x 1.32x 1.95x

Profiling By Oprofile

Parallelizing Hotspot Information Android

Skia

Multicore

“Skia” Profiling

Zoom Sequential Skia – 1PE

1Frame 50msec Finish15sec OSCAR API Runtime Library 1Frame 28msec

Finish

Zoom

Parallelized Skia – 3PE

Parallelized Skia – 3PE

void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) { SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height()); uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y); unsigned deviceRB = fDevice.rowBytes(); SkPMColor src32 = fSrcColor32; while (--height >= 0) { blend32_16_row(src32, device, width); device = (uint16_t*)((char*)device + deviceRB); } }

Finish 8sec Skia Execution Skia Execution

SkRGB16_Blitter::blitAntiH 78.60% SkRGB16_Blitter::blitRect 8.54% SkAlphaRuns::add 2.47% SuperBlitter::blitH 2.37% SkAlphaRuns::Break 2.02% Others 6.01%

DrawCircle2

1.50x

slide-18
SLIDE 18

15 30 45 60 通常の1コア実⾏ 並列化3コア実⾏

DrawImage : FPS

Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7

15 30 45 60

通常の1コア実⾏ 並列化3コア実⾏

DrawRect :FPS

22.82 43.57 27.16

×1.91 ×1.95

for DrawRect 1.91 speedup for DrawImage 1.95 speedup On Nexus7, 3 core parallelization gave us

52.88

18

1 Core 3 cores 1 Core 3 cores http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

slide-19
SLIDE 19

Low-Power Optimization with OSCAR API

MT1 VC0 MT2 MT4 MT3 Sleep VC1 Scheduled Result by OSCAR Compiler void main_VC0() { MT1 void main_VC1() { MT2 #pragma oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) #pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) Sleep MT4 MT3 } } Generate Code Image by OSCAR Compiler

slide-20
SLIDE 20

Power Reduction of MPEG2 Decoding to 1/4

  • n 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler

  • Avg. Power

5.73 [W]

  • Avg. Power

1.52 [W]

73.5% Power Reduction

MPEG2 Decoding with 8 CPU cores

1 2 3 4 5 6 7 1 2 3 4 5 6 7

Without Power Control (Voltage:1.4V) With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)

slide-21
SLIDE 21

33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X

(Optical Flow with a hand-tuned library)

1 2.29 3.09 5.4 18.85 26.71 32.65

5 10 15 20 25 30 35

1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE Speedups against a single SH processor

3.4[fps] 111[fps]

slide-22
SLIDE 22

Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)

Without Power Reduction

With Power Reduction by OSCAR Compiler

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms] →30[fps]

70% of power reduction

slide-23
SLIDE 23

Automatic Power Reduction for MPEG2 Decode on Android Multicore

ODROID X2 ARM Cortex-A94cores

23

  • On 3 cores, Automatic Power Reduction control successfully reduced power to

1/7 against without Power Reduction control.

  • 3 cores with the compiler power reduction control reduced power to 1/3 against
  • rdinary 1 core execution.

0.97 1.88 2.79

0.63 0.46 0.37

0.00 0.50 1.00 1.50 2.00 2.50 3.00 1 2 3 Power Consumption [W] Number of Processor Cores 電力制御なし 電力制御あり

Without Power Reduction With Power Reduction

2/3 (‐35.0%) 1/4 (‐75.5%)

1/7(‐86.7%) 1/3 (‐61.9%)

http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ

slide-24
SLIDE 24

Automatic Power Reduction on 4 core Intel Haswell

  • Haswell Processor

– OS Ubuntu 13.10 – Intel CPU Core i7 4770K

  • 4 cores
  • L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle
  • L2 Cache 64Bytes/cycle
  • L3 Cache 8 MB
  • Frequency 3.5GHz~0.8MHz

– Memory 16GB (8GB×2)

24

slide-25
SLIDE 25

Power Reduction on Intel Haswell for Real-time Optical Flow

25

Power was reduced to 1/4 by the compiler power

  • ptimization on the same 3 cores.

The power with 3 core was reduced to 1/3 against 1 core.

29.29 36.59 41.58

28.40 13.22 10.49

0.00 10.00 20.00 30.00 40.00 50.00 1 2 3 Average Power [W]

  • No. of Processor Cores

電力制御なし 電力制御あり Power was reduced to 1/4 by compiler on 3 cores Power was reduced to 1/3 compared with one core

  • rdinal

execution For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame])

Without Power Control With Power Control

slide-26
SLIDE 26

Power Waves for 1 Core to 3 Cores without the Compiler Power Control on Intel Haswell for Real-time Optical Flow

2014/6/17 DEMO 26

電圧 (V) 電流 (A) 電力(W)

29.29W 36.59W 41.58W

2 Cores 1 Core 3 Cores

Power Power Power

slide-27
SLIDE 27

2014/6/17 DEMO 27

電圧 (V) 電流 (A) 電力(W)

28.40W 13.22W 10.49W

Power Power Power

3 Cores 2 Cores 1 Core

Power Waves for 1 Core to 3 Cores with the Compiler Power Control on Intel Haswell for Real-time Optical Flow

slide-28
SLIDE 28

Power for 1 & 3Cores without Control

  • vs. for 3 Cores with Control on Haswell

Without Power Control With Power Control

2014/6/17 DEMO 28 2014/6/17 29.29W

1 Core

Power

28 41.58W

3 Cores

Power

Without Power Control

28 10.49W

Power

3 Cores

slide-29
SLIDE 29

Future Multicore Products

Next Generation Automobiles

‐ Safer, more comfortable, energy efficient, environment friendly ‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

Solar powered with more than 100 times power efficient : FLOPS/W

  • Regional Disaster Simulators

saving lives from tornadoes, localized heavy rain, fires with earth quakes ‐From everyday recharging to less than once a week ‐ Solar powered operation in emergency condition ‐ Keep health

Smart phones

Cancer treatment, Drinkable inner camera

  • Emergency solar powered
  • No cooling fun, No dust ,

clean usable inside OP room

Advanced medical systems Personal / Regional Supercomputers