Multi-platform Automatic Parallelization and Power Reduction by - - PowerPoint PPT Presentation
Multi-platform Automatic Parallelization and Power Reduction by - - PowerPoint PPT Presentation
Multi-platform Automatic Parallelization and Power Reduction by OSCAR Compiler Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan
To improve effective performance, cost-performance and software productivity and reduce power
OSCAR Parallelizing Compiler
Multigrain Parallelization
coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data Localization
Automatic data management for distributed shared memory, cache and local memory
Data Transfer Overlapping
Data transfer overlapping using Data Transfer Controllers (DMAs)
Power Reduction
Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Data Localization Group dlg0 dlg3 dlg1 dlg2
Low Power Heterogeneous Multicore Code Generation
API Analyzer
(Available from Waseda)
Existing sequential compiler
Multicore Program Development Using OSCAR API V2.0
Sequential Application Program in Fortran or C
(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)
Low Power Homogeneous Multicore Code Generation
API Analyzer Existing sequential compiler
Proc0 Thread 0 Code with directives
Waseda OSCAR Parallelizing Compiler
- Coarse grain task
parallelization
- Data Localization
- DMAC data transfer
- Power reduction using
DVFS, Clock/ Power gating
Proc1 Thread 1 Code with directives Parallelized API F or C program
OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Directives for thread generation, memory, data transfer using DMA, power managements Generation of parallel machine codes using sequential compilers
Executable on various multicores
OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface
Homegeneous Multicore s from Vendor A (SMP servers)
Server Code Generation OpenMP Compiler
Shred memory servers
Heterogeneous Multicores from Vendor B
Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.
Accelerator 1
Code
Accelerator 2
Code
Homogeneous
Accelerator Compiler/ User
Add “hint” directives before a loop or a function to specify it is executable by the accelerator with how many clocks
Hetero Manual parallelization / power reduction
Model Base Designed Engine Control on V850 Multicore with Denso
Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.
1 core 2 cores
Hard real-time automobile engine control by multicore
Parallelizing Handwritten Engine Control Programs
- n Multi‐core processors
- Current automotive crankshaft program
– Developed by TOYOTA Motor Corp – About 300,000 Lines – Difficulty of parallel processing
- Too fine granularity
- Many conditional branches and small basic blocks,
but no parallelizable loops
– Minimizing run‐time overhead and improvement of parallelism are necessary
- Current product compilers can not parallelize
- Current accelerators are not applicable
Automatic parallelization of a crankshaft program using
multi‐grain parallelization in OSCAR Compiler
- Performance improvement and efficient multi‐threaded
programming development
2013/04/19 Cool Chips XVI 5
Analysis of Coarse Grain Parallelism by OSCAR Compiler
2013/04/19 Cool Chips XVI 6
1 2
3
4
5 6 7
8
9 10
11
12 13 14
Macro-Flow Graph Macro-Task Graph Earliest Executable Condition
1 2
3
4
5 6 7
8
9 10
11
12 13 14
Decomposes a program into coarse grain
tasks, or macro tasks(MTs)
1.
BB (Basic Block)
2.
RB (Repetition Block, or loop)
3.
SB (Subroutine Block, or function) Generate MFG(Macro Flow Graph) Control flow and data dependencies Generates MTG(Macro Task Graph) Coarse grain parallelism
: Data Dependency : Control Flow : Conditional Branch
Data Dependency Control Flow Conditional Branch
Coarse Grain Task Parallelization
- f Hand-written Engine Control Program
2013/04/19 Cool Chips XVI 7
MTG of crankshaft programs
Loop parallelization
- No parallelizable loops
in engine control codes
Fine grain parallelization
- Each BBs are very low cost
- less than 100 clock cycles
- Branches prevent compilers
Coarse grain parallelization
- Utilize parallelism between
SBs and BBs
Static Task Scheduling
2013/04/19 Cool Chips XVI 8 MFG of sample program
Dynamic task scheduling
Prevent from traceability
Add run-time overhead
Static task scheduling
Guarantee Real-time constraints
Ensure traceability
Minimize run-time overhead
Cannot assign BBs having braches statically
Static task scheduling can be applied if the MTG has only data dependency
The compiler cannot see if the branch is taken or not at compile time.
Fuse tasks by hiding conditional branches in MFG
to avoid dynamic task scheduling
- Macro Task Fusion
Analysis of A Crankshaft Program Using Macro Task Fusion
2013/04/19 Cool Chips XVI 9
There is not enough parallelism
Can not schedule MTs at compile time MTG of crankshaft program before macro task fusion MTG of crankshaft program after macro task fusion sb4 and block5 account for
- ver 90% of whole execution
time.
MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements
2013/04/19 Cool Chips XVI 10
Successfully increased coarse grain parallelism
Critical Path(CP)
CP accounts for
- ver 99% of whole
execution time.
MTG of crankshaft program before restructuring
Critical Path(CP)
CP accounts for about 60% of whole execution time.
MTG of crankshaft program after restructuring
Succeed to reduce CP
99% -> 60%
Evaluation Environment : Embedded Multi-core Processor RPX
- SH-4A 648MHz * 8
– As a first step, we use just two SH-4A cores because target dual-core processors are currently under design for next-generation automobiles
2013/04/19 Cool Chips XVI 11
. t-
Evaluation of Crankshaft Program with Multi- core Processors
- Attain 1.54 times speedup on RPX
– There are no loops, but only many conditional branches and small basic blocks and difficult to parallelize this program
- This result shows possibility of multi-core processor for engine
control programs
2013/04/19 Cool Chips XVI 12
1.00 1.54 0.57 0.37 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 1 core 2 core execution time(us) speedup ratio
Performance of OSCAR Compiler
- n Intel Core i7 Notebook PC
- OSCAR Compiler accelerate Intel Compiler about 2.0 times
- n average
1.00 1.18 2.70 1.00 1.00
1.70 2.24 4.12 2.91 2.53
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
SPEC95 su2cor SPEC95 hydro2d SPEC95 mgrid SPEC95 turb3d AAC Encoder
Speesup Ratio Intel Compiler Ver.14.0 OSCAR CPU: Intel Core i7 3720QM (Quad‐core) MEM: 32GB DDR3‐SODIMM PC3‐12800 OS: Ubuntu 12.04 LTS
Parallel Processing of JPEG XR Encoder on TILEPro64
Multimedia Applications: Sequential C Source Code Parallelized C Program with OSCAR API
OSCAR Compiler
Parallelized Executable Binary for TILEPro64 API Analyzer + Sequential Compiler Cache Allocation Setting
1x
28x
55x
0.00 10.00 20.00 30.00 40.00 50.00 60.00 1 64 Speedup # Cores
Speedup (JPEG XR Encoder)
Default Cache Allocation Our Cache Allocation
(1)OSCAR Parallelization (2)Cache Allocation Setting
Local cache optimization: Parallel Data Structure (tile) on heap allocating to local cache 55x speedup on 64 cores
AAC Encoder JPEG XR Encoder Optical Flow Calc.
, 1 , 2 , 3 , 4 , 5 , 6 , 7 , , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 , 2 1 , 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 , 3 1 , 3 2 , 3 3 , 3 4 , 3 5 , 3 6 , 3 7 , 3 , 4 1 , 4 2 , 4 3 , 4 4 , 4 5 , 4 6 , 4 7 , 4 , 5 1 , 5 2 , 5 3 , 5 4 , 5 5 , 5 6 , 5 7 , 5 , 6 1 , 6 2 , 6 3 , 6 4 , 6 5 , 6 6 , 6 7 , 6 , 7 1 , 7 2 , 7 3 , 7 4 , 7 5 , 7 6 , 7 7 , 7 I/O I/O I/O Memory Controller 0 Memory Controller 1 Memory Controller 2 Memory Controller 3 Ds t0 X4) rt 1 al) nal) X4)
{
Parallel Processing of Face Detection on Manycore, Highendand PC Server
15
- OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core
- n SR16000 Power7 highendserver
.
1.00 1.72 3.01 5.74 9.30 1.00 1.93 3.57 6.46 11.55 1.00 1.93 3.67 6.46 10.92
0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00
1 2 4 8 16 速度向上率 コア数
速度向上率
tilepro64 gcc SR16k(Power7 8core*4cpu*4node) xlc rs440(Intel Xeon 8core*4cpu) icc
Automatic Parallelization of Face Detection
92 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000
(Power7 Based 128 Core Linux SMP)
10 20 30 40 50 60 70 80 90 100 1pe 2pe 4pe 8pe 16pe 32pe 64pe 128pe
Speedup against sequential processing
- scar
Profile-Based Automatic Parallelization and Sequential Program Tuning for Android 2D Rendering on Nexus7
WASEDA UNIVERSITY
OSCAR Compiler
Profile-Based Automatic Parallelization
OSCAR Parallelization Compiler
Analyze Result
Profile Result Sequential Binary File
Hotspot Analyzer
Original Source files
ARM Compil er (GCC)
Paralleized Binary File Parallelized Source Files with OSCAR API
ARM Compile r (GCC) A B
Rewite to Parallelizable C
OSCAR API Analyzer
Parallelized Source Files Hotspot Source File Green Computing Systems Research and Development Center Waseda University Path Generation Rasterizatio n Shading BitBlit
paths mask destination Image source image
modified destination Image
Transform figure to some paths Transform path to Bitmap(Mask)
Calculate and transfer display image from source image, mask and destination image.
Make a color data from a Command
Standard library which performs 2D rendering on Android
Skia 2D Rendering Pipeline
Android 2D Rendering “Skia”
Execution flow is different on each benchmar High load on the BitBlit process. SkRGB16_Blitter ::blitH 81.84% memset_128_loo p 2.70% sk_fill_path 2.21% memset32_loop1 28 1.42% SkString::~SkStri ng() 0.61% Others 11.22%
DrawArc
- thers
2.57% SkRGB16_Blitter::blitRect 97.43%
DrawRect
A B
void SkRGB16_Blitter_blitRect_oscar(int width, int height, uint16_t* device, unsigned deviceRB, SkPMColor src32) { int i; uint16_t* deviceTMP; for (i = height; i > 0; i--){ deviceTMP = (uint16_t*)((char*)device + (deviceRB * (height - i))); blend32_16_row(src32, deviceTMP, width); } } void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) { SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height()); uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y); unsigned deviceRB = fDevice.rowBytes(); SkPMColor src32 = fSrcColor32; SkRGB16_Blitter_blitRect_oscar(width, height, device, deviceRB, src32); }
Original Source Code After Tuning True dependency on variable deviceRB is solved. Separate C++ File
Tuning Method for “Skia” [DrawRect] C
Evaluations
GPU~~~
CPU Load Graph [DrawRect]
22.82 38.58 33.86 27.16 43.57 50.98 50.77 52.88
10 20 30 40 50 60
DrawRect DrawCircle2… FPS(Frames Per second)
Benchmarks
Performance
0xbenchmark 2D Canvas Test
Ordinal Compilation
Google NEXUS 7
NVIDIA Tegra3 Chip Processor : NVIDIA Tegra3 ARM Cortex A9 - 4Core Clock Frequency : 1.2[GHz]
1.91x 1.32x 1.95x
Profiling By Oprofile
Parallelizing Hotspot Information Android
Skia
Multicore
C
“Skia” Profiling
Zoom Sequential Skia – 1PE
1Frame 50msec Finish15sec OSCAR API Runtime Library 1Frame 28msec
Finish
Zoom
Parallelized Skia – 3PE
Parallelized Skia – 3PE
void SkRGB16_Blitter::blitRect(int x, int y, int width, int height) { SkASSERT(x + width <= fDevice.width() && y + height <= fDevice.height()); uint16_t* SK_RESTRICT device = fDevice.getAddr16(x, y); unsigned deviceRB = fDevice.rowBytes(); SkPMColor src32 = fSrcColor32; while (--height >= 0) { blend32_16_row(src32, device, width); device = (uint16_t*)((char*)device + deviceRB); } }
Finish 8sec Skia Execution Skia Execution
SkRGB16_Blitter::blitAntiH 78.60% SkRGB16_Blitter::blitRect 8.54% SkAlphaRuns::add 2.47% SuperBlitter::blitH 2.37% SkAlphaRuns::Break 2.02% Others 6.01%
DrawCircle2
1.50x
15 30 45 60 通常の1コア実⾏ 並列化3コア実⾏
DrawImage : FPS
Parallelization of 2D Rendering Engine SKIA on 3 cores of Google NEXUS7
15 30 45 60
通常の1コア実⾏ 並列化3コア実⾏
DrawRect :FPS
22.82 43.57 27.16
×1.91 ×1.95
for DrawRect 1.91 speedup for DrawImage 1.95 speedup On Nexus7, 3 core parallelization gave us
52.88
18
1 Core 3 cores 1 Core 3 cores http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
Low-Power Optimization with OSCAR API
MT1 VC0 MT2 MT4 MT3 Sleep VC1 Scheduled Result by OSCAR Compiler void main_VC0() { MT1 void main_VC1() { MT2 #pragma oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) #pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) Sleep MT4 MT3 } } Generate Code Image by OSCAR Compiler
Power Reduction of MPEG2 Decoding to 1/4
- n 8 Core Homogeneous Multicore RP-2
by OSCAR Parallelizing Compiler
- Avg. Power
5.73 [W]
- Avg. Power
1.52 [W]
73.5% Power Reduction
MPEG2 Decoding with 8 CPU cores
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Without Power Control (Voltage:1.4V) With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V-1.0V)
33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X
(Optical Flow with a hand-tuned library)
1 2.29 3.09 5.4 18.85 26.71 32.65
5 10 15 20 25 30 35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE Speedups against a single SH processor
3.4[fps] 111[fps]
Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)
Without Power Reduction
With Power Reduction by OSCAR Compiler
Average:1.76[W] Average:0.54[W]
1cycle : 33[ms] →30[fps]
70% of power reduction
Automatic Power Reduction for MPEG2 Decode on Android Multicore
ODROID X2 ARM Cortex-A94cores
23
- On 3 cores, Automatic Power Reduction control successfully reduced power to
1/7 against without Power Reduction control.
- 3 cores with the compiler power reduction control reduced power to 1/3 against
- rdinary 1 core execution.
0.97 1.88 2.79
0.63 0.46 0.37
0.00 0.50 1.00 1.50 2.00 2.50 3.00 1 2 3 Power Consumption [W] Number of Processor Cores 電力制御なし 電力制御あり
Without Power Reduction With Power Reduction
2/3 (‐35.0%) 1/4 (‐75.5%)
1/7(‐86.7%) 1/3 (‐61.9%)
http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ
Automatic Power Reduction on 4 core Intel Haswell
- Haswell Processor
– OS Ubuntu 13.10 – Intel CPU Core i7 4770K
- 4 cores
- L1 Cache: Load 64Bytes/cycle, Store 32Bytes/cycle
- L2 Cache 64Bytes/cycle
- L3 Cache 8 MB
- Frequency 3.5GHz~0.8MHz
– Memory 16GB (8GB×2)
24
Power Reduction on Intel Haswell for Real-time Optical Flow
25
Power was reduced to 1/4 by the compiler power
- ptimization on the same 3 cores.
The power with 3 core was reduced to 1/3 against 1 core.
29.29 36.59 41.58
28.40 13.22 10.49
0.00 10.00 20.00 30.00 40.00 50.00 1 2 3 Average Power [W]
- No. of Processor Cores
電力制御なし 電力制御あり Power was reduced to 1/4 by compiler on 3 cores Power was reduced to 1/3 compared with one core
- rdinal
execution For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame])
Without Power Control With Power Control
Power Waves for 1 Core to 3 Cores without the Compiler Power Control on Intel Haswell for Real-time Optical Flow
2014/6/17 DEMO 26
電圧 (V) 電流 (A) 電力(W)
29.29W 36.59W 41.58W
2 Cores 1 Core 3 Cores
Power Power Power
2014/6/17 DEMO 27
電圧 (V) 電流 (A) 電力(W)
28.40W 13.22W 10.49W
Power Power Power
3 Cores 2 Cores 1 Core
Power Waves for 1 Core to 3 Cores with the Compiler Power Control on Intel Haswell for Real-time Optical Flow
Power for 1 & 3Cores without Control
- vs. for 3 Cores with Control on Haswell
Without Power Control With Power Control
2014/6/17 DEMO 28 2014/6/17 29.29W
1 Core
Power
28 41.58W
3 Cores
Power
Without Power Control
28 10.49W
Power
3 Cores
Future Multicore Products
Next Generation Automobiles
‐ Safer, more comfortable, energy efficient, environment friendly ‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control
Solar powered with more than 100 times power efficient : FLOPS/W
- Regional Disaster Simulators
saving lives from tornadoes, localized heavy rain, fires with earth quakes ‐From everyday recharging to less than once a week ‐ Solar powered operation in emergency condition ‐ Keep health
Smart phones
Cancer treatment, Drinkable inner camera
- Emergency solar powered
- No cooling fun, No dust ,
clean usable inside OP room
Advanced medical systems Personal / Regional Supercomputers