KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - - PowerPoint PPT Presentation

kpn on heterogeneous multi many core
SMART_READER_LITE
LIVE PREVIEW

KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - - PowerPoint PPT Presentation

Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013 Agenda Porting Application


slide-1
SLIDE 1

MAD 2013

Debugging of application software based on KPN on heterogeneous multi / many-core processors

Yukoh Matsumoto, Ph.D. President & CEO, Architect

TOPS Systems Corp.

Multicore / Manycore provider in Japan

slide-2
SLIDE 2

MAD 2013

Agenda

Porting Application onto Heterogeneous Manycore

Case Study : Real-Time Ray Tracing, 800TFLOPS on Desk Top Machine

Architecture & Algorithm Co-Design

Deep Performance Analysis

Software Partitioning into Kahn Process Network

System Performance Modeling

System Performance Simulation

Debugging Issues and Challenges

Working on Better Solutions

Conclusions

slide-3
SLIDE 3

MAD 2013

Parallel Processing Goal

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael I. Gordon, William Thies, and Saman Amarasinghe, MIT

TOPSTREAM CPU GPU CPU GPU Dual Core CPU Single CPU

Cluster Core

Exploit more parallelism for higher performance

slide-4
SLIDE 4

MAD 2013

Core Core Core Core

Our Multi-/Many-core SW Development Flow

Sequential to Distributed Processing

Sequential Computation

Decomposition

Fine Grain Tasks Multiple Processes

Mapping

Distributed Processing on Multicore Distributed Processing (Kahn Process Network)

Functional Verification Performance Verification

Core Core Core Core

Debugging of both Functionality and Performance

slide-5
SLIDE 5

MAD 2013

Programming Model based on KPN

Basic Distributed Processing based on KPM

Process works with local memory Point to Point unidirectional Communication Channel Finite depth of FIFO within a channel Process consist of Multiple tasks Single executable Task

Global Memory

Allows read only access to Global Memory

slide-6
SLIDE 6

MAD 2013

Experiences : Sequential to Distributed Processing

 Computer Vision

  • SIFT

; 10 cores, 4 cores, 8 cores

  • Haar-Like

; 4 cores

  • SVM

; 4 cores

 Computer Graphics

  • Ray Tracing

; 73 cores

 Codec

  • H.264 Decoder

; 10 cores

  • JPEG Decoder

; 10 cores

 Wireless Communication

  • 802.11b MAC and Baseband

; 4 cores

Distributed Processing on Heterogeneous Multicore / Manycore

slide-7
SLIDE 7

MAD 2013

Ultra-Accurate Real-Time Ray Tracing

 Color Model with 35 bands  Rendering on Free Surface (Bezier)  HD (1920 x 1080 pixels) @ 30frame/s

Performance Requirement

  • ≒800 TFLOPS / system; 88TFLOPS / chip

LSI Design (Estimated)

  • Technology : TSMC 45nm
  • Clock Frequency : 750 MHz
  • Chip Size : 17mm×17mm ;

Logic : 267.7MGate Memory : 23Mbit

Case Study:TOPSTREAM™ RTRT

※Joint R&D with TOYOTA Moror & NIHON UNISYS

Master Node

・Memory ・HDD

CPU

(Desk Top Machine) ・60cm×60cm×20cm=7,200cm3 ・Power Consumption : 1000W(max) Slave Node

C0

64-bit RISC Processor

L2-I Memory (Code) 96kByte L2-D Memory (Data) 784kByte

Peripheral Bus

Distributed Arbitration On Chip Global bus (TOPSTREAM™ bus)

Bus Bridge I/O Bus Bridge L1-I Memory 64kByte L1-D Memory 128kByte

Local bus (TOPSTREAM™ bus)

Core7

・・・ ・・・

・・ ・・

C7

Bus Ctrl

C6 C5 C4 C3 C2 C1

I$

I/O I/O L0

D$

Core1 L0 Core0 L0

Memory Interface MMP Inter Core Event Interface

Core2 L0

(A Cluster includes 9 Heterogeneous Core) TOPSTREAM™ RTRT (73 Heterogeneous Manycore) (Image Generated by Visual Simulation )

・Intel CPU: 100k Chips ・NVIDIA GPU: 20k~30k Chips ・TOPSTREAM: 9 Chips

Heterogeneous Many Core : 0.88TFLOPS/W

slide-8
SLIDE 8

MAD 2013

Heterogeneous Multi-Core drives Computer Graphics Paradigm Shift

 Synthesis

 Animations, movies, video games  Algorithms : Polygon based Ray Tracing  Computer Performance Requirement : ~ 1TFLOPS

 Reproduction

 Replace prototypes and samples  Industrial Design & Showrooms

  • Automotives, Buildings, Houses, etc.

 Algorithms : Natural Surface based Ray Tracing, Photon Mapping  Computer Performance Requirement : 100’s TFLOPS ~

slide-9
SLIDE 9

MAD 2013

Architecture-Algorithm Co-Design for Application Domain Specific Computing

Architecture-Algorithm Co-Design

Algorithmic Optimization

Application “Ray Tracing” “Photon Mapping” Requirements 88TFLOPS@100W 750MHz, SW Spec HW Spec

HW/SW Co-Design

SW design HW design

System Spec

Analysis HW/SW Co-Verification

Base HW

(RTL)

TS-ISIM

(ISS)

Patents

TOPSTREAM™ Platform IP

TOPS_Lib (RTL)

SW Partitioning

Legacy Software (Sequential) Fine Grain Tasks Process Elements Distributed Processing KPN model Distributed Processing Multi-Core Splitting Grouping Co-operating Mapping

Performance & Power Simulation Architectural Optimization Reduces Walls

 ILP Wall  Memory Wall  Power Wall

Reduces Walls

 ILP Wall  Memory Wall  Power Wall Performance vs. Power Optimization

Performance = f × IPC Power = ½ α C V2 f

Performance vs. Power Optimization

Performance = f × IPC Power = ½ α C V2 f TOPSTREAM™ Architecture

slide-10
SLIDE 10

MAD 2013

Core2

Architecture & Algorithm Co-Design

Optimizations go Bidirectional

Core1

SW-C1 SW-C 2 SW-C3 SW-C 4 SW-C 5 SW-C 6 SW-C 7 P-1 P-4 P-2

Communication

Optimization ・Network Topology ・Functional Partitioning Optimization ・Extended Instructions ・Memory Hierarchy ・etc. method

P-3

CPU Core3 Core4

OS

Map to KPN ・Merging ・FIFO Mapping onto cores ・Static ・Dynamic

Bus / Network

Partitioning based on analysis

・Functional Equivalency Checking

KPN model Multi-core model Distributed Processing model

Can expect more than 10 X of Performance Improvement

slide-11
SLIDE 11

MAD 2013

 Distributed Processing with KPN

– Non-Shared Memory Processes – Zero-Overhead Message Passing Mechanism (ZOMP)

 Combination of Parallelisms

– Distributed Parallel Processing(Task、Pipeline) – Data Parallelism (High-Level、Instruction Level)

 Stream Processing (Core)

– Kernel – Stream-In (Read Message) – Stream-Out (Write Message)

 Optimization of Core

– Support Stream Processing :background Stream – Complex Inst :Reduction of Kernel cycle – FIFO support mechanism – Reduction of energy for instruction / data supply

System Level Architecture

Task-A Task-B Task-C Task-D

FIFO (*)

Task- A

FIFO (*) FIFO (*)

Task- B

Kahn Process Network Combination of Data & Task Parallel

Data Parallel Task Parallel Data Parallel (SIMD)

Core can keep Processing of Kernel

time 時間

Local Memory Local Memory

Combination of Parallelisms, Stream Processing, and ASIP

slide-12
SLIDE 12

MAD 2013

Basic concept of stream processing:

“Maximize processor efficiency”

Careful Scheduling of Stream-In and Stream-Out

slide-13
SLIDE 13

MAD 2013

Real-Time Ray Tracing

Performance Analysis Result Examples

 Performance Requirement Analysis

 Performance / Frame  Performance / Area  Performance / Ray  Performance / Ray Type  Performance / Function  Computation / Function  Memmory Access / Function

・Memory Allocation ・Memory Hierarchy ・Special Instruction ・Floating point to Fix Point ・Processing unit

Big Challenge was Dynamic Huge Load Changes (max. 3751 x )

slide-14
SLIDE 14

MAD 2013

Partitioning and KPN model for Ray Tracing

 Partitioning of Ray Tracing Process

 Based on processing flow : Functional partitioning

 Mapping of Processor ics

Kahn Process Network (KPN) model for Ray Tracing

Ray Generation Space Check BBox Check Surface Intersect Create Node

Ray Generation Space Check Voxel Traverse BBox Check Surf Intersect Trim Rough Check Trim Check Depth Test Create Node Haikei Intersect Object Lighting Background Lighting Input Output

Process (local memory) Channel (point-to-point, FIFO)

Light

Rendering on Intersect with Color and Lighting

Two Levels of Functional Verification

1st Level : Each Process 2nd Level : Whole KPN

Equivalency Test with a number of Input / Output Data Set

slide-15
SLIDE 15

MAD 2013

Debugging of Multicore is crazy!

11 cores 11 cores Each core is executing its instruction stream Each core is executing its instruction stream Synchronization Point Synchronization Point

slide-16
SLIDE 16

MAD 2013

Human’s nature is

 For typical engineers,

  • can follow “a single instruction stream” for debugging
  • make mistakes with “Two instruction stream”
  • No way with “Three instruction stream”

Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Provide tools to be able to concentrate on debugging Provide tools to be able to concentrate on debugging

slide-17
SLIDE 17

MAD 2013

Debugging of application with KPN model

Something wrong on Filter Function

QCP 1 QCP 2 QCP 3 QCP 4

Focus on FIFO

slide-18
SLIDE 18

MAD 2013

MPArchitect provides several tools

Activities inside core Instruction Profile On-Chip bus usage

Activity monitor helps programming for Low Power

slide-19
SLIDE 19

MAD 2013

Performance Considerations Real-Time Ray Tracing

KPN based Distributed Processing Flow

Primary Ray Gen Priority Selection Ray Gen Space Check Voxel Traverse BBoxCheck Surf Intersect Shadow Ray Gen DepthTest/Haikei/CreateNode Ray Tree Gen Setected Sub- Area

OL BL

Lighting Sub-Area (35 band)

n 23 23 ① ③ ② 4 4 32 DEPTH 4 4 4 4 16 16

Reflection Ray

16 16

Virtual Process Brightness

16 16

Lighting Lighting

Reflec tion

Optimization for Load balancing Critical Path

Critical Loop and Buffers for Load Ballancing

slide-20
SLIDE 20

MAD 2013

Performance Simulation

Architecture Simulation Model & Accuracy

Computation (cores) Communication

(Bus & Memory)

sequential untimed concurrent untimed concurrent Estimated timed) concurrent Instruction Accurate concurrent Cycle Accurate untimed Transaction Level timed Cycle Accurate Simulation Model BusModel CoreModel Chip Paritioning Cycle counts Extraction of Bus Transactions

Result of Trade-Off between speed and accuracy

slide-21
SLIDE 21

MAD 2013

C0

64-bit RISC Processor L2-C Memory (Code) L2-D Memory (Data)

Peripheral Bus

Distributed Arbitration On Chip Global bus (TOPSTREAM Distributed Arbitration On Chip Global bus (TOPSTREAM™ ™ bus) bus)

Bus Bridge

I/O

Bus Bridge

L1-C Memory (Code) L1-D Memory (Data) Local Local bus(TOPSTREAM bus(TOPSTREAM™ ™ bus) bus)

Core7

・・・ ・・・

・・・ ・・・

C7

Bus Ctrl

C6 C5 C4 C3 C2 C1

Inst Cache I/O I/O

L0 Mem

Data Cache

Core1

L0 Mem

Core0

L0 Mem IHB 外部メモリ・インタフェース

Bus Bridge

MMP Inter Core Event Interface

外部メモリ

Hardware and Software Modeling Flow

C0

64-bit RISC Processor L2-C Memory (Code) L2-D Memory (Data)

Peripheral Bus

Distributed Arbitration On Chip Global bus (TOPSTREAM Distributed Arbitration On Chip Global bus (TOPSTREAM™ ™ bus) bus)

Bus Bridge

I/O

Bus Bridge

L1-C Memory (Code) L1-D Memory (Data) Local Local bus(TOPSTREAM bus(TOPSTREAM™ ™ bus) bus)

Core7

・・・ ・・・

・・・ ・・・

C7

Bus Ctrl

C6 C5 C4 C3 C2 C1

Inst Cache I/O I/O

L0 Mem

Data Cache

Core1

L0 Mem

Core0

L0 Mem IHB 外部メモリ・インタフェース

Bus Bridge

MMP Inter Core Event Interface

Sequential Software written in C

Ray Generation

Surf Intersect Lighting

C’ (C0) C’ (C1) C’ (C2) C’ (C3) ・・・・ Distributed Ssoftware (KPN)

Stack Heap Text (Code) Library Monitor

nrBez()

Data

System Memory Map Architecture & Algorithm Optimization Partitioning

C’’ w/ cycles (C0)

・・・・ Architectural Design TOPSTREAM™ RTRT

Datapath Design

ISA ・・・・ Annotation of estimated cycle time Modeling of Bus and Memory Hierarchy Initialization

Performance Simulation Results

I/F I/F I/F Datapath Design

ISA

Datapath Design

ISA ISA

Datapath Design

Enternal Memory

C’’ w/ cycles (C1) C’’ w/ cycles (C2) C’’ w/ cycles (C3)

Modeling of Processor Cores Performance Simulation Model

slide-22
SLIDE 22

MAD 2013

Performance Simulation Results

72並列サイクル数 10 20 30 40 50 60 70 80

Core0 Core1 Core2 Core3 Core4 Core5(OL) Core5(BL) Chip 要求性能(cyc)

百 万 コア 演算サイクル数

目標サイクル

3.0倍 Overhead due to wait for FIFO ready 1) Input FIFO data 2) Room for Output FIFO

 2 X Improvement of Performance by

  • Optimizing FIFO depth by SW
  • Changing priority for critical loop control ; Ray Generation
  • Extending Application Specific Instructions

Target Cycles 3 x

Processing Time [M cycles]

Chip Level Performance Analysis

Figured out where to improve performance

slide-23
SLIDE 23

MAD 2013

Performance Simulation

Simulation results(Cycles/Area)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 318 316 314 312 310 308 306 304 302 320 301 303 305 307 309 311 313 315 317 319 56エリア シミュレーション済み 2 278 276 274 272 270 268 266 264 262 280 261 263 265 267 269 271 273 275 277 279 赤字 5エリア 第1優先エリア 3 238 236 234 232 230 228 226 224 222 240 221 223 225 227 229 231 233 235 237 239 青字 5エリア 第2優先エリア 4 198 196 194 192 190 188 186 184 182 200 181 183 185 187 189 191 193 195 197 199 5 158 156 154 152 150 148 146 144 142 160 141 143 145 147 149 151 153 155 157 159 6 118 116 114 112 110 108 106 104 102 120 101 103 105 107 109 111 113 115 117 119 7 78 76 74 72 70 68 66 64 62 80 61 63 65 67 69 71 73 75 77 79 8 38 36 34 32 30 28 26 24 22 40 21 23 25 27 29 31 33 35 37 39 9 18 16 14 12 10 8 6 4 2 20 1 3 5 7 9 11 13 15 17 19 10 58 56 54 52 50 48 46 44 42 60 41 43 45 47 49 51 53 55 57 59 11 98 96 94 92 90 88 86 84 82 100 81 83 85 87 89 91 93 95 97 99 12 138 136 134 132 130 128 126 124 122 140 121 123 125 127 129 131 133 135 137 139 13 178 176 174 172 170 168 166 164 162 180 161 163 165 167 169 171 173 175 177 179 14 218 216 214 212 210 208 206 204 202 220 201 203 205 207 209 211 213 215 217 219 15 258 256 254 252 250 248 246 244 242 260 241 243 245 247 249 251 253 255 257 259 16 298 296 294 292 290 288 286 284 282 300 281 283 285 287 289 291 293 295 297 299 1 3 2 20 6 4 10 8 14 12 5 7 9 11 13 15 35 69 66 180 82 62 280 41 55 53 51 49 47 45 43 60 42 44 46 48 50 52 54 81 95 93 91 89 87 85 83 100 84 86 88 90 92 94 124 122

各コアの稼働率(画像全体の平均) 10 20 30 40 50 60 Core0Active Ratio (%) Core1Active Ratio (%) Core2Active Ratio (%) Core3Active Ratio (%) Core4Active Ratio (%) 系列1

5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316

エリア毎コアサイクル数 Average=2,101,951 Total=670,328,434

Ave

Area based processing

20 40 60 80 100 120 Core0Active Ratio (%) Core1Active Ratio (%) Core2Active Ratio (%) Core3Active Ratio (%) Core4Active Ratio (%) Center of Body(1) Background(280) Rear Glass(69) Rear Lamp(35) Front Galss + BG(66) Ground(180) Wheel(82) Front Lamp(8) Roof(10) Front Glass(62)

Core Active Ratio (%) = ActiveCycle/FinalCycle

Average core utilization : 35%

Each core utilization (average)

Better Load balancing : max. 3751 x ⇒ max. 9 x

slide-24
SLIDE 24

MAD 2013

Fast Verification Environment

  • Software Verification

– TS-ISIM : Multicore ISS @ 17MIPS – Virtual COM port

  • System Verification

– Standard Methodology : OVM – RTL Simulator : Questa

  • Test Bench :Virtual System

– Hardware Emulator :Veloce

  • A. Stand Alone : Core Level HW/SW Co-Verification
  • B. Virtual System : Multicore System

 Veloce (Emulator) connection to WS through TBX

TS-ISIM (in house tool) Supported by

slide-25
SLIDE 25

MAD 2013

ISS vs. RTL Comparison at Instruction Level

 Objective

— Speed-up Debugging of Applications at System Level TS-ISIM(ISS)

FA/PFA, Code, Rs1, Rs2, Rd, PSW, MRW, Addr/PAddr, Data FA/PFA, Code, Rs1, Rs2, Rd, PSW FA/PFA, Code, Rs1, Rs2, Rd, PSW ・・・・・

RTL

FA/PFA, Code, Rs1, Rs2, Rd, PSW, MRW, Addr/PAddr, Data FA/PFA, Code, Rs1, Rs2, Rd, PSW FA/PFA, Code, Rs1, Rs2, Rd, PSW ・・・・・

Compare by Inst. Instruction Trace Instruction Trace

Bug is inside an instruction (1000 x performance with Hardware Emulator)

slide-26
SLIDE 26

MAD 2013

Challenges to provide tool sets for KPN programming

SystemC Framework

C code

KPN Editor Profiler Groupint

Sample Input

Error Report Mapping Architecture Exploration SMYLEvideo Equivalation Cheker Visualization Tool SystemC

Hardware

Generatior (Performance Est.) SMYLEref TOPSTREAM™

Distributed Processing Software

<様々なマルチ・メニーコア>

Sequential Software

Synthesizable

SystemC

Perf. Report User

SHIM Model Description (XML) Perform. Cost

 Visualization  Optimization  Exploration  Hard Wired

EQ?

Data Conversion Fine Grain Partitioning

Goal : Exploration of best SW and best HW architecture

slide-27
SLIDE 27

MAD 2013

Thank you for your attention!

Post PC Era