MAD 2013
KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - - PowerPoint PPT Presentation
KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - - PowerPoint PPT Presentation
Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013 Agenda Porting Application
MAD 2013
Agenda
Porting Application onto Heterogeneous Manycore
Case Study : Real-Time Ray Tracing, 800TFLOPS on Desk Top Machine
Architecture & Algorithm Co-Design
Deep Performance Analysis
Software Partitioning into Kahn Process Network
System Performance Modeling
System Performance Simulation
Debugging Issues and Challenges
Working on Better Solutions
Conclusions
MAD 2013
Parallel Processing Goal
Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael I. Gordon, William Thies, and Saman Amarasinghe, MIT
TOPSTREAM CPU GPU CPU GPU Dual Core CPU Single CPU
Cluster Core
Exploit more parallelism for higher performance
MAD 2013
Core Core Core Core
Our Multi-/Many-core SW Development Flow
Sequential to Distributed Processing
Sequential Computation
Decomposition
Fine Grain Tasks Multiple Processes
Mapping
Distributed Processing on Multicore Distributed Processing (Kahn Process Network)
Functional Verification Performance Verification
Core Core Core Core
Debugging of both Functionality and Performance
MAD 2013
Programming Model based on KPN
Basic Distributed Processing based on KPM
Process works with local memory Point to Point unidirectional Communication Channel Finite depth of FIFO within a channel Process consist of Multiple tasks Single executable Task
Global Memory
Allows read only access to Global Memory
MAD 2013
Experiences : Sequential to Distributed Processing
Computer Vision
- SIFT
; 10 cores, 4 cores, 8 cores
- Haar-Like
; 4 cores
- SVM
; 4 cores
Computer Graphics
- Ray Tracing
; 73 cores
Codec
- H.264 Decoder
; 10 cores
- JPEG Decoder
; 10 cores
Wireless Communication
- 802.11b MAC and Baseband
; 4 cores
Distributed Processing on Heterogeneous Multicore / Manycore
MAD 2013
Ultra-Accurate Real-Time Ray Tracing
Color Model with 35 bands Rendering on Free Surface (Bezier) HD (1920 x 1080 pixels) @ 30frame/s
Performance Requirement
- ≒800 TFLOPS / system; 88TFLOPS / chip
LSI Design (Estimated)
- Technology : TSMC 45nm
- Clock Frequency : 750 MHz
- Chip Size : 17mm×17mm ;
Logic : 267.7MGate Memory : 23Mbit
Case Study:TOPSTREAM™ RTRT
※Joint R&D with TOYOTA Moror & NIHON UNISYS
Master Node
・Memory ・HDD
CPU
(Desk Top Machine) ・60cm×60cm×20cm=7,200cm3 ・Power Consumption : 1000W(max) Slave Node
C0
64-bit RISC Processor
L2-I Memory (Code) 96kByte L2-D Memory (Data) 784kByte
Peripheral Bus
Distributed Arbitration On Chip Global bus (TOPSTREAM™ bus)
Bus Bridge I/O Bus Bridge L1-I Memory 64kByte L1-D Memory 128kByte
Local bus (TOPSTREAM™ bus)
Core7
・・・ ・・・
・・ ・・
C7
Bus Ctrl
C6 C5 C4 C3 C2 C1
I$
I/O I/O L0
D$
Core1 L0 Core0 L0
Memory Interface MMP Inter Core Event Interface
Core2 L0
(A Cluster includes 9 Heterogeneous Core) TOPSTREAM™ RTRT (73 Heterogeneous Manycore) (Image Generated by Visual Simulation )
・Intel CPU: 100k Chips ・NVIDIA GPU: 20k~30k Chips ・TOPSTREAM: 9 Chips
Heterogeneous Many Core : 0.88TFLOPS/W
MAD 2013
Heterogeneous Multi-Core drives Computer Graphics Paradigm Shift
Synthesis
Animations, movies, video games Algorithms : Polygon based Ray Tracing Computer Performance Requirement : ~ 1TFLOPS
Reproduction
Replace prototypes and samples Industrial Design & Showrooms
- Automotives, Buildings, Houses, etc.
Algorithms : Natural Surface based Ray Tracing, Photon Mapping Computer Performance Requirement : 100’s TFLOPS ~
MAD 2013
Architecture-Algorithm Co-Design for Application Domain Specific Computing
Architecture-Algorithm Co-Design
Algorithmic Optimization
Application “Ray Tracing” “Photon Mapping” Requirements 88TFLOPS@100W 750MHz, SW Spec HW Spec
HW/SW Co-Design
SW design HW design
System Spec
Analysis HW/SW Co-Verification
Base HW
(RTL)
TS-ISIM
(ISS)
Patents
TOPSTREAM™ Platform IP
TOPS_Lib (RTL)
SW Partitioning
Legacy Software (Sequential) Fine Grain Tasks Process Elements Distributed Processing KPN model Distributed Processing Multi-Core Splitting Grouping Co-operating Mapping
Performance & Power Simulation Architectural Optimization Reduces Walls
ILP Wall Memory Wall Power Wall
Reduces Walls
ILP Wall Memory Wall Power Wall Performance vs. Power Optimization
Performance = f × IPC Power = ½ α C V2 f
Performance vs. Power Optimization
Performance = f × IPC Power = ½ α C V2 f TOPSTREAM™ Architecture
MAD 2013
Core2
Architecture & Algorithm Co-Design
Optimizations go Bidirectional
Core1
SW-C1 SW-C 2 SW-C3 SW-C 4 SW-C 5 SW-C 6 SW-C 7 P-1 P-4 P-2
Communication
Optimization ・Network Topology ・Functional Partitioning Optimization ・Extended Instructions ・Memory Hierarchy ・etc. method
P-3
CPU Core3 Core4
OS
Map to KPN ・Merging ・FIFO Mapping onto cores ・Static ・Dynamic
Bus / Network
Partitioning based on analysis
・Functional Equivalency Checking
KPN model Multi-core model Distributed Processing model
Can expect more than 10 X of Performance Improvement
MAD 2013
Distributed Processing with KPN
– Non-Shared Memory Processes – Zero-Overhead Message Passing Mechanism (ZOMP)
Combination of Parallelisms
– Distributed Parallel Processing(Task、Pipeline) – Data Parallelism (High-Level、Instruction Level)
Stream Processing (Core)
– Kernel – Stream-In (Read Message) – Stream-Out (Write Message)
Optimization of Core
– Support Stream Processing :background Stream – Complex Inst :Reduction of Kernel cycle – FIFO support mechanism – Reduction of energy for instruction / data supply
System Level Architecture
Task-A Task-B Task-C Task-D
FIFO (*)
Task- A
FIFO (*) FIFO (*)
Task- B
Kahn Process Network Combination of Data & Task Parallel
Data Parallel Task Parallel Data Parallel (SIMD)
Core can keep Processing of Kernel
time 時間
Local Memory Local Memory
Combination of Parallelisms, Stream Processing, and ASIP
MAD 2013
Basic concept of stream processing:
“Maximize processor efficiency”
Careful Scheduling of Stream-In and Stream-Out
MAD 2013
Real-Time Ray Tracing
Performance Analysis Result Examples
Performance Requirement Analysis
Performance / Frame Performance / Area Performance / Ray Performance / Ray Type Performance / Function Computation / Function Memmory Access / Function
・Memory Allocation ・Memory Hierarchy ・Special Instruction ・Floating point to Fix Point ・Processing unit
Big Challenge was Dynamic Huge Load Changes (max. 3751 x )
MAD 2013
Partitioning and KPN model for Ray Tracing
Partitioning of Ray Tracing Process
Based on processing flow : Functional partitioning
Mapping of Processor ics
Kahn Process Network (KPN) model for Ray Tracing
Ray Generation Space Check BBox Check Surface Intersect Create Node
Ray Generation Space Check Voxel Traverse BBox Check Surf Intersect Trim Rough Check Trim Check Depth Test Create Node Haikei Intersect Object Lighting Background Lighting Input Output
Process (local memory) Channel (point-to-point, FIFO)
Light
Rendering on Intersect with Color and Lighting
Two Levels of Functional Verification
1st Level : Each Process 2nd Level : Whole KPN
Equivalency Test with a number of Input / Output Data Set
MAD 2013
Debugging of Multicore is crazy!
11 cores 11 cores Each core is executing its instruction stream Each core is executing its instruction stream Synchronization Point Synchronization Point
MAD 2013
Human’s nature is
For typical engineers,
- can follow “a single instruction stream” for debugging
- make mistakes with “Two instruction stream”
- No way with “Three instruction stream”
Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Provide tools to be able to concentrate on debugging Provide tools to be able to concentrate on debugging
MAD 2013
Debugging of application with KPN model
Something wrong on Filter Function
QCP 1 QCP 2 QCP 3 QCP 4
Focus on FIFO
MAD 2013
MPArchitect provides several tools
Activities inside core Instruction Profile On-Chip bus usage
Activity monitor helps programming for Low Power
MAD 2013
Performance Considerations Real-Time Ray Tracing
KPN based Distributed Processing Flow
Primary Ray Gen Priority Selection Ray Gen Space Check Voxel Traverse BBoxCheck Surf Intersect Shadow Ray Gen DepthTest/Haikei/CreateNode Ray Tree Gen Setected Sub- Area
OL BL
Lighting Sub-Area (35 band)
n 23 23 ① ③ ② 4 4 32 DEPTH 4 4 4 4 16 16
Reflection Ray
16 16
Virtual Process Brightness
16 16
Lighting Lighting
Reflec tion
Optimization for Load balancing Critical Path
Critical Loop and Buffers for Load Ballancing
MAD 2013
Performance Simulation
Architecture Simulation Model & Accuracy
Computation (cores) Communication
(Bus & Memory)
sequential untimed concurrent untimed concurrent Estimated timed) concurrent Instruction Accurate concurrent Cycle Accurate untimed Transaction Level timed Cycle Accurate Simulation Model BusModel CoreModel Chip Paritioning Cycle counts Extraction of Bus Transactions
Result of Trade-Off between speed and accuracy
MAD 2013
C0
64-bit RISC Processor L2-C Memory (Code) L2-D Memory (Data)
Peripheral Bus
Distributed Arbitration On Chip Global bus (TOPSTREAM Distributed Arbitration On Chip Global bus (TOPSTREAM™ ™ bus) bus)
Bus Bridge
I/O
Bus Bridge
L1-C Memory (Code) L1-D Memory (Data) Local Local bus(TOPSTREAM bus(TOPSTREAM™ ™ bus) bus)
Core7
・・・ ・・・
・・・ ・・・
C7
Bus Ctrl
C6 C5 C4 C3 C2 C1
Inst Cache I/O I/O
L0 Mem
Data Cache
Core1
L0 Mem
Core0
L0 Mem IHB 外部メモリ・インタフェース
Bus Bridge
MMP Inter Core Event Interface
外部メモリ
Hardware and Software Modeling Flow
C0
64-bit RISC Processor L2-C Memory (Code) L2-D Memory (Data)
Peripheral Bus
Distributed Arbitration On Chip Global bus (TOPSTREAM Distributed Arbitration On Chip Global bus (TOPSTREAM™ ™ bus) bus)
Bus Bridge
I/O
Bus Bridge
L1-C Memory (Code) L1-D Memory (Data) Local Local bus(TOPSTREAM bus(TOPSTREAM™ ™ bus) bus)
Core7
・・・ ・・・
・・・ ・・・
C7
Bus Ctrl
C6 C5 C4 C3 C2 C1
Inst Cache I/O I/O
L0 Mem
Data Cache
Core1
L0 Mem
Core0
L0 Mem IHB 外部メモリ・インタフェース
Bus Bridge
MMP Inter Core Event Interface
Sequential Software written in C
Ray Generation
Surf Intersect Lighting
C’ (C0) C’ (C1) C’ (C2) C’ (C3) ・・・・ Distributed Ssoftware (KPN)
Stack Heap Text (Code) Library Monitor
nrBez()
Data
System Memory Map Architecture & Algorithm Optimization Partitioning
C’’ w/ cycles (C0)
・・・・ Architectural Design TOPSTREAM™ RTRT
Datapath Design
ISA ・・・・ Annotation of estimated cycle time Modeling of Bus and Memory Hierarchy Initialization
Performance Simulation Results
I/F I/F I/F Datapath Design
ISA
Datapath Design
ISA ISA
Datapath Design
Enternal Memory
C’’ w/ cycles (C1) C’’ w/ cycles (C2) C’’ w/ cycles (C3)
Modeling of Processor Cores Performance Simulation Model
MAD 2013
Performance Simulation Results
72並列サイクル数 10 20 30 40 50 60 70 80
Core0 Core1 Core2 Core3 Core4 Core5(OL) Core5(BL) Chip 要求性能(cyc)
百 万 コア 演算サイクル数
目標サイクル
3.0倍 Overhead due to wait for FIFO ready 1) Input FIFO data 2) Room for Output FIFO
2 X Improvement of Performance by
- Optimizing FIFO depth by SW
- Changing priority for critical loop control ; Ray Generation
- Extending Application Specific Instructions
Target Cycles 3 x
Processing Time [M cycles]
Chip Level Performance Analysis
Figured out where to improve performance
MAD 2013
Performance Simulation
Simulation results(Cycles/Area)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 318 316 314 312 310 308 306 304 302 320 301 303 305 307 309 311 313 315 317 319 56エリア シミュレーション済み 2 278 276 274 272 270 268 266 264 262 280 261 263 265 267 269 271 273 275 277 279 赤字 5エリア 第1優先エリア 3 238 236 234 232 230 228 226 224 222 240 221 223 225 227 229 231 233 235 237 239 青字 5エリア 第2優先エリア 4 198 196 194 192 190 188 186 184 182 200 181 183 185 187 189 191 193 195 197 199 5 158 156 154 152 150 148 146 144 142 160 141 143 145 147 149 151 153 155 157 159 6 118 116 114 112 110 108 106 104 102 120 101 103 105 107 109 111 113 115 117 119 7 78 76 74 72 70 68 66 64 62 80 61 63 65 67 69 71 73 75 77 79 8 38 36 34 32 30 28 26 24 22 40 21 23 25 27 29 31 33 35 37 39 9 18 16 14 12 10 8 6 4 2 20 1 3 5 7 9 11 13 15 17 19 10 58 56 54 52 50 48 46 44 42 60 41 43 45 47 49 51 53 55 57 59 11 98 96 94 92 90 88 86 84 82 100 81 83 85 87 89 91 93 95 97 99 12 138 136 134 132 130 128 126 124 122 140 121 123 125 127 129 131 133 135 137 139 13 178 176 174 172 170 168 166 164 162 180 161 163 165 167 169 171 173 175 177 179 14 218 216 214 212 210 208 206 204 202 220 201 203 205 207 209 211 213 215 217 219 15 258 256 254 252 250 248 246 244 242 260 241 243 245 247 249 251 253 255 257 259 16 298 296 294 292 290 288 286 284 282 300 281 283 285 287 289 291 293 295 297 299 1 3 2 20 6 4 10 8 14 12 5 7 9 11 13 15 35 69 66 180 82 62 280 41 55 53 51 49 47 45 43 60 42 44 46 48 50 52 54 81 95 93 91 89 87 85 83 100 84 86 88 90 92 94 124 122
各コアの稼働率(画像全体の平均) 10 20 30 40 50 60 Core0Active Ratio (%) Core1Active Ratio (%) Core2Active Ratio (%) Core3Active Ratio (%) Core4Active Ratio (%) 系列1
5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316
エリア毎コアサイクル数 Average=2,101,951 Total=670,328,434
Ave
Area based processing
20 40 60 80 100 120 Core0Active Ratio (%) Core1Active Ratio (%) Core2Active Ratio (%) Core3Active Ratio (%) Core4Active Ratio (%) Center of Body(1) Background(280) Rear Glass(69) Rear Lamp(35) Front Galss + BG(66) Ground(180) Wheel(82) Front Lamp(8) Roof(10) Front Glass(62)
Core Active Ratio (%) = ActiveCycle/FinalCycle
Average core utilization : 35%
Each core utilization (average)
Better Load balancing : max. 3751 x ⇒ max. 9 x
MAD 2013
Fast Verification Environment
- Software Verification
– TS-ISIM : Multicore ISS @ 17MIPS – Virtual COM port
- System Verification
– Standard Methodology : OVM – RTL Simulator : Questa
- Test Bench :Virtual System
– Hardware Emulator :Veloce
- A. Stand Alone : Core Level HW/SW Co-Verification
- B. Virtual System : Multicore System
Veloce (Emulator) connection to WS through TBX
TS-ISIM (in house tool) Supported by
MAD 2013
ISS vs. RTL Comparison at Instruction Level
Objective
— Speed-up Debugging of Applications at System Level TS-ISIM(ISS)
FA/PFA, Code, Rs1, Rs2, Rd, PSW, MRW, Addr/PAddr, Data FA/PFA, Code, Rs1, Rs2, Rd, PSW FA/PFA, Code, Rs1, Rs2, Rd, PSW ・・・・・
RTL
FA/PFA, Code, Rs1, Rs2, Rd, PSW, MRW, Addr/PAddr, Data FA/PFA, Code, Rs1, Rs2, Rd, PSW FA/PFA, Code, Rs1, Rs2, Rd, PSW ・・・・・
Compare by Inst. Instruction Trace Instruction Trace
Bug is inside an instruction (1000 x performance with Hardware Emulator)
MAD 2013
Challenges to provide tool sets for KPN programming
SystemC Framework
C code
KPN Editor Profiler Groupint
Sample Input
Error Report Mapping Architecture Exploration SMYLEvideo Equivalation Cheker Visualization Tool SystemC
Hardware
Generatior (Performance Est.) SMYLEref TOPSTREAM™
Distributed Processing Software
<様々なマルチ・メニーコア>
Sequential Software
Synthesizable
SystemC
Perf. Report User
SHIM Model Description (XML) Perform. Cost
Visualization Optimization Exploration Hard Wired
EQ?
Data Conversion Fine Grain Partitioning
Goal : Exploration of best SW and best HW architecture
MAD 2013