KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - PowerPoint PPT Presentation

Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013

Agenda Porting Application onto Heterogeneous Manycore  Case Study : Real-Time Ray Tracing, 800TFLOPS on Desk Top Machine  Architecture & Algorithm Co-Design  Deep Performance Analysis  Software Partitioning into Kahn Process Network  System Performance Modeling  System Performance Simulation  Debugging Issues and Challenges  Working on Better Solutions  Conclusions  MAD 2013

Parallel Processing Goal TOPSTREAM Cluster CPU Core GPU CPU GPU Single Dual Core CPU CPU Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael I. Gordon, William Thies, and Saman Amarasinghe, MIT Exploit more parallelism for higher performance MAD 2013

Our Multi-/Many-core SW Development Flow Sequential to Distributed Processing Core Core Decomposition Core Core Fine Grain Sequential Computation Tasks Functional Verification Multiple Processes Core Core Mapping Core Core Distributed Processing Distributed Processing on Multicore (Kahn Process Network) Performance Verification Debugging of both Functionality and Performance MAD 2013

Programming Model based on KPN Global Memory Process consist of Multiple tasks Single executable Task Basic Distributed Processing based on KPM Process works with local memory Point to Point unidirectional Communication Channel Finite depth of FIFO within a channel Allows read only access to Global Memory MAD 2013

Experiences : Sequential to Distributed Processing  Computer Vision  SIFT ; 10 cores, 4 cores, 8 cores  Haar-Like ; 4 cores  SVM ; 4 cores  Computer Graphics  Ray Tracing ; 73 cores  Codec  H.264 Decoder ; 10 cores  JPEG Decoder ; 10 cores  Wireless Communication  802.11b MAC and Baseband ; 4 cores Distributed Processing on Heterogeneous Multicore / Manycore MAD 2013

Case Study ： TOPSTREAM™ RTRT ・ Intel CPU: 100k Chips ・ NVIDIA GPU: 20k ～ 30k Chips ・ TOPSTREAM: 9 Chips Memory Interface Ultra-Accurate Real-Time Ray Tracing ･･････ I/O I/O I/O I$ 64-bit L2-I L2-D  Color Model with 35 bands Bus Memory Memory RISC Peripheral B ｕｓ Master Node Ctrl (Code) (Data) 96kByte 784kByte Processor  Rendering on Free Surface (Bezier) D$ ・ Memory CPU Bus Bridge  HD (1920 x 1080 pixels) @ 30frame/s ・ HDD Distributed Arbitration On Chip Global bus (TOPSTREAM ™ bus) Performance Requirement C0 MMP C1 C2 C3 C4 C5 C6 Bus Bridge C7 Slave Node  ≒ 800 TFLOPS / system; 88TFLOPS / chip L1-I L1-D Inter Core Event Interface Memory Memory 64kByte 128kByte LSI Design (Estimated) Local bus (TOPSTREAM ™ bus)  Technology : TSMC 45nm  Clock Frequency : 750 MHz Core0 Core1 Core2 Core7 ････ L0 L0 L0 L0  Chip Size : 17mm × 17mm ; TOPSTREAM™ RTRT Logic : 267.7MGate (73 Heterogeneous Manycore) Memory : 23Mbit （ Desk Top Machine ）・ 60cm × 60cm × 20cm ＝ 7,200cm 3 ・ Power Consumption : 1000 Ｗ (max) (A Cluster includes 9 Heterogeneous Core) (Image Generated by Visual Simulation ) ※ Joint R&D with TOYOTA Moror & NIHON UNISYS Heterogeneous Many Core : 0.88TFLOPS/W MAD 2013

Heterogeneous Multi-Core drives Computer Graphics Paradigm Shift  Synthesis Animations, movies, video games  Algorithms : Polygon based Ray Tracing   Computer Performance Requirement : ～ 1TFLOPS  Reproduction Replace prototypes and samples  Industrial Design & Showrooms   Automotives, Buildings, Houses, etc. Algorithms : Natural Surface based Ray Tracing, Photon Mapping  Computer Performance Requirement : 100’s TFLOPS ～  MAD 2013

Architecture-Algorithm Co-Design for Application Domain Specific Computing Requirements Performance vs. Power Optimization Performance vs. Power Optimization Application 88TFLOPS@100W SW Partitioning “ Ray Tracing ” Performance = f × IPC Performance = f × IPC 750MHz, “ Photon Mapping ” Power = ½ α C V 2 f Power = ½ α C V 2 f Legacy Software (Sequential ） TOPSTREAM™ Architecture Architecture-Algorithm Co-Design Splitting Analysis Fine Grain Tasks Architectural Algorithmic Optimization Optimization Grouping Process Reduces Walls Reduces Walls Performance & Power Simulation Elements  ILP Wall  ILP Wall  Memory Wall  Memory Wall Co-operating  Power Wall  Power Wall HW System SW Distributed TOPSTREAM™ Platform IP Spec Spec Spec Processing KPN model Mapping TS-ISIM Patents HW/SW Co-Design （ ISS ） Distributed Processing HW design SW design Multi-Core TOPS_Lib Base HW HW/SW Co-Verification （ RTL ）（ RTL ） MAD 2013

Architecture & Algorithm Co-Design Optimizations go Bidirectional Partitioning based on analysis ・ Functional Equivalency Checking Distributed Processing model method SW-C1 SW-C 4 SW-C 6 SW-C 2 SW-C3 SW-C 5 SW-C 7 KPN model ＯＳ P-2 Optimization Communication ・ Network Topology Map to KPN ・ Functional Partitioning ・ Merging P-1 P- ３ P- ４・ FIFO Multi-core model Optimization CPU Core2 Bus / Network Mapping onto cores ・ Extended Instructions ・ Static ・ Memory Hierarchy Core1 Core3 Core4 ・ Dynamic ・ etc. Can expect more than 10 X of Performance Improvement MAD 2013

System Level Architecture  Distributed Processing with KPN Local Memory Local Memory – Non-Shared Memory Processes Task- Task- FIFO FIFO FIFO (*) (*) – Zero-Overhead Message Passing Mechanism (*) B A （ ZOMP ） Kahn Process Network  Combination of Parallelisms – Distributed Parallel Processing （ Task 、 Pipeline ） Data Parallel Task-A Data Parallelism （ High-Level 、 Instruction Level ） – Data Parallel （ SIMD ） Task Parallel Task-B  Stream Processing (Core) Task-C Task-D – Kernel – Stream-In (Read Message) time Combination of Data & Task Parallel – Stream-Out (Write Message) Core can keep Processing of  Optimization of Core Kernel – Support Stream Processing ： background Stream – Complex Inst ： Reduction of Kernel cycle – FIFO support mechanism – Reduction of energy for instruction / data supply Combination of Parallelisms, Stream Processing, and ASIP MAD 2013 時間

Basic concept of stream processing: “ Maximize processor efficiency” Careful Scheduling of Stream-In and Stream-Out MAD 2013

Real-Time Ray Tracing Performance Analysis Result Examples  Performance Requirement Analysis  Performance / Frame  Performance / Area  Performance / Ray  Performance / Ray Type  Performance / Function  Computation / Function  Memmory Access / Function ・ Memory Allocation ・ Memory Hierarchy ・ Processing unit ・ Special Instruction ・ Floating point to Fix Point Big Challenge was Dynamic Huge Load Changes (max. 3751 x ) MAD 2013

Partitioning and KPN model for Ray Tracing Rendering on Intersect with Color and Lighting Ray Generation  Partitioning of Ray Tracing Process  Based on processing flow : Functional partitioning Space Check Light BBox Check Create Node Surface Intersect Ray Trim Generation Voxel BBox Surf Rough Traverse Check Intersect Input Check Space Check Trim Depth Check Test Haikei Create Intersect Node Process (local memory) Channel (point-to-point, FIFO) Two Levels of Functional Verification Object 1 st Level : Each Process Lighting  Mapping of Processor ics Output 2 nd Level : Whole KPN Background Lighting Kahn Process Network (KPN) model for Ray Tracing Equivalency Test with a number of Input / Output Data Set MAD 2013

11 cores 11 cores Debugging of Multicore is crazy! Synchronization Point Synchronization Point Each core is executing its instruction stream Each core is executing its instruction stream MAD 2013

Human’s nature is  For typical engineers,  can follow “ a single instruction stream ” for debugging  make mistakes with “ Two instruction stream ”  No way with “ Three instruction stream ” Key for Multicore Debugging Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Extract “One Stream” of information, and concentrate on it. Provide tools to be able to concentrate on debugging Provide tools to be able to concentrate on debugging MAD 2013

Debugging of application with KPN model QCP QCP QCP QCP 1 2 3 4 Something wrong on Filter Function Focus on FIFO MAD 2013

MPArchitect provides several tools Instruction Profile Activities inside core On-Chip bus usage Activity monitor helps programming for Low Power MAD 2013

Performance Considerations Real-Time Ray Tracing KPN based Distributed Processing Flow Setected Sub- Reflection Ray Virtual Area Process Shadow Ray Gen Primary Ray Gen 16 16 n 23 23 Ray Tree Gen ③ ① ② OL BL Priority Selection 16 16 Ray Gen Optimization for Lighting Lighting Lighting Load balancing Space Check 4 DEPTH 16 16 Voxel Traverse Reflec Brightness tion 4 Sub-Area BBoxCheck （ 35 band ） 32 Critical Path Surf Intersect 4 4 4 4 DepthTest/Haikei/CreateNode Critical Loop and Buffers for Load Ballancing MAD 2013

KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - PowerPoint PPT Presentation

Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013 Agenda Porting Application

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

LoRa network a short introduction Irene de Ruijter, Erik Bruinzeel & Timme Hovinga KPN

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Flow A Multithreaded KPN language Mitchell Gouzenko | Hyonjee Joo | Adam Chelminski | Zach

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

CUstom Built hEterogeneous Multi-core ArCHitecture design paradigm based simulator : Towards

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

#MicroFocusCyberSummit ArcSight is an Open Architecture for SecOps Marius Iversen KPN ZM OPS

Building the gigabit society Eelco Blok KPN at a glance Mobile 4G netw twork rk with 99%

Sustainable Value at Royal KPN a VBDO webinar 10 October 2017 Safe harbor Alternative

phase plan to prepare your business today Jaya Baloo CISO KPN #teissamsterdam19 The Quantum

Everything is Quantum The EU Quantum Flagship Our mission is to keep KPN reliable & secure

Sohtaro Kanda / kanda@post.kek.jp 2015. 07. 08 at PD15 SR 2 Muon spin rotation and

Meets Visualization Jaegul Choo Assistant Professor Dept. of Computer Science and Engineering

Physical Information Security Fall 2009 CS461/ECE422 Computer Security I Reading Material

WIMPS and LIPSS WIMPS and LIPSS A. Afanasev Afanasev, O.K. Baker (contact person), K. McFarlane

A New Technique for the Reconstruction, Validation, and Simulation of Hybrid Pixel Hits D.

b -tagging performance in ATLAS Berkeley Workshop on Physics Opportunities with the First LHC

ACM MM 2010 Dong Liu , Shuicheng Yan, Yong Rui and Hong-Jiang Zhang Harbin Institute of Technology

An Efficient Hybrid Shadow Rendering Algorithm Eric Chan Frdo Durand Massachusetts Institute

KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - PowerPoint PPT Presentation

Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013 Agenda Porting Application

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

LoRa network a short introduction Irene de Ruijter, Erik Bruinzeel &amp; Timme Hovinga KPN

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Flow A Multithreaded KPN language Mitchell Gouzenko | Hyonjee Joo | Adam Chelminski | Zach

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

CUstom Built hEterogeneous Multi-core ArCHitecture design paradigm based simulator : Towards

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

#MicroFocusCyberSummit ArcSight is an Open Architecture for SecOps Marius Iversen KPN ZM OPS

Building the gigabit society Eelco Blok KPN at a glance Mobile 4G netw twork rk with 99%

Sustainable Value at Royal KPN a VBDO webinar 10 October 2017 Safe harbor Alternative

phase plan to prepare your business today Jaya Baloo CISO KPN #teissamsterdam19 The Quantum

Everything is Quantum The EU Quantum Flagship Our mission is to keep KPN reliable &amp; secure

Sohtaro Kanda / kanda@post.kek.jp 2015. 07. 08 at PD15 SR 2 Muon spin rotation and

Meets Visualization Jaegul Choo Assistant Professor Dept. of Computer Science and Engineering

Physical Information Security Fall 2009 CS461/ECE422 Computer Security I Reading Material

WIMPS and LIPSS WIMPS and LIPSS A. Afanasev Afanasev, O.K. Baker (contact person), K. McFarlane

A New Technique for the Reconstruction, Validation, and Simulation of Hybrid Pixel Hits D.

b -tagging performance in ATLAS Berkeley Workshop on Physics Opportunities with the First LHC

ACM MM 2010 Dong Liu , Shuicheng Yan, Yong Rui and Hong-Jiang Zhang Harbin Institute of Technology

An Efficient Hybrid Shadow Rendering Algorithm Eric Chan Frdo Durand Massachusetts Institute

LoRa network a short introduction Irene de Ruijter, Erik Bruinzeel & Timme Hovinga KPN

Everything is Quantum The EU Quantum Flagship Our mission is to keep KPN reliable & secure