Programming the Adapteva Epiphany 64-core Network-on-chip - PowerPoint PPT Presentation

Introduction Architecture Performance Experiments Heat Stencil Conclusions Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert Edwards, Gaurav Mitra and Alistair Rendell Research School of Computer Science The Australian National University May 19,2014

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Parallella Hardware Architecture Software Environment Performance Experiments 3 On-chip Communication Off-chip Communication Heat Stencil 4 Implementation Results Conclusions 5

Introduction Architecture Performance Experiments Heat Stencil Conclusions Introduction Adapteva Epiphany Coprocessor New scalable many-core architecture Energy efficient platform (50 GFLOPS/Watt) $99 for a Parallella board Contributions Explored features of the Epiphany Evaluated the performance Demonstrated how to write high performance applications on this platform

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Parallella Hardware Architecture Software Environment Performance Experiments 3 Heat Stencil 4 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Parallella Board

Introduction Architecture Performance Experiments Heat Stencil Conclusions Epiphany Coprocessor Features Multi-core MIMD architecture No cache 32 KB of local SRAM in four banks of 8 KB Shared address space 64 General purpose registers Epiphany Instruction set Superscalar CPU - two floating point operations (Fused Multiply-Add) and one 64-bit memory load/store operation

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Parallella Hardware Architecture Software Environment Performance Experiments 3 Heat Stencil 4 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Software Environment Programming Environment C/C++ Epiphany SDK Programming Considerations Memory Size Relatively small 32 KB of local RAM per eCore (for storing both code and data) Store code and data in different local memory banks Distribute code between multiple cores Processor Capability Currently no hardware support for integer multiply, floating point divide or double-precision floating point operations Branching costs 3 cycles. Unroll inner loops to increase performance

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Performance Experiments 3 On-chip Communication Off-chip Communication Heat Stencil 4 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Experiment Platform ZedBoard evaluation module with Zynq SoC Daughter card with Epiphany-IV 64-core (E64G401) Dual core ARM Cortex-A9 host at 667 MHz Epiphany eCores at 600 MHz 512 MB of DDR3 RAM on the host 32 MB shared with eCores

Introduction Architecture Performance Experiments Heat Stencil Conclusions Bandwidth Experiment: To evaluate cost of sending messages from one eCore to another

Introduction Architecture Performance Experiments Heat Stencil Conclusions Latency Latency for small message transfers

Introduction Architecture Performance Experiments Heat Stencil Conclusions Latency Experiment: To evaluate the effect of Node distance on Transfer Latency Node 1 Node 2 Distance Time per transfer (nsec) 0,0 0,1 1 11.12 0,0 0,2 2 11.14 0,0 1,2 3 11.19 0,0 0,4 4 11.38 0,0 3,3 5 11.62 0,0 4,4 6 11.86 0,0 7,7 14 12.57 80 bytes are transferred from one eCore to another ≈ 7 cycles per transfer

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Performance Experiments 3 On-chip Communication Off-chip Communication Heat Stencil 4 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Shared Memory Access Experiment: To evaluate the performance of the external shared memory, multiple nodes write to the shared memory simultaneously. Each eCore continuously writes blocks of 2 KBytes over 2 seconds and the utilization is measured Node (Total No) Iterations Utilization Nodes closer to 0,0 61037 0.41 column 7 and row 0 0,1 48829 0.33 get the best write 2 * 2 nodes access 1,0 24414 0.17 Write throughput of 1,1 12207 0.08 150 MB/sec 0,7 1,7 2,7 3,7 27460+ 0.187 each (8) 3050+ 0.021 each (4) 2040+ 0.014 each 8 * 8 nodes (8) 100 - 1000 (9) 10 - 100 (7) 1 - 10 (24) 0

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Performance Experiments 3 Heat Stencil 4 Implementation Results 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Heat Stencil Equation Five-point star-shaped stencil T new i , j = w 1 ∗ T prev i , j + 1 + w 2 ∗ T prev i , j + w 3 ∗ T prev i , j − 1 + w 4 ∗ T prev i + 1 , j + w 5 ∗ T prev i − 1 , j

Introduction Architecture Performance Experiments Heat Stencil Conclusions Implementation Hand-tuned unrolled assembly code “In-place” implementation Size of grid limited by local memory (and size of assembly code) All 64 registers used and managed carefully Grid initialized in the host and transferred to each eCore Computation followed by communication phase in each iteration

Introduction Architecture Performance Experiments Heat Stencil Conclusions Computation Phase Grid sizes of 20 × X. Width of 20 decided based on register availability Buffer 2 rows of grid points into registers and perform FMADDs Continuous runs of Fused Multiply-Add (FMADD) interleaved with 64-bit load/store 5 grid points accumulated at a time Each grid point loaded into register only once

Introduction Architecture Performance Experiments Heat Stencil Conclusions Communication Phase Synchronization between neighbouring eCores Transfers started after neighbour’s computation phase DMA for boundary transfers

Introduction Architecture Performance Experiments Heat Stencil Conclusions Outline Introduction 1 Architecture 2 Performance Experiments 3 Heat Stencil 4 Implementation Results 5 Conclusions

Introduction Architecture Performance Experiments Heat Stencil Conclusions Floating point performance Single-core Floating point performance in GFLOPS Stencil evaluated for 50 iterations. 81-95% of peak performance

Introduction Architecture Performance Experiments Heat Stencil Conclusions Floating point performance 64-core Floating point performance in GFLOPS 83% of peak performance with communication *Lighter colors show performance without communication

Introduction Architecture Performance Experiments Heat Stencil Conclusions Weak Scaling Weak Scaling - Number of eCores vs Time Number of eCores from 1 to 64 Vary problem size from 60 × 60 to 480 × 480

Introduction Architecture Performance Experiments Heat Stencil Conclusions Strong Scaling Strong Scaling - Number of eCores vs Speedup Number of eCores from 1 to 64 Problem size fixed

Introduction Architecture Performance Experiments Heat Stencil Conclusions Conclusions and Future Work Heat Stencil running at 65 GFLOPS (83%) ≈ 32 GFLOPS/Watt assuming 2W power consumption Double-buffering of boundary regions to overlap computation and communication Epiphany platform holds high potential for HPC Considerable effort to extract high performance Memory constraint important factor while designing algorithms Streaming algorithm to process higher grid sizes Future version of Epiphany to have 4096 cores (70 GFLOPS/Watt)

Introduction Architecture Performance Experiments Heat Stencil Conclusions Questions?

Programming the Adapteva Epiphany 64-core Network-on-chip - PowerPoint PPT Presentation

Introduction Architecture Performance Experiments Heat Stencil Conclusions Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert Edwards, Gaurav Mitra and Alistair Rendell Research School of Computer

The Parallella Computer and the Epiphany Chip William Tracy 2016 Table of Contents Introduction

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

The 4 th Sunday after Epiphany-The Presentation Malachi. 3. Luke 2:22ff +In the Name of the

Affordable Direct Primary Care Affordable Direct Primary Care Lee S. Gross, M.D. Founder, Epiphany

Epiphany Runs for 12 days after Christmas Celebrates Jesus as the light of the world

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro <mcatanzaro@igalia.com> Web

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Hybrid Wireless Network on Chip: A New Paradigm in Multi-Core Design Partha Pratim Pande and

Network on Chip Architectures Network on Chip Architectures Maurizio Palesi and Shashi Kumar

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow

Application of Many-core Accelerators for Problems in Astronomy and Physics N.Nakasato

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO?

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Programming the Adapteva Epiphany 64-core Network-on-chip - PowerPoint PPT Presentation

Introduction Architecture Performance Experiments Heat Stencil Conclusions Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert Edwards, Gaurav Mitra and Alistair Rendell Research School of Computer

The Parallella Computer and the Epiphany Chip William Tracy 2016 Table of Contents Introduction

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

The 4 th Sunday after Epiphany-The Presentation Malachi. 3. Luke 2:22ff +In the Name of the

Affordable Direct Primary Care Affordable Direct Primary Care Lee S. Gross, M.D. Founder, Epiphany

Epiphany Runs for 12 days after Christmas Celebrates Jesus as the light of the world

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro &lt;mcatanzaro@igalia.com&gt; Web

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Hybrid Wireless Network on Chip: A New Paradigm in Multi-Core Design Partha Pratim Pande and

Network on Chip Architectures Network on Chip Architectures Maurizio Palesi and Shashi Kumar

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow

Application of Many-core Accelerators for Problems in Astronomy and Physics N.Nakasato

Module 4.1 Memory and Data Locality CUDA Memories Objective To learn to effectively use

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO?

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro <mcatanzaro@igalia.com> Web