Master Thesis Exploring the Epiphany manycore architecture for the - PowerPoint PPT Presentation

Master Thesis Exploring the Epiphany manycore architecture for the Lattice Boltzmann algorithm Sebastian Raase 18 th November, 2014

Preface ● This thesis is a cooperation between Volvo Penta AB and Högskolan i Halmstad. ● Volvo Penta designs and builds boat drive systems. (source: www.sintef.no) 2

Motivation ● The Parallella system has been advertised on Kickstarter as “A Supercomputer For Everyone” – and succeeded! ● Computational Fluid Dynamics (CFD) is the largest user of high-performance computing in engineering. [citation needed] ● Connecting those might provide interesting insights about the architecture, and as far as I know, nobody did it before. (Of course, it might also have to do with me needing a master thesis to finish my degree, ● HH having access to the Parallella systems, and Volvo Penta being interested in CFD…) 3

I will talk about: ● Computational Fluid Dynamics ● Lattice Boltzmann algorithm ● Adapteva Epiphany and Parallella board ● Implementation ● Results ● Conclusion 4

Computational Fluid Dynamics ● uses numerical methods to analyze fluid flows → both gases and liquids are fluids ● widespread applications in aerodynamics, architecture, automotive, chemistry, meteorology, navy, … ● computationally very intensive → high-performance computing, parallelization, … ● focus on a single, particle-based algorithm → Lattice Boltzmann 5

Lattice Boltzmann algorithm (I) ● based on Boltzmann equation, late 19 th century: ∂ t | ∂ t | difgusion ∂ t | external ∂ f ∂ t = ∂ f + ∂ f + ∂ f collision f ● = f( x , v , t) describes the particle probability density in phase space (i.e. at specific position, velocity and time) ● collision term is particularly hard to solve ● Particle distribution is only affected by collisions (particle- particle interactions), diffusion (particle movement), and external forces (environment), nothing else. 6

Lattice Boltzmann algorithm (II) ● phase space f( x , v , t) is discretized (lattice models) → discrete positions, velocities and time (and angles) ● named D m Q n (m: dimensions, n: number of discrete velocities) ● focus on two models: D3Q19 (single node) D2Q9 (single node) 7

Adapteva Epiphany (I) ● two-dimensional mesh network-on-chip consisting of eCore processor nodes ● low power (16 cores @ 800 MHz < 1W) ● single shared, flat 32-bit address space ● 1 MiB address space per node, 64x64 (=4096) nodes maximum row column local address mesh structure (source: Ep. Arch. Ref. 19 bit 0 bit 31 25 mesh address format 8

Adapteva Epiphany (II) ● eCores are 32-bit RISC processors eCore with IALU (integer) and FPU (float) 32 KiB Reg. File local → single-precision FPU only memory IALU FPU ● only 32 KiB local memory per node, divided into independent 8 KiB-banks 2 timers 2ch DMA controller ● timers allow counting of events, mesh controller allowing clock-cycle precise runtime measurements mesh node 9

Parallella-16 board ● currently available “reference” platform for Epiphany arch ● Xilinx Zynq (1 GHz, dual-core ARM Cortex-A9) as host ● 16-core Epiphany E16G3 chip connected using FPGA logic ● 32 MiB of (Epiphany-)external shared memory Parallella-16 board Epiphany chip is marked red (source: www.parallella.org) 10

Implementation (I) ● D2Q9 and D3Q19 implementations completely separate ● each implementation consists of two applications ● host application: – single-threaded ARM Linux application running on the Zynq – loads eCores with code and starts them – reads lattice data (results) from shared memory – creates density/velocity grayscale images and GIF animations – writes lattice data and time measurements to ASCII files 11

Implementation (II) y ● Epiphany application: 0 1 2 3 4 5 6 7 – single-threaded, but running on 8 9 10 11 all active eCores simultaneously 12 13 14 15 x – works on a part of the lattice ( block ), z which is always kept in local memory 0 1 2 3 – after iteration, result may be copied → 4 5 6 7 to shared memory ( to the host) y 8 9 10 11 – only next-neighbor communication x 12 13 14 15 (except for shared memory) blocking approaches – all cores run in lockstep, using barriers (bold: domain boundaries) 12

Results (I) ● very consistent results ● excellent scalability for the calculations (growing problem) – calculation times (almost) independent of number of cores – tiny 3% speed decrease* going from one to four active cores, but no further speed decrease (next-neighbor communication only) ● linear scalability for transmitting lattice to host → – increased number of blocks (cores) increased lattice size (* 2D case, 24x24 blocksize, -O3 optimization level) 13

Results (II) ● good computational performance in 2D – 2.8 MLU/s* per core (45 MLU/s @ 16 cores) – in 2005, a single-core AMD Opteron was measured at 7 MLU/s, but in double precision ● much less impressive for 3D case – 0.34 MLU/s per core (5.4 MLU/s @ 16 cores) – in 2012, a single Nvidia Tesla achieved 650 MLU/s... ● comparison numbers were done on much larger lattices… (* MLU/s: millions of lattice node updates per second) 14

Results (III) ● very small local memory, split into 8 KiB code / 24 KiB lattice – at most 682 (2D, ~26x26) or 323 (3D, ~7x6x7) nodes/core – bulk-based optimization ineffective in 3D (too few bulk nodes), but 2.2x speedup in 2D compared to naive approach → more with large blocks – maximum lattice size 384 KiB @ 16 cores comparing naive / bulk-optimized algorithm 15

Results (IV) ● very small bandwidth to shared memory – measured 85 MiB/s (i.e. ~270 lattices/second @ 16-core) – theoretical maximum is 600 MiB/s, or 200 MiB/s if non-optimal accesses* → not enough to stream lattice each iteration – no overlap possible between calculation and transmission… computation / host copy comparison (* but further limited by (2D, 24x24 block size, 16 cores) current FPGA logic) 16

Conclusion ● computations show excellent scalability, fair performance, and still room for optimization ● too small local memory, too little external bandwidth → currently not suitable for Lattice Boltzmann algorithm ● However: This work used the very first publicly available Epiphany chip. 17

The End. 18

Master Thesis Exploring the Epiphany manycore architecture for the - PowerPoint PPT Presentation

Master Thesis Exploring the Epiphany manycore architecture for the Lattice Boltzmann algorithm Sebastian Raase 18 th November, 2014 Preface This thesis is a cooperation between Volvo Penta AB and Hgskolan i Halmstad. Volvo Penta

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

The Parallella Computer and the Epiphany Chip William Tracy 2016 Table of Contents Introduction

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

HONORS THESIS PRESENTATION GUIDELINES FOR THESIS ADVISORS AND SECOND READERS Thesis Presentation :

Master of Statistics Thesis Milestones in Toledo A short introduction for the Master thesis

Honors Thesis & Thesis Presentation Guidelines for Thesis Advisers and Second Readers

The 4 th Sunday after Epiphany-The Presentation Malachi. 3. Luke 2:22ff +In the Name of the

Affordable Direct Primary Care Affordable Direct Primary Care Lee S. Gross, M.D. Founder, Epiphany

Epiphany Runs for 12 days after Christmas Celebrates Jesus as the light of the world

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro <mcatanzaro@igalia.com> Web

Masters Thesis Information SoC Master Program, year 2 SoC Master Program, year 2 Elena

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore

CSC 151 Spring 2020 Topic: Project Introduction April 17, 2020 Day 32 Exam 3 Reminder Exam 3

QEMU internals Chad D. Kersey January 28, 2009 Chad D. Kersey QEMU internals The basics

Combinatorics and topology of toric arrangements III. Epilogue Emanuele Delucchi (SNSF /

State of the ${kit} (kit=WebKitGTK+) Adrin Prez Igalia WebKit Basics Includes WebKitGTK+

New results on Coulomb effects in meson production in relativistic heavy ion collisions Andrzej

Crim Crimin inal Co Court rt Famil ily Co Court rt Jud udge one one Jud udge tw two

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

Intelligent Agents Chapter 2 Chapter 2 1 Outline Agents and environments Rationality

Master Thesis Exploring the Epiphany manycore architecture for the - PowerPoint PPT Presentation

Master Thesis Exploring the Epiphany manycore architecture for the Lattice Boltzmann algorithm Sebastian Raase 18 th November, 2014 Preface This thesis is a cooperation between Volvo Penta AB and Hgskolan i Halmstad. Volvo Penta

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

The Parallella Computer and the Epiphany Chip William Tracy 2016 Table of Contents Introduction

ePYTHON An implementation of Python for the many-core Epiphany coprocessor Nick Brown, EPCC

HONORS THESIS PRESENTATION GUIDELINES FOR THESIS ADVISORS AND SECOND READERS Thesis Presentation :

Master of Statistics Thesis Milestones in Toledo A short introduction for the Master thesis

Honors Thesis &amp; Thesis Presentation Guidelines for Thesis Advisers and Second Readers

The 4 th Sunday after Epiphany-The Presentation Malachi. 3. Luke 2:22ff +In the Name of the

Affordable Direct Primary Care Affordable Direct Primary Care Lee S. Gross, M.D. Founder, Epiphany

Epiphany Runs for 12 days after Christmas Celebrates Jesus as the light of the world

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro &lt;mcatanzaro@igalia.com&gt; Web

Masters Thesis Information SoC Master Program, year 2 SoC Master Program, year 2 Elena

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE /

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

The Frontier Thesis: How &amp; Why the Riverina Was Won The Frontier Thesis The Frontier Thesis:

HERO: Open-Source Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore

CSC 151 Spring 2020 Topic: Project Introduction April 17, 2020 Day 32 Exam 3 Reminder Exam 3

QEMU internals Chad D. Kersey January 28, 2009 Chad D. Kersey QEMU internals The basics

Combinatorics and topology of toric arrangements III. Epilogue Emanuele Delucchi (SNSF /

State of the ${kit} (kit=WebKitGTK+) Adrin Prez Igalia WebKit Basics Includes WebKitGTK+

New results on Coulomb effects in meson production in relativistic heavy ion collisions Andrzej

Crim Crimin inal Co Court rt Famil ily Co Court rt Jud udge one one Jud udge tw two

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

Intelligent Agents Chapter 2 Chapter 2 1 Outline Agents and environments Rationality

Honors Thesis & Thesis Presentation Guidelines for Thesis Advisers and Second Readers

HTTPS: Achievements, Challenges, and Epiphany Michael Catanzaro <mcatanzaro@igalia.com> Web

The Frontier Thesis: How & Why the Riverina Was Won The Frontier Thesis The Frontier Thesis: