Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL - PowerPoint PPT Presentation

A RISC-V Extension for the Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL First Workshop on Computer Architecture Research with RISC-V (CARRV 2017) Boston, MA, USA, October 14, 2017

Talk Outline ▪ Fresh Breeze – Architecture, Programming Model, Codelets, and Memory Model ▪ RISC-V Extensions ▪ AutoBuffer Implementation ▪ Trees of Chunk Based Memory Structure ▪ Garbage Collection ▪ Instructions for Tasking ▪ Observations ▪ Current System Limitations ▪ Future Work 2

Fresh Breeze ▪ A Programming Model and System Architecture ▪ Two Key Concepts: ▪ Tree-of-chunks Memory Model ▪ The Codelet Model for fine-grain tasking ▪ Goals: ▪ Energy-efficient architecture for exascale computing ▪ Satisfy requirements for modular software construction 3

A 4-Core Fresh Breeze System 4

The Fresh Breeze Memory Model ▪ A chunk (1024-bit) – holds 16 64-bit scalars or handles (to chunks) ▪ All data and program objects represented by trees of fixed size memory chunks. ▪ Memory chunks are write once. ▪ Distributed memory hierarchy without consistency issues. ▪ Low overhead memory management with reference count garbage collection. ▪ Global name space for all data and programs. ▪ Supports capability based security. 5

Fine Grain Tasking with Codelets ▪ Fresh Breeze Codelets. ▪ The unit for scheduling processing resources. ▪ Contains block of instructions executed to completion. Non pre-emptable. ▪ Activated by availability of input data objects. Signals successor codelets when results are ready. (Data Flow). ▪ Hardware supported scheduling and load distribution. 6

Running funJava Programs ▪ funJava is a functional subset of Java. ▪ The funJava compiler has four stages: ▪ javac : Produce bytecode files from the funJava code. ▪ Convert : Construct a Data Flow Graph (DFG) from the bytecode for each Java method. ▪ Transform : Identify loops amenable to data parallel implementation (map or reduce) and construct a set of DFGs representing abstract codelets for each method. ▪ Generate : Convert each abstract codelet DFG into real codelets. ▪ L oad and run codelets on a target machine using the Kiva simulator. 7

Tasking Model – Spawning Team of Workers 8

funJava Sample Code long dotProduct(long[] a, long[] b, int len) { long sum = 0; for (int i = 0; i < len; i++) { sum = sum + a[i] * b[i]; } return sum; } 9

Tree of Tasks (Map) Master Task (Vector length = 4096) … Worker 15 Worker 0 Worker 1 (i: 3840..4095) (i: 0..255) (i: 256..511) … Leaf 0 Leaf 1 Leaf 15 (i: 0..15) (i: 16..31) (i: 240..255) Each leaf computes sum of a[i] * b[i] for its range of i. 10

Tree of Tasks (Reduce) Root computes sum Root of results of its (Vector length = 4096) child nodes Worker computes … Worker 15 Worker 0 Worker 1 sum of results (i: 3840..4095) of its leaf (i: 0..255) (i: 256..511) nodes … Leaf 0 Leaf 1 Leaf 15 (i: 0..15) (i: 16..31) (i: 240..255) Each leaf computes sum of a[i] * b[i] for its range of i 11

Traverse Codelet Codelet 1: 19]: LMove S0: 0; -> D: 26 20]: IMove S0: 2; -> D: 28 21]: IMove S0: 41; -> D: 29 22]: SyncCreate Code: 3; sigCnt: 11; itemCnt: 41 -> D: 26; argsBase: 13 argsCnt: 2 23]: ISet 0 -> D: 28 24]: IfIGeq S0: 28; S1: 41; Lab: 41 25]: ISub S0: 41; S1: 0; LV: 1; -> D: 52 26]: IfINeq S0: 28; S1: 52; Lab: 29 27]: IMove S0: 42; -> D: 53 28]: Jump Lab: 30 29]: IMove S0: 37; -> D: 53 30]: ReadFull H: 4; Off: 28; -> D: 54 31]: ReadFull H: 6; Off: 28; -> D: 56 32]: IMul S0: 28; S1: 37; -> D: 58 33]: IAdd S0: 8; S1: 58; -> D: 59 34]: LMove S0: 54; -> D: 30 35]: LMove S0: 56; -> D: 32 36]: IMove S0: 59; -> D: 34 37]: IMove S0: 53; -> D: 35 38]: TaskSpawn Code: 2; argsBase: 13; argsCnt: 5 39]: IAdd S0: 28; S1: 0; LV: 1; -> D: 28 40]: Jump Lab: 24 41]: TaskQuit 12

Leaf Codelet Codelet 2: 0]: ISet 1 -> D: 10 1]: ISet 0 -> D: 11 2]: LSet 0 -> D: 12 3]: IMove S0: 11; -> D: 14 4]: LMove S0: 12; -> D: 16 5]: IfIGeq S0: 14; S1: 9; Lab: 13 6]: IAdd S0: 14; S1: 8; -> D: 15 7]: ReadFull H: 6; Off: 14; -> D: 18 a[i] * b[i] 8]: ReadFull H: 4; Off: 14; -> D: 20 9]: LMul S0: 18; S1: 20; -> D: 22 10]: IAdd S0: 14; S1: 0; LV: 1; -> D: 14 Sum of 11]: LAdd S0: 16; S1: 22; -> D: 16 products 12]: Jump Lab: 5 13]: SyncUpdate Sync: 0; Off: 2; Data: 16 14]: TaskQuit 13

Reduce Codelet Codelet 3: 0]: ISet CV: 1 -> D: 7 1]: ISet CV: 0 -> D: 10 2]: LSet CV: 0 -> D: 8 3]: IfIGeq S0: 10; S1: 5; Lab: 8 4]: ReadFull H: 0; Off: 10; -> D: 12 Sum of sums of products 5]: LAdd S0: 8; S1: 12; -> D: 8 6]: IAdd S0: 10; S1: 7; LV: 1; -> D: 10 7]: Jump Lab: 3 8]: SyncUpdate Sync: 4; Off: 4; Data: 8 9]: TaskQuit 14

RISC-V Extensions ▪ Instructions for building and accessing data objects using the Tree-of-chunks Memory Model ▪ Instructions to support spawning and coordination of worker tasks. 15

Task Record ▪ Represents a task in the Core Scheduler’s queue ▪ Moved between Core Schedulers’ queues as directed by the Load Balancer ▪ Has fields: codeIdx (16 bits): The index of the codelet to be executed. argsCnt(4 bits): The number of arguments needed for the task. argsList (64 bits): The handle of a chunk containing the arguments. 16

The AutoBuffer ▪ Used in place of the usual level one cache. ▪ Holds several memory chunks for direct access by the processor. ▪ For direct access, each processor register has an extra index field and valid bit. ▪ index and valid are set by the ChunkCreate instruction, or when the chunk is loaded into the AutoBuffer in response to a read instruction. 17

Memory Instructions ChunkCreate ( dest ) WriteFull ( handle, index, value ) WriteLeft ( handle, index, value ) WriteRight ( handle, index, value ) ReadFull ( dest, handle, index) ReadLeft ( dest, handle, index) ReadRight ( dest, handle, index) 18

Garbage Collection ▪ To automatically reclaim free chunk space ▪ Garbage collection done by hardware ▪ Use the reference count (RC) scheme with RC: ▪ Held as metadata for each chunk in each Memory Unit ▪ Accessed using the handle of the chunk ▪ Initially zero ▪ Incremented by one by the ChunkCreate instruction and whenever the handle is copied ▪ Decremented when a task with chunk handle terminates ▪ When zero the chunk is marked free and the reference count of any chunks referenced by handles in the freed chunk are decremented 19

Sync Chunk A sync chunk (1024-bit) contains 16 memory items (or 64-bit word): Item 0: handle of the sync data chunk; null if argsCnt is zero Items 1 - 14: hold up to 14 argument items Item 15: Sync Control Word 0 1 2 . . . 14 15 ITEM 0 ITEM 1 ITEM 2 ITEM 14 ITEM 15 . . . Sync Data Data Arg0 Data Arg1 Data Arg13 Sync Control (Handle) (Val/Handle) (Val/Handle) (Val/Handle) Word 20

Sync Control Word codeIdx (16 bits): Index of Codelet flags (16 bits): A boolean flag for each of itmCnt work tasks; used just to check that each task contributes exactly one result item to the sync data chunk. - (2 bits): (not used) argsCnt (6 bits): # of args sigCnt (6 bits): (Not used - for streams sigIdx (6 bits): (Not used - for streams) itmCnt (6 bits): # of data itmIdx (6 bits): # of items counter (increment counter) codeIdx flags - argsCnt sigCnt sigIdx itmCnt itmIdx (16) (16) 2 (6) (6) (6) (6) (6) 64 48 32 30 24 18 12 6 0 21

Sync Data Chunk ▪ A chunk to hold at most 16 handles to data items ▪ Example: N data item handles with M data items (scalars & handles) total . . . Data Item 15 Data Item 0 Data Item 1 Data Item 2 Data Item 3 Data Item 4 Data Item 5 (Scalar) (Handle) (Handle) (Scalar) (Handle) (Scalar) (Handle) Chunk Chunk of Scalars Chunk of Scalars Chunk of of and Handles and Handles Scalars Handles (Mix of scalars (Mix of scalars (Tree of and trees of and trees of Chunks) chunks) chunks) 22

Instructions for Tasking SyncCreate ( dest, code, count ) Create a SyncChunk for a task to be executed upon completion of the current task. TaskSpawn ( code, args, sync, index) Puts a TaskRecord in the scheduler queue. SyncUpdate ( sync, index, result ) Puts a worker result in the data chunk of a SyncChunk . TaskQuit () 23

Observations ▪ From machine learning and linear algebra performance studies – for sufficiently large input data, computation performance scales linearly as the number of cores increases due to: ▪ The ability to decompose computations into many parallel data driven tasks. ▪ Efficient load balancing using hardware ▪ Similar observation for experiments involving running several different funJava programs at the same time. ▪ Task can be from the same or different computation 24

Current System Limitations ▪ Running out of JVM heap space. ▪ Long simulation time. ▪ The need for garbage collection by the Java VM between successive simulation runs. 25

Future Work ▪ Multi-host version of Kiva ▪ Compiler enhancements to support streams and transactions ▪ Build an FPGA-based prototype system ▪ Model a Fresh Breeze multi-core processor with extended RISC-V Processing Cores. ▪ Use BlueDBM facility in the Computation Structures Group (CSG) of MIT CSAIL 26

Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL - PowerPoint PPT Presentation

A RISC-V Extension for the Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL First Workshop on Computer Architecture Research with RISC-V (CARRV 2017) Boston, MA, USA, October 14, 2017 Talk Outline Fresh Breeze

Inference engine Knowledge base Breeze Stench 4 PIT Breeze Breeze 3 PIT Stench Gold

Breeze Church Management Software 1 / 28 mrv - Oct2019_Info_Presentation What We Will Cover

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming Jack Dennis

Fresh Breeze Status Jack Dennis MIT CSAIL Architecture and Programming Models for High

Fresh Breeze Streams Programming Model and Architecture for Real Time Streaming Jack Dennis MIT

FRESH BUCKS S N A P I N C E N T I V E P R O G R A M WHAT IS FRESH BUCKS? Fresh Bucks helps

For personal use only Banana Tree Trunk Cross Section (fresh billet) WALKAMIN FACTORY - FRESH

Greater Boca Raton Beach & Park District OCEAN BREEZE GOLF COURSE Youth Training Facility

A simple 1D numerical model for operational nowcasting of sea breeze at the HKIA Julian S.Y

Numerical simulation of the breeze circulation using the WRF-ARW model Shokurov M.V., Kraevskaya

Fresh Express 2018 Cyclospora cayetanensis outbreak Summary and Response German Rios Technical

Fresh-Register Automata Fresh-Register Automata Nikos Tzevelekos Oxford University Computing

MASTERColour CDM Fresh enhancing fresh food while saving energy! August 30, 2011 MASTERColour

Accessing the UK market now and after Brexit Fresh Produce Consortium Introducing the UK

BACTERIAL CAUSES OF FIN BACTERIAL CAUSES OF FIN ROT IN SOME FRESH WATER ROT IN SOME FRESH WATER

FInish FI Fresh.Point Project goals FI Fresh.Point Business goals: Minimize and

Learn to Code A Professional Development Activity for K-8 Math Teachers Learn to Code A

CAREERS EXPO visitors to the relevant areas of interest. Students from the Djarragun Hospitality

Vice President Education (VPE) Club Officer Training Agenda VPE VPE VPE Role

How to train an image classifier using PyTorch Rogier van der Geer -- GoDataDriven What is

Applying for the Fresh Fruit and Vegetable Program Grant Year 2020-21 Hi, Im Rachel Floyd. I am

Last time: monads (etc.) = > > 1/ 40 This time: applicatives (etc.) 2/ 40 Example

Revelation 1:9-20 9 I, John, your brother and partner in the tribulation and the kingdom and the

Our header sectjon now features a fresh new look and WGU logo. Lets explore some of the new

Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL - PowerPoint PPT Presentation

A RISC-V Extension for the Fresh Breeze Architecture Jack B. Dennis Willie Lim MIT CSAIL First Workshop on Computer Architecture Research with RISC-V (CARRV 2017) Boston, MA, USA, October 14, 2017 Talk Outline Fresh Breeze

Inference engine Knowledge base Breeze Stench 4 PIT Breeze Breeze 3 PIT Stench Gold

Breeze Church Management Software 1 / 28 mrv - Oct2019_Info_Presentation What We Will Cover

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming Jack Dennis

Fresh Breeze Status Jack Dennis MIT CSAIL Architecture and Programming Models for High

Fresh Breeze Streams Programming Model and Architecture for Real Time Streaming Jack Dennis MIT

FRESH BUCKS S N A P I N C E N T I V E P R O G R A M WHAT IS FRESH BUCKS? Fresh Bucks helps

For personal use only Banana Tree Trunk Cross Section (fresh billet) WALKAMIN FACTORY - FRESH

Greater Boca Raton Beach &amp; Park District OCEAN BREEZE GOLF COURSE Youth Training Facility

A simple 1D numerical model for operational nowcasting of sea breeze at the HKIA Julian S.Y

Numerical simulation of the breeze circulation using the WRF-ARW model Shokurov M.V., Kraevskaya

Fresh Express 2018 Cyclospora cayetanensis outbreak Summary and Response German Rios Technical

Fresh-Register Automata Fresh-Register Automata Nikos Tzevelekos Oxford University Computing

MASTERColour CDM Fresh enhancing fresh food while saving energy! August 30, 2011 MASTERColour

Accessing the UK market now and after Brexit Fresh Produce Consortium Introducing the UK

BACTERIAL CAUSES OF FIN BACTERIAL CAUSES OF FIN ROT IN SOME FRESH WATER ROT IN SOME FRESH WATER

FInish FI Fresh.Point Project goals FI Fresh.Point Business goals: Minimize and

Learn to Code A Professional Development Activity for K-8 Math Teachers Learn to Code A

CAREERS EXPO visitors to the relevant areas of interest. Students from the Djarragun Hospitality

Vice President Education (VPE) Club Officer Training Agenda VPE VPE VPE Role

How to train an image classifier using PyTorch Rogier van der Geer -- GoDataDriven What is

Applying for the Fresh Fruit and Vegetable Program Grant Year 2020-21 Hi, Im Rachel Floyd. I am

Last time: monads (etc.) = &gt; &gt; 1/ 40 This time: applicatives (etc.) 2/ 40 Example

Revelation 1:9-20 9 I, John, your brother and partner in the tribulation and the kingdom and the

Our header sectjon now features a fresh new look and WGU logo. Lets explore some of the new

Greater Boca Raton Beach & Park District OCEAN BREEZE GOLF COURSE Youth Training Facility

Last time: monads (etc.) = > > 1/ 40 This time: applicatives (etc.) 2/ 40 Example