COSMOS: Coordination of High-Level Synthesis and Memory Optimization - PowerPoint PPT Presentation

ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, USA

Hardware Accelerators Motivations • Hardware accelerators are devices designed and optimized to realize very specific functionalities General-Purpose DianNao Generality Processor Cores Hardware Accelerators Efficiency [T. Chen et al., ASPLOS’14] ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 2 / 16

Hardware Accelerators Architecture Component Interface Accelerator Component Logic Component #1 On-chip Interconnect Loop #1 … Loop #N Component #2 Component Datapath … bank bank bank bank Component #K bank bank bank bank Private Local Memory (PLM) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 3 / 16

Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N knob Component Datapath conf. #1 Cost (Area) bank bank bank bank knob conf. #2 bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N Pareto-Optimal Component Datapath Implementations Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N Pareto Dominated Component Datapath Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) 1. Loop unrolling Which knobs can be used to obtain several for (k = 0; k < N; ++k) RTL implementations? a[k] = b[k] + c[k]; b[k] c[k] Cost (Area) a[k] RTL Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) 1. Loop unrolling Which knobs can be used to obtain several for (k = 0; k < N; k += 2) RTL implementations? a[k+0] = b[k+0] + c[k+0]; a[k+1] = b[k+1] + c[k+1]; apply b[k+0] c[k+0] b[k+1] c[k+1] Cost (Area) unrolling a[k+0] a[k+1] RTL Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) 2. Memory Ports Which knobs can be used to obtain several RTL implementations? port 1 port 2 Cost (Area) bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Hardware Accelerators High-Level Synthesis (HLS) 2. Memory Ports Which knobs can be used to obtain several RTL implementations? port 1 port 2 port 3 port 4 increase number of ports Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples Need of multi-port memories using standard memories 3.0 1 port 2 ports 4 ports 8 ports 2.5 Area (mm 2 ) 2.0 1.5 latency span: 1.4× 1.0 area span: 1.2× Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples Need of multi-port memories using multi-port memories 3.0 1 port 2 ports 4 ports 8 ports 2.5 latency span: 7.9× Area (mm 2 ) 2.0 area span: 3.7× 1.5 1.0 Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories 2. The algorithms adopted by HLS tools are based on heuristics that make it hard to set the knobs ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples Unpredictability of HLS tools 3.0 1 port 2 ports 4 ports 8 ports 1.20 14 2.5 14u 1.16 # unrolls 10 9 10u 8 1.12 9u Area (mm 2 ) 8u 2.0 7u 7 1.08 6u 6 5u 4 3 4u 5 2 3u 1.04 2u 1.5 1.00 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 1.0 Gradient Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples Unpredictability of HLS tools 3.0 1 port 2 ports 4 ports 8 ports 1.20 14 2.5 14u 1.16 # unrolls 10 9 10u 8 1.12 9u Area (mm 2 ) 8u 2.0 7u 7 1.08 6u 6 5u 4 3 4u 5 2 3u 1.04 2u 1.5 1.00 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 1.0 Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories 2. The algorithms adopted by HLS tools are based on heuristics that make it hard to set the knobs 3. HLS tools do not handle the simultaneous optimization of multiple components ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Motivational Examples Need of compositionality 0.63 1.20 Area (mm 2 ) Area (mm 2 ) Grayscale Gradient 1.16 0.62 1.12 0.61 1.08 0.60 1.04 0.59 1.00 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 Effective Latency (ms) Effective Latency (ms) 1.80 Composition 1.76 Area (mm 2 ) 1.72 1.68 Pareto 1.64 Dominated 1.60 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 Effective Throughput (1/ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators 1. COSMOS is able to efficiently coordinate high- level synthesis and memory generator tools ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators 1. COSMOS is able to efficiently coordinate high- level synthesis and memory generator tools 2. COSMOS leverages a scalable compositional design-space exploration methodology ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators Step 1: Component Characterization § SystemC Specification region 2 area Accelerator region 1 #K Component #1 latency … region 2 Step 1 area Component #K region 1 #1 latency ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators Step 2: Design-Space Exploration § region 2 Design Space of area the Accelerator region 1 #K latency area region 2 Step 2 area region 1 #1 throughput latency ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

Component Characterization • Goal: for each component of the accelerator identify the regions with the Pareto-optimal implementations 1.00 1 port 0.95 0.90 Area (mm 2 ) region 1 0.85 4 ports 0.80 2 ports region 2 0.75 0.70 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 7 / 16

Component Characterization • Goal: for each component of the accelerator identify the regions with the Pareto-optimal implementations upper-left point 1.00 1 port 0.95 lower-right point 0.90 Area (mm 2 ) region 1 0.85 4 ports 0.80 2 ports region 2 0.75 0.70 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 7 / 16

COSMOS: Coordination of High-Level Synthesis and Memory Optimization - PowerPoint PPT Presentation

ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York,

The Quantum The Quantum The Quantum and the and the and the Cosmos Cosmos Cosmos SUSY 2007

COSMOS Outreach Activities and Industry Involvement COSMOS PLATFORM FOR ADVANCED WIRELESS

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Our Place in the Cosmos Our Place in the Cosmos Course Aims and and To explain primarily at a

Coordination models Essence We are trying to separate computation from coordination; coordination

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Our Place in the Cosmos Our Place in the Cosmos Summary of Previous Lecture and and Night

Our Place in the Cosmos Our Place in the Cosmos The Ancient Greeks and and By far the most

Our Place in the Cosmos Our Place in the Cosmos Rotation of the Earth and and The most

Our Place Our Place in in the the Cosmos Cosmos further south in the winter, further north

BEYOND OUR COSMOS Session: GI S Application and Data I ntegration Presented by: Mark Laudon -

Our Our Place Place in in the the Cosmos Cosmos planetary nebulae and white dwarfs More

Our Our Place Place in in the the Cosmos Cosmos It includes the planets, their moons,

Our Place Our Place in in the the Cosmos Cosmos and that the same rules of physics apply

Our Place Our Place in in the the Cosmos Cosmos describe patterns in nature They then

Our Place Our Place in in the the Cosmos Cosmos following physical properties Surface

r r rqrts r

HPC Architectures Types of resource currently in use Reusing this material This work is licensed

Parallel Algorithms and CS260 Algorithmic Engineering Implementations Yihan Sun Algorithmic

Delerium and Dementia -Sadly, I still have nothing new to disclose since early my last

Advanced Architectures Goals of Distributed Computing better services 15A. Distributed

Profiling the Memory Usage of OCaml Applications without Changing their Behavior OCaml 2013

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

Adventures in the Exposome with Ecological Momentary Assessment Jeremy Mennis, Ph.D., GISP