SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , - PowerPoint PPT Presentation

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead Feb 25 th 2020 VLSI and Architecture Lab

Opportunistic Resources for Graduate Students 2 Free leftovers Steak dinner toward Opportunistically collecting snacks towards a meal.

Opportunistic Resources in the CMP 3 “Free leftovers” Interconnect Communication Interconnect NoC Router Intel Skylake 8180 HCC [1] Opportunistically collecting “snacks” to make a “meal”. [1] Intel Skylake SP HCC, Wikichip.

Opportunistic Resources in the CMP 4 “Free leftovers” Interconnect Communication Interconnect NoC Router Intel Skylake 8180 HCC [1] Opportunistically collecting “snacks” to What is the performance gain we add by make a “meal”. opportunistically “snacking” on CMP resources? [1] Intel Skylake SP HCC, Wikichip.

Quantifying Design Slack in the NoC 5  NoC designed to minimize latency during heavy traffic  NoC implementation can account for 60% to 75% of the miss latency [2] [2] Sanchez et al., ACM TACO, 2010.

Quantifying Design Slack in the NoC 6  NoC designed to minimize latency during heavy traffic  NoC implementation can account for 60% to 75% of the miss latency [2]  Study of NoC resource utilization on recent NoCs designs  3 selected best paper nominated NoCs have similar performance:  DAPPER [3] , AxNoC [4] , BiNoCHS [5]  Reducing resources, substantially reduced performances  Further details of study is in our paper [2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

Quantifying Design Slack in the NoC 7  NoC designed to minimize latency  Opportunities in Network-on-Chip during heavy traffic Slack  NoC implementation can account for 60% to 75% of the miss latency [2]  Study of NoC resource utilization on recent NoCs designs NoC Router  3 selected best paper nominated NoCs have similar performance:  DAPPER [3] , AxNoC [4] , BiNoCHS [5]  Reducing resources, substantially reduced performances  Further details of study is in our paper [2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

Quantifying Design Slack in the NoC 8  NoC designed to minimize latency  Opportunities in Network-on-Chip during heavy traffic Slack  NoC implementation can account for  Crossbar 60% to 75% of the miss latency [2]  Study of NoC resource utilization on recent NoCs designs NoC Router  3 selected best paper nominated NoCs have similar performance:  DAPPER [3] , AxNoC [4] , BiNoCHS [5]  Reducing resources, substantially reduced performances  Further details of study is in our paper [2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

Quantifying Design Slack in the NoC 9  NoC designed to minimize latency  Opportunities in Network-on-Chip during heavy traffic Slack  NoC implementation can account for  Crossbar 60% to 75% of the miss latency [2]  Network Links  Study of NoC resource utilization on recent NoCs designs NoC Router  3 selected best paper nominated NoCs have similar performance:  DAPPER [3] , AxNoC [4] , BiNoCHS [5]  Reducing resources, substantially reduced performances  Further details of study is in our paper [2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

Quantifying Design Slack in the NoC 10  NoC designed to minimize latency  Opportunities in Network-on-Chip during heavy traffic Slack  NoC implementation can account for  Crossbar 60% to 75% of the miss latency [2]  Network Links  Internal Buffers  Study of NoC resource utilization on recent NoCs designs NoC Router  3 selected best paper nominated NoCs have similar performance:  DAPPER [3] , AxNoC [4] , BiNoCHS [5]  Reducing resources, substantially reduced performances  Further details of study is in our paper [2] Sanchez et al., ACM TACO, 2010. [3] Raparti et al., IEEE/ACM NOCS, 2018. [4] Ahmed et al., IEEE/ACM NOCS, 2018. [5] Mirhosseini et al, IEEE/ACM NOCS, 2017.

Quantifying Design Slack in the NoC 11 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization 

Quantifying Design Slack in the NoC 12 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization  Median utilization, Router 5: 8.6% Router 5 50 Router Crossbar Usage (%) 40 30 20 10 0 25 30 35 40 Time (10 8 Cycles)

Quantifying Design Slack in the NoC 13 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization 

Quantifying Design Slack in the NoC 14 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization  Link Utilization  Peak utilization link (Graph500): 18% utilization  Highest median link utilization (LULESH): 3.3% utilization  Link Utilization Median utilization, Router 5: 8.6%

Quantifying Design Slack in the NoC 15 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization  Link Utilization  Peak utilization link (Graph500): 18% utilization  Highest median link utilization (LULESH): 3.3% utilization  Link Utilization Buffer Utilization  Median utilization, Router 5: 8.6% Raytrace : 4% of cycles have localized contention  10% utilization during contention  3M flits of the 2.4T flits forwarded: buffer utilization reaches  30-55% of the total capacity

Quantifying Design Slack in the NoC 16 Simulated 16 core CMP with 4 benchmarks representing  Crossbar Utilization “low”, “medium”, “medium-high”, “high” traffic Crossbar Utilization:  Peak utilization (Graph 500): 42% utilization  Highest median (Graph 500): 13.3% utilization  Link Utilization  Peak utilization link (Graph500): 18% utilization  Highest median link utilization (LULESH): 3.3% utilization  Link Utilization The SnackNoC platform improves efficiency Buffer Utilization  Median utilization, Router 5: 8.6% Raytrace : 4% of cycles have localized contention  Router 5 and performance of the CMP by offloading 10% utilization during contention  50 3M flits of the 2.4T flits forwarded: buffer utilization reaches  Router Crossbar Usage (%) 30-55% of the total capacity 40 data-parallel workloads and “snacking” on 30 20 10 network resources. 0 25 30 35 40 Time (10 8 Cycles)

Overview 17  “Slack” of the Communication Fabric  The SnackNoC Platform  Experimental Results  Conclusion and Future Considerations

SnackNoC Platform Overview 18  Goals:  Opportunistically “Snack” on existing network resources for additional performance  Limited additional overhead to uncore  Minimal or zero interference to CMP traffic  Opportunistic NoC-based compute platform  Limited dataflow engine  Applications:  Data-parallel workloads used in scientific computing, graph analytics, and machine learning

SnackNoC Platform Overview 19  Goals:  Opportunistically “Snack” on existing network resources for additional performance  Limited additional overhead to uncore  Minimal or zero interference to CMP traffic  Opportunistic NoC-based compute platform  Limited dataflow engine  Applications:  Data-parallel workloads used in scientific computing, graph analytics, and machine learning Celerity RISC-V SoC [6] [6] S. Davidson et al., IEEE Micro, 2018.

SnackNoC Platform Overview 20  Goals:  Opportunistically “Snack” on existing network resources for additional performance  Limited additional overhead to uncore  Minimal or zero interference to CMP traffic Google Cloud TPU [7]  Opportunistic NoC-based compute platform  Limited dataflow engine  Applications:  Data-parallel workloads used in scientific computing, graph analytics, and machine learning Celerity RISC-V SoC [6] [6] S. Davidson et al., IEEE Micro, 2018. [7] Jouppi et. al, IEEE/ACM ISCA, 2017.

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , - PowerPoint PPT Presentation

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead Feb 25 th 2020 VLSI and Architecture Lab Opportunistic Resources for Graduate Students 2 Free leftovers Steak

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

CompSci 356: Computer Network Architectures Lecture 23: Application Layer Protocols Chapter 9.1

4 Network Layer Network Layer Network Layer Network Layer Switching Via Memory Three types of

Layered Protocols Low-level layers Transport layer Application layer Middleware layer 1 / 54

Vascular tissue stomata Palisade layer Vascular tissue Palisade layer Vascular tissue Air

Introduction to the Transport Layer CSC 249 Feb 13, 2018 1 Transport Layer Overview q Tasks

Network Layer Understand principles behind network layer services: network layer service

Continuous Improvement Toolkit SMED Continuous Improvement Toolkit . www.citoolkit.com Managing

BUILDING A GROWTH MACHINE BY: BRIAN BALFOUR Brian Balfour :: @bbalfour ::

Scheduling and (Integer) Linear Programming Christian Artigues LAAS - CNRS & Universit de

Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds Layong (Larry) Luo,

IEEE 5G Summit IMT-2020: Standards and Spectrum for 5G Colin Langtry Chief, Study Groups

SEE LATEST VERSION: http://tinyurl.com/YosemiteRoadmap20150709slides Outline Mission and

Report from the Project Manager Bakul Banerjee Associate Contractor Project Manager Associate

Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri

Sambuz

Useful Links

Newsletter

Mail Us

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , - PowerPoint PPT Presentation

SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER Karthik Sangaiah , Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead Feb 25 th 2020 VLSI and Architecture Lab Opportunistic Resources for Graduate Students 2 Free leftovers Steak

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

CompSci 356: Computer Network Architectures Lecture 23: Application Layer Protocols Chapter 9.1

4 Network Layer Network Layer Network Layer Network Layer Switching Via Memory Three types of

Layered Protocols Low-level layers Transport layer Application layer Middleware layer 1 / 54

Vascular tissue stomata Palisade layer Vascular tissue Palisade layer Vascular tissue Air

Introduction to the Transport Layer CSC 249 Feb 13, 2018 1 Transport Layer Overview q Tasks

Network Layer Understand principles behind network layer services: network layer service

Continuous Improvement Toolkit SMED Continuous Improvement Toolkit . www.citoolkit.com Managing

BUILDING A GROWTH MACHINE BY: BRIAN BALFOUR Brian Balfour :: @bbalfour ::

Scheduling and (Integer) Linear Programming Christian Artigues LAAS - CNRS &amp; Universit de

Towards Converged SmartNIC Architecture for Bare Metal &amp; Public Clouds Layong (Larry) Luo,

IEEE 5G Summit IMT-2020: Standards and Spectrum for 5G Colin Langtry Chief, Study Groups

SEE LATEST VERSION: http://tinyurl.com/YosemiteRoadmap20150709slides Outline Mission and

Report from the Project Manager Bakul Banerjee Associate Contractor Project Manager Associate

Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri

Sambuz

Useful Links

Newsletter

Mail Us

Scheduling and (Integer) Linear Programming Christian Artigues LAAS - CNRS & Universit de

Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds Layong (Larry) Luo,