NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - PowerPoint PPT Presentation

NETWORK‐ON‐CHIP‐ASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR “DYNAMIC” HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group – University of Ferrara – Italy email : davide.bertozzi@unife.it A collaboration with José Flich, Universidad Politecnica de Valencia (Spain) A collaboration with Giorgos Dimitrakopoulos, Democritus University of Thrace (Greece)

Workload consolidation Consolidation of multiple computation workloads onto the same high‐ end embedded computing platform is well‐underway in many domains Aggregation of ECUs Multimedia Home Gateways IoT Platforms Embedded system virtualization is one relevant branch of this trend

Heterogeneous Parallel Computer Architecture General GOPS/Watt Purpose Host multi‐core High‐ Programmable heterog. Hardware Speed Accelerator processor Parallel threads accelerators Massive HW I/O Highest heavily dependent on multithreading for Coarse‐grain GOPS/W local data content data‐parallelism parallelism General‐ Throughput HW Top‐level NoC Purpose Computing IPs Computing (GPGPUs) (SMPs) DMA DRAM Reconfigurable engine memory Programmable and Graphics fabric controller customizable accelerators The accelerator store Specialization and Parallelism are THE design paradigm for embedded SoCs  Proliferation of more or less programmable computing acceleration resources.  Dark silicon will be harnessed through specialization. Multi‐Programmed Parallel Hardware Mixed‐Criticality Platforms Workload

Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator Leverage fine‐grained temporal multiplexing, relying on dedicated hardware support for fast and lightweight context switching! the same full‐fledged HW solutions proposed in high‐end GPGPUs won’t be affordable in low‐power SoCS! High‐end GPGPUs

Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator Use a coarser form of accelerator time‐sharing: execute offload requests in a run‐to‐completion, first‐come first‐served manner Overly long waiting times. Latency‐critical requests may have to resort to host execution. Common sense

Concurrent Acceleration Requests Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others Running processes Host Processor HOW TO SHARE THE ACCELERATOR? Programmable Manycore Accelerator To shorten time‐to‐completion, use up all of the available cores! Embedded applications exhibit a limited amount of data parallelism, alternated with task‐level parallelism. HPC Performance is likely to saturate as core allocation grows.

Concurrent Acceleration Requests OK, I got it: SPATIAL‐DIVISION MULTIPLEXING (SDM) is the solution! Running processes Host Processor Programmable Manycore Accelerator SDM is trivial. Where is the challenge?  Current accelerator architectures are at odds with SDM  Partition just the cores? Or also the memory?  Should match program parallelism to execution environment  Software‐only solutions cannot provide complete isolation  Designing SDM for Predictability? For Security? For both? Much more than a concept: a design philosophy!

Does SDM make sense at all? Image Processing benchmarks run on a gem5‐based General‐Purpose Many‐Core Platform simulator to emphasize the computation speedup, minimizing communication and memory access effects: ideal crossbar and 1 cycle memory access latency 9 IDEAL FAST ROD Convert 8 DetectUniScaleResize Distance GaussianBlur ComputeKeypoints rBrief 7 ALIZED SPEEDUP 6 5 4 NORM 3 2 1 0 1 2 3 4 5 6 7 8 9 #CLUSTERS With some exceptions, the trend is confirmed: real applications cannot exploit the whole parallelism provided by hardware! Relying on Space‐Division Multiplexing approach we relinquish the maximum parallelism but:  Such parallelism is actually not needed  Non‐Uniform Memory Access (NUMA) effects can be minimized  Interferences of other applications are avoided inside the partition Davide Bertozzi MPSoC Research Group

What about TDM? A batch of applications (8 requests for each app, 9 apps in total) is run and evaluated with several memory configurations by using the SDM and the coarse‐grain TDM approaches. SDM: L2 partitioning, best schedule SDM: L2 partitioning, random schedule SDM: global L2, best schedule SDM: global L2, random schedule Memory partitioning helps smooth out NUMA effects SDM overtakes TDM, providing speedups on the whole execution by 35% in the best case (i.e., full knowledge of incoming request pattern) and by 19% with random scheduling of acceleration requests. Davide Bertozzi MPSoC Research Group

SDM Technology • SDM is OK, but how? • The Mapping Challenge • The Reconfiguration Challenge • The Adaptivity Challenge

The Isolation Property The traffic generated by different applications collides in the accelerator NoC…… ……as the NoC paths are shared between ……even for smart allocation schemes! nodes assigned to different applications. It might be a good idea to prevent traffic from different applications from mixing. Or not? At least for better composability and analyzability. Smart task allocation cannot guarantee the isolation property. You need NoC support for that!

Our Approach: Routing Restrictions A deterministic (or partially adaptive) routing algorithm without cyclic dependencies among links or buffers can be represented by the set of routing restrictions it imposes For irregular topologies as well  A routing restriction forbids any packets to use two consecutive channels Can we design a routing mechanism that finds a packet’s way to destination by interpreting such routing restrictions?

Logic‐Based Distributed Routing Destination Switch (Xcurr,Ycurr) FORBIDDEN! LBDR logic: 1‐ compute the target quadrant NORTH‐EAST 2‐ Take North if at next hop I can turn east QUADRANT 3‐ Take East if at next hop I can turn north 4‐ Go East...provided the East port is connected! Routing logic is assisted by a 26-bit configuration register per switch: - Routing restrictions are coded at each switch by means of routing bits Rxy - Unconnected ports are coded by means of connectivity bits Cx Lower More flexible than algorithmic routing: More scalable than routing tables: coverage it supports different routing algorithms the configuration register stays the than routing and (not all) irregular 2D mesh topologies same regardless the network size tables (~80%)

Basic Partitioning Support Setting connectivity bits to zero at partition boundaries prevents messages from escaping from their partition Additional benefits: • complexity in the order of algorithmic xy LBDR configuration bits • no modification of the routing algorithm required • no additional provisioning to guarantee deadlock freedom • no virtual channel needed (yet)

The Flexibility Challenge With the basic approach, not all partition shapes are feasible! Out‐of‐reach! There is a mismatch between partition shapes .....and in fact another routing algorithm and the underlying routing algorithm... works for the same partition shapes! TWO POSSIBLE SOLUTIONS:  set up only those partition shapes that are “legal” for the chosen routing algorithm  adapt the routing algorithms to the partition shapes

What about Global Traffic? Not all network traffic is headed to switches inside the partition. Global traffic to memory controllers and/or unpartitioned L2 should be supported! Solution Provide two sets of LBDR bits, differing only in the connectivity bits Underlying philosophy: There is ONE GLOBAL ROUTING ALGORITHM ‐ for intra‐partition messages ‐ as well as for global traffic The routing algorithm is Local unmodifid Cx bits Global  no deadlock risks Cx bits ….but you start «invading» other partitions!

Why not Changing the Philosophy? Unrelated per‐partition algorithms: different algorithms are implemented in each partition, locally deadlock‐free but globally not. E.g., different instances of the Segment‐based Routing (SR) strategy applied on a partition‐basis Global traffic support is not straightforward any more, since the global routing function is not necessarily deadlock‐free any more!

What about Global Traffic? Two virtual channels to separate local from global traffic Each virtual channel has its own routing algorithm ~2X INCREASE IN COMPLEXITY OF THE LBDR ROUTING MECHANISM! MOREOVER, YOU HAVE VIRTUAL CHANNELS! What about Isolation?  VC0 traffic (local) suffers from link‐level interference with VC1 traffic (global). Can be solved by using 2 networks!  VC1 traffic is a mix of global traffic originating from different partitions. Can be solved only through temporal isolation!

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - PowerPoint PPT Presentation

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR DYNAMIC HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group University of Ferrara Italy email : davide.bertozzi@unife.it A collaboration with Jos

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Assisted Discovery of On-Chip Debug Interfaces Joe Grand (@joegrand) Introduction On-chip

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Design of Adaptive Communication Design of Adaptive Communication Channel Buffers for Low-

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

NLSR: Named-data Link State Routing Protocol A K M Mahmudul Hoque, Syed Obaid Amin, Adam Alyyan,

IP Datagram ICMP Message Format 1 byte 1 byte 1 byte 1 byte VERS HL Service Total Length

Adaptive Caching Algorithms with Optimality Guarantees for NDN Networks Stratis Ioannidis and

S-38.2121 Routing in Telecommunication Networks Prof. Raimo Kantola raimo.kantola@hut.fi, Tel.

Deadlock-Recovery Support for Fault-tolerant Routing Algorithms in 3D-NoC Architectures Akram

MIND: Machine Learning based Network Dynamics Dr. Yanhui Geng Huawei Noahs Ark Lab, Hong Kong

Demand-oblivious routing: distributed vs. centralized approaches G abor R etv ari and

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - PowerPoint PPT Presentation

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR DYNAMIC HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group University of Ferrara Italy email : davide.bertozzi@unife.it A collaboration with Jos

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Assisted Discovery of On-Chip Debug Interfaces Joe Grand (@joegrand) Introduction On-chip

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Design of Adaptive Communication Design of Adaptive Communication Channel Buffers for Low-

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

NLSR: Named-data Link State Routing Protocol A K M Mahmudul Hoque, Syed Obaid Amin, Adam Alyyan,

IP Datagram ICMP Message Format 1 byte 1 byte 1 byte 1 byte VERS HL Service Total Length

Adaptive Caching Algorithms with Optimality Guarantees for NDN Networks Stratis Ioannidis and

S-38.2121 Routing in Telecommunication Networks Prof. Raimo Kantola raimo.kantola@hut.fi, Tel.

Deadlock-Recovery Support for Fault-tolerant Routing Algorithms in 3D-NoC Architectures Akram

MIND: Machine Learning based Network Dynamics Dr. Yanhui Geng Huawei Noahs Ark Lab, Hong Kong

Demand-oblivious routing: distributed vs. centralized approaches G abor R etv ari and

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System