NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - - PowerPoint PPT Presentation

network on chip assisted adaptive partitioning and
SMART_READER_LITE
LIVE PREVIEW

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR - - PowerPoint PPT Presentation

NETWORKONCHIPASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR DYNAMIC HOMOGENEOUS MANYCORES Davide Bertozzi MPSoC Research Group University of Ferrara Italy email : davide.bertozzi@unife.it A collaboration with Jos


slide-1
SLIDE 1

NETWORK‐ON‐CHIP‐ASSISTED ADAPTIVE PARTITIONING AND ISOLATION FOR “DYNAMIC” HOMOGENEOUS MANYCORES

Davide Bertozzi

MPSoC Research Group – University of Ferrara – Italy email: davide.bertozzi@unife.it

A collaboration with José Flich, Universidad Politecnica de Valencia (Spain) A collaboration with Giorgos Dimitrakopoulos, Democritus University of Thrace (Greece)

slide-2
SLIDE 2

Workload consolidation

Consolidation of multiple computation workloads onto the same high‐ end embedded computing platform is well‐underway in many domains

Aggregation of ECUs Multimedia Home Gateways

Embedded system virtualization is one relevant branch of this trend

IoT Platforms

slide-3
SLIDE 3

Host multi‐core heterog. processor General Purpose Programmable Accelerator Hardware accelerators High‐ Speed I/O DMA engine Top‐level NoC

DRAM memory controller

Graphics

Reconfigurable fabric

Heterogeneous Parallel Computer Architecture

 Proliferation of more or less programmable computing acceleration resources.  Dark silicon will be harnessed through specialization.

Highest GOPS/W General‐ Purpose Computing (SMPs) Coarse‐grain parallelism Throughput Computing (GPGPUs) Massive HW multithreading for data‐parallelism HW IPs Programmable and customizable accelerators Parallel threads heavily dependent on local data content

The accelerator store Specialization and Parallelism are THE design paradigm for embedded SoCs

GOPS/Watt

Multi‐Programmed Mixed‐Criticality Workload Parallel Hardware Platforms

slide-4
SLIDE 4

Host Processor Running processes

Programmable Manycore Accelerator

Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others

Concurrent Acceleration Requests

HOW TO SHARE THE ACCELERATOR? High‐end GPGPUs the same full‐fledged HW solutions proposed in high‐end GPGPUs won’t be affordable in low‐power SoCS! Leverage fine‐grained temporal multiplexing, relying on dedicated hardware support for fast and lightweight context switching!

slide-5
SLIDE 5

Host Processor Running processes

Programmable Manycore Accelerator

Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others

Concurrent Acceleration Requests

HOW TO SHARE THE ACCELERATOR? Common sense Overly long waiting times. Latency‐critical requests may have to resort to host execution. Use a coarser form of accelerator time‐sharing: execute offload requests in a run‐to‐completion, first‐come first‐served manner

slide-6
SLIDE 6

Host Processor Running processes

Programmable Manycore Accelerator

Multiple applications may concurrently require to offload computation to a manycore accelerator, each application being unaware of the existence of the others

Concurrent Acceleration Requests

HOW TO SHARE THE ACCELERATOR? HPC Embedded applications exhibit a limited amount of data parallelism, alternated with task‐level parallelism. Performance is likely to saturate as core allocation grows. To shorten time‐to‐completion, use up all of the available cores!

slide-7
SLIDE 7

Host Processor Running processes

Programmable Manycore Accelerator

OK, I got it: SPATIAL‐DIVISION MULTIPLEXING (SDM) is the solution!

Concurrent Acceleration Requests

 Current accelerator architectures are at odds with SDM  Partition just the cores? Or also the memory?  Should match program parallelism to execution environment  Software‐only solutions cannot provide complete isolation  Designing SDM for Predictability? For Security? For both? SDM is trivial. Where is the challenge?

Much more than a concept: a design philosophy!

slide-8
SLIDE 8

Davide Bertozzi MPSoC Research Group

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

NORM ALIZED SPEEDUP #CLUSTERS

IDEAL FAST ROD Convert DetectUniScaleResize Distance GaussianBlur ComputeKeypoints rBrief

Image Processing benchmarks run on a gem5‐based General‐Purpose Many‐Core Platform simulator to emphasize the computation speedup, minimizing communication and memory access effects: ideal crossbar and 1 cycle memory access latency With some exceptions, the trend is confirmed: real applications cannot exploit the whole parallelism provided by hardware! Relying on Space‐Division Multiplexing approach we relinquish the maximum parallelism but:

  • Such parallelism is actually not needed
  • Non‐Uniform Memory Access (NUMA) effects can be minimized
  • Interferences of other applications are avoided inside the partition

Does SDM make sense at all?

slide-9
SLIDE 9

Davide Bertozzi MPSoC Research Group

A batch of applications (8 requests for each app, 9 apps in total) is run and evaluated with several memory configurations by using the SDM and the coarse‐grain TDM approaches.

SDM overtakes TDM, providing speedups on the whole execution by 35% in the best case (i.e., full knowledge of incoming request pattern) and by 19% with random scheduling of acceleration requests.

What about TDM?

SDM: L2 partitioning, best schedule SDM: L2 partitioning, random schedule SDM: global L2, best schedule SDM: global L2, random schedule

Memory partitioning helps smooth out NUMA effects

slide-10
SLIDE 10

SDM Technology

  • SDM is OK, but how?
  • The Mapping Challenge
  • The Reconfiguration Challenge
  • The Adaptivity Challenge
slide-11
SLIDE 11

The traffic generated by different applications collides in the accelerator NoC……

……as the NoC paths are shared between nodes assigned to different applications. ……even for smart allocation schemes! It might be a good idea to prevent traffic from different applications from mixing. Or not? At least for better composability and analyzability.

The Isolation Property

Smart task allocation cannot guarantee the isolation property. You need NoC support for that!

slide-12
SLIDE 12

A deterministic (or partially adaptive) routing algorithm without cyclic dependencies among links or buffers can be represented by the set of routing restrictions it imposes  A routing restriction forbids any packets to use two consecutive channels

For irregular topologies as well

Our Approach: Routing Restrictions

Can we design a routing mechanism that finds a packet’s way to destination by interpreting such routing restrictions?

slide-13
SLIDE 13

Routing logic is assisted by a 26-bit configuration register per switch:

  • Routing restrictions are coded at each switch by means of routing bits Rxy
  • Unconnected ports are coded by means of connectivity bits Cx

Switch

(Xcurr,Ycurr)

LBDR logic: 1‐ compute the target quadrant NORTH‐EAST QUADRANT Destination FORBIDDEN! 2‐ Take North if at next hop I can turn east 3‐ Take East if at next hop I can turn north 4‐ Go East...provided the East port is connected!

Logic‐Based Distributed Routing

More flexible than algorithmic routing: it supports different routing algorithms and (not all) irregular 2D mesh topologies More scalable than routing tables: the configuration register stays the same regardless the network size Lower coverage than routing tables (~80%)

slide-14
SLIDE 14

Setting connectivity bits to zero at partition boundaries prevents messages from escaping from their partition

Additional benefits:

  • complexity in the order of algorithmic xy
  • no modification of the routing algorithm required
  • no additional provisioning to guarantee deadlock freedom
  • no virtual channel needed (yet)

Basic Partitioning Support

LBDR configuration bits

slide-15
SLIDE 15

With the basic approach, not all partition shapes are feasible!

There is a mismatch between partition shapes and the underlying routing algorithm... .....and in fact another routing algorithm works for the same partition shapes!

TWO POSSIBLE SOLUTIONS:

  • set up only those partition shapes that are “legal” for the chosen routing algorithm
  • adapt the routing algorithms to the partition shapes

The Flexibility Challenge

Out‐of‐reach!

slide-16
SLIDE 16

Not all network traffic is headed to switches inside the partition. Global traffic to memory controllers and/or unpartitioned L2 should be supported!

Solution

Underlying philosophy: There is ONE GLOBAL ROUTING ALGORITHM ‐ for intra‐partition messages ‐ as well as for global traffic The routing algorithm is unmodifid

  • no deadlock risks

Local Cx bits Global Cx bits

What about Global Traffic?

Provide two sets of LBDR bits, differing

  • nly in the connectivity bits

….but you start «invading» other partitions!

slide-17
SLIDE 17

Unrelated per‐partition algorithms: different algorithms are implemented in each partition, locally deadlock‐free but globally not.

E.g., different instances of the Segment‐based Routing (SR) strategy applied on a partition‐basis Global traffic support is not straightforward any more, since the global routing function is not necessarily deadlock‐free any more!

Why not Changing the Philosophy?

slide-18
SLIDE 18

~2X INCREASE IN COMPLEXITY OF THE LBDR ROUTING MECHANISM! MOREOVER, YOU HAVE VIRTUAL CHANNELS!

What about Global Traffic?

Two virtual channels to separate local from global traffic

Each virtual channel has its own routing algorithm What about Isolation?

 VC0 traffic (local) suffers from link‐level interference with VC1 traffic (global). Can be solved by using 2 networks!  VC1 traffic is a mix of global traffic originating from different partitions. Can be solved only through temporal isolation!

slide-19
SLIDE 19

Temporal Isolation of Global Traffic

PhaseNoC, a collaboration with Giorgos Dimitrakopoulos, Demokritus University of Thrace (Greece)

A domain is defined as an individual VC—or a group of VCs—serving one (or more) partition flow(s). The various domains will never “compete” with each other to gain access to any network resource. There is no information leak whatsoever across the domains

Global traffic flows from domains can propagate concurrently and in complete isolation, despite sharing the same hardware infrastructure, by means of a VC-level TDM.

slide-20
SLIDE 20

Temporal Isolation of Global Traffic

PhaseNoC, a collaboration with Giorgos Dimitrakopoulos, Demokritus University of Thrace (Greece) Latency Optimization. The phases are coordinated into optimally scheduled interlocked propagating waves, which ensure that in‐flight packets of all domains experience the minimum possible latency The topology structure limits the number of domains under perfect schedule Perfect schedule If D ≤ 2(P+1)

slide-21
SLIDE 21

SDM Technology

  • SDM is OK, but how?
  • The Mapping Challenge
  • The Reconfiguration Challenge
  • The Adaptivity Challenge
slide-22
SLIDE 22

Davide Bertozzi MPSoC Research Group

When an application offloads computation to a programmable manycore accelerator, a resource manager should take care of mapping it onto the available resources.

  • Assume an application is requesting N cores (performance saturation with > N cores)
  • It is not always true that striving to grant N cores to this application is the best

performing solution!

Assumption: 2 applications are already mapped on the accelerator resources If a third application requests 6 cores to offload computation , what is the best mapping option?

The Mapping Challenge

Sometimes the answer is: a 5‐core mapping!

slide-23
SLIDE 23

Davide Bertozzi MPSoC Research Group

200000 400000 600000 800000 1000000 1200000 1400000 1600000

1I 2I 3I 3L 4I 4L 4T 4Q 5L 5V 5S 5Y 5F 5X 5G 6L 6G 6R 7G 8R 9Q 10Q

EXECUTION TIME [CYCLES] #CLUSTERS PER PARTITION – PARTITION SHAPE

  • CASE STUDY: FAST‐ROSTEN

PARTITION SHAPES MAKE THE DIFFERENCE! The speedups are significantly shape-dependent…so another challenge arises for the resource manager: sometimes mapping an application on less resources but better arranged gives more benefits than leveraging more computational units!

Partition shape matters

slide-24
SLIDE 24

SDM Technology

  • SDM is OK, but how?
  • The Mapping Challenge
  • The Reconfiguration Challenge
  • The Adaptivity Challenge
slide-25
SLIDE 25

Runtime modifications of the routing algorithm may be needed in order to enable/maximize/adapt/prolong utilization of accelerator resources

SCENARIO: GLOBAL ROUTING

ALGORITHM Global reconfiguration

  • f the network routing

function to enable allocation of Application 2

SCENARIO: PER‐PARTITION

ROUTING ALGORITHM Runtime adaptation of the partition routing algorithm for performance optimization, or for fault‐tolerance

Reconfiguration of the Routing Function

slide-26
SLIDE 26
  • During the reconfiguration transient, packets of both the new and the old routing

function co‐exist in the network.

  • Although the routing functions are deadlock‐free, this might not hold for their

combination!

Channel Dependency Graph (CDG)

  • XY routing
  • YX routing
  • their mix

Example with SR routing 1 2 3

1x‐ 0x+ 2y‐ 1y+ 3x‐ 2x+ 3y‐ 0y+ CDG

for

XY 0x+ 1x‐ 2x+ 3x‐ 1y+ 0y+ 3y‐ 2y‐ xy CDG

For

YX 0y+ 2y‐ 1y+ 3y‐ yx 2x+ 0x+ 3x‐ 1x‐

+

1x‐ 0y+ 2x+ 3y‐ xy xy yx yx

  • CONF. A

TRANSIENT CONFIGURATION

  • CONF. B

DEADLOCK!

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

The Deadlock Concern

slide-27
SLIDE 27
  • Static reconfiguration approaches drain the network from ongoing packets before

reconfiguring its routing function.

  • Better idea: Overlapped Static Reconfiguration (OSR)

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

Old messages New messages OSR prevents this from happening by removing old dependencies and adding new ones in a controlled way Deadlock would require old packets following new packets and vice versa

  • ld

new token

  • ld

new Special epoch transition rules enforce that no old packet can follow a token

OSR

slide-28
SLIDE 28
  • The token signals propagate among switches throughout the network in the order
  • f the CGD (Channel Dependency Graph) of the old routing function.

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

3 6 9 12 15 36 39 36 33 33 30 30 27 24 21 18 39 42 42 45 45 48 48 51 54 75 72 69 66 63 60 57 Scrolling up phase Scrolling down phase

If OLD CDG is acyclic, each input port of each switch will receive a token exactly once.

Total Reconfiguration Time 75 cycles

Token Propagation

slide-29
SLIDE 29

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

  • Baseline OSR may have significant implications on the latency of ongoing packets
  • Optimizations do exist in order to minimize the performance drop
  • The most aggressive one yields fully transparent reconfiguration
  • At the cost of possible out‐of‐order delivery in some unfortunate cases
  • Other solutions do exist that yield a good performance‐complexity compromise

100 200 300 400 500 600 OSR-Lite OSR-Lite_opt_inorder OSR-Lite_opt

Maximum latency (cycles)

Execution Time (cycles)

Reconfiguration transient in an 8x8 chip multiprocessor Fully Transparent Reconfiguration Baseline OSR Intermediate

  • ptimization

Performance Perturbation

slide-30
SLIDE 30

SDM Technology

  • SDM is OK, but how?
  • The Mapping Challenge
  • The Reconfiguration Challenge
  • The Adaptivity Challenge
slide-31
SLIDE 31

P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R P P P P P P P P P P P P P P P P R R R R R R R R R R R R R R R R

So far:

  • Effective resource sharing implies

some form of partitioning…

  • … and isolation for protection.

Next step:

  • Partition scheduling.
  • Partition reshaping.

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

Strong implications on the programming model!

Looking Forward: Adaptivity

Mode‐base partitions Asynchronous partitions Partition Reshaping

Robert Hilbrich and J. Reinier van Kampenhout, “Partitioning and Task Transfer on NoC‐based Many‐Core Processors in the Avionics Domain”, ADA Deutschland 2011

slide-32
SLIDE 32

Davide Bertozzi MPSoC Research Group

OUT‐OF‐ORDER DELIVERY DEADLOCK

Existing Partition Current S ↦ D path Extended Partition New S ↦ D path

No guarantee that packet following red path reaches the destination before the one following the green path:

  • Old routes can be non‐minimal after a new

topology is established

  • Congestion can cause delays

All possible paths

Running on 3 cores

  • Safe option

NoC Reconfig. Running on 4 cores

  • Aggressive option

Scheduling of actions in time plays a critical role!

 DEADLOCK‐FREE but INEFFICIENT X DEADLOCK‐ and OoO‐PRONE

OSR mechanism again needed to provide:  a separation token between old and new packets thus avoiding out‐of‐order delivery  deadlock freedom

Partition Extension

Pause

slide-33
SLIDE 33

A new era for high‐end systems‐on‐chip!

slide-34
SLIDE 34

Full‐Fledged HW/SW System Vision

HYPERVISOR RESOURCE MANAGER

Virtualization as a means of simplifying programming Master the reconfiguration hooks exposed by the hardware for energy‐efficient platform‐management

Dynamic repartitioning

A B B C A D

ZzZ

 Delivering efficient support for resource sharing (TDM, SDM, mix)…  …and adaptive resource sharing for workload adaptivity Workload‐ adaptive power management while meeting application requirements  Programmers should be allowed to specify hardware platform‐agnostic execution requirements (perf. targets, quality‐of‐service, reliability, or security)  Programmers do not need to adapt their applications to the host system hardware, but just to the abstracted environment provided inside each VM.

  • App. requirements

Platform state

slide-35
SLIDE 35

MPSoC Research Group, ENDIF Department, University of Ferrara , via Saragat 1, 44122, Ferrara, ITALY

  • Spatial‐division multiplexing as a way to consolidate multi‐program

workloads onto shared manycore programmable accelerators

  • Design methods for spatial partitioning encompass the HW/SW stack

and build up a «design‐for‐partitioning» methodology

– Different degrees of isolation

  • Partition shape makes a difference for its execution time

– Challenges the resource management policy

  • Next step: support for dynamism

– Runtime modification of the routing function – Adaptive partition size and shape

  • New vision for high‐end embedded systems
  • Future work:

Programming model and resource mangement for a dynamic hardware platform.

Conclusions