Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM - - PowerPoint PPT Presentation

▶

Jun 30, 2023 410 likes •709 views

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra Sandra Cataln, Rafael Rodrguez- Luis Costero, Francisco D. Igual, Snchez, Enrique S. Quintana-Ort Katzalin Olcoz

SLIDE 1

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra

Luis Costero, Francisco D. Igual, Katzalin Olcoz Sandra Catalán, Rafael Rodríguez- Sánchez, Enrique S. Quintana-Ortí

SLIDE 2

https://www.youtube.com/watch?v=

SLIDE 3

Task parallelism

SLIDE 4

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

SLIDE 5

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism

SLIDE 6

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism Virtual Cores

SLIDE 7

Software execution models for ARM big.LITTLE

SLIDE 8

Target architecture

SLIDE 9

Execution Models

Cluster swithching mode

CPU Migration Global task scheduling

SLIDE 10

Parallel execution of DLA operations

n multi-threaded architectures

SLIDE 11

A=UTU

SLIDE 12

Runtime task scheduling of DLA

perations
Task scheduling for the Cholesky factorization

SLIDE 13

Runtime task scheduling of DLA

perations
Task scheduling in heterogeneous architectures

– The runtime distinguishes between CPU and

GPU targets: OmpSs, StarPU, MAGMA, libflame

– Tasks assigned depending on target properties

and specific techniques are applied

SLIDE 14

Runtime task scheduling of DLA

perations
Task scheduling in asymmetric architectures

– Asymmetry-concious runtime: Botlev-OmpSs – Critical-aware Task Scheduler policy – Each task is mapped to a single core

SLIDE 15

Data parallel libraries of BLAS3 kernels

Multi-threaded implementation of the BLAS-3

SLIDE 16

Data parallel libraries of BLAS3 kernels

Data-parallel libraries for asymmetric

architectures:

– Global Task Scheduling – Dynamic workload distribution between the

clusters

– Static workload distribution in a cluster – Specific loop strides for each type of core

SLIDE 17

Retargeting existing task schedulers to asymmetric architectures

SLIDE 18

Evaluation of conventional runtimes

n AMPs

SLIDE 19

Combining conventional runtimes with asymmetric libraries

GTS model (inspired in CPUM)

– Virtual cores composed of 1A15 + 1A7 – Both cores are active simultaneously

Parallelism:

– Task-level: symmetric runtime – Data-level: asymmetric library

SLIDE 20

Combining conventional runtimes with asymmetric libraries

Comparison with other approaches:

✔ Any conventional task scheduler will work

transparently with no special modifications

✔ Any improvement in the runtime will impact the

performance on an AMP

✔ Any improvement in the asymmetry-aware library

will impact the performace on an AMP

✗ Need of a tuned asymmetry-aware DLA library

SLIDE 21

Experimental results

SLIDE 22

Performance evaluation of the asymmetric BLIS

SLIDE 23

Performance evaluation of the asymmetric BLIS

SLIDE 24

Integration of the asymmetric BLIS in a conventional task scheduler

SLIDE 25

Performance comparison versus asymmetry-aware task scheduler

SLIDE 26

Conclusions

SLIDE 27

In this work...

Task-parallelism + Data-parallelism on AMPs
Reuse of existing task schedulers.
Competitive with asymmetry-aware schedulers

SLIDE 28

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra

Luis Costero, Francisco D. Igual, Katzalin Olcoz Sandra Catalán, Rafael Rodríguez- Sánchez, Enrique S. Quintana-Ortí

Task parallelism

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism Virtual Cores

Software execution models for ARM big.LITTLE

Target architecture

Execution Models

Cluster swithching mode

CPU Migration Global task scheduling

Parallel execution of DLA operations

A=UTU

Runtime task scheduling of DLA

Runtime task scheduling of DLA

– The runtime distinguishes between CPU and

GPU targets: OmpSs, StarPU, MAGMA, libflame

– Tasks assigned depending on target properties

and specific techniques are applied

Runtime task scheduling of DLA

– Asymmetry-concious runtime: Botlev-OmpSs – Critical-aware Task Scheduler policy – Each task is mapped to a single core

Data parallel libraries of BLAS3 kernels

Data parallel libraries of BLAS3 kernels

architectures:

– Global Task Scheduling – Dynamic workload distribution between the

clusters

– Static workload distribution in a cluster – Specific loop strides for each type of core

Retargeting existing task schedulers to asymmetric architectures

Evaluation of conventional runtimes

Combining conventional runtimes with asymmetric libraries

Combining conventional runtimes with asymmetric libraries

transparently with no special modifications

performance on an AMP

will impact the performace on an AMP

Experimental results

Performance evaluation of the asymmetric BLIS

Performance evaluation of the asymmetric BLIS

Integration of the asymmetric BLIS in a conventional task scheduler

Performance comparison versus asymmetry-aware task scheduler

Conclusions

In this work...

Thank you