Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM - - PowerPoint PPT Presentation

refactoring conventional task schedulers to exploit
SMART_READER_LITE
LIVE PREVIEW

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM - - PowerPoint PPT Presentation

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra Sandra Cataln, Rafael Rodrguez- Luis Costero, Francisco D. Igual, Snchez, Enrique S. Quintana-Ort Katzalin Olcoz


slide-1
SLIDE 1

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra

Luis Costero, Francisco D. Igual, Katzalin Olcoz Sandra Catalán, Rafael Rodríguez- Sánchez, Enrique S. Quintana-Ortí

slide-2
SLIDE 2

https://www.youtube.com/watch?v=

slide-3
SLIDE 3

Task parallelism

slide-4
SLIDE 4

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

slide-5
SLIDE 5

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism

slide-6
SLIDE 6

Contribution

Asymmetry-oblivious scheduler Asymmetry-aware DLA library +

Task parallelism Data parallelism Virtual Cores

slide-7
SLIDE 7

Software execution models for ARM big.LITTLE

slide-8
SLIDE 8

Target architecture

slide-9
SLIDE 9

Execution Models

Cluster swithching mode

CPU Migration Global task scheduling

slide-10
SLIDE 10

Parallel execution of DLA operations

  • n multi-threaded architectures
slide-11
SLIDE 11

A=UTU

slide-12
SLIDE 12

Runtime task scheduling of DLA

  • perations
  • Task scheduling for the Cholesky factorization
slide-13
SLIDE 13

Runtime task scheduling of DLA

  • perations
  • Task scheduling in heterogeneous architectures

– The runtime distinguishes between CPU and

GPU targets: OmpSs, StarPU, MAGMA, libflame

– Tasks assigned depending on target properties

and specific techniques are applied

slide-14
SLIDE 14

Runtime task scheduling of DLA

  • perations
  • Task scheduling in asymmetric architectures

– Asymmetry-concious runtime: Botlev-OmpSs – Critical-aware Task Scheduler policy – Each task is mapped to a single core

slide-15
SLIDE 15

Data parallel libraries of BLAS3 kernels

  • Multi-threaded implementation of the BLAS-3
slide-16
SLIDE 16

Data parallel libraries of BLAS3 kernels

  • Data-parallel libraries for asymmetric

architectures:

– Global Task Scheduling – Dynamic workload distribution between the

clusters

– Static workload distribution in a cluster – Specific loop strides for each type of core

slide-17
SLIDE 17

Retargeting existing task schedulers to asymmetric architectures

slide-18
SLIDE 18

Evaluation of conventional runtimes

  • n AMPs
slide-19
SLIDE 19

Combining conventional runtimes with asymmetric libraries

  • GTS model (inspired in CPUM)

– Virtual cores composed of 1A15 + 1A7 – Both cores are active simultaneously

  • Parallelism:

– Task-level: symmetric runtime – Data-level: asymmetric library

slide-20
SLIDE 20

Combining conventional runtimes with asymmetric libraries

  • Comparison with other approaches:

✔ Any conventional task scheduler will work

transparently with no special modifications

✔ Any improvement in the runtime will impact the

performance on an AMP

✔ Any improvement in the asymmetry-aware library

will impact the performace on an AMP

✗ Need of a tuned asymmetry-aware DLA library

slide-21
SLIDE 21

Experimental results

slide-22
SLIDE 22

Performance evaluation of the asymmetric BLIS

slide-23
SLIDE 23

Performance evaluation of the asymmetric BLIS

slide-24
SLIDE 24

Integration of the asymmetric BLIS in a conventional task scheduler

slide-25
SLIDE 25

Performance comparison versus asymmetry-aware task scheduler

slide-26
SLIDE 26

Conclusions

slide-27
SLIDE 27

In this work...

  • Task-parallelism + Data-parallelism on AMPs
  • Reuse of existing task schedulers.
  • Competitive with asymmetry-aware schedulers
slide-28
SLIDE 28

Thank you