Computational Significance (and its implications for HPC) - - PowerPoint PPT Presentation

computational significance
SMART_READER_LITE
LIVE PREVIEW

Computational Significance (and its implications for HPC) - - PowerPoint PPT Presentation

Computational Significance (and its implications for HPC) Dimitrios S. Nikolopoulos CCDSC Dareiz, Oct. 5 2016 Challenge Transistors Aggressive shrinking Variation in performance, data retention times Two approaches Mitigate


slide-1
SLIDE 1

Computational Significance

(and its implications for HPC)

Dimitrios S. Nikolopoulos CCDSC Dareizé, Oct. 5 2016

slide-2
SLIDE 2

Challenge

  • Transistors

ü Aggressive shrinking ü Variation in performance, data retention times

  • Two approaches

ü Mitigate it, lose performance ü Embrace it, gain performance, introduce errors

  • Best effort computing

ü Where algorithms are inherently approximate ü Where algorithms or systems can mitigate errors

slide-3
SLIDE 3

Significance-Driven Computing

  • Not every line of code or variable are equal

ü Each has a unique contribution to the output ü Estimating this contribution needs domain expertise

  • Computational significance

ü Value of contribution to output

  • Disciplined approximation
  • Abstraction for software

ü Selectively protect execution q memory objects, tasks, threads ü Control error in the compiler, runtime, language ü Algorithm complexity control

slide-4
SLIDE 4

GMRES Resilience

Gscwandtner et al., CSR&D, 2015

slide-5
SLIDE 5

Significance-driven GMRES

Vassiliadis et al., IJPP , 2016 Chalios et al., CDT, 2015

slide-6
SLIDE 6

Self-stabilizing CG

Aliaga et al., PARCO, 2015

  • Algorithmic fault correction

ü Periodic step correcting state of algorithm ü Guaranteed convergence with accurate healing step ü No assumptions about convergence rate

  • Heterogeneous architecture

ü 1-N reliable-unreliable cores ü Designed with iso-efficiency metrics ü Healing step on reliable core

slide-7
SLIDE 7

Language & runtime support

  • Disciplined approximation

ü User controls significance, error, performance

  • Significance abstraction of code & data

ü Binary ü Continuous

  • Approximate alternatives of code blocks
  • Examples

ü OpenMP tasks q Significance ‘score’, task alternatives ü Dataflow annotations q Data criticality ü App-specific error checks

slide-8
SLIDE 8

Programming Model

Aliaga et al., PARCO, 2015

slide-9
SLIDE 9

Simple example: Convolution

Aliaga et al., PARCO, 2015

slide-10
SLIDE 10

Significance-driven runtime

  • On-the-fly task versioning

ü Controlled approximation & error checking

  • Quality-aware synchronization

ü Flimsy barriers

  • Significance propagation

ü Track & tune significance of task groups & chains

  • Multi-dimensional Optimization

ü Performance, Power, Energy, Quality

Vassiliadis et al., IJPP, 2016

slide-11
SLIDE 11

Convolution trade-off’s

Vassiliadis et al., CF, 2015 Vassiliadis et al., IJPP, 2016

slide-12
SLIDE 12

Some HPC app results

Vassiliadis et al., CF, 2015 Vassiliadis et al., IJPP, 2016

slide-13
SLIDE 13

Lulesh error

Vassiliadis et al., CF, 2015 Vassiliadis et al., IJPP, 2016

slide-14
SLIDE 14

Variable-reliability memory

  • DRAM refresh consumes significant power

ü Projected to 40%-50% in future large-memory systems

  • Refresh-free memories

ü Additional errors ü Many mitigation options (ECC, application)

  • Significance-driven memory management

ü Data placement & migration ü Memory reliability control

slide-15
SLIDE 15

Variable-reliability memory

slide-16
SLIDE 16

Relaxing refresh on an HPC server

ü Divide physical memory to Reliable and Variably-Reliable Domains ü Allocate kernel to RD ü Allocate critical App data to RD ü Allow programmer to allocate heap to VRD

slide-17
SLIDE 17

Application resilience

ü Applications are naturally resilient, just by accessing data ü Potential for significant performance & energy gains

slide-18
SLIDE 18

Application-level resilience methods

  • Data classification based on criticality

ü E.g. low/high-frequency coefficients

  • Refresh by access

ü Exploit the natural refresh ü Spread accesses to variably-reliable memory ü Iterative algorithms (e.g. k-means) ü Controlled anti-locality techniques (e.g. stencils)

  • Access-aware scheduling

ü Postpone writes to variably-reliable memory ü Prioritize reads to variably-reliable memory

slide-19
SLIDE 19

Refresh-by-data-access

ü Accesses during window of vulnerability act as natural refresh ü Move writes late, move reads early ü Scheduling controls data refresh time ü Anti-locality

  • ptimization

problem

slide-20
SLIDE 20

Refresh-by-access

ü Scheduling parallel tasks to control refresh time ü Improved resilience at no performance cost

slide-21
SLIDE 21

HPC in a different context

slide-22
SLIDE 22

Acknowledgments

  • The team

ü Charalampos Chalios ü Kostas Tovletoglou ü Giorgis Georgakoudis ü George Karakonstantis ü Hans Vandierendonck

  • The support

ü EPSRC (SERT) ü EU (SCoRPiO, UniServer) ü Royal Society (Wolfson Award)