..or Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev ! - - PowerPoint PPT Presentation

or
SMART_READER_LITE
LIVE PREVIEW

..or Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev ! - - PowerPoint PPT Presentation

Correctness Checking Concepts and Tools for High Performance Computing ..or Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev ! School of Computing University of Utah Salt Lake City, UT 84112 URL: http://www.cs.utah.edu/fv Supported by


slide-1
SLIDE 1

URL: http://www.cs.utah.edu/fv

Supported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)! and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) ! SUPER Institute (for resilience research)! and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing

..or

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev! School of Computing University of Utah Salt Lake City, UT 84112

slide-2
SLIDE 2

Ganesh Gopalakrishnan, Wei-Fan Chiang, and Alexey Solovyev! School of Computing University of Utah Salt Lake City, UT 84112

URL: http://www.cs.utah.edu/fv

Supported by NSF awards SI2 (ACI-1148127), EAGER (CCF-1241849), Failure Resistant Systems (CCF 1255776)! and SRC Task 2426.001, NSF Medium (CCF 7298529), EAGER (CCF 1346756) ! SUPER Institute (for resilience research)! and special thanks to Microsoft for funding (2006-2010) on getting established in this area!

Correctness Checking Concepts and Tools for High Performance Computing

Bugs: Black Ice on the Road to Exascale

..or

slide-3
SLIDE 3

!3

slide-4
SLIDE 4

Relevant Personal History

  • PhD from Stony Brook : 1981 (when Mead/Conway : VLSI, Hennessy : MIPS, Patterson : Sparc)!
  • Joined Utah 1986!
  • Taught OS as my second class!
  • Wrote to Tanenbaum!
  • Got Minix on 5.25 inch floppy!
  • Class did kernel hacking on dual 5.25 inch IBM PC!
  • ……..!
  • Worked on various aspects of concurrency!
  • Self-timed Circuit Design!
  • Pipelined Processor Verification!
  • Cache Coherence Protocols!
  • Shared Memory Consistency Models!
  • Feel privileged to work on Formal Methods for Concurrency in Service of HPC !!

!4

slide-5
SLIDE 5

We have been fortunate to have built some tools in support of HPC FV

  • Let us do some demos … so that you have

some context to what I’ll be later saying

!5

slide-6
SLIDE 6

DEMO: Dynamic Execution based ! Debugging of MPI Programs!

!

Tool Name : ISP

!6

slide-7
SLIDE 7

DEMO: Symbolic Execution based debugging of! Sequential programs and! GPU CUDA programs!

!

Tool name : GKLEE

!7

slide-8
SLIDE 8

Brief History of Why We are Where We Are

  • CISC machines (70s)!
  • Pipelining —> Clock Frequency growth + Compilers!
  • Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!
  • Pipelining —> Better ILP use!
  • Moore’s law : afforded Pipelining tricks!
  • Dennard’s law : allowed voltage scaling!
  • POWER DENSITY stayed the same!
  • Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!
  • Tejas Project Write-off — NY Times !
  • Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!

!8

slide-9
SLIDE 9

Brief History of Why We are Where We Are

!9

  • CISC machines (70s)!
  • Pipelining —> Clock Frequency growth + Compilers!
  • Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!
  • Pipelining —> Better ILP use!
  • Moore’s law : afforded Pipelining tricks!
  • Dennard’s law : allowed voltage scaling!
  • POWER DENSITY stayed the same!
  • Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!
  • Tejas Project Write-off — NY Times !
  • Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!
slide-10
SLIDE 10

Brief History of Why We are Where We Are

!10

  • CISC machines (70s)!
  • Pipelining —> Clock Frequency growth + Compilers!
  • Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!
  • Pipelining —> Better ILP use!
  • Moore’s law : afforded Pipelining tricks!
  • Dennard’s law : allowed voltage scaling!
  • POWER DENSITY stayed the same!
  • Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!
  • Tejas Project Write-off — NY Times !
  • Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!
slide-11
SLIDE 11

Brief History of Why We are Where We Are

!11

  • CISC machines (70s)!
  • Pipelining —> Clock Frequency growth + Compilers!
  • Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!
  • Pipelining —> Better ILP use!
  • Moore’s law : afforded Pipelining tricks!
  • Dennard’s law : allowed voltage scaling!
  • POWER DENSITY stayed the same!
  • Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!
  • Tejas Project Write-off — NY Times !
  • Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!
slide-12
SLIDE 12

Brief History of Why We are Where We Are

!12

  • CISC machines (70s)!
  • Pipelining —> Clock Frequency growth + Compilers!
  • Hennessy and Patterson outdid the industry using “Mead and Conway” VLSI design!
  • Pipelining —> Better ILP use!
  • Moore’s law : afforded Pipelining tricks!
  • Dennard’s law : allowed voltage scaling!
  • POWER DENSITY stayed the same!
  • Ridiculous Frequencies, Diminishing ILP Returns, Moore Alive, Dennard Dying already…!
  • Tejas Project Write-off — NY Times !
  • Dick Lyon, Charles Leiserson, Guy Blelloch, … were right ALL ALONG !!
slide-13
SLIDE 13

Smart Phones (describe the shape of things to come in HPC)!

!

(from Adve, http://www.cs.berkeley.edu/~bodik/ASPLOS13/ Symposium/sarita-adve-12-asplos-pc-symposium.pdf

!13

slide-14
SLIDE 14

Today’s main HPC Mantra

  • “Maximize the volume of computational results
  • btained per Watt”

!14

slide-15
SLIDE 15

But what about correctness…. ?

!15 Industrial Flares

Nvidia NASA Uintah (SCI Group, Utah) Marsden Lab, UCSD Wikipedia

slide-16
SLIDE 16

Today’s main HPC Mantra

  • “Maximize the volume of computational results
  • btained per Watt”!
  • Subject to Moore’s and Dennard’s laws

!16

(Courtesy Bob Colwell)

slide-17
SLIDE 17

Today’s main HPC Mantra

!17

(Courtesy Bob Colwell)

  • “Maximize the volume of computational results
  • btained per Watt”!
  • Subject to Moore’s and Dennard’s laws
slide-18
SLIDE 18

So, how prepared are we to debug Heterogeneous Concurrent Systems?

!18

slide-19
SLIDE 19

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!19

slide-20
SLIDE 20

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!20

slide-21
SLIDE 21

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!21

slide-22
SLIDE 22

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!22

slide-23
SLIDE 23

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!23

slide-24
SLIDE 24

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!24

slide-25
SLIDE 25

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!25

slide-26
SLIDE 26

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!26

slide-27
SLIDE 27

What is the young many-core world already facing ?

  • Multiple heterogeneous cores!
  • Multiple concurrency models!
  • Data Races!
  • Dead Dennard —> Dark Silicon!
  • Bit Flips!
  • Floating-Point Uncertainties!
  • OFTEN clueless (about concurrency) programming community — will

provide examples!

  • WE JUST DON’T KNOW HOW TO CALIBRATE THE RISKS

!27

slide-28
SLIDE 28

Power-6 Studies

!28

slide-29
SLIDE 29

Getting Resilience Ground Truths (Power-6)

!29

slide-30
SLIDE 30

Power-7 Studies

!30

slide-31
SLIDE 31

A “feel” of HPC Correctness

  • Constant pressure : The “most science per dollar”!
  • Many dimensions of correctness!
  • HPC explores unknown aspects of Sciences!
  • Algorithmic Approximations are often made!
  • Growing heterogeneity in HPC platforms!
  • Floating-point representation is inexact!
  • “Bit flips” !
  • Correctness training lacks!
  • Busy-enough doing Science!
  • Finding and keeping “Pi men” is difficult!
  • Always makes sense to switch to latest HW!
  • Often the poorest documented

!31

RIKEN K machine

(Lazowka)

HPC Sciences

slide-32
SLIDE 32

A “feel” of HPC Correctness

  • Constant pressure : The “most science per dollar”!
  • Many dimensions of correctness!
  • HPC explores unknown aspects of Sciences!
  • Algorithmic Approximations are often made!
  • Growing heterogeneity in HPC platforms!
  • Floating-point representation is inexact!
  • “Bit flips” !
  • Correctness training lacks!
  • Busy-enough doing Science!
  • Finding and keeping “Pi men” is difficult!
  • Always makes sense to switch to latest HW!
  • Often the poorest documented

!32

(Our twist)

FM HPC

RIKEN K machine

(Lazowka)

HPC Sciences

slide-33
SLIDE 33

A Heterogeneity-induced bug! (Berzins, Meng, Humphrey, XSEDE’12)

!33

P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…" floor("P"/"C")"="161%

Xeon% Phi%

"P"/"C"="162" floor("P"/"C")"="162%

Expecting 161 msgs Sent 162 msgs

slide-34
SLIDE 34

A Heterogeneity-induced bug! (Berzins, Meng, Humphrey, XSEDE’12)

!34

P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…" floor("P"/"C")"="161%

Xeon% Phi%

"P"/"C"="162" floor("P"/"C")"="162%

Expecting 161 msgs Sent 162 msgs

Authors’ fix : used double-precision for P/C! Question: Is there a more deft solution ?

slide-35
SLIDE 35

A Heterogeneity-induced bug! (Berzins, Meng, Humphrey, XSEDE’12)

!35

P"="0.421874999999999944488848768742172978818416595458984375"" C"="0.0026041666666666665221063770019327421323396265506744384765625""

Compute:"floor("P"/"C")"

Xeon%

"P"/"C"="161.9999…" floor("P"/"C")"="161%

Xeon% Phi%

"P"/"C"="162" floor("P"/"C")"="162%

Expecting 161 msgs Sent 162 msgs

Authors’ fix : used double-precision for P/C! Question: Is there a more deft solution ?! More important question : What exactly went wrong ??! (the XSEDE’12 authors moved along…)

slide-36
SLIDE 36

Resilience

  • ~7 B transistors per GPU (and many B for CPUs) and a ton of memory!
  • 10^18 Transistors Throbbing at GHz for Weeks!
  • Some bit changes MUST be unplanned ones!
  • In HPC, results combine more (than, say, in “cloud”)!
  • “Bit flip” is a catch-all term for !
  • High speed-variability of devices coupled with DVFS jitter!
  • Local hot spots develop, aging chip electronics!
  • Particle strikes!
  • Energy is the main currency!
  • Some of the energy-saving “games” that must be played (this invites bit-flips)!
  • Dynamic Slack Detection, followed by lowering voltage + frequency!
  • One PNNL study (Kevin Baker) : 36KW -> 18KW

!36

slide-37
SLIDE 37

Our Position (1)

  • Despite “bit flips” and such, it is amply clear

that sequential and concurrency bugs still

  • ught to be our principal focus!
  • They occur quite predictably (unlike bit flips)!
  • They are something we can control (and

eliminate in many cases)

!37

slide-38
SLIDE 38

Our Position (2)

  • Unless we can debug in the small, there is NO

WAY we can debug in the large

!38

slide-39
SLIDE 39

Our Observations (3)

  • There are SO MANY instances where experts

are getting it wrong — and spreading the wrong

!39

slide-40
SLIDE 40

Example-1

  • IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

  • It can

!40

slide-41
SLIDE 41

Example-1

  • IBM Documentation: “If you debug your MPI program

under zero Eager Limit (buffering for MPI sends), then adding additional buffering does not cause new deadlocks”

  • It can

!41

slide-42
SLIDE 42

Example-2

  • A reduction kernel given as an early-chapter

example of a recent Cuda book is broken!

  • Reason: Assumes that CUDA atomic-add

has a “fence” semantics!

  • Erratum has been issued on book website

!42

slide-43
SLIDE 43

Example-3

  • A work-stealing queue in “GPU gems” is

incorrect!

  • Reason: Assumes “store store” ordering

between two sequentially issued stores (must have used a fence in-between)

!43

slide-44
SLIDE 44

Feature of GPU programming

  • Programmers face concurrency corner-cases quite

frequently

  • As opposed to (e.g.) OS where low-level

concurrency is usually hidden within the kernel

!44

slide-45
SLIDE 45

Example-4

  • If your code ran correctly in FORTRAN, it will

also run correctly in C

!45

slide-46
SLIDE 46

Example-4 invalidated

!46

slide-47
SLIDE 47

Example-5 ! Simple questions can’t be answered by today’s tools! Does this program deadlock? (Yes.)

!47

slide-48
SLIDE 48

Example-5 ! Simple questions can’t be answered by today’s tools! Does this program deadlock? (Yes.)

!48

Match

slide-49
SLIDE 49

!49 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

slide-50
SLIDE 50

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

!50 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

slide-51
SLIDE 51

!51 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

slide-52
SLIDE 52

!52 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' E.g.'When'compiling'with'opKmizaKons' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…(' New(Answer:(0,(2,(4,(6,(8,(…'

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

slide-53
SLIDE 53

!53 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' But'if'you'read'the'CUDA'documentaKon' Carefully,'you'noKce'you'had'to'use'a'' C'VolaKle'that'restored'“correct”'answers!' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…(' Vola$le(x[],(y[]..'

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

slide-54
SLIDE 54

!54 __global__'void'kernel(int*'x,'int*'y)' {' ''int'index'='threadIdx.x;' ''y[index]'='x[index]'+'y[index];' ''if'(index'!='63'&&'index'!='31)' ''''y[index+1]'='1111;' }' Ini$ally(:(x[i](==(y[i](==(i( Warp1size(=(32( The'hardware'schedules'these'instrucKons'in' “warps”'(SIMD'groups).'' However,'this'“warp'view”'oSen'appears' to'be'lost' But'the'ability'to'“rescue'correct'answer”' is'no'longer'a'guarantee'(since'CUDA'5.0)' Expected(Answer:(0,(1111,(1111,(…,(1111,(64,(1111,(…('

Example-6 : Does Warp-Synchronous Programming Help Avoid a __syncthreads?

slide-55
SLIDE 55

So you really trust your compilers?

  • Talk to Prof. John Regehr of Utah!
  • C-Smith : Differential testing of compilers!
  • The single most impressive compiler testing work

(IMHO) in recent times!

  • Has found goof-ups in -O0 for short programs!
  • Many bugs around C volatiles!
  • Learned that NOTHING is known about how

compilers (ought to) treat floating-point

!55

slide-56
SLIDE 56

Without swift action, the “din” of ! the blind leading the blind will sow more confusion

!56

Some threads offer advice ranging from “use volatiles”! ! (was in early CUDA documentation; gone since 5.0)!

!

Others advocate the use of __syncthreads (barriers)

!

Or query device registers to know warp size

!

https://devtalk.nvidia.com/default/topic/512376/ https://devtalk.nvidia.com/default/topic/499715/ https://devtalk.nvidia.com/default/topic/382928/

!

And there are several threads simply discuss this issue

!

https://devtalk.nvidia.com/default/topic/632471 https://devtalk.nvidia.com/default/topic/377816/

!

There isn’t a comprehensive picture of dos and don’t and WHY !

slide-57
SLIDE 57

Discussions on “warp-synchronous” code

!57

https://devtalk.nvidia.com/default/topic/499715/are-threads-of- a-warp-really-sync-/?offset=2

slide-58
SLIDE 58

Example-8! Do GPUs obey coherence?! (Coherence = per-location Seq Consistency)

  • Ask me after the talk……. :)
  • We are stress testing real GPUs
  • and finding things out!
  • (work is inspired by Bill Collier who called it

“X-raying real machines” in his famous RAPA book)

!58

slide-59
SLIDE 59

Our (humble) suggestion

  • There is NO WAY the complexity of anything can be conquered

without mathematics!

  • The complexity of debugging needs the “mathematics of

debugging” — the true mathematics of Software Engineering!

  • i.e. formal methods!
  • Must develop the “right kind” of formal methods!
  • Coexist with the grubby!
  • Take on problems in context!
  • Win practitioner friends early — and KEEP THEM

!59

slide-60
SLIDE 60

What is hard about HPC Concurrency?

  • The Scale of Concurrency and the Number of interacting APIs !
  • MPI-2, MPI-3, OpenMP, CUDA, OpenCL, OpenACC,

PThreads, use of NonBlocking Data Structures, dynamic Scheduling!

  • Each API thinks it “owns” the machine!
  • Exposure of Everyday Programmer to Low Level

Concurrency is a worrisome reality!!

  • Memory Consistency Models Matter!
  • Governs visibility across threads / fences!
  • Yet, very poorly specified / understood!
  • Compiler Optimizations — not even basic studies exist

!60

slide-61
SLIDE 61

Is there a role for Formal Methods?

  • Yes indeed!!
  • For instance, why is it that microprocessors don’t do “Pentium FDIV” any more?!
  • Processor ALUs have only become even more complex!
  • Answer : Formal gets serious use in the industry!
  • Intel : Symbolic Trajectory Evaluation!
  • Others : similar methods!
  • Processors get FV to varying degrees for other subsystems!
  • E.g. Cache coherence (at a protocol level)

!61

slide-62
SLIDE 62

Is there a role for Formal Methods?

  • Yes indeed!!
  • there are a fascinating array of correctness

challenges!

  • Very little involvement from mainstream CS side!
  • lack of exposure, limited interactions across

departments,!

  • Need “cool show-pieces” to draw students to

HPC research…

!62

slide-63
SLIDE 63

An example “cool project”! Utah Pi “cluster” built by PhD students at Utah! “Mo” Mohammed Saeed Al Mahfoudh ! and Simone Atzeni!

!

(Under $500 ; Runs MPI, Habanero Java, …)

!63

slide-64
SLIDE 64

Anyone wanting to do software testing for concurrency must slay two exponentials

!64

slide-65
SLIDE 65

!65

Anyone wanting to do software testing for concurrency must slay two exponentials

slide-66
SLIDE 66

A FM Grab-bag for anyone wanting to debug concurrent programs

  • Slay input-space exponential using!
  • Symbolic Execution!
  • Slay schedule-space exponential by !
  • Not jiggling schedules that are Happens-

Before equivalent

!66

slide-67
SLIDE 67

Not Exploring HB-Equivalent Schedules

!67

slide-68
SLIDE 68

A FM Grab-bag for anyone wanting to debug concurrent programs

  • Concepts in the fuel-tank must include!
  • Lamport’s “happens before”!
  • Define concurrency coverage using it!
  • Design active-testing methods that systematically explore

schedule-space!

  • Memory consistency models!
  • Data races and how to detect them!
  • Symbolic execution!
  • Helps achieve input-space Coverage

!68

slide-69
SLIDE 69

Overview of our (active) projects

  • HPC Concurrency!
  • Dynamic Verification Methods for MPI : CACM, Dec 2011!
  • GPU data-race checking : PPoPP’12, SC’12, SC’14!
  • Floating-point!
  • Finding inputs that cause highest relative error (“sour spot search”) :

PPoPP’14!

  • Detecting and Root-Causing Non-determinism!
  • Pruner project at LLNL - combined static / dynamic analysis for OpenMP race

checking!

  • System Resilience!
  • We have developed an LLVM-level Fault Injector called KULFI!
  • Using Coalesced Stack Trace Graphs to Highlight Behavioral Differences!
  • Our main focus continues to be correctness tools for HPC Concurrency

!69

slide-70
SLIDE 70

Biggest Gain due to Formal Methods:! Conceptual Cohesion!

  • Example : Helps understand that Concurrency and

Sequential Abstractions Tessellate

  • Helps Understand that Sequential == Deterministic
  • Helps Understand Data Races as Breaking the

Sequential Contract

!70

slide-71
SLIDE 71

Concurrency and Sequential Abstractions Tessellate !

!71 Fine%grained*concurrency**

  • f*transistor%level*circuits*

Sequen6al*view*of* Boolean*Func6ons*(gates)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Shared*memory*or*Msg** Passing*based*Parallelism* Solving*A*x*=*B*

slide-72
SLIDE 72

Why Fixate on Data Races?

  • Key assumption that enables sequential thinking!
  • Sequential almost always means Deterministic!
  • In an Out of Order CPU, nothing is sequential!
  • Yet we think of assembly programs as “sequential”!
  • Only because they yield deterministic results!
  • Create Hazards (say in a time-sensitive way)!
  • Then we lose this sequential / deterministic abstraction!
  • Parallel Programming Almost Always Strives to produce Sequential i.e.

Deterministic Outcomes!

!72

slide-73
SLIDE 73

Races and Race-Free Generalized

!73 Fine%grained*concurrency**

  • f*transistor%level*circuits*

Sequen6al*view*of* Boolean*Func6ons*(gates)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Shared*memory*or*Msg** Passing*based*Parallelism* Solving*A*x*=*B*

Critical races! gives gates! that spike ! (broken Boolean! Abstraction)

slide-74
SLIDE 74

Races and Race-Free Generalized

!74 Fine%grained*concurrency**

  • f*transistor%level*circuits*

Sequen6al*view*of* Boolean*Func6ons*(gates)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Shared*memory*or*Msg** Passing*based*Parallelism* Solving*A*x*=*B*

Races between! Clocks and Data! Breaks !

  • Seq. Abstraction.
slide-75
SLIDE 75

Races and Race-Free Generalized

!75 Fine%grained*concurrency**

  • f*transistor%level*circuits*

Sequen6al*view*of* Boolean*Func6ons*(gates)* Concurrent*State*Machines* Using*Gates*and*Flops* Sequen6al*Program** Abstrac6ons*(e.g.*ISA)* Shared*memory*or*Msg** Passing*based*Parallelism* Solving*A*x*=*B*

Data Races! Break Sequential! Consistency ! ( Unsynchronized! Interleavings! Matter )

slide-76
SLIDE 76

Results on UT Lonestar Benchmarks

!76

slide-77
SLIDE 77

Results on UIUC Parboil Benchmarks

!77

slide-78
SLIDE 78

Uintah: A Scalable Computational Framework for Multi-physics problems

  • Under continuous development over the past decade
  • Scalability to 700K CPU cores possible now
  • ~1M LOC or more!
  • Modular extensibility to accommodate GPUs and Xeon Phis
  • Partitions concerns
  • App developer writes sequential apps!
  • Infrastructure developer tunes / improves perf

!78

slide-79
SLIDE 79

Uintah Organization

!79 ICE MPM ARCHES

Simulation Controller Load Balancer Scheduler t4 t1 t2 t3 t5 t6 t7 t8 t9 t10 t11 t12 t13

Application Packages Abstract Directed Acyclic Tast Graph Runtime System

slide-80
SLIDE 80

Case Study: Data Warehouse Error! Collect Coalesced call-paths leading to DW::put().! Diffed across two scheduler versions to isolate bug

!80

slide-81
SLIDE 81

Conceptual view of Uintah equipped with a monitoring network (future work)

!81

Static Analysis of DWH and Scheduler Automaton Learning from Traces Tailor Learning for Hybrid Concurrency Events Build Cross-Layer Monitoring Hierarchies Derive System Control Invariants to Document + Debug via CSTG Hierarchical Active Testing and Monitoring using Standardized Interfaces

Internal Ready Queue Post MPI Receive External Ready Queue GPU Ready Queue CPU Check MPI Receive Post MPI Sends Check Host to Device Copy Device Device to Host Copy Internal Ready Task Completed Task Task Graph Post Device Copy

Device Enabled

DW::reduceMPI MPIScheduler::execute+A MPIScheduler::initiateReduction 1 MPIScheduler::execute+B MPIScheduler::runTask 73 MPIScheduler::runReductionTask 1 1 DetailedTask::doit 73 UnifiedScheduler::execute UnifiedScheduler::runTask

  • 73
  • 73

./sus AMRSim::run+A AMRSim::run+B 1 AMRSim::run+C AMRSim::run+D AMRSim::run+E AMRSim::executeTimestep AMRSim::doInitialTimestep 1 DW::override 1 69

  • 69

4

  • 4

Task::doit

Automata to Trigger CSTG Collection Static Analysis Helps Refine CSTGs Task Graph Compilation to Generate Salient High-Level Events to Cross-Check

slide-82
SLIDE 82

Concluding Remarks

  • Slaying bugs in HPC essential for Exascale!
  • Need a mix of empirical to formal !
  • Formal helps with concurrency coverage!
  • Formal helps write clear unambiguous and

validated specs!

  • and educate sure-footedly

!82

slide-83
SLIDE 83

thanks!

  • www.cs.utah.edu/fv
  • Thanks to my former students who have taught me

everything I know about FV and its relevance in the industry

!83

slide-84
SLIDE 84

The rest of the talk

  • Some results in GPU Data Race Checking!
  • Demo of Symbolic Execution and GKLEE !
  • Data Race Detection in GPU Programs!
  • Computational Frameworks!
  • Uintah !
  • How Coalesced Stack Trace Graphs help debug !
  • Other projects : Floating-Point Correctness and System

Resilience !

  • Concluding Remarks

!84

slide-85
SLIDE 85

The rest of the talk

  • Some results in GPU Data Race Checking!
  • Demo of Symbolic Execution and GKLEE !
  • Data Race Detection in GPU Programs!
  • Computational Frameworks!
  • Uintah !
  • How Coalesced Stack Trace Graphs help debug !
  • Other projects : Floating-Point Correctness and System

Resilience !

  • Concluding Remarks

!85

slide-86
SLIDE 86

The key to data race checking

  • For the most part, CUDA code is synchronized via

barriers (__syncthread)

  • Thus, explore a “canonical” interleaving, hoping to

detect the “first race” if there is any race

!86

slide-87
SLIDE 87

Interleaving exploration

!87 For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$

slide-88
SLIDE 88

Finding Representative Interleavings

!88 For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$

slide-89
SLIDE 89

Finding Representative Interleavings

!89 For$Example:$ If$the$green$dots$are$ local$thread$ac6ons,$ $then$ all$schedules$ $that$arrive$ at$the$“cut$line”$ $are$equivalent!$

slide-90
SLIDE 90

GKLEE Examines Canonical Schedule

!90

Instead(of(considering(all( Schedules(and(( All(Poten5al(Races…(

slide-91
SLIDE 91

GKLEE Examines Canonical Schedule

!91

Instead(of(considering(all( Schedules(and(( All(Poten5al(Races…( Consider(JUST(THIS(SINGLE( CANONICAL(SCHEDULE(!!( Folk(Theorem((proved(in(our(paper):( “We(will(find(A(RACE( If(there(is(ANY(race”(!!(

slide-92
SLIDE 92

An Example with Two Data Races

!92

slide-93
SLIDE 93

An Example with Two Data Races

!93

The “classic race”! Threads i and i+1 race

slide-94
SLIDE 94

An Example with Two Data Races

!94

The “classic race”! Threads i and i+1 race

Not explained in any CUDA book as a race! This is the “porting race” (evaluation order between ! divergent warps is unspecified)

slide-95
SLIDE 95

GKLEE’s steps

!95

slide-96
SLIDE 96

!96

Symbolic Execution

GKLEE’s steps

slide-97
SLIDE 97

!97

Compute! Conflicts! and solve! for races Symbolic Execution Compute! Conflicts! and solve! for races

GKLEE’s steps

slide-98
SLIDE 98

GKLEE of PPoPP 2012

!98

LLVM$byte) code$ instruc2ons$

Symbolic$ Analyzer$and$ Scheduler$ Error$$ Monitors$ C++$CUDA$Programs$with$ Symbolic$Variable$ Declara2ons$ LLVM)GCC$

  • "Deadlocks"
  • "Data"races"
  • "Concrete"test"inputs"
  • "Bank"conflicts"
  • "Warp"divergences"
  • "Non9coalesced""
  • $Test$Cases$
  • $Provide$high$coverage$
  • $Can$be$run$on$HW$
slide-99
SLIDE 99

The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses!

!99 __global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{' ''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);'' ''...' ''__syncthreads();' ''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;'' ''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{' ''''unsigned'data4'='d_Data[pos];'' ''''...' ''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}' ''''__syncthreads();'...' }' inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)'

{''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}'

“GKLEE:'Is'there'a'Race'?”'

slide-100
SLIDE 100

The advantages of a symbolic-execution based GPU Race Checker: Produces concrete witnesses!

!100 __global__'void'histogram64Kernel(unsigned'*d_Result,'unsigned'*d_Data,'int'dataN)'{' ''const'int'threadPos'='((threadIdx.x'&'(~63))'>>'0)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'15)'<<'2)'' '''''''''''''''''''''''''''''''''''''|'((threadIdx.x'&'48)'>>'4);'' ''...' ''__syncthreads();' ''for'(int'pos'='IMUL(blockIdx.x,'blockDim.x)'+'threadIdx.x;'pos'<'dataN;'' ''''''''''pos'+='IMUL(blockDim.x,'gridDim.x))''{' ''''unsigned'data4'='d_Data[pos];'' ''''...' ''''addData64(s_Hist,'threadPos,'(data4'>>'26)'&'0x3FU);'}' ''''__syncthreads();'...' }' inline'void'addData64(unsigned'char'*s_Hist,'int'threadPos,'unsigned'int'data)'

{''s_Hist['threadPos'+'IMUL(data,'THREAD_N)']++;'}'

Threads'5'and'and'13''have'a''WW'race'' when'd_Data[5]'='0x04040404'and'd_Data[13]'='0.''

GKLEE''

slide-101
SLIDE 101

GKLEE of SC’12 introduced the idea of Parametric Flows ! GKLEEp tool introduced (for race-checking mostly)

!101

slide-102
SLIDE 102

Idea behind Parametric flows:! Capitalize on Thread Symmetry ! Divide behavior into Flow Equivalence Classes!

!

!102

slide-103
SLIDE 103

Idea behind Parametric flows: Capitalize on Thread Symmetry ! Divide behavior into Flow Equivalence Classes!

!103

Keep two symbolic threads per flow-group and race-check per flow.

slide-104
SLIDE 104

Where Race-Checking Happens under Parameterized Flows!

!104

slide-105
SLIDE 105

Where Race-Checking Happens under Parameterized Flows!

!105

Intra-Flow

slide-106
SLIDE 106

Where Race-Checking Happens under Parameterized Flows!

!106

Inter-Flow

slide-107
SLIDE 107

Favorable Results

!107

slide-108
SLIDE 108

Yet, Unfavorable Results often…

!108

When parametric flow division happens inside a loop, we can get an exp # of flows.

slide-109
SLIDE 109

Symbolic Execution with Static Analysis! (SC’14 accepted paper)

!109

slide-110
SLIDE 110

How SESA works

  • Static Analysis Pass marks how vars affected by each flow in a barrier

interval may affect the generation of addresses in the next barrier interval

!110

Barrier Barrier Barrier

There are two ! classes of flows:! (1) Flows that modify ! Global or Shared Var! That flow into Control! Predicates or Array Indexing ! Positions (2) Flows that don’t do so Within the next Barrier Interval,! “OR” the Green Flows into one flow

slide-111
SLIDE 111

SESA Results

  • We have been able to run SESA on !
  • Lonestar Benchmarks (UT)!
  • Parboil Benchmarks (UIUC)!
  • It scales well and finds issues!
  • Races!
  • Out of bounds accesses!
  • Tool being integrated into Eclipse

!111