Real World Embedded DSLs Scottish Programming Languages and - - PowerPoint PPT Presentation

real world embedded dsls
SMART_READER_LITE
LIVE PREVIEW

Real World Embedded DSLs Scottish Programming Languages and - - PowerPoint PPT Presentation

Real World Embedded DSLs Scottish Programming Languages and Verification Summer School 2019 Rob Stewart (R.Stewart@hw.ac.uk) August 2019 Heriot-Watt University, Edinburgh What is a DSL What is a DSL Paul Hudak: A DSL is


slide-1
SLIDE 1

Real World Embedded DSLs

Scottish Programming Languages and Verification Summer School 2019

Rob Stewart (R.Stewart@hw.ac.uk) August 2019

Heriot-Watt University, Edinburgh

slide-2
SLIDE 2

What is a DSL

slide-3
SLIDE 3

What is a DSL

Paul Hudak: ”A DSL is…”

  • Programming language geared for application domain
  • Capture semantics of a domain, no more no less
  • User immersed in domain knows domain semantics
  • Just need a notation to express those semantics

Paul Hudak. “Domain Specific Languages”. In: ed. by Peter Salus. Vol. 3. Handbook of Programming Languages, Little Languages and Tools. MacMillan, Indiana, 1998. Chap. 3.

1

slide-4
SLIDE 4

DSL Design Guidelines

  • 1. Choose a domain
  • 2. Design DSL to accurately capture domain semantics
  • 3. Use the KISS (keep it simple, stupid) principle
  • 4. “Little languages” are a Good Thing
  • 5. Concentrate on domain semantics; not too much on syntax
  • 6. Don’t let performance dominate design
  • 7. Don’t let design dominate performance either
  • 8. Prototype your design, refine, iterate
  • 9. Build tools to support the DSL
  • 10. Develop applications with the DSL
  • 11. Keep end user in mind; Success = A Happy Customer

Hudak, “Domain Specific Languages”.

2

slide-5
SLIDE 5

Domain Specificity

slide-6
SLIDE 6

Application Domain Examples

  • Scheduling
  • Simulation
  • Lexing/parsing
  • Robotics
  • Graphics & animation
  • Databases
  • Logic
  • Security
  • Modelling
  • Graphical user

interfaces

  • Symbolic computing
  • Hardware description
  • Text processing
  • Computer music
  • Distributed & parallel

computing

3

slide-7
SLIDE 7

Domain Specificity

DSLs ACM Computing Survey:

  • Some consider Cobol a DSL for business applications,
  • thers argue this is pushing the notion of application

domain too far

  • Think of DSLs in terms of a gradual scale: specialised DSLs

e.g. BNF on left and GPLs such as C++ on right

  • Hard to tell if command languages like the Unix shell or

scripting languages like Tcl are DSLs

  • Domain-specificity is a matter of degree

Marjan Mernik, Jan Heering, and Anthony M. Sloane. “When and how to develop domain-specific languages”. In: ACM Comput. Surv. 37.4 (2005),

  • pp. 316–344.

4

slide-8
SLIDE 8

Why DSLs?

slide-9
SLIDE 9

DSL Advantages

  • 1. More concise: easy to look at, see, think about, show
  • 2. Increase programmer productivity: DSLs tend to be high

level meaning shorter programs

  • 3. Programs easier to maintain
  • less code == less maintenance
  • 4. Are easier to reason about: programs expressed at level of

problem domain, domain knowledge can be conserved, validated, and reused

Debasish Ghosh. DSLs in Action. Greenwich, CT, USA: Manning Publications Co., 2010.

5

slide-10
SLIDE 10

The DSL pay off

  • Initial DSL costs high, but software development costs low
  • Should eventually start saving time and money

Hudak, “Domain Specific Languages”.

6

slide-11
SLIDE 11

DSLs: Return On Investment

  • Rhapsody: UML model to develop software components
  • Philips had issues with Rhapsody (see paper)
  • Dezyne: another modelling language, verifies live-lock

freedom, determinism etc. properties

  • Philips developed ComMA DSL
  • automates translation of Rhapsody to Dezyne

Mathijs Schuts, Jozef Hooman, and Paul Tielemans. “Industrial Experience with the Migration of Legacy Models using a DSL”. In: Real World Domain Specific Languages Workshop, Vienna, Austria, February 24. 2018, 1:1–1:10.

7

slide-12
SLIDE 12

DSLs: Return On Investment

  • Manual: 576 hours (16 person weeks)
  • manual transformation of 8 state machines
  • Automated: 190 hours to develop automation
  • 60 hours: Rhapsody input, Dezyne output with ComMA
  • 15 hours: model learning, equivalence checking
  • 25 hours: Visual Studio integration
  • 90 hours: develop additional state machine support

ROI = gain from investment − cost of investment cost of investment ROI = (576 − 190)/190 ≈ 2

Schuts, Hooman, and Tielemans, “Industrial Experience with the Migration

  • f Legacy Models using a DSL”.

8

slide-13
SLIDE 13

Early DSL example

slide-14
SLIDE 14

APT

APT (Automatically Programmed Tool):

  • Numerically controlled machine tools
  • One of the 1st DSLs

Douglas T. Ross. “Origins of the APT Language for Automatically Programmed Tools”. In: SIGPLAN Not. 13.8 (Aug. 1978), pp. 61–99.

9

slide-15
SLIDE 15

Future Proofing APT

10

slide-16
SLIDE 16

Domain Specifity of APT Semantics

11

slide-17
SLIDE 17

APT Declarative vs Imperative

12

slide-18
SLIDE 18

APT Implementation Concerns

13

slide-19
SLIDE 19

APT

14

slide-20
SLIDE 20

APT

15

slide-21
SLIDE 21

APT

16

slide-22
SLIDE 22

APT

17

slide-23
SLIDE 23

APT Vocabulary

18

slide-24
SLIDE 24

DSLs used Today

  • PERL: text manipulation
  • VHDL: hardware description
  • L

AT

EX: typesetting

  • HTML: document markup
  • SQL: database transactions
  • Maple: symbolic computing
  • AutoCAD: computer aided design
  • Prolog: logic
  • Excel

Hudak, “Domain Specific Languages”.

19

slide-25
SLIDE 25

The rest of this talk

  • 1. Counterexamples for many ”in general” observations
  • 2. Code examples mostly extracted from publications
  • Footnote citations on these slides

20

slide-26
SLIDE 26

Modern DSL examples

slide-27
SLIDE 27

Motivations for DSLs: Examples

  • Familiar notation for domain experts (SQL)
  • High level abstraction (Keras)
  • Compositionality (Frenetic)
  • Speed (Halide)
  • Productivity (Halide)
  • Correctness (Ivory)

21

slide-28
SLIDE 28

Domain Expert Familiarity: SQL

SELECT firstName, lastName, address FROM employee WHERE salary > ALL (SELECT salary FROM employee WHERE firstName = 'Paul')

  • Programmer training
  • 1 day to become SQL competent
  • months to become SQL expert

Hudak, “Domain Specific Languages”.

22

slide-29
SLIDE 29

Abstraction: Keras

Embedded in Python for defining neural networks model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid'))

  • High level API on top of Tensorflow
  • Rapid prototyping of neural networks
  • Insert Tensorflow code to Keras model/training pipeline
  • TF flexibility: custom cost function or layer
  • TF functionality: threads, debugger
  • TF control: set variables to be trainable or not
  • Analogous to inline ASM, inline C, etc.

23

slide-30
SLIDE 30

Compositionality: Frenetic Network programming language

  • Problem with OpenFlow and NOX (SDN languages)
  • lack compositionality
  • low level: programs unnecessarily complicated
  • two-tier programs lead to race conditions
  • Solution: Frenetic DSL
  • high level compositional patterns (translates to OpenFlow)
  • two sub-languages
  • 1. ”see every packet” network query language
  • 2. functional reactive network policy language
  • queries and policies compose

Nate Foster et al. “Frenetic: a network programming language”. In: Proceeding of the 16th ACM SIGPLAN ICFP 2011, Tokyo, Japan, September 19-21,

  • 2011. ACM, 2011, pp. 279–291.

24

slide-31
SLIDE 31

Compositionality: Frenetic

Embedded in Python… ”to ease adoption” def host_query(): return (Select(sizes) * Where(inport_fp(2)) * GroupBy([dstmac]) * Every(60)) def all_stats(): Merge(host_query(),web_query()) >> Print() def repeater_web_monitor(): repeater() all_stats()

25

slide-32
SLIDE 32

Speed: Halide

  • High performance C++ embedded image/array processing
  • Separates algorithm from scheduling code

Func blur_3x3(Func input) { Func blur_x, blur_y; Var x, y, xi, yi; // The algorithm - no storage or order blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3; blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3; // The schedule - defines order, locality; implies storage blur_y.tile(x, y, xi, yi, 256, 32) .vectorize(xi, 8).parallel(y); blur_x.compute_at(blur_y, x).vectorize(x, 8); return blur_y; }

Jonathan Ragan-Kelley et al. “Halide: a language and compiler for

  • ptimizing parallelism, locality, and recomputation in image processing

pipelines”. In: ACM SIGPLAN PLDI, Seattle, WA, USA, June. 2013, pp. 519–530.

26

slide-33
SLIDE 33

Speed and Productivity: Halide

  • Programmer productivity and fast performance
  • Bilateral slicing layer
  • high-performance image processing architecture to

approximate complicated image processing pipelines

  • Halide extensions
  • Automatic Differentiation
  • Scheduling
  • Programmer productivity
  • Halide 24 lines, PyTorch 42 lines, CUDA 308 lines
  • Halide 10x faster than CUDA, 20x faster than PyTorch

Tzu-Mao Li et al. “Differentiable programming for image processing and deep learning in halide”. In: ACM Trans. Graph. 37.4 (2018), 139:1–139:13.

27

slide-34
SLIDE 34

Speed and Productivity: Halide

Li et al., “Differentiable programming for image processing and deep learning in halide”.

28

slide-35
SLIDE 35

Correctness: Ivory

  • Ivory: safe systems programming, memory and type safety
  • Type system shallowly embedded using GHC type features
  • Syntax is deeply embedded, from one AST:
  • Embedded C generation
  • SMT-based symbolic simulator
  • Theorem-prover back-end

Industry strength EDSL:

  • Boeing use Ivory to implement level-of-interoperability for

a NATO standard interface for Unmanned Control System (UCS) & Unmanned Aerial Vehicle (UAV) interoperability

Trevor Elliott et al. “Guilt free ivory”. In: Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Vancouver, BC, Canada, September 3-4, 2015. 2015, pp. 189–200.

29

slide-36
SLIDE 36

Correctness: Ivory

fib_loop :: Def ('[Ix 1000] :-> Uint32)

  • Def is Ivory procedure (aka C function)
  • '[Ix 1000] :-> Uint32
  • takes index argument n
  • 0 =< n < 1000
  • this procedure returns unsigned 32 bit integer

fib_loop = proc "fib_loop" $ \ n -> body $ do

  • Ivory body func takes argument of type Ivory eff ()
  • eff effect scope enforces type & memory safety

30

slide-37
SLIDE 37

Correctness: Ivory

a <- local (ival 0) b <- local (ival 1)

  • a and b local stack variables

n `times` \_ith -> do a' <- deref a b' <- deref b store a b' store b (a' + b')

  • Run a loop 1000 times (inferred from [Ix 1000])

31

slide-38
SLIDE 38

Correctness: Ivory

fib_loop :: Def ('[Ix 1000] :-> Uint32) fib_loop = proc "fib_loop" $ \ n -> body $ do a <- local (ival 0) b <- local (ival 1) n `times` \_ith -> do a' <- deref a b' <- deref b store a b' store b (a' + b') result <- deref a ret result fib_module :: Module fib_module = package "fib" (incl fib_loop) main = C.compile [ fib_module ] https://ivorylang.org/ivory-fib.html

32

slide-39
SLIDE 39

Implementations

Notice distinguishing feature?

  • Internal
  • Keras (Python)
  • Frenetic (Python)
  • Halide (C++)
  • Ivory (Haskell)
  • External
  • SQL

Embedding of external languages too e.g. Selda: a type safe SQL EDSL

Anton Ekblad. “Scoping Monadic Relational Database Queries”. In: Proceedings of the 12th ACM SIGPLAN International Symposium on Haskell. Haskell 2019. Berlin, Germany, 2019, pp. 114–124.

33

slide-40
SLIDE 40

Internal and External DSLs

slide-41
SLIDE 41

DSL Implementation Choices

External

  • 1. Parser + Interpreter: interactive read–eval–print loop
  • 2. Parser + Compiler: DSL constructs to another language
  • LLVM a popular IR to target for CPUs/GPUs

Internal

  • Embed in a general purpose language
  • Reuse features/infrastructure of existing language
  • frontend (syntax + type checker)
  • maybe its backend too
  • maybe its runtime system too
  • Concentrate on semantics
  • Metaprogramming tools to have uniform look and feel

Trend: language embeddings, away from external approaches

34

slide-42
SLIDE 42

External Advantages

  • Domain specific notation not constrained by host’s syntax
  • Building DSLs from scratch: better error messages
  • DSL syntax close to notations used by domain experts
  • Domain specific analysis, verification, optimisation,

parallelisation and transformation (AVOPT) is possible

  • AVOPT for internal? host’s syntax or semantics may be too

complex or not well defined, limiting AVOPT

35

slide-43
SLIDE 43

External Disadvantages

  • External DSLs is large development effort because a

complex language processor must be implemented

  • syntax, semantics, interpreter/compiler, tools
  • DSLs from scratch often lead to incoherent designs
  • DSL design is hard, requiring both domain and language

development expertise. Few people have both.

  • Mission creep: programmers want more features
  • A new language for every domain?

Mernik, Heering, and Sloane, “When and how to develop domain-specific languages”.

36

slide-44
SLIDE 44

Implementation of Internal DSLs

  • Syntax tree manipulation (deeply embedded compilers)
  • create & traverse AST, AST manipulations to generate code
  • Type embedding (e.g. Par monad, parser combinators)
  • DS types, operations over them
  • Runtime meta-programming (e.g. MetaOCaml, Scala LMS)
  • Program fragments generated at runtime
  • Compile-time meta-programming (e.g. Template Haskell)
  • Program fragments generated at compile time
  • Preprocessor (e.g. macros)
  • DSL translated to host language before compilation
  • Static analysis limited to that performed by base language
  • Extend a compiler for domain specific code generation

37

slide-45
SLIDE 45

Internal DSL Advantages/Disadvantages

  • Advantages
  • modest development effort, rapid prototyping
  • many language features for free
  • host tooling (debugging, perf benchmarks, editors) for free
  • lower user training costs
  • Disadvantages
  • syntax may be far from optimal
  • cannot easily introduce arbitrary syntax
  • difficult to express/implement domain specific
  • ptimisations, affecting efficiency
  • cannot easily extend compiler
  • bad error reporting

Mernik, Heering, and Sloane, “When and how to develop domain-specific languages”.

38

slide-46
SLIDE 46

Counterexamples

Claimed disadvantages of EDSLs:

  • 1. Difficult to extend a host language compiler
  • 2. Bad error messages

Are these fair criticisms?

39

slide-47
SLIDE 47

Extending a Compiler

Counterexample to ”extensible compiler” argument:

  • user defined GHC rewrites
  • GHC makes no attempt to verify rule is an identity
  • GHC makes no attempt to ensure that the right hand side

is more efficient than the left hand side

  • Opportunity for domain specific optimisations?

blur5x5 :: Image -> Image blur3x3 :: Image -> Image {-# RULES "blur5x5/blur3x3" forall image. (blur3x3 (blur3x3 image)) = blur5x5 image #-}

40

slide-48
SLIDE 48

Custom Error Message

EDSL ”bad error reporting” claim not entirely true. 3 + False

<interactive>:1:1 error:

  • No instance for (Num Bool) arising from a use of `+'
  • In the expression: 3 + False

In an equation for `it': it = 3 + False George Wilson. “Functional Programming in Education”. YouTube. July 2019.

41

slide-49
SLIDE 49

Custom Error Message

import GHC.TypeLits instance TypeError (Text "Booleans are not numbers" :$$: Text "so we cannot add or multiply them") => Num Bool where ... 3 + False <interactive>:1:1 error:

  • Booleans are not numbers

so we cannot add or multiple them

  • In the expression: 3 + False

In an equation for `it': it = 3 + False

42

slide-50
SLIDE 50

Library versus EDSL?

slide-51
SLIDE 51

Are EDSL just libraries?

  • X is an EDSL for image processing
  • Y is an EDSL for web programming
  • Z is an EDSL for ….

When is a library not domain specific? Are all libraries EDSLs?

43

slide-52
SLIDE 52

DSL design patterns

  • Language exploitation
  • 1. Specialisation: restrict host for safety, optimisation..
  • 2. Extension: host language syntax/semantics extended
  • Informal designs
  • Natural language and illustrative DSL programs
  • Formal designs
  • BNF grammars for syntax specifications
  • Rewrite systems
  • Abstract state machines for semantic specification

If library formally defined does it constitute ”language” status?

Mernik, Heering, and Sloane, “When and how to develop domain-specific languages”.

44

slide-53
SLIDE 53

Library versus EDSL?

When is a library an EDSL?

  • 1. Well defined DS semantics library has a formal semantics

e.g. HdpH-RS has a formal operational semantics for its constructs?

  • 2. Compiler library has its own compiler for its constructs

E.g. Accelerate?

  • 3. Language restriction library is a restriction of expressivity

e.g. lifting values into the library’s types?

  • 4. Extends syntax library extends host’s syntax e.g. use of

compile time meta-programming?

45

slide-54
SLIDE 54

Library versus EDSL?

HdpH-RS embedded in Haskell

  • - task distribution

data Par a

  • - monadic parallel computation of type 'a'

type Task a spawn :: Task a -> Par (Future a)

  • - lazy

spawnAt :: Node -> Task a -> Par (Future a)

  • - eager
  • - communication of results via futures

type Future a get :: Future a -> Par a

  • - local read

Robert J. Stewart, Patrick Maier, and Phil Trinder. “Transparent fault tolerance for scalable functional computation”. In: J. Funct. Program. 26 (2016), e5.

46

slide-55
SLIDE 55

Library versus EDSL?

States R, S, T ::= S | T parallel composition | ⟨M⟩p thread on node p, executing M | ⟨ ⟨M⟩ ⟩p spark on node p, to execute M | i{M}p full IVar i on node p, holding M | i{⟨M⟩q}p empty IVar i on node p, supervising thread ⟨M⟩q | i{⟨ ⟨M⟩ ⟩Q}p empty IVar i on node p, supervising spark ⟨ ⟨M⟩ ⟩q | i{⊥}p zombie IVar i on node p | deadp notification that node p is dead ⟨E[spawn M]⟩p − → νi.(⟨E[return i]⟩p | i{⟨ ⟨M >>= rput i⟩ ⟩{p}}p | ⟨ ⟨M >>= rput i⟩ ⟩p), (spawn) ⟨E[spawnAt q M]⟩p − → νi.(⟨E[return i]⟩p | i{⟨M >>= rput i⟩q}p | ⟨M >>= rput i⟩q), (spawnAt) ⟨ ⟨M⟩ ⟩p1 | i{⟨ ⟨M⟩ ⟩P }q − → ⟨ ⟨M⟩ ⟩p2 | i{⟨ ⟨M⟩ ⟩P }q, if p1, p2 ∈ P (migrate) ⟨ ⟨M⟩ ⟩p | i{⟨ ⟨M⟩ ⟩P1}q − → ⟨ ⟨M⟩ ⟩p | i{⟨ ⟨M⟩ ⟩P2}q, if p ∈ P1 ∩ P2 (track) etc... 47

slide-56
SLIDE 56

Library versus EDSL?

Node A supervisor Node B victim Node C thief holds .

j

B ! FISH C OnNode B FISH C B ? FISH C A ! REQ i r0 B C REQ i r0 B C A ? REQ i r0 B C B ! AUTH i C AUTH i C InTransition B C B ? AUTH i C C ! SCHEDULE .

j

B SCHEDULE .

j B

C ? SCHEDULE .

j

B A ! ACK i r0 ACK i r0 A ? ACK i r0 OnNode C

i{⟨ ⟨M⟩ ⟩{B}}A | ⟨ ⟨M⟩ ⟩B

   (track)

i{⟨ ⟨M⟩ ⟩{B,C}}A | ⟨ ⟨M⟩ ⟩B

   (migrate)

i{⟨ ⟨M⟩ ⟩{B,C}}A | ⟨ ⟨M⟩ ⟩C

   (track)

i{⟨ ⟨M⟩ ⟩{C}}A | ⟨ ⟨M⟩ ⟩C 48

slide-57
SLIDE 57

Library versus EDSL?

HdpH-RS domain: scalable fault tolerant parallel computing

  • 1. 3 primitives, 3 types
  • 2. An operational semantics for these primitives
  • domain: task parallelism + fault tolerance
  • 3. A verified scheduler

It is a shallow embedding:

  • primitives implemented in Haskell that return values
  • uses GHCs frontend, backend and its RTS

Is HdpH-RS ”just” library, or a DSL?

49

slide-58
SLIDE 58

Library versus EDSL?

Accelerate DSL for parallel array processing

  • GHC frontend: yes
  • GHC code generator backend: no
  • GHC runtime system: no

Has multiple backends from Accelerate AST

  • LLVM IR
  • CUDA

50

slide-59
SLIDE 59

Language Embeddings

slide-60
SLIDE 60

Shallow Embeddings: Par monad

  • Abstract data types for the domain
  • Operators over those types
  • In Haskell a monad might be the central construct

newtype Par a instance Monad Par data IVar a runPar :: Par a -> a spawn :: NFData a => Par a -> Par (IVar a) get :: IVar a -> Par a

  • Shallow embeddings simple to implement
  • no compiler construction
  • Host compiler has no domain knowledge
  • applies host language’s backend to generate machine code

51

slide-61
SLIDE 61

Shallow Embeddings: Repa

data family Array rep sh e data instance Array D sh e = ADelayed sh (sh -> e) data instance Array U sh e = AUnboxed sh (Vector e)

  • - types for array representations

data D

  • - Delayed

data U

  • - Manifest, unboxed

computeP :: (Load rs sh e, Target rt e) => Array rs sh e

  • > Array rt sh e

Ben Lippmeier et al. “Guiding parallel array fusion with indexed types”. In: Proceedings of the 5th ACM SIGPLAN Symposium on Haskell, Haskell 2012, Copenhagen, Denmark, 13 September 2012. 2012, pp. 25–36.

52

slide-62
SLIDE 62

Shallow Embeddings: Repa

  • function composition on delayed arrays
  • fusion e.g. map/map, permutation, replication, slicing, etc.
  • relies on GHC for code generation
  • makes careful use of GHCs primops (more next lecture)
  • at mercy of GHC code gen capabilities

53

slide-63
SLIDE 63

Language and Compiler Embeddings

slide-64
SLIDE 64

Overview

Let’s look at three approaches:

  • 1. Deeply embedded compilers .e.g. Accelerate
  • 2. Compile time metaprogramming e.g. Template Haskell
  • 3. Compiler staging e.g. MetaOCaml, Scala

54

slide-65
SLIDE 65

Deeply Embedded Compilers

slide-66
SLIDE 66

Deep Embeddings

  • Deep EDSLs don’t use all host language
  • may have its own compiler
  • or runtime system
  • constructs return AST structures, not values

55

slide-67
SLIDE 67

Deep EDSL: Accelerate

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') dotProductGPU xs ys = LLVM.run (dotp xs ys)

Manuel M. T. Chakravarty et al. “Accelerating Haskell array codes with multicore GPUs”. In: DAMP 2011, Austin, TX, USA, January 23, 2011. ACM, 2011,

  • pp. 3–14.

56

slide-68
SLIDE 68

Deep EDSL: Accelerate

My function:

brightenBy :: Int -> Acc Image -> Acc Image brightenBy i = map (+ (lift i))

The structure returned:

Map (\x y -> PrimAdd `PrimApp` ...) 57

slide-69
SLIDE 69

Deep EDSL: Compiling and Executing Accelerate

run :: Arrays a => Acc a -> a run a = unsafePerformIO (runIO a) runIO :: Arrays a => Acc a -> IO a runIO a = withPool defaultTargetPool (\target -> runWithIO target a) runWithIO :: Arrays a => PTX -> Acc a -> IO a runWithIO target a = execute where !acc = convertAcc a execute = do dumpGraph acc evalPTX target $ do build <- phase "compile" (compileAcc acc) >>= dumpStats exec <- phase "link" (linkAcc build) res <- phase "execute" (evalPar (executeAcc exec >>= copyToHostLazy)) return res 58

slide-70
SLIDE 70

Leaking Abstractions

slide-71
SLIDE 71

Where does EDSL stop and host start?

In February 2016 I asked on Halide-dev about my functions:

Image<uint8_t> blurX(Image<uint8_t> image); Image<uint8_t> blurY(Image<uint8_t> image); Image<uint8_t> brightenBy(Image<uint8_t> image, float);

Hi Rob, You’ve constructed a library that passes whole images across C++ function call boundaries, so no fusion can happen, and so you’re missing out on all the benefits of

  • Halide. This is a long way away from the usage model
  • f Halide. The tutorials give a better sense of …

On [Halide-dev]: https://lists.csail.mit.edu/pipermail/halide-dev/2016-February/002188.html

59

slide-72
SLIDE 72

Where does EDSL stop and host start?

Correct solution:

Func blurX(Func image); Func blurY(Func image); Func brightenBy(Func image, float);

Reason: Halide is a functional language embedded in C++ But my program compiled and was executed (slowly) I discovered the error of my ways by:

  • 1. Emailing Halide-dev
  • 2. Reading Halide code examples

Why not a type error?

60

slide-73
SLIDE 73

Compile Time Metaprogramming

slide-74
SLIDE 74

Compile time metaprogramming

  • Main disadvantage of embedded compilers
  • cannot access to host language’s optimisations
  • cannot use language constructs requiring host language

types e.g. if/then/else

  • Shallow embeddings don’t suffer these problems
  • but inefficient execution performance
  • no domain specific optimisations
  • Compile time metaprogramming transforms user written

code to syntactic structures

  • host language -> AST transforms -> host language
  • all happens at compile time

Sean Seefried, Manuel M. T. Chakravarty, and Gabriele Keller. “Optimising Embedded DSLs Using Template Haskell”. In: GPCE 2004, Vancouver, Canada, October 24-28, 2004. Proceedings. Springer, 2004, pp. 186–205.

61

slide-75
SLIDE 75

Compile time metaprogramming with Template Haskell

For a n × n matrix M, domain knowledge is: M × M−1 = I Host language does not know this property for matrices. Consider the computation: m * inverse m * n

  • Metaprogramming algorithm:
  • 1. reify code into an AST data structure

exp_mat = [| \m n -> m * inverse m * n |]

  • 2. AST -> AST optimisation for M × M −1 = I
  • 3. reflect AST back into code (also called splicing)

Seefried, Chakravarty, and Keller, “Optimising Embedded DSLs Using Template Haskell”.

62

slide-76
SLIDE 76

Compile time metaprogramming with Template Haskell

Apply the optimisation:

rmMatByInverse (InfixE (Just 'm) 'GHC.Num.* (Just (AppE 'inverse 'm))) = VarE (mkName "identity")

Pattern match with λp.e

rmMatByInverse (LamE pats exp) = LamE pats (rmMatByInverse exp)

Pattern match with f a

rmMatByInverse (AppE exp exp') = AppE (rmMatByInverse exp) (rmMatByInverse exp')

And the rest

rmMatByInverse exp = exp

63

slide-77
SLIDE 77

Compile time metaprogramming with Template Haskell

Our computation: \m n -> m * inverse m * n Reify: exp_mat = [| \m n -> m * inverse m * n |] Splice this back into program: $(rmMayByInverse exp_mat) Becomes \m n -> n At compile time.

64

slide-78
SLIDE 78

Comparison with Deeply Embedded Compiler Approach

Our computation: \m n -> m * inverse m * n Optimised at runtime:

rmMatByInverse :: Exp -> Exp rmMatByInverse exp@(Multiply (Var x) (Inverse (Var y))) = if x == y then Identity else exp rmMatByInverse (Lambda pats exp) = Lambda (pats) (rmMatByInverse exp) rmMatByInverse (App exp exp') = App (rmMatByInverse exp) (rmMatByInverse exp') rmMatByInverse exp = exp

  • ptimise :: AST -> AST
  • ptimise = .. rmMatByInverse ..

65

slide-79
SLIDE 79

Deep Compilers vs Metaprogramming

  • Pan: Deeply embedded compiler for image processing
  • ”Compiling embedded languages”
  • PanTHeon: Compile time metaprogramming
  • ”Optimising Embedded DSLs Using Template Haskell”
  • Performance: both sometimes faster/slower
  • Pan aggressively unrolls expressions, PanTHeon doesn’t
  • PanTHeon: cannot profile spliced code (TemplateHaskell)
  • Source lines of code implementation
  • Pan: ~13k
  • PanTHeon: ~4k (code generator + optimisations for free)

Conal Elliott, Sigbjørn Finne, and Oege de Moor. “Compiling embedded languages”. In: J. Funct. Program. 13.3 (2003), pp. 455–481. Seefried, Chakravarty, and Keller, “Optimising Embedded DSLs Using Template Haskell”.

66

slide-80
SLIDE 80

Staged Compilation

slide-81
SLIDE 81

Staging

Staged program = conventional program + staging annotations

  • Programmer delays evaluation of program expressions
  • A stage is code generator that constructs next stage
  • Generator and generated code are expressed in single

program

  • Partial evaluation
  • performs aggressive constant propagation
  • produces intermediate program specialised to static inputs
  • Partial evaluation is a form of program specialization.

67

slide-82
SLIDE 82

Multi Stage Programming (MSP) with MetaOCaml

  • 1. Brackets (.<..>.) around expression delays computation

# let a = 1+2;; val a : int = 3 # let a = .<1+2>.;; val a : int code = .<1+2>.

  • 1. Escape (.~) splices in delayed values

# let b = .<.~a * .~a >. ;; val b : int code = .<(1 + 2) * (1 + 2)>.

  • 1. Run (.!) compiles and executes code

# let c = .! b;; val c : int = 9

Walid Taha. “A Gentle Introduction to Multi-stage Programming”. In: Domain-Specific Program Generation, Dagstuhl Castle, Germany, Revised

  • Papers. Springer, 2003, pp. 30–50.

68

slide-83
SLIDE 83

MetaOCaml Example

let rec power (n, x) = match n with 0 -> 1 | n -> x * (power (n-1, x));; let power2 = fun x -> power (2,x);; (* power2 3 *) (* => power (2,3) *) (* => 3 * power (1,3) *) (* => 3 * (3 * power (0,3) *) (* => 3 * (3 * 1) *) (* => 9 *) let my_fast_power2 = fun x -> x*x*1;;

69

slide-84
SLIDE 84

MetaOCaml Example: Specialising Code

let rec power (n, x) = match n with 0 -> .<1>. | n -> .<.~x * .~(power (n-1, x))>.;;

  • this returns code of type integer, not integer
  • bracket around multiplication returns code of type integer
  • escape of power splices in more code

let power2 = .! .<fun x -> .~(power (2,.<x>.))>.;;

behaves just like:

fun x -> x*x*1;;

We can keep specialising power

let power3 = .! .<fun x -> .~(power (3,.<x>.))>.;; let power4 = .! .<fun x -> .~(power (4,.<x>.))>.;;

70

slide-85
SLIDE 85

MetaOCaml Example: Arithmetic Staged Interpreter

let rec eval1 e env fenv = match e with Int i -> i | Var s -> env s | App (s,e2) -> (fenv s) (eval1 e2 env fenv) | Add (e1,e2) -> (eval1 e1 env fenv)+(eval1 e2 env fenv) | Sub (e1,e2) -> (eval1 e1 env fenv)-(eval1 e2 env fenv) | Mul (e1,e2) -> (eval1 e1 env fenv)*(eval1 e2 env fenv) | Div (e1,e2) -> (eval1 e1 env fenv)/(eval1 e2 env fenv) | Ifz (e1,e2,e3) -> if (eval1 e1 env fenv) = 0 then (eval1 e2 env fenv) else (eval1 e3 env fenv) Taha, “A Gentle Introduction to Multi-stage Programming”.

71

slide-86
SLIDE 86

MetaOCaml Example: Arithmetic Staged Interpreter

fact (x) { return (x * fact (x-1); } Build lexer/parser to construct AST:

Program ([Declaration ("fact","x", Ifz(Var "x", Int 1, Mul(Var"x",(App ("fact", Sub(Var "x",Int 1)))))) ] , App ("fact", Int 10))

  • Interpreter 20 times slower than fact(20) in OCaml :(

72

slide-87
SLIDE 87

MetaOCaml Example: Arithmetic Staged Interpreter

let rec eval2 e env fenv = match e with Int i -> .<i>. | Var s -> env s | App (s,e2) -> .<.~(fenv s).~(eval2 e2 env fenv)> .... | Div (e1,e2)-> .<.~(eval2 e1 env fenv) / .~(eval2 e2 env fenv)>. | Ifz (e1,e2,e3) -> .<if .~(eval2 e1 env fenv)=0 then .~(eval2 e2 env fenv) else .~(eval2 e3 env fenv)>. (* fact(10) same as Ocaml, we didn't write by hand! *) .<let rec f = fun x -> if x = 0 then 1 else x * (f (x - 1)) in (f 10)>. Taha, “A Gentle Introduction to Multi-stage Programming”.

73

slide-88
SLIDE 88

MetaOCaml Example: QBF Staged Interpreter

A DSL for quantified boolean logic (QBF)

type bexp = True | False | And of bexp * bexp | Or of bexp * bexp | Not of bexp | Implies of bexp * bexp (* forall x. x and not x*) | Forall of string * bexp | Var of string

∀p.T ⇒ p

Forall ("p", Implies(True, Var "p")) Krzysztof Czarnecki et al. “DSL Implementation in MetaOCaml, Template Haskell, and C++”. In: Domain-Specific Program Generation, Dagstuhl Castle, Germany, March, 2003, Revised Papers. Springer, 2003, pp. 51–72.

74

slide-89
SLIDE 89

MetaOCaml Example: QBF Staged Interpreter

let rec eval b env = match b with True -> true | False -> false | And (b1,b2) -> (eval b1 env) && (eval b2 env) | Or (b1,b2) -> (eval b1 env) || (eval b2 env) | Not b1 -> not (eval b1 env) | Implies (b1,b2) -> eval (Or(b2,And(Not(b2),Not(b1)))) env | Forall (x,b1) -> let trywith bv = (eval b1 (ext env x bv)) in (trywith true) && (trywith false) | Var x -> env x eval (parse "forall x. x and not x");;

  • Staging separates 2 phases of computation
  • 1. traversing a program
  • 2. evaluating a program

75

slide-90
SLIDE 90

MetaOCaml Example: QBF Staged Interpreter

let rec eval' b env = match b with True -> .<true>. | False -> .<false>. | And (b1,b2) -> .< .~(eval' b1 env) && .~(eval' b2 env) >. | Or (b1,b2) -> .< .~(eval' b1 env) || .~(eval' b2 env) >. | Not b1 -> .< not .~(eval' b1 env) >. | Implies (b1,b2) -> .< .~(eval' (Or(b2,And(Not(b2),Not(b1)))) env) >. | Forall (x,b1) -> .< let trywith bv = .~(eval' b1 (ext env x .<bv>.)) in (trywith true) && (trywith false) >. | Var x -> env x # let a = eval' (Forall ("p", Implies(True, Var "p"))) env0;; a : bool code = .<let trywith = fun bv -> (bv || ((not bv) && (not true))) in ((trywith true) && (trywith false))>. # .! a;;

  • : bool = false

76

slide-91
SLIDE 91

Metaprogramming: MetaOCaml versus Template Haskell

MetaOCaml (staged interpreter) Template Haskell (templates) .<E>. (bracket) [| E |] (quotation) .~ (escape) $s (splice) .<t>. (type for staged code) Q Exp (quoted values) .! (run) none

  • Template Haskell allows inspection of quoted values can

alter code’s semantics before reaches compiler

  • Template Haskell: compile time code gen, no runtime
  • verhead
  • MetaOCaml: runtime code gen, some runtime overhead
  • speedups possible when dynamic variables become static

values, incremental compiler optimises away condition checks, specialises functions, etc.

77

slide-92
SLIDE 92

Lightweight Modular Staging (LMS) in Scala

  • Programming abstractions used during code generation,

not reflected in generated code

  • L = lightweight, just a library
  • M = modular, easy to extend
  • S = staging
  • Types distinguish expressions evaluated
  • ”execute now” has type:

T

  • ”execute later” (delayed) has type:

Rep[T]

78

slide-93
SLIDE 93

Lightweight Modular Staging (LMS) in Scala

Scala:

def power(b: Double, p: Int): Double = if (p==0) 1.0 else b * power(b, p - 1)

Scala LMS:

def power(b: Rep[Double], p: Int): Rep[Double] = if (p==0) 1.0 else b *power(b, p - 1) power(x,5) def apply(x1: Double): Double = { val x2 = x1 * x1 val x3 = x1 * x2 val x4 = x1 * x3 valx5 = x1 * x4 x5 } Alexey Rodriguez Blanter et al. Concepts of Programming Design: Scala and Lightweight Modular Staging (LMS). http://www.cs.uu.nl/docs/vakken/mcpd/slides/slides- scala-lms.pdf.

79

slide-94
SLIDE 94

Lightweight Modular Staging (LMS) in Scala

def power(b: Rep[Double], p: Int): Rep[Double] = { def loop(x: Rep[Double], ac: Rep[Double], y: Int): Rep[Double] = { if(y ==0) ac else if (y%2==0) loop(x * x, ac, y /2) else loop(x, ac * x, y -1) } loop(b,1.0, p) } power(x,11) def apply(x1: Double): Double = { val x2 = x1 * x1 // x * x val x3 = x1 * x2 // ac * x val x4 = x2 * x2 // x * x val x8 = x4 * x4 // x * x val x11 = x3 * x8 // ac * x x11 } 80

slide-95
SLIDE 95

LMS in Practise: Delite

  • Delite: compiler framework and runtime for parallel EDSLs
  • Scala success story: Delite uses LMS for high performance
  • Successful DSLs developed with Delite
  • OptiML: Machine Learning and linear algebra
  • OptiQL: Collection and query operations
  • OptiMesh: Mesh-based PDE solvers
  • OptiGraph: Graph analysis

81

slide-96
SLIDE 96

Summary

Approach Host frontend Host backend Optimise via Embedded compiler yes no traditional compiler opts Staged compiler no yes MP: delayed expressions

  • Ext. metaprogramming

yes yes MP: transformation MP: metaprogramming

  • Embedded compilers: Accelerate (Haskell)
  • Extensional metaprogramming: Template Haskell
  • Staged compilers: MetaOCaml, Scala LMS

Seefried, Chakravarty, and Keller, “Optimising Embedded DSLs Using Template Haskell”.

82

slide-97
SLIDE 97

Haskell Take on DSLs

slide-98
SLIDE 98

haskell-cafe mailing list

Subject: [Haskell-cafe] What *is* a DSL? From: Günther_Schmidt <gue.schmidt () web ! de> Date: 2009-10-07 15:10:58 Hi all, for people that have followed my posts on the DSL subject this question probably will seem strange, especially asking it now .. Because out there I see quite a lot of stuff that is labeled as DSL, I mean for example packages on hackage, quite useuful ones too, where I don't see the split of assembling an expression tree from evaluating it, to me that seems more like combinator libraries. Thus: What is a DSL? Günther 83

slide-99
SLIDE 99

haskell-cafe mailing list

84

slide-100
SLIDE 100

haskell-cafe mailing list

A DSL is just a domain-specific language. It doesn’t imply any specific implementation technique. A shallow embedding of a DSL is when the ”evaluation” is done immediately by the functions and combinators

  • f the DSL. I don’t think it’s possible to draw a line be-

tween a combinator library and a shallowly embedded DSL. A deep embedding is when interpretation is done on an intermediate data structure. – Emil Axelsson, Chalmers University.

85

slide-101
SLIDE 101

haskell-cafe mailing list

I’ve argued that every monad gives a DSL. They all have the same syntax - do-notation, but each choice

  • f monad gives quite different semantics for this nota-

tion. – Dan Piponi

86

slide-102
SLIDE 102

haskell-cafe mailing list

I’ve informally argued that a true DSL – separate from a good API – should have semantic characteristics of a language: binding forms, control structures, abstrac- tion, composition. Some have type systems. Basic DSLs may only have a few charateristics of lan- guages though – a (partial) grammar. That’s closer to a well-defined API in my books. – Don Stewart

87

slide-103
SLIDE 103

haskell-cafe mailing list

Parsec, like most other parser combinator libraries, is a shallowly embedded DSL… a Haskell function that does parsing, i.e. a function of type String -> Maybe (String, a) You can’t analyse it further—you can’t transform it into another grammar to optimise it or print it out—because the information about what things it accepts has been locked up into a non-analysable Haskell function. The

  • nly thing you can do with it is feed it input and see

what happens. – Bob Atkey

88

slide-104
SLIDE 104

Embeddings in Haskell

slide-105
SLIDE 105

Embeddings with Haskell

  • GHC gives us
  • frontend: syntax & type checking
  • interpreter: test components and small programs
  • Haskell EDSL often rely on
  • higher order functions
  • type class overloading
  • monads
  • Choices
  • 1. functions directly capture semantics of language (shallow)
  • 2. based on the abstract syntax of EDSL program (deep)
  • multiple interpretations e.g. acceleration, visualisation..

89

slide-106
SLIDE 106

Shallow Embeddings

90

slide-107
SLIDE 107

Compile Time Metaprogramming

91

slide-108
SLIDE 108

Three case studies

  • 1. Repa: array processing
  • 2. Accelerate: array processing
  • strict evaluation semantics (host language is lazy)
  • 3. Lava: circuit description

92

slide-109
SLIDE 109

Array Processing: Repa

slide-110
SLIDE 110

Haskell Embeddings

93

slide-111
SLIDE 111

Parallel Shallow Embedding

94

slide-112
SLIDE 112

Repa Language

data family Array rep sh e data instance Array D sh e = ADelayed sh (sh -> e) data instance Array U sh e = AUnboxed sh (Vector e)

  • - types for array representations

data D

  • - Delayed

data U

  • - Manifest, unboxed

computeP :: (Load rs sh e, Target rt e) => Array rs sh e

  • > Array rt sh e

Lippmeier et al., “Guiding parallel array fusion with indexed types”.

95

slide-113
SLIDE 113

Repa Example

type Image a = Array U DIM2 a gradientX :: Image Float -> IO (Image Float) gradientX img = computeP $ forStencil2 BoundClamp img [stencil2|

  • 1

1

  • 2

2

  • 1

1 |] gradientY :: Image Float -> IO (Image Float) gradientY img = computeP $ forStencil2 BoundClamp img [stencil2| 1 2 1

  • 1 -2 -1 |]

gradMagnitude :: Float -> Image Float -> Image Float

  • > IO (Image (Float, Word8))

gradMagnitude threshLow dX dY = computeP $ R.zipWith mag dX dY where mag = ... 96

slide-114
SLIDE 114

Repa Example

readImage :: String -> IO Image saveImage :: Image -> String -> IO () main = do image1 <- readImage "input.png" image2 <- gradientX image1 image3 <- gradientY image1 image4 <- gradMagnitude thresh image2 image3 saveImage image4 "output.png"

  • Each computeP call uses static scheduler
  • assumes well balanced regular parallelism
  • Monadic interface sequences parallel ”gang” schedulers
  • avoid: cache contention, overloading OS scheduler

Lippmeier et al., “Guiding parallel array fusion with indexed types”.

97

slide-115
SLIDE 115

Repa Parallelism: Use Multithreaded GHC

  • - 'n' is number of threads to use

forkGang :: Int -> IO Gang forkGang n = ... zipWithM_ forkOn [0..] -- create worker threads $ zipWith3 gangWorker [0 .. n-1] mvsRequest mvsDone gangWorker :: Int -> MVar Req -> MVar () -> IO () gangWorker threadId varRequest varDone = do -- Wait for a request req <- takeMVar varRequest case req of ReqDo action

  • > do -- Run the action we were given.

action threadId ...

98

slide-116
SLIDE 116

Array Processing: Accelerate

slide-117
SLIDE 117

Haskell Embeddings

99

slide-118
SLIDE 118

Parallel Deep Embedding

100

slide-119
SLIDE 119

Deep Embeddings with Haskell

Andy Gill. “Domain-specific languages and code synthesis using Haskell”. In: Commun. ACM 57.6 (2014), pp. 42–49.

101

slide-120
SLIDE 120

Accelerate Language Surface AST

map :: (Shape sh, Elt a, Elt b) => (Exp a -> Exp b)

  • > Acc (Array sh a)
  • > Acc (Array sh b)

zipWith :: (Shape sh, Elt a, Elt b, Elt c) => (Exp a -> Exp b -> Exp c)

  • > Acc (Array sh a)
  • > Acc (Array sh b)
  • > Acc (Array sh c)

stencil :: (Stencil sh a stencil, Elt b) => (stencil -> Exp b)

  • > Boundary (Array sh a)
  • > Acc (Array sh a)
  • > Acc (Array sh b)
  • - slice, fold, backpermute, ...

102

slide-121
SLIDE 121

Accelerate Literature

Material from

  • Trevor McDonell’s PhD thesis
  • Email exchanges with Trevor

Trevor L. McDonell. “Optimising Purely Functional GPU Programs”. PhD thesis. University of New South Wales, Sydney, Australia, 2015.

103

slide-122
SLIDE 122

Accelerate

  • User programs generate CUDA/LLVM programs at runtime

dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 ( zipWith (*) xs' ys' )

  • Acc is an Accelerate program, will produce value of type a
  • run function generates code, compiles it, executes it

run :: Arrays a => Acc a -> a

104

slide-123
SLIDE 123

Accelerate Language

McDonell, “Optimising Purely Functional GPU Programs”.

105

slide-124
SLIDE 124

Comiling and Executing Accelerate

  • Skeletons build trees to represent array computations
  • GADTs preserve embedded program’s type info in term tree
  • Smart constructors

Data.Array.Accelerate.Language

map = Acc $$ Map zipWith = Acc $$$ ZipWith fold = Acc $$$ Fold ...

  • Internal conversion from HOAS to de Bruijn representation

enables program transformations and recovers sharing

  • - convert array expression to de Bruijn form
  • - incorporating sharing information

convertAcc :: Arrays arrs => Acc arrs -> AST.Acc arrs

106

slide-125
SLIDE 125

Accelerate Internal IR

dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 ( zipWith (*) xs' ys' )

Becomes:

Fold add (Const 0) (ZipWith mul xs' ys') where add = Lam (Lam (Body ( PrimAdd (FloatingNumType (TypeFloat FloatingDict)) `PrimApp` Tuple (NilTup `SnocTup` (Var (SuccIdx ZeroIdx)) `SnocTup` (Var ZeroIdx))))) mul = -- same as add, but using PrimMul ...

The generated IR is optimised (e.g. fusion) then compiled to

  • bject code, which is linked at runtime and executed

107

slide-126
SLIDE 126

Skeleton Code Templates: Map (CUDA)

[cunit | $esc:("#include <accelerate_cuda.h>") $edecls:texIn extern "C" __global__ void map ( // types of the elements of the input/output arrays $params:argIn, $params:argOut ){ const int shapeSize = size(shOut); const int gridSize = $exp:(gridSize dev); int ix; for ( ix = $exp:(threadIdx dev); ix < shapeSize; ix += gridSize ) { // gets input array element from index $items:(dce x .=. get ix) // scalar operation per element $items:(setOut "ix" .=. f x) } |]

Listing 4.1 from McDonell’s PhD thesis.

108

slide-127
SLIDE 127

Skeleton Code Templates

  • Accelerate now LLVM based (not CUDA)
  • But same template skeleton idea
  • Parallel code structure defined by skeleton templates
  • Types & user defined functions added to template during

code gen

  • Doesn’t use TemplateHaskell’s quasiquotation
  • Instead uses Haskell LLVM library API

Trevor L. McDonell et al. “Type-safe runtime code generation: Accelerate to LLVM”. In: Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Vancouver, BC, Canada, September 3-4, 2015. ACM, 2015, pp. 201–212.

109

slide-128
SLIDE 128

Skeleton Code Templates: Map (LLVM)

mkMap aenv apply = let (arrOut, paramOut) = mutableArray @sh "out" (arrIn, paramIn) = mutableArray @sh "in" paramEnv = envParam aenv in makeOpenAcc "map" (paramOut ++ paramIn ++ paramEnv) $ do start <- return (lift 0) end <- shapeSize (irArrayShape arrIn) imapFromTo start end $ \i -> do xs <- readArray arrIn i ys <- app1 apply xs writeArray arrOut i ys return_

  • - from 'accelerate-llvm' package

imapFromTo :: IR Int -> IR Int

  • > (IR Int -> CodeGen Native ()) -> CodeGen Native ()

110

slide-129
SLIDE 129

LLVM Backend

run :: Arrays a => Acc a -> a run a = unsafePerformIO (runIO a) runIO :: Arrays a => Acc a -> IO a runIO a = withPool defaultTargetPool (\target -> runWithIO target a) runWithIO :: Arrays a => PTX -> Acc a -> IO a runWithIO target a = execute where !acc = convertAcc a execute = do dumpGraph acc evalPTX target $ do build <- phase "compile" (compileAcc acc) >>= dumpStats exec <- phase "link" (linkAcc build) res <- phase "execute" (evalPar (executeAcc exec >>= copyToHostLazy)) return res 111

slide-130
SLIDE 130

Comparing Accelerate and Repa

slide-131
SLIDE 131

Comparing Accelerate and Repa

Same goals:

  • Collective operations on regular multidimensional arrays
  • Non-nested, flat data-parallelism
  • Embed in Haskell

Achieve these goals in very different ways:

  • Repa uses type indexed array representations to help GHC

generate better code

  • Accelerate avoids GHC’s code generation altogether

112

slide-132
SLIDE 132

Performance

McDonell et al., “Type-safe runtime code generation: Accelerate to LLVM”.

113

slide-133
SLIDE 133

Benefits of Accelerate’s Deep Embedding

Things you can do many with an Accelerate program:

  • 1. Pretty print it
  • 2. Interpret it
  • 3. Generate & execute CUDA for GPUs
  • 4. Generate & execute LLVM for CPUs/GPUs
  • 5. Visualise program graph with GraphViz

114

slide-134
SLIDE 134

Accelerate Arrays and Functions

arr1 :: Acc (Array DIM2 Int) arr1 = A.use $ A.fromList (Z :. 3 :. 3) [1..9] arr2 :: Acc (Array DIM2 Int) arr2 = A.use $ A.fromList (Z :. 3 :. 3) [10..19] f :: Acc (Array DIM2 Int) -> Acc (Array DIM2 Int) f = A.map (+2) . A.map (+1) g :: Acc (Array DIM2 Int) -> Acc (Array DIM2 Int) g = A.transpose

115

slide-135
SLIDE 135

Pretty Print It

let program = A.zip (f arr1) (g arr2) print program -- show it let a0 = use (Array (Z :. 3 :. 3) [1,2,3,4,5,6,7,8,9]) in let a1 = use (Array (Z :. 3 :. 3) [10,11,12,13,14,15,16,17,18]) in generate (intersect (shape a0) (let x0 = shape a1 in Z :. indexHead x0 :. indexHead (indexTail x0))) (\x0 -> (2 + (1 + (a0!x0)) , a1!Z :. indexHead x0 :. indexHead (indexTail x0)))

116

slide-136
SLIDE 136

Run It

let program = A.zip (f arr1) (g arr2) print print (A.run program) -- run it Matrix (Z :. 3 :. 3) [ (4,10), (5,13), (6,16), (7,11), (8,14), (9,17), (10,12),(11,15),(12,18)]

117

slide-137
SLIDE 137

Comparison of Profiling Tooling

slide-138
SLIDE 138

Repa Profiling

  • Repa uses GHC runtime system
  • Threadscope for profiling GHC generated parallel code
  • Hence: Repa can inherit Threadscope profiling tool

118

slide-139
SLIDE 139

Accelerate Profiling

  • Accelerate doesn’t generate parallel code via GHC
  • Doesn’t have access to GHC tools e.g. Threadscope
  • Use NVidia profiler GPU profiling tooling instead

Figure 4.2 from McDonell’s thesis.

McDonell, “Optimising Purely Functional GPU Programs”.

119

slide-140
SLIDE 140

Implementation Considerations

slide-141
SLIDE 141

Repa Implementation Considerations

  • Good: GHC has good multicore/concurrency support
  • Good: less engineering reuse GHC code generation
  • Questionable: at mercy of GHC code generation

Question: Can GHC Core be relied on for producing efficient high performance numerical code? E.g inlining and con- stant propagation for aggressive array fusion? GHC Core is a SystemF language, not an array processing IR.

120

slide-142
SLIDE 142

Accelerate Implementation Considerations

  • Obscure LLVM code might rule out LLVM optimisations
  • Generate LLVM IR that LLVM will optimise well
  • usually equates to “simple LLVM”
  • Accelerate tells LLVM exactly which CPU is being used
  • Ask LLVM to vectorise for this CPU
  • Don’t generate SIMD instructions
  • Rely on LLVM auto-vectorisation
  • vectorisation will happen if LLVM thinks it is beneficial
  • Accelerate produces code it knows LLVM can vectorise well

121

slide-143
SLIDE 143

Another Domain: Circuit Description

slide-144
SLIDE 144

Deep EDSL: Lava

  • Strongly typed EDSL for describing hardware circuits
  • Deeply embedded
  • test circuit designs with GHCi (host language interpreter)
  • generate VHDL to synthesise circuits to hardware

Example from Andy Gill’s ACM Communications paper.

Gill, “Domain-specific languages and code synthesis using Haskell”.

122

slide-145
SLIDE 145

Counting Pulses Schematic

counter :: (Rep a, Num a) => Signal Bool -> Signal Bool -> Signal a counter restart inc = loop where reg = register 0 loop reg' = mux2 restart (0,reg) loop = mux2 inc (reg' + 1, reg')

123

slide-146
SLIDE 146

Counting Pulses

Simulate with GHCi (shallow host language interpreter): GHCi> counter low (toSeq (cycle [True,False,False])) 1 : 1 : 1 : 2 : 2 : 2 : 3 : 3 : 3 : ... Reify counter to deep embedding: GHCi> reify (counter (Var "restart") (Var "inc")) [(0,MUX2 1 (2,3)), (1,VAR "inc"), (2,ADD 3 4), (3,MUX2 5 (6,7)), (4,LIT 1), (5,VAR "restart"), (6,LIT 0), (7,REGISTER 0 0)]

124

slide-147
SLIDE 147

Deep: Translating Reified AST to VHDL

architecture str of counter is signal sig_2_o0 : std_logic_vector(3 downto 0); ... begin sig_2_o0 <= sig_4_o0 when (inc = '1') else sig_6_o0; sig_5_o0 <= stf_logic_vector(...); sig_6_o0 <= "0000" when (restart = '1') else sig_10_o0; sig_10_o0_next <= sig_2_o0; proc14 : process(rst,clk) is begin if rst = '1' then sig_10_o0 <= "0000"; elseif rising_edge(clk) then if (clk_en = '1') then sig_10_o0 <=sig_10_o0_next; ... end architecture; 125

slide-148
SLIDE 148

Haskell EDSLs Summary

slide-149
SLIDE 149

Haskell EDSLs Summary

Approach domain specific opts host opts language examples shallow yes (rewrite rules) yes host repa, HdpH-RS deep yes (runtime) no host Accelerate, Lava MP yes (compile time) yes quasiquotes PanTHeon (MP = metaprogramming) 126

slide-150
SLIDE 150

Conclusions

slide-151
SLIDE 151

Conclusions

  • DSL: notation that captures domain semantics
  • Why DSLs?
  • AVOPT: Analysis, Verification (ComMA), Optimisation,

Parallelisation (Hdph-RS, Accelerate) and Transformation

  • Compositionality (Frenetic), performance and productivity

(Halide), correctness (Ivory)

  • Drawbacks
  • engineering effort, incoherent designs
  • poor implementation choice from plethora of options
  • unenforced boundaries between EDSL and host language
  • Implementation choices
  • Internal or external
  • Shallow embed language (Repa), deeply embed compiler

(Accelerate), compile time metaprogramming (Template Haskell), staged metaprogramming (MetaOCaml, Scala LMS)

127