Basic Idea The main task of a functional programmer should be to - - PowerPoint PPT Presentation

basic idea
SMART_READER_LITE
LIVE PREVIEW

Basic Idea The main task of a functional programmer should be to - - PowerPoint PPT Presentation

Basic Idea The main task of a functional programmer should be to specify what has to be evaluated in parallel, and not Parallel functional programming how the parallel evaluation has to be organized. Parallelism from a functional


slide-1
SLIDE 1

Parallel functional programming

“Parallelism from a functional angle” Thanks to Lennart Edblom (viceprefekt) for the material.

Basic Idea

”The main task of a functional programmer should be to specify what has to be evaluated in parallel, and not how the parallel evaluation has to be

  • rganized.”

Main goal: speedup (as always).

Functional Languages

Why functional languages??

  • Easier to partition a parallel program as tasks to evaluate
  • Simple communication model (data dependence)

– The rest is hidden (in general) to the user

  • Determinism:

– Suppose that the sequential program is correct – Deadlock can not occur – The result is independent of the scheduling

  • Simpler debugging (= sequential program)
  • Easy to utilize parallel language constructions on a high level

Is this really always true??? We will try to find out….

Problems

Operational aspects (not the semantics)

  • Performance monitoring
  • Cost modeling
  • Locality
slide-2
SLIDE 2

A basic concept…

"Pure" functional languages posess "the Church4Rosser property". Independent subexpressions can be evaluated in any

  • rder (sequentially/parallel). The resultat will be the

same (except for memory management)!

  • Only data dependencies controls the execution order
  • No side effects

…and a question of definition

Parallelism 4 several processes solves parts of a joint problem. Aim: speedup Concurrency 4 independent processes cooperate, deadlock possible, often non4deterministic, explicit communication Aim: better structure, higher level of abstraction

Simple Partitioning

Basic idea: Every computation needed to produce the final resultat can be executed as a separate task (in parallel). (No side effects. Only data dependencies controls) parallel x = (f1 x, f2 x) f1 y = y + 1 f2 z = z * 3 (f1 x) and (f2 x) may be evaluated in parallel. But first x must have a value. Data dependence! par"g x = g (f1 x) (f2 x) g a b = a + b g:s argument may be evaluated in parallel… …. before the evaluation of g can start (strict languages!)

Independent Tasks are Evaluated in Parallel

Example: Compute Fibonacci4numbers recursively The two recursive calls are evaluated in parallel recursively.

fun nfib n = if n <= 1 then 1 else 1 + nfib(n-1) + nfib(n-2)

slide-3
SLIDE 3

Simple Communication Model

Data dependent = communication channel "Simple" debugging The same program executes sequentially or in parallel, communication, scheduling etc does not have to be considered when debugging No deadlocks may be introduced by parallelization! (But an erroneous (sequential) program is of course erroneous in a parallel execution.) Performance problems still have to examine with real tests (or simulations).

Language Issues – Design

Redex 4 an expression (often function application) that can be evaluated There exists two main classes of functional languages: Strict vs non"strict Strict language 4 all arguments are evaluated, may be evaluated in parallel, before the body of the function Non"strict – arguments are evaluated if/when they are needed => The evaluation of the function body may start before the arguments "exists”. ”Lazy evaluation”. ”Data4driven” vs ”Demand4driven” evaluation. Strict – sometimes have to limit the parallelism Non"strict 4 problems finding enough parallelism

  • In non4strict languages you use strictness-analysis to decide (at

compile time) which expressions that are really being used A function is strict if f ⊥ = ⊥, ⊥ denotes a non4defined value. Is also used for programs that never finish executing. If a function is strict it is safe to evaluate the arguments (and function body) in parallel.

How to utilize the parallelism?

  • Partitioning into tasks – implicit or explicit?
  • Static or dynamic load balancing?
  • Task placement?
  • Granularity?

Where is the control?

Implicit parallelism 4 compiler & runtime4system decides about partitioning, distribution of data, load balancing, communication Strict languages: easy to partition into tasks, often (too) fine grained Non4strict: strictness analysis needed Limited implicit parallelism

  • Some language constructs matches parallel computation schemes
  • Data parallelism (SIMD) – the same operation is applied in parallel on

every element in a large data structure. Powerful in functional languages with advanced data structures and higher order functions. Controlled (semi"explicit) parallelism 4 Annotation 4 directives / suggestions to the compiler 4 Evaluation strategies Explicit parallelism Language constructs for partitioning, communication etc. ”Algorithm skeletons” – catch common patterns for parallel computations in higher

  • rder functions. Express programs in these ”skeletons”.
slide-4
SLIDE 4

Languages

"Pure" vs "non"pure" functional languages Pure 4 no sido effects (assignment, I/O etc) Pure language 4 easier to parallelize & partition Explicit control is hard to combine with "pureness" € Type system Small influence on the parallelism Some languages have special types for "parallel data structures”.

Computation Models

Data flow "Data4driven evaluation" 4 an

  • peration can be performed as

soon as all operands are available. Can be described by data flow graphs; a directed graph where the nodes represents

  • perations and arrows data

dependencies between the

  • perations.

Ex: let x = a*b y = 4 * c in (x+y) * (x-y) / c end;

  • (Idealized) Behavior
  • The values are sent directly between the

nodes/instructions. No shared memory.

  • Only data dependencies limits the parallelism
  • Computations may be "pipelined" through the

graph

  • Operations can not have side effects
  • Reduction
  • "Merge" and "Switch" nodes are used to build

conditions and loops

  • A suitable representation of d4f graphs can be

used as machine language in a d4f4machine.

Reduction

  • A (functional) program = one (large) expression
  • Evaluation is done by stepwise substituting

subexpressions with their values until a "normal form" is reached.

  • Expressions represented as a graph => graph reduction
  • Ex: f x = (4 + (2 * x)) / ((2 * x) 4 5)
  • Common subexpressions share graph representation.
  • Parallel reductions are possible
  • Usually demand driven evaluation
slide-5
SLIDE 5

Hardware

During many years there were a lot of research about special architectures that closely matched these computation models. Reduction machines 4 ALICE, Flagship, GRIP etc Data flow machines 4 Manchester, TTDA & Monsoon (MIT) Many ideas from this research have been adopted in modern (parallel) computer architecture. Not everything were implemented, but a lot were simulated using conventional hardware. The idea for special hardware for parallel functional programming is now "very unfashionable". Most are nowadays done using traditional hardware (newer things Cell, GPU??) – the programming and communication models may however be totally separate from reality (high level of abstraction)

Dataflow vs. Control4Flow

von Neumann or control flow computing model

– a program is a series of addressable instructions, each of which either

  • specifies an operation along with memory locations of the operands or
  • it specifies (un)conditional transfer of control to some other instruction.

– Essentially: the next instruction to be executed depends on what happened during the execution of the current instruction. – The next instr. to be executed is pointed to and triggered by the PC. – The instruction is executed even if some of its operands are not available yet

Dataflow model: the execution is driven only by the availability of operands!

– no PC and global updateable store – The two features of von Neumann model that become bottlenecks in exploiting parallelism are missing

Implementation Issues

Early implementations were interpreting Functional languages are often implemented with the help of an abstract machine. This is often also true for parallel implementations. The level of abstraction of the abstract machine is important for how easily it can be realized on a concrete architecture. 4 interpretation 4 concrete machine code

semantics abstract machine compiler reduction abstract machine

program informs meaning abstract code instance normal form input

  • utput

equivalence

Issues of the ordering

Computations on a von Neumann4machine must be performed in some order, which bring forward the question about reduction order. Normal order evaluation ; evaluates the arguments when they are needed; is implemented using call4by4need ≈ lazy evaluation (or call4by4name); realizes non4strict semantics. Always terminates if the value of the expression ≠ ⊥ Is often used in graph reduction. Applicative order evaluation; evaluates the arguments before a function is called; is implemented using call4by4value; realizes strict semantics. Can get into an infinite loop when evaluating arguments that are not used. Is often used in data flow.

slide-6
SLIDE 6

Questions

Shall funktional parallelism be realized in the language itself or by the abstract machine? Graph reduction with lazy evaluation is a natural model, but not simple to implement efficiently => places high demands on compiler & run4time systems (strictness analysis, task generation etc) Control flow based implementations => many operational details must be controlled in the language => higher demands on the programmer Which type of parallelism gives performance gaines? "compile4time performance prediction", "run4time performance modeling"

Speculative Evaluation

(for non4strict languages) Evaluate values for nodes even though that they not necessarily is needed, to reduce the total execution time.

  • "Compulsory" tasks must have higher priority
  • May not “over4use” memory (and other

resources)

  • Must be able to "upgrade" and "kill" a speculative

task

Garbage Collection

  • Is done parallel to ”productive” work
  • Often takes place on individual processors,

rarely globally

Parallel Implementation of Non4strict Languages

  • Based on a graph representation of the program (may be

stack4 or packet based)

  • A number of processes, each composed of a (sequential)

abstract machine , executes a thread; corresponding to reduction of a part of the graph

  • Threads ready to be evaluated are called sparks, stored

in a spark pool

  • Synchronization is needed when a thread needs a value

computed by another thread

  • Potential sparks are identified during compilation,

automatically or based on annotations / code instructions

  • It can be compulsory or optional to let a spark create a

new process

slide-7
SLIDE 7

Parallel Implementation of Non4strict Languages

It exists different kinds of mechanisms for scheduling, task creation, blocking and resuming, load management etc. Ett exempel: "evaluate4and4die" 4 if a parent needs the value of a "child spark"; start to evaluate the corresponding computation: i) the child has already finished the computation, just collect the value ii) the child has not started the computation yet, "kill the child" and do the computation yourself iii) the child is doing computation right now. Wait, synchronize.

Proofs

Due to its mathematical foundation functional languages are more suitable for proofs and formal manipulation. 4 that programs matches its specification 4 that two programs are equivalent 4 program transformations 4 "refinement" of programs For completely implicit parallelism the same techniques as for a sequential program can be used. Explicit parallelism; have to prove that the synchronization is correct as well. Program transformations 4may also influence the parallel properties of the program.

Implicit Parallelism

a) Independent subexpressions can be evaluated in parallel (Church4Rosser) b) Data parallelism. Ex; binom 4 The recursive calls to binom can be evaluated in parallel ( Div & Conq 4 pair) 4 The comparisons in the “case analysis” can be done in parallel (fine grained!) Will the gain in time be swallowed by overhead (comm. & syncr)? Grain/cost analysis is needed! Binom; Since + is a strict function both recursive calls can be evaluated in parallel. binom can be derived to be a strict function => the arguments can be evaluated eagerly and in parallel

binom :: Int -> Int -> Int binom n k | k==0 && n >=0 = 1 | n<k && n >=0 = 0 | n>=k && k>=0 = binom (n-1) k + binom (n-1)(k-1) | otherwise = error "negative params"

Indicated Parallelism, Annotations

"Advice" to the compiler. Only changes the run4time behavior, not the semantics. Can still be compiled sequentially. Example: Parallel let"expression Binom :: Int 4> Int 4> Int binom n k | k==0 && n >=0 = 1 | n<k && n >=0 = 0 | n>=k && k>=0 = letparv=binom (n41) (k41) in binom (n41) k +v | otherwise = error "negative params" Can use a special compilation schedule. Parallel map: Can be used in matrix4vector multiplication parmap f [ ] = [€] parmap f (x:xs) = letpar y = (f x) 44only if NF evaluation in y : (parmap f xs)

slide-8
SLIDE 8

Controlled Parallelism

Semi4explicit ; no explicit notation for parallel processes, but still some explicit constructions / operations. Objective: clear separation between what shall be computed ("purely functional") and control of parallelism and the dynamic behavior Para"functional programming Explicit constructions for 4scheduling – specify a partial order for computations 4 mapping of program on processors 4 distribution of data

Quotation

”A parallel imperative program specifies in detail many resource-allocation decisions which the parallel functional program does not mention at all”

Quotation

Å andra sidan:

”To take a general purpose program, automatically to partition that program into parallel threads, automatically and dynamically to manage those threads,… and to achieve high efficiency from that parallel system is a significant intellectual challenge”

Quotation

Å andra sidan:

”Arbitrary programs are rarely parallel. Quite a bit of work needs to go into designing and expressing a parallel algorithm”

slide-9
SLIDE 9

Conclusions

  • Programming model – important to find the right

level of abstraction for the problem / program

  • Memory consumption 4 often a problem
  • Foreign language interfacing 4 necessary if do not

want to "reinvent the wheel"

  • Familiar syntax 4 facilitates acceptance
  • Architecture independence 4 language &

applications that works on different types of computers