parallel and concurrent haskell part i
play

Parallel and Concurrent Haskell Part I Asynchronous agents Simon - PDF document

Concurrent data structures Locks Parallel and Concurrent Haskell Part I Asynchronous agents Simon Marlow Threads (Microsoft Research, Cambridge, UK) Parallel Algorithms All you need is X Parallel and Concurrent Haskell ecosystem Strategies


  1. Concurrent data structures Locks Parallel and Concurrent Haskell Part I Asynchronous agents Simon Marlow Threads (Microsoft Research, Cambridge, UK) Parallel Algorithms All you need is X Parallel and Concurrent Haskell ecosystem Strategies • Where X is actors, threads, transactional MVars memory, futures... Eval monad • Often true, but for a given application, some Par monad lightweight X s will be much more suitable than others. threads the IO • In Haskell, our approach is to give you lots of manager different X s asynchronous exceptions – “Embrace diversity (but control side effects)” (Simon Peyton Jones) Software Transactional Memory Parallelism vs. Concurrency Parallelism vs. Concurrency • Primary distinguishing feature of Parallel Haskell: determinism Multiple threads for modularity Multiple cores for performance – The program does “the same thing” regardless of of interaction how many cores are used to run it. – No race conditions or deadlocks – add parallelism without sacrificing correctness – Parallelism is used to speed up pure (non ‐ IO Parallel Haskell Concurrent Haskell monad) Haskell code

  2. Parallelism vs. Concurrency I. Parallel Haskell • In this part of the course, you will learn how to: • Primary distinguishing feature of Concurrent – Do basic parallelism: Haskell: threads of control • compile and run a Haskell program, and measure its performance – Concurrent programming is done in the IO monad • parallelise a simple Haskell program (a Sudoku solver) • use ThreadScope to profile parallel execution • because threads have effects • do dynamic rather than static partitioning • effects from multiple threads are interleaved • measure parallel speedup nondeterministically at runtime. – use Amdahl’s law to calculate possible speedup – Work with Evaluation Strategies – Concurrent programming allows programs that • build simple Strategies interact with multiple external agents to be modular • parallelise a data ‐ mining problem: K ‐ Means • the interaction with each agent is programmed separately – Work with the Par Monad • Use the Par monad for expressing dataflow parallelism • Allows programs to be structured as a collection of • Parallelise a type ‐ inference engine interacting agents (actors) Running example: solving Sudoku Solving Sudoku problems – code from the Haskell wiki (brute force search • Sequentially: with some intelligent pruning) – divide the file into lines – can solve all 49,000 problems in 2 mins – call the solver for each line – input: a line of text representing a problem i m por t Sudoku i m por t Cont r ol . Except i on .......2143.......6........2.15..........637...........68...4.....23........7.... i m por t Syst em . Envi r onm ent .......241..8.............3...4..5..7.....1......3.......51.6....2....5..3...7... .......24....1...........8.3.7...1..1..8..5.....2......2.4...6.5...7.3........... m ai n : : I O ( ) m ai n = do [ f ] <- get Ar gs gr i ds <- f m ap l i nes $ r eadFi l e f i m por t Sudoku m apM ( eval uat e . sol ve) gr i ds sol ve : : St r i ng - > M aybe G r i d eval uat e : : a - > I O a Compile the program... Run the program... $ . / sudoku1 sudoku17. 1000. t xt +RTS - s 2, 392, 127, 440 byt es al l ocat ed i n t he heap $ ghc - O 2 sudoku1. hs - r t sopt s [ 1 of 2] Com pi l i ng Sudoku ( Sudoku. hs, Sudoku. o ) 36, 829, 592 byt es copi ed dur i ng G C [ 2 of 2] Com pi l i ng M ai n ( sudoku1. hs, sudoku1. o ) 191, 168 byt es m axi m um r esi dency ( 11 sam pl e( s) ) Li nki ng sudoku1 . . . 82, 256 byt es m axi m um sl op $ 2 M B t ot al m em or y i n use ( 0 M B l ost due t o f r agm ent at i on) G ener at i on 0: 4570 col l ect i ons, 0 par al l el , 0. 14s, 0. 13s el apsed G ener at i on 1: 11 col l ect i ons, 0 par al l el , 0. 00s, 0. 00s el apsed . . . I NI T t i m e 0. 00s ( 0. 00s el apsed) M UT t i m e 2. 92s ( 2. 92s el apsed) G C t i m e 0. 14s ( 0. 14s el apsed) EXI T t i m e 0. 00s ( 0. 00s el apsed) Tot al t i m e 3. 06s ( 3. 06s el apsed) . . .

  3. Now to parallelise it... The Eval monad i m por t Cont r ol . Par al l el . St r at egi es • Doing parallel computation entails specifying dat a Eval a coordination in some way – compute A in i nst ance M onad Eval parallel with B r unEval : : Eval a - > a • This is a constraint on evaluation order r par : : a - > Eval a r seq : : a - > Eval a • But by design, Haskell does not have a • Eval is pure specified evaluation order • Just for expressing sequencing between rpar/rseq – nothing more • So we need to add something to the language • Compositional – larger Eval sequences can be built by to express constraints on evaluation order composing smaller ones using monad combinators • Internal workings of Eval are very simple (see Haskell Symposium 2010 paper) What does rpar actually do? Basic Eval patterns x <- <- r par e par e • To compute a in parallel with b, and return a • rpar creates a spark by writing an entry in the spark pool pair of the results: – rpar is very cheap! (not a thread) Start evaluating do • the spark pool is a circular buffer a in the a’ <- r par a • when a processor has nothing to do, it tries to remove an background b’ <- r seq b entry from its own spark pool, or steal an entry from r et ur n ( a’ , b’ ) another spark pool ( work stealing) Evaluate b, and • alternatively: • when a spark is found, it is evaluated wait for the • The spark pool can be full – watch out for spark overflow! result do a’ <- r par a e b’ <- r seq b r seq a’ r et ur n ( a’ , b’ ) Spark Pool • what is the difference between the two? Parallelising Sudoku But this won’t work... r unEval $ do • Let’s divide the work in two, so we can solve as’ <- r par ( m ap sol ve as) bs’ <- r par ( m ap sol ve bs) each half in parallel: r seq as’ r seq bs’ r et ur n ( ) l et ( as, bs) = spl i t At ( l engt h gr i ds ` di v` 2) gr i ds • rpar evaluates its argument to Weak Head Normal • Now we need something like Form (WHNF) • WTF is WHNF? r unEval $ do as’ <- r par ( m ap sol ve as) – evaluates as far as the first constructor bs’ <- r par ( m ap sol ve bs) – e.g. for a list, we get either [] or (x:xs) r seq as’ r seq bs’ – e.g. WHNF of “map solve (a:as)” would be “solve a : map r et ur n ( ) solve as” • But we want to evaluate the whole list, and the elements

  4. We need ‘deep’ Ok, adding deep r unEval $ do i m por t Cont r ol . DeepSeq as’ <- r par ( deep ( m ap sol ve as) ) deep : : NFDat a a => a - > a bs’ <- r par ( deep ( m ap sol ve bs) ) deep a = deepseq a a r seq as’ r seq bs’ • deep fully evaluates a nested data structure and r et ur n ( ) returns it • Now we just need to evaluate this at the top level in – e.g. a list: the list is fully evaluated, including the elements ‘main’: • uses overloading: the argument must be an instance of NFData eval uat e $ r unEval $ do a <- r par ( deep ( m ap sol ve as) ) – instances for most common types are provided by the . . . library • (normally using the result would be enough to force evaluation, but we’re not using the result here) Let’s try it... Runtime results... $ . / sudoku2 sudoku17. 1000. t xt +RTS - N2 - s • Compile sudoku2 2, 400, 125, 664 byt es al l ocat ed i n t he heap 48, 845, 008 byt es copi ed dur i ng G C – (add ‐ threaded ‐ rtsopts) 2, 617, 120 byt es m axi m um r esi dency ( 7 sam pl e( s) ) 313, 496 byt es m axi m um sl op 9 M B t ot al m em or y i n use ( 0 M B l ost due t o f r agm ent at i on) – run with sudoku17. 1000. t xt +RTS - N2 G ener at i on 0: 2975 col l ect i ons, 2974 par al l el , 1. 04s, 0. 15s el apsed • Take note of the Elapsed Time G ener at i on 1: 7 col l ect i ons, 7 par al l el , 0. 05s, 0. 02s el apsed Par al l el G C wor k bal ance: 1. 52 ( 6087267 / 3999565, i deal 2) SPARKS: 2 ( 1 conver t ed, 0 pr uned) I NI T t i m e 0. 00s ( 0. 00s el apsed) M UT t i m e 2. 21s ( 1. 80s el apsed) G C t i m e 1. 08s ( 0. 17s el apsed) EXI T t i m e 0. 00s ( 0. 00s el apsed) Tot al t i m e 3. 29s ( 1. 97s el apsed) Calculating Speedup Why not 2? • Calculating speedup with 2 processors: • there are two reasons for lack of parallel speedup: – Elapsed time (1 proc) / Elapsed Time (2 procs) – NB. not CPU time (2 procs) / Elapsed (2 procs)! – less than 100% utilisation (some processors idle for part of the time) – NB. compare against sequential program, not – extra overhead in the parallel version parallel program running on 1 proc • Each of these has many possible causes... • Speedup for sudoku2: 3.06/1.97 = 1.55 – not great...

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend