Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research

The free lunch is over. Muticores are here. We have  to program them. This is hard. Yada-yada-yada. Programming parallel computers   Plan A . Start with a language whose computational fabric is by-default sequential, and by heroic means make the program parallel  Plan B . Start with a language whose computational fabric is by-default parallel Every successful large-scale application of parallelism  has been largely declarative and value-oriented  SQL Server  LINQ  Map/Reduce  Scientific computation Plan B will win . Parallel programming will increasingly  mean functional programming

 “Just use a functional language and your troubles are over”  Right idea:  No side effects Limited side effects  Strong guarantees that sub-computations do not interfere  But far too starry eyed. No silver bullet:  one size does not fit all  need to “think parallel”: if the algorithm has sequential data dependencies, no language will save you!

A “cost model” gives  Different problems need the programmer some different solutions. idea of what an  Shared memory vs distributed memory operation costs, without burying her in  Transactional memory details  Message passing  Data parallelism Examples:  Locality Send message: copy • data or swing a  Granularity pointer?  Map/reduce Memory fetch: •  ...on and on and on... uniform access or do cache effects  Common theme: dominate? Thread spawn: tens  the cost model matters – you can’t • of cycles or tens of just say “leave it to the system” thousands of cycles?  no single cost model is right for all Scheduling: can a • thread starve?

 Goal: express the “natural structure” of a program involving lots of concurrent I/O (eg a web serer, or responsive GUI, or download lots of URLs in parallel)  Makes perfect sense with or without multicore  Most threads are blocked most of the time Usually done with   Thread pools  Event handler  Message pumps Really really hard to get right, esp when combined with  exceptions, error handling NB: Significant steps forward in F#/C# recently: Async<T> See http://channel9.msdn.com/blogs/pdc2008/tl11

 Sole goal: performance using multiple cores  …at the cost of a more complicated program  #include “StdTalk.io”  Clock speeds not increasing  Transistor count still increasing  Delivered in the form of more cores  Often with an inadequate memory bandwidth  No alternative: the only way to ride Moore’s law is to write parallel code

 Use a functional language  But offer many different approaches to parallel/concurrent programming, each with a different cost model  Do not force an up-front choice:  Better one language offering many abstractions  …than many languages offer one each  (HPF, map/reduce, pthreads …)

This talk Lots of different concurrent/parallel Multicore programming paradigms (cost models) in Haskell Use Haskell! Semi-implicit Task parallelism Data parallelism parallelism Explicit threads, Operate simultaneously on synchronised via locks, bulk data Evaluate pure messages, or STM functions in parallel Massive parallelism Modest parallelism Modest parallelism Easy to program Hard to program Single flow of control Implicit synchronisation Easy to program Implicit synchronisation Slogan: no silver bullet: embrace diversity

Multicore Parallel programming essential Task parallelism Explicit threads, synchronised via locks, messages, or STM

 Lots of threads, all performing I/O  GUIs  Web servers (and other servers of course)  BitTorrent clients  Non-deterministic by design  Needs  Lightweight threads  A mechanism for threads to coordinate/share  Typically: pthreads/Java threads + locks/condition variables

 Very very lightweight threads  Explicitly spawned, can perform I/O  Threads cost a few hundred bytes each  You can have (literally) millions of them  I/O blocking via epoll => OK to have hundreds of thousands of outstanding I/O requests  Pre-emptively scheduled  Threads share memory  Coordination via Software Transactional Memory (STM)

main = do { putStr (reverse “yes”) ; putStr “no” } • Effects are explicit in the type system – (reverse “yes”) :: String -- No effects – (putStr “no”) :: IO () -- Can have effects • The main program is an effect-ful computation – main :: IO ()

newRef :: a -> IO (Ref a) readRef :: Ref a -> IO a writeRef :: Ref a -> a -> IO () Reads and main = do { r <- newRef 0 ; incR r writes are ; s <- readRef r 100% explicit! ; print s } You can’t say incR :: Ref Int -> IO () (r + 6), because incR r = do { v <- readRef r r :: Ref Int ; writeRef r (v+1) }

forkIO :: IO () -> IO ThreadId  forkIO spawns a thread  It takes an action as its argument webServer :: RequestPort -> IO () webServer p = do { conn <- acceptRequest p ; forkIO (serviceRequest conn) ; webServer p } serviceRequest :: Connection -> IO () serviceRequest c = do { … interact with client … } No event-loop spaghetti!

 How do threads coordinate with each other? main = do { r <- newRef 0 ; forkIO (incR r) ; incR r ; ... } Aargh! A race incR :: Ref Int -> IO () incR r = do { v <- readRef r ; writeRef r (v+1) }

A 10-second review:  Races : due to forgotten locks  Deadlock : locks acquired in “wrong” order.  Lost wakeups: forgotten notify to condition variable  Diabolical error recovery : need to restore invariants and release locks in exception handlers  These are serious problems. But even worse...

Scalable double-ended queue: one lock per cell No interference if ends “far enough” apart But watch out when the queue is 0, 1, or 2 elements long!

Difficulty of concurrent Coding style queue Sequential code Undergraduate

Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables

Difficulty of concurrent Coding style queue Sequential code Undergraduate Locks and Publishable result at condition international conference variables Atomic blocks Undergraduate

atomically { ... sequential get code ... }  To a first approximation, just write the sequential code, and wrap atomically around it  All-or-nothing semantics: Atomic commit  Atomic block executes in Isolation A C I D  Cannot deadlock (there are no locks!)  Atomicity makes error recovery easy (e.g. exception thrown inside the get code)

atomically :: IO a -> IO a main = do { r <- newRef 0 ; forkIO (atomically (incR r)) ; atomically (incR r) ; ... }  atomically is a function, not a syntactic construct  A worry: what stops you doing incR outside atomically?

atomically :: STM a -> IO a newTVar :: a -> STM (TVar a)  Better idea: readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM () incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } main = do { r <- atomically (newTVar 0) ; forkIO (atomically (incT r)) ; atomic (incT r) ; ... }

atomic :: STM a -> IO a newTVar :: a -> STM (TVar a) readTVar :: TVar a -> STM a writeTVar :: TVar a -> a -> STM ()  Can’t fiddle with TVars outside atomic block [good]  Can’t do IO inside atomic block [sad, but also good]  No changes to the compiler (whatsoever). Only runtime system and primops.  ...and, best of all...

incT :: TVar Int -> STM () incT r = do { v <- readTVar r; writeTVar r (v+1) } Composition incT2 :: TVar Int -> STM () is THE way incT2 r = do { incT r; incT r } we build big programs foo :: IO () that work foo = ...atomically (incT2 r)...  An STM computation is always executed atomically (e.g. incT2). The type tells you.  Simply glue STMs together arbitrarily; then wrap with atomic  No nested atomic. (What would it mean?)

 MVars for efficiency in (very common) special cases  Blocking (retry) and choice (orElse) in STM  Exceptions in STM

 A very simple web server written in Haskell  full HTTP 1.0 and 1.1 support,  handles chunked transfer encoding,  uses sendfile for optimized static file serving,  allows request bodies and response bodies to be processed in constant space  Protection for all the basic attack vectors: overlarge request headers and slow-loris attacks  500 lines of Haskell (building on some amazing libraries: bytestring, blaze-builder, iteratee)

 A new thread for each user request  Fast, fast Pong requests/sec

 Again, lots of threads: 400-600 is typical  Significantly bigger program: 5000 lines of Haskell – but way smaller (Not shown: Vuse 480k lines) 80,000 than the loc competition Erlang Haskell  Built on STM  Performance: roughly competitive

 So far everything is shared memory  Distributed memory has a different cost model  Think message passing…  Think Erlang …

Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research The free lunch is over. Muticores are here. We have to program them. This is hard. Yada-yada-yada. Programming parallel computers Plan A . Start with a language whose computational fabric

Z3 - a Tutorial Leonardo de Moura Nikolaj Bjrner Microsoft Research Microsoft Research

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

Weighted Automata and Concurrency Akash Lal Microsoft Research, India Microsoft Research, India

On the E ffi ciency of the Walrasian Mechanism Moshe Babaio ff Brendan Lucier (Microsoft

(S (Search) earch) Box Box Susan Dumais Microsoft Research

Deep Learning in Microsoft with CNTK Alexey Kamenev Microsoft Research Deep Learning in the

Post-Snowden Elliptic Curve Cryptography Patrick Longa Microsoft Research Joppe Bos NXP

SMT Solvers Theory & Practice Leonardo de Moura leonardo@microsoft.com Microsoft Research

Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L Lear arnin ing I Inference

Harbin Institute of Technology Microsoft Research Asia Microsoft Advanced Technology Center

Motion Estimation (I) Ce Liu celiu@microsoft.com Microsoft Research New England We live in a

Information In formation Sy Systems stems Susan Dumais Microsoft Research

Formal Methods and Tools for Distributed Systems Thomas Ball Microsoft

And Then There Richard Li University of Utah Were More: Christos Gkantsidis Microsoft Research

a force of .0059 nN/volt2 per comb-finger height (ym)

5.2 Microsoft Excel Microsoft Excel Microsoft Excel is the spreadsheet component of the

An Introduction to DryadLINQ Christophe Poulain Microsoft Research Microsoft Research Virtual

Bandwidth 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big problem into many smaller

SMT@Microsoft AFM 2007 Leonardo de Moura and Nikolaj Bjrner { leonardo, nbjorner }

European Science Initiative Fabien Petitcolas Microsoft Research labs Institutes / joint centres

Implementing IPv6 for Windows NT Richard P. Draves, Microsoft Research Allison Mankin, USC/ISI

Compiling Parallel Programs into Circuits Satnam Singh satnams@microsoft.com Microsoft Research

miltos1 https://miltos.allamanis.com Microsoft Research Cambridge

Microsoft Research The free lunch is over. Muticores are here. We - PowerPoint PPT Presentation

Simon Peyton Jones Microsoft Research The free lunch is over. Muticores are here. We have to program them. This is hard. Yada-yada-yada. Programming parallel computers Plan A . Start with a language whose computational fabric

Z3 - a Tutorial Leonardo de Moura Nikolaj Bjrner Microsoft Research Microsoft Research

Robust PageRank and Locally Computable Spam Detection Features Vahab Mirrokni [Microsoft

Deep Learning for Dialog Nate Kushman Researcher Microsoft Research Labs Microsoft Research

Weighted Automata and Concurrency Akash Lal Microsoft Research, India Microsoft Research, India

On the E ffi ciency of the Walrasian Mechanism Moshe Babaio ff Brendan Lucier (Microsoft

(S (Search) earch) Box Box Susan Dumais Microsoft Research

Deep Learning in Microsoft with CNTK Alexey Kamenev Microsoft Research Deep Learning in the

Post-Snowden Elliptic Curve Cryptography Patrick Longa Microsoft Research Joppe Bos NXP

SMT Solvers Theory &amp; Practice Leonardo de Moura leonardo@microsoft.com Microsoft Research

Microsoft AI and Research Deep Learning at Microsoft 2 De Deep L Lear arnin ing I Inference

Harbin Institute of Technology Microsoft Research Asia Microsoft Advanced Technology Center

Motion Estimation (I) Ce Liu celiu@microsoft.com Microsoft Research New England We live in a

Information In formation Sy Systems stems Susan Dumais Microsoft Research

Formal Methods and Tools for Distributed Systems Thomas Ball Microsoft

And Then There Richard Li University of Utah Were More: Christos Gkantsidis Microsoft Research

a force of .0059 nN/volt2 per comb-finger height (ym)

5.2 Microsoft Excel Microsoft Excel Microsoft Excel is the spreadsheet component of the

An Introduction to DryadLINQ Christophe Poulain Microsoft Research Microsoft Research Virtual

Bandwidth 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big problem into many smaller

SMT@Microsoft AFM 2007 Leonardo de Moura and Nikolaj Bjrner { leonardo, nbjorner }

European Science Initiative Fabien Petitcolas Microsoft Research labs Institutes / joint centres

Implementing IPv6 for Windows NT Richard P. Draves, Microsoft Research Allison Mankin, USC/ISI

Compiling Parallel Programs into Circuits Satnam Singh satnams@microsoft.com Microsoft Research

miltos1 https://miltos.allamanis.com Microsoft Research Cambridge

SMT Solvers Theory & Practice Leonardo de Moura leonardo@microsoft.com Microsoft Research