overview
play

Overview Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and - PDF document

11/10/2011 Finally, let us put things into perspective by looking at alternatives to MapReduce. We start with Dryad from Microsoft. 301 Overview Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed


  1. 11/10/2011 Finally, let us put things into perspective by looking at alternatives to MapReduce. We start with Dryad from Microsoft. 301 Overview • Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21- 23, 2007 • Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A System for General-Purpose Distributed Data- Parallel Computing Using a High-Level Language. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008 • Presentation based on authors’ slides 302 1

  2. 11/10/2011 Outline • Dryad Design • Implementation • Policies as Plug-ins • Building on Dryad 303 303 Design Space Internet Data- parallel Shared Private memory data center Latency Throughput 304 304 2

  3. 11/10/2011 2-D Piping • Unix Pipes: 1-D grep | sed | sort | awk | perl • Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 305 305 Dryad = Execution Layer Job (Application) Pipeline ≈ Dryad Shell Cluster Machine 306 306 3

  4. 11/10/2011 Outline • Dryad Design • Implementation • Policies as Plug-ins • Building on Dryad 307 307 Virtualized 2-D Pipelines 308 308 4

  5. 11/10/2011 Virtualized 2-D Pipelines 309 309 Virtualized 2-D Pipelines 310 310 5

  6. 11/10/2011 Virtualized 2-D Pipelines 311 311 Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized 312 312 6

  7. 11/10/2011 Dryad Job Structure grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 Channels Input Stage Output files files sort grep awk sed perl sort grep awk sed grep sort Vertices (processes) 313 313 Channels Finite Streams of items X • distributed filesystem files (persistent) Items • SMB/NTFS files (temporary) • TCP pipes M (inter-machine) • memory FIFOs (intra-machine) 314 314 7

  8. 11/10/2011 Architecture data plane Files, TCP, FIFO, Network job schedule V V V NS PD PD PD control plane Job manager cluster 315 315 Staging 1. Build 2. Send 7. Serialize .exe vertex vertices code 5. Generate graph JM code Cluster 6. Initialize vertices services 3. Start JM 8. Monitor Vertex execution 4. Query cluster resources 316 8

  9. 11/10/2011 Outline • Dryad Design • Implementation • Policies and Resource Management • Building on Dryad 317 317 Policy Managers R R R R Stage R Connection R-X X X X X Stage X R-X X Manager R manager Manager Job Manager 318 318 9

  10. 11/10/2011 Duplicate Execution Manager X[0] X[1] X[3] X[2] X’[2] Slow Duplicate Completed vertices vertex vertex Duplication Policy = f(running times, data volumes) 319 Aggregation Manager S S S S S S T static S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 T dynamic 320 320 10

  11. 11/10/2011 Data Distribution (Group By) Source Source Source m m x n Dest Dest Dest n 321 321 Range-Distribution Manager S S S [0-100) S S S Hist [0-30),[30-100) T static D D D T T [0-30) [0-?) [30-100) [?-100) dynamic 322 322 11

  12. 11/10/2011 Goal: Declarative Programming X X X X S S S T T T T static dynamic 323 323 Outline • Dryad Design • Implementation • Policies as Plug-ins • Building on Dryad 324 324 12

  13. 11/10/2011 Software Stack Machine Job queueing, monitoring sed, awk, grep, etc. Learning C# SSIS legacy PSQL Perl C++ Queries C# Vectors code SQL server Distributed Shell DryadLINQ C++ Dryad Distributed Filesystem CIFS/NTFS Cluster Services Windows Windows Windows Windows Server Server Server Server 325 325 Example Query: Sky Server • Table photoPrimary – All identified astronomical objects (354,254,163 records) – ID, color magnitude in 5 bands (u, g, r, i, z) • Table neighbors – For each object, neighbors within 30 arc seconds (2,803,165,372 records) • Query 18: gravitational lens effect – Find all objects that have neighbors whose color is similar to that object 326 13

  14. 11/10/2011 SkyServer Query 18 select distinct U.ObjID H into results from photoPrimary U, neighbors N, n Y Y photoPrimary L L L where U.ObjID = N.ObjID 4n S S and U.mode = 1 and L.ObjID = N.NeighborObjID 4n M M and U.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g))<0.05 and abs((U.g-U.r)-(L.g-L.r))<0.05 n D D and abs((U.r-U.i)-(L.r-L.i))<0.05 n and abs((U.i-U.z)-(L.i-L.z))<0.05 X X U N U N 327 327 SkyServer DB query H • Took SQL plan [distinct] (u.color,n.neighborobjid) [merge outputs] n [re-partition by n.neighborobjid] Y Y • Manually coded in Dryad [order by n.neighborobjid] U U select • Manually partitioned data select u.color,n.neighborobjid 4n u.objid S S from u join n from u join <temp> where where u.objid = n.objid 4n u: objid, color M M u.objid = <temp>.neighborobjid and n: objid, neighborobjid |u.color - <temp>.color| < d [partition by objid] n D D n X X U N U N 328 14

  15. 11/10/2011 Optimization H Y U n Y Y S S S S U U M 4n S S M M 4n M M M n D D D n X X X U N U N U N 329 Optimization H Y U n Y Y S S S S U U M 4n S S M M 4n M M M n D D D n X X X U N U N U N 330 15

  16. 11/10/2011 SkyServer Q18 Performance 16.0 Dryad In-Memory 14.0 Dryad Two-pass 12.0 SQLServer 2005 10.0 Speed-up 8.0 (times) 6.0 4.0 2.0 0.0 0 2 4 6 8 10 Number of Computers 331 331 DryadLINQ • Declarative programming • Integration with Visual Studio • Integration with .Net • Type safety • Automatic serialization • Job graph optimizations  static  dynamic • Conciseness 336 336 16

  17. 11/10/2011 LINQ Collection<T> collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 337 337 DryadLINQ = LINQ + Dryad Collection<T> collection; bool IsLegal(Key k); string Hash(Key); Vertex code var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Query plan (Dryad job) Data collection C# C# C# C# results 338 338 17

  18. 11/10/2011 Data Model C# objects Partition Collection 339 339 Query Providers Client machine DryadLINQ Data center C# Distributed Query Invoke Query Expr ToDryadTable Input Tables query plan Dryad JM Execution Output (11) C# Objects Results Output Tables foreach DryadTable 340 340 18

  19. 11/10/2011 Example: Histogram public static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k) { var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top; } “A line of words of wisdom” [“A”, “line”, “of”, “words”, “of”, “wisdom”] [[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]] [ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}] 341 341 Histogram Plan SelectMany HashDistribute Merge GroupBy Select OrderByDescending Take MergeSort Take 342 342 19

  20. 11/10/2011 Map-Reduce in DryadLINQ public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result; } 343 343 Map-Reduce Plan M M M M M M M map Q Q Q Q Q Q Q sort map M G 1 G 1 G 1 G 1 G 1 G 1 G 1 groupby R R R R R R R reduce D distribute D D D D D D D G partial aggregation R (1) (2) (3) MS MS mergesort MS MS MS groupby X G 2 G 2 G 2 G 2 G 2 reduce R R R R R X X X mergesort MS MS reduce G 2 G 2 groupby S S S S S S R R reduce A A A consumer X X 344 344 T 20

  21. 11/10/2011 Distributed Sorting in DryadLINQ public static IQueryable<TSource> DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, int pcount) { var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector); } 345 345 Distributed Sorting Plan Deterministic Sampling DS DS DS DS DS Histogram Data partitioning H H H O D D D D D (1) (2) (3) M M M M M Merge Sort S S S S S 346 346 21

  22. 11/10/2011 Outline • Introduction • Dryad • DryadLINQ • Building on DryadLINQ 349 349 Machine Learning in DryadLINQ Data analysis Machine learning Large Vector DryadLINQ Dryad 350 350 22

  23. 11/10/2011 Very Large Vector Library PartitionedVector<T> T T T Scalar<T> T 351 351 Operations on Large Vectors: Map 1 f T U f preserves partitioning T f U 352 352 23

  24. 11/10/2011 Map 2 (Pairwise) f T U V T U f V 353 353 Map 3 (Vector-Scalar) f T U V T U f V 354 354 354 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend