dryad shell dryad 2 d grep 1000 sed 500 sort 1000 awk 500
play

Dryad Shell Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 - PDF document

11/10/2011 Overview Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Finally, let us put things into perspective by Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. European Conference on


  1. 11/10/2011 Overview • Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Finally, let us put things into perspective by Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. European Conference on looking at alternatives to MapReduce. Computer Systems (EuroSys), Lisbon, Portugal, March 21- 23, 2007 • Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar We start with Dryad from Microsoft. Erlingsson, Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A System for General-Purpose Distributed Data- Parallel Computing Using a High-Level Language. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008 • Presentation based on authors’ slides 301 302 Outline Design Space • Dryad Design Internet • Implementation • Policies as Plug-ins Data- • Building on Dryad parallel Shared Private memory data center Latency Throughput 303 303 304 304 2-D Piping Dryad = Execution Layer • Unix Pipes: 1-D grep | sed | sort | awk | perl Job (Application) Pipeline ≈ Dryad Shell • Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 Cluster Machine 305 305 306 306 1

  2. 11/10/2011 Outline Virtualized 2-D Pipelines • Dryad Design • Implementation • Policies as Plug-ins • Building on Dryad 307 307 308 308 Virtualized 2-D Pipelines Virtualized 2-D Pipelines 309 309 310 310 Virtualized 2-D Pipelines Virtualized 2-D Pipelines • 2D DAG • multi-machine • virtualized 311 311 312 312 2

  3. 11/10/2011 Dryad Job Structure Channels grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 Finite Streams of items X Channels Input • distributed filesystem files Stage Output files (persistent) files Items sort grep • SMB/NTFS files awk (temporary) sed perl sort grep • TCP pipes M awk sed (inter-machine) grep sort • memory FIFOs (intra-machine) Vertices (processes) 313 313 314 314 Staging Architecture 1. Build data plane Files, TCP, FIFO, Network job schedule 2. Send 7. Serialize .exe vertex vertices V V V code 5. Generate graph JM code Cluster NS PD PD PD 6. Initialize vertices services 3. Start JM 8. Monitor Vertex execution 4. Query control plane cluster resources Job manager cluster 315 315 316 Policy Managers Outline R R R R Stage R • Dryad Design • Implementation Connection R-X • Policies and Resource Management • Building on Dryad X X X X Stage X R-X X Manager R manager Manager Job Manager 317 317 318 318 3

  4. 11/10/2011 Aggregation Manager Duplicate Execution Manager S S S S S S X[0] X[1] X[3] X[2] X’[2] T static Slow Duplicate Completed vertices vertex vertex S S S S S S # 1 # 2 # 1 # 3 # 3 # 2 rack # A A A # 1 # 2 # 3 Duplication Policy = f(running times, data volumes) T dynamic 319 320 320 Range-Distribution Manager Data Distribution (Group By) S S S [0-100) Source Source Source m S S S Hist [0-30),[30-100) m x n T static D D D Dest Dest Dest n T T [0-?) [0-30) [30-100) [?-100) dynamic 322 321 321 322 Goal: Declarative Programming Outline • Dryad Design • Implementation X X X X • Policies as Plug-ins • Building on Dryad S S S T T T T static dynamic 323 323 324 324 4

  5. 11/10/2011 Software Stack Example Query: Sky Server • Table photoPrimary – All identified astronomical objects (354,254,163 Machine Job queueing, monitoring sed, awk, grep, etc. records) Learning C# SSIS legacy – ID, color magnitude in 5 bands (u, g, r, i, z) PSQL Perl C++ Queries C# Vectors code • Table neighbors SQL Distributed Shell DryadLINQ C++ server – For each object, neighbors within 30 arc seconds Dryad (2,803,165,372 records) • Query 18: gravitational lens effect Distributed Filesystem CIFS/NTFS – Find all objects that have neighbors whose color is Cluster Services similar to that object Windows Windows Windows Windows Server Server Server Server 325 325 326 SkyServer Query 18 SkyServer DB query H select distinct U.ObjID • Took SQL plan [distinct] H (u.color,n.neighborobjid) into results [merge outputs] n [re-partition by n.neighborobjid] Y Y from photoPrimary U, • Manually coded in Dryad [order by n.neighborobjid] neighbors N, n U U Y Y select • Manually partitioned data select photoPrimary L u.color,n.neighborobjid 4n u.objid S S L L from u join n where U.ObjID = N.ObjID from u join <temp> where 4n S S and U.mode = 1 where u.objid = n.objid 4n u: objid, color M M u.objid = <temp>.neighborobjid and and L.ObjID = N.NeighborObjID n: objid, neighborobjid 4n |u.color - <temp>.color| < d M M and U.ObjID < L.ObjID [partition by objid] and abs((U.u-U.g)-(L.u-L.g))<0.05 n D D and abs((U.g-U.r)-(L.g-L.r))<0.05 n D D and abs((U.r-U.i)-(L.r-L.i))<0.05 n X X n and abs((U.i-U.z)-(L.i-L.z))<0.05 X X U N U N U N U N 327 327 328 Optimization Optimization H H Y Y U U n n Y Y Y Y S S S S S S S S U U U U M M 4n 4n S S S S M M M M 4n 4n M M M M M M n n D D D D D D n n X X X X X X U N U N U N U N U N U N 329 330 5

  6. 11/10/2011 SkyServer Q18 Performance DryadLINQ 16.0 Dryad In-Memory • Declarative programming 14.0 Dryad Two-pass • Integration with Visual Studio 12.0 SQLServer 2005 • Integration with .Net 10.0 Speed-up • Type safety (times) 8.0 • Automatic serialization 6.0 • Job graph optimizations 4.0  static  dynamic 2.0 • Conciseness 0.0 0 2 4 6 8 10 Number of Computers 331 331 336 336 LINQ DryadLINQ = LINQ + Dryad Collection<T> collection; bool IsLegal(Key k); string Hash(Key); Vertex Collection<T> collection; code var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Query bool IsLegal(Key); plan (Dryad job) string Hash(Key); Data collection var results = from c in collection where IsLegal(c.key) C# C# C# C# select new { Hash(c.key), c.value}; results 337 337 338 338 Data Model Query Providers Client machine C# objects Partition DryadLINQ Data center C# Distributed Query ToDryadTable Query Expr Invoke Input Tables query plan Dryad JM Execution Output (11) foreach C# Objects Results Output Tables DryadTable Collection 339 339 340 340 6

  7. 11/10/2011 Example: Histogram Histogram Plan public static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k) { SelectMany var words = input.SelectMany(x => x.line.Split(' ')); HashDistribute var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); Merge var ordered = counts.OrderByDescending(x => x.count); GroupBy var top = ordered.Take(k); Select return top; } OrderByDescending Take “A line of words of wisdom” MergeSort [“A”, “line”, “of”, “words”, “of”, “wisdom”] Take [[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]] [ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}] [{“of”, 2}, {“A”, 1}, {“line”, 1}] 341 341 342 342 Map-Reduce Plan Map-Reduce in DryadLINQ M M M M M M M map Q Q Q Q Q Q Q sort map M G 1 G 1 G 1 G 1 G 1 G 1 G 1 groupby public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, R R R R R R R reduce D Expression<Func<T, IEnumerable<M>>> mapper, D D D D D D D distribute Expression<Func<M,K>> keySelector, G partial aggregation Expression<Func<IGrouping<K,M>,S>> reducer) R (1) (2) (3) { MS MS mergesort MS MS MS var map = input.SelectMany(mapper); groupby X G 2 G 2 G 2 G 2 G 2 var group = map.GroupBy(keySelector); R R reduce var result = group.Select(reducer); R R R return result; X X X mergesort MS MS } reduce G 2 G 2 groupby S S S S S S R R reduce A A A X X consumer 343 343 344 344 T Distributed Sorting in DryadLINQ Distributed Sorting Plan Deterministic Sampling DS DS DS DS DS public static IQueryable<TSource> Histogram DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, Data partitioning H H H int pcount) { var samples = source.Apply(x => Sampling(x)); O var keys = samples.Apply(x => ComputeKeys(x, pcount)); D D D D D var parts = source.RangePartition(keySelector, keys); (1) (2) (3) return parts.OrderBy(keySelector); } M M M M M Merge Sort S S S S S 345 345 346 346 7

  8. 11/10/2011 Outline Machine Learning in DryadLINQ • Introduction • Dryad • DryadLINQ Data analysis Machine learning • Building on DryadLINQ Large Vector DryadLINQ Dryad 349 349 350 350 Operations on Large Vectors: Very Large Vector Library Map 1 PartitionedVector<T> f U T T T T f preserves partitioning T Scalar<T> f T U 351 351 352 352 Map 2 (Pairwise) Map 3 (Vector-Scalar) f f T U V T U V T T U U f f V V 353 353 354 354 354 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend