An Introduction to DryadLINQ Christophe Poulain Microsoft Research - - PowerPoint PPT Presentation

an introduction to dryadlinq
SMART_READER_LITE
LIVE PREVIEW

An Introduction to DryadLINQ Christophe Poulain Microsoft Research - - PowerPoint PPT Presentation

An Introduction to DryadLINQ Christophe Poulain Microsoft Research Microsoft Research Virtual School of Computational Science and Engineering Big Data For Science Course, July 28, 2010 The Fourth Paradigm: Data The Fourth Paradigm: Data-


slide-1
SLIDE 1

An Introduction to DryadLINQ

Christophe Poulain

Microsoft Research Microsoft Research Virtual School of Computational Science and Engineering Big Data For Science Course, July 28, 2010

slide-2
SLIDE 2

The Fourth Paradigm: Data The Fourth Paradigm: Data-

  • Intensive Science

Intensive Science

http://research.microsoft.com/fourthparadigm

Scientific discovery is increasingly driven by exploration of large amounts of data from many sources. Scientific breakthrough will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

2

slide-3
SLIDE 3

Powered by powerful multi-core workstations, readily available commodity clusters and cloud services platforms Data Data-

  • intensive computing is increasingly prevalent

intensive computing is increasingly prevalent Programming data analyses that scale from desktop to a large number of compute nodes remains challenging

3

112 containers x 2000 servers = 224000 servers

slide-4
SLIDE 4

Research programming models for writing distributed data-parallel applications that scale from a small cluster to a large data-center.

Dryad and DryadLINQ Dryad and DryadLINQ

4

A DryadLINQ programmer can use thousands of machines, each of them with multiple processors or cores, without prior knowledge in parallel programming.

slide-5
SLIDE 5

Dryad/DryadLINQ on Windows HPC 2008 (SP1) is available as a free download from:

http://research.microsoft.com/collaboration/tools/dryad.aspx

Availability Availability

– DryadLINQ (in source) & Dryad (in binary) – With tutorials, programming guides, sample codes, libraries, and a community site: http://connect.microsoft.com/dryad – Windows HPC Server licenses freely available through your department’s subscription to MSDN Academic Alliance

5

slide-6
SLIDE 6

Outline Outline

  • DryadLINQ programming model
  • Dryad and DryadLINQ overview
  • Applications
slide-7
SLIDE 7

Dryad DryadLINQ LINQ Experience Experience Use a cluster as if it were a single computer

  • Sequential, single machine programming abstraction
  • Same program runs on single-core, multi-core, or cluster
  • Familiar programming languages
  • C#, VB, F#, IronPython…
  • C#, VB, F#, IronPython…
  • Familiar development environment
  • .NET, Visual Studio or other IDE
slide-8
SLIDE 8

LINQ LINQ

  • Microsoft’s Language INtegrated Query

– Released with .NET Framework 3.5, Visual Studio optional

  • A set of operators to manipulate datasets in .NET

– Support traditional relational operators

  • Select, Join, GroupBy, Aggregate, etc.

– Integrated into .NET programming languages

  • Programs can call operators
  • Programs can call operators
  • Operators can invoke arbitrary .NET functions
  • Data model

– Data elements are strongly typed .NET objects – Much more expressive than SQL tables

  • Extremely extensible

– Add new custom operators – Add new execution providers

slide-9
SLIDE 9

Example Example of a

  • f a LINQ Query

LINQ Query

IEnumerable<string> logs = GetLogLines(); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access;

Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep

  • nly entries that are accesses by

ulfar.

9

select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm")

  • rderby access.count descending

select access;

ulfar. Group ulfar’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency.

slide-10
SLIDE 10

DryadLINQ Data Model DryadLINQ Data Model

Partition .Net objects

10

PartitionedTable<T>

PartitionedTable<T> implements IQueryable<T> and IEnumerable<T> PartitionedTable exposes metadata information:

  • type, partition, compression scheme, etc.
slide-11
SLIDE 11

A complete DryadLINQ program

public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } PartitionedTable<string> logs = PartitionedTable.Get<string>( @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\logfile.pt” ); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; } public class UserPageCount{ public string user; public string page; public int count; public UserPageCount( string user, string page, int count) { this.user = user; this.page = page; this.count = count; } } select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm")

  • rderby access.count descending

select access; htmAccesses.ToPartitionedTable( @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\results.pt” );

slide-12
SLIDE 12
  • Executing the log query

DryadLINQ for Dryad on Windows Server 2008 HPC Cluster DryadLINQ for Dryad on Windows Server 2008 HPC Cluster

12

slide-13
SLIDE 13

MapReduce MapReduce in DryadLINQ in DryadLINQ

MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs {

13

{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

slide-14
SLIDE 14

Outline Outline

  • DryadLINQ programming model
  • Dryad and DryadLINQ overview
  • Applications
slide-15
SLIDE 15

Image Processing

Software Stack Software Stack

DryadLINQ Other Languages Machine Learning Graph Analysis Data Mining Applications

Other Applications Cosmos DFS SQL Servers

15

Windows Server Cluster Services (Azure, HPC, or Cosmos) Azure Storage Dryad DryadLINQ Windows Server Windows Server Windows Server Other Languages CIFS/NTFS

slide-16
SLIDE 16

Dryad Dryad

  • Provides a general, flexible execution layer

– Dataflow graph as the computation model – Higher language layer supplies graph, vertex code, channel types, hints for data locality, …

  • Automatically handles execution
  • Automatically handles execution

– Distributes code, routes data – Schedules processes on machines near data – Masks failures in cluster and network – Fair scheduling of concurrent jobs

slide-17
SLIDE 17

Dryad Job Structure Dryad Job Structure

grep sed sort awk perl grep sort Input files Output files Channels Stage grep grep sed sort sort awk Vertices (processes)

Channel is a finite streams of items

  • NTFS files (temporary)
  • TCP pipes (inter-machine)
  • Memory FIFOs (intra-machine)

Channel is a finite streams of items

  • NTFS files (temporary)
  • TCP pipes (inter-machine)
  • Memory FIFOs (intra-machine)
slide-18
SLIDE 18

Dryad System Architecture Dryad System Architecture Files, TCP, FIFO

Job1

data plane

PD PD PD V V V

job manager

control plane

PD PD PD

cluster

Job1: v11, v12, … Job2: v21, v22, … Job3: …

scheduler

New jobs

slide-19
SLIDE 19
  • Fault tolerance

DryadLINQ for Dryad on Windows Server 2008 HPC Cluster DryadLINQ for Dryad on Windows Server 2008 HPC Cluster

19

slide-20
SLIDE 20

Fault Tolerance Fault Tolerance

slide-21
SLIDE 21

Consider an embarrassingly parallel problem Consider an embarrassingly parallel problem

public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args)

21

{ int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } }

slide-22
SLIDE 22

An embarrassingly parallel problem An embarrassingly parallel problem Many cores, one machine with PLINQ Many cores, one machine with PLINQ

public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args)

22

{ int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds.AsParallel() select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } }

slide-23
SLIDE 23

An embarrassingly parallel problem An embarrassingly parallel problem Many cores, many machines with DryadLINQ (& PLINQ) Many cores, many machines with DryadLINQ (& PLINQ)

public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(2000); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) {

23

{ int count = 50; var seeds = Enumerable.Range(1, count); int[] ranges = seeds.Take(count - 1).ToArray(); var pairs = from seed in seeds.ToPartitionedTable("tmp.pt").RangePartition(i => i, ranges) select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } }

slide-24
SLIDE 24

An embarrassingly parallel An embarrassingly parallel problem problem Simulating errors on the cluster Simulating errors on the cluster

private static Random RANDOM = new Random(); public static Pair<int, string> DoWorkAndSimulateFailure(int index) { if (RANDOM.NextDouble() < 0.1) Substitute DoWorkAndSimulateFailure for DoWork:

24

{ throw new Exception("My program failed."); } System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); }

Will the program successfully finish when we run it? Let us see…

slide-25
SLIDE 25

DryadLINQ: Friendly programming API for Dryad DryadLINQ: Friendly programming API for Dryad

Local machine Execution engines

Query

DryadLINQ terface Scalability

Cluster DryadLINQ leverages LINQ’s extensibility

PLINQ .Net program (C#, VB, F#, etc)

Query Objects

LINQ-to-SQL LINQ-to-XML LINQ provider interfa

Single-core Multi-core

slide-26
SLIDE 26

Collection<T> collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ Provider

Vertex code Query plan (Dryad job)

26

C#

collection results

C# C# C#

(Dryad job) Data

slide-27
SLIDE 27

Example: Word Count Example: Word Count Count word frequency in a set of documents:

var docs = new PartitionedTable<Doc>(“dfs://yuan/docs”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToTable(“dfs://yuan/counts.txt”); IN

metadata

SM

doc => doc.words

GB

word => word

S

g => new …

OUT

metadata

slide-28
SLIDE 28

Distributed Execution of Word Count Distributed Execution of Word Count

SM LINQ expression IN Dryad execution DryadLINQ GB S OUT

slide-29
SLIDE 29

Execution Plan for Word Count Execution Plan for Word Count

SM GB SM Q GB C

SelectMany Sort GroupBy Count pipelined

29

(1)

GB S C D MS GB Sum

Count Distribute Mergesort GroupBy Sum pipelined

slide-30
SLIDE 30

Execution Plan for Word Count Execution Plan for Word Count

SM GB SM Q GB C SM Q GB C SM Q GB C SM Q GB C

30

(1)

GB S C D MS GB Sum

(2)

C D MS GB Sum C D MS GB Sum C D MS GB Sum

slide-31
SLIDE 31

Distributed Execution Plan Distributed Execution Plan

<ClusterName> Arden1 </ClusterName> <Resources> <Resource>C:\DryadLinqDrop\lib\retail\amd64\wrappernativeinfo.dll</Resource> <Resource>C:\DryadLinqDrop\lib\Release\LinqToDryad.dll</Resource> <Resource>C:\apps\WordCount\bin\Release\DryadLinq.dll</Resource> <Resource>C:\apps\WordCount\bin\Release\WordCount.exe</Resource> </Resources> <QueryPlan> … <Vertex> <UniqueId> 1 </UniqueId> <Partitions> 8 </Partitions> <ChannelType> DiskFile </ChannelType> <ConnectionOperator> CrossProduct </ConnectionOperator> <Entry> <AssemblyName> DryadLinq.dll </AssemblyName> <ClassName> LinqToDryad.DryadLinq_Vertex</ClassName> <MethodName> Super_2 </MethodName></Entry> <Children><Child><UniqueId> 0 </UniqueId></Child></Children> </Vertex> ... </QueryPlan>

slide-32
SLIDE 32

public static int Super_2(string args) { DryadVertexEnv denv = new DryadVertexEnv(args); var dwriter_3 = denv.MakeWriter(DryadLinq_Extension.FactoryType_1); var dreader_4 = denv.MakeReader(DryadLinq_Extension.FactoryType_0);

Vertex code (in DryadLinq.dll) Vertex code (in DryadLinq.dll)

The actual code executed at Dryad vertex

var dreader_4 = denv.MakeReader(DryadLinq_Extension.FactoryType_0); var source_5 = DryadLinqVertex.SelectMany(dreader_4, doc => doc.words); var source_6 = DryadLinqVertex.Sort(source_5, word => word); var source_7 = DryadLinqVertex.OrderedGroupBy(source_6, word => word); var source_8 = DryadLinqVertex.Select(g => new Pair<String, Int32>(g.Key, g.Count())); DryadLinqVertex.HashPartition(source_8, e => e.Key, dwriter_3); return 0; }

slide-33
SLIDE 33

DryadLINQ DryadLINQ

  • Distributed execution plan generation

– Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc.

  • Vertex runtime
  • Vertex runtime

– Single machine (multi-core) implementation of LINQ – Vertex code that runs on vertices – Data serialization code – Callback code for runtime dynamic optimizations – Automatically distributed to cluster machines

slide-34
SLIDE 34

DryadLINQ Job Browser DryadLINQ Job Browser

Artemis

34

slide-35
SLIDE 35
  • A simple fault-tolerant, distributed file system that

provides the abstractions necessary for data parallel computations on HPC clusters

  • High performance, reliable, scalable service
  • Prototypical workload

High throughput, sequential IO, write once Simple and Scalable Distributed File System Simple and Scalable Distributed File System

  • High throughput, sequential IO, write once
  • Cluster machines working in parallel
  • Configurable number of replicas per dataset

http://research.microsoft.com/events/techfair2010/demos.aspx

35

slide-36
SLIDE 36

Outline Outline

  • DryadLINQ programming model
  • Dryad and DryadLINQ overview
  • Applications
slide-37
SLIDE 37

Dryad Dryad

  • Continuously deployed since 2006
  • The execution engine for Bing analytics
  • Running on >> 104 machines
  • Runs on clusters > 3000 machines
  • Runs on clusters > 3000 machines
  • Sifting through > 10Pb data daily

37

slide-38
SLIDE 38

Microsoft Microsoft Kinect Kinect: Learning From Data : Learning From Data

Training examples

Kinect (formerly Project Natal) is using DryadLINQ to train enormous decision trees from millions of images across hundreds of cores.

38

Motion Capture (ground truth)

Classifier

Machine learning Rasterize Recognize players from depth map at frame rate using fraction of Xbox CPU.

slide-39
SLIDE 39
  • > 1022 objects
  • Sparse, multi-dimensional data structures
  • Complex datatypes

(images, video, matrices, etc.)

  • Complex application logic and dataflow

– >35000 lines of .Net

Large Large-

  • scale machine learning

scale machine learning

– >35000 lines of .Net – 140 CPU days – > 105 processes – 30 TB data analyzed – 140 avg parallelism (235 machines) – 300% CPU utilization (4 cores/machine)

39

slide-40
SLIDE 40

Highly efficient Highly efficient parallellization parallellization

40

slide-41
SLIDE 41

SDSS Query Q18 SDSS Query Q18

Most time-consuming query from Sloan Digital Sky Survey database Find all objects within 1' of one another other that have very similar colors: that is with the color ratios u-g, g-r, r-I are less than 0.05m. (http://www.sdss.jhu.edu/SQL/SQLQueries.html)

  • Two tables 11.8GB and 41.8 GB
  • Two tables 11.8GB and 41.8 GB
  • Under 2 minutes with 40 nodes
  • Hand-tuned Dryad

implementation is faster (92s vs 113s with 40 nodes)

  • DryadLINQ code is 10x smaller

See Dryad (Eurosys’07) and DryadLINQ (OSDI’08) papers.

slide-42
SLIDE 42

SDSS Query Q18 SDSS Query Q18

PartitionedTable<PhotoObjAll> photoObjAll = PartitionedTable.Get<PhotoObjAll>(@"file://\\<...>\ugriz-u9.pt"); PartitionedTable<Neighbor> neighbors = PartitionedTable.Get<Neighbor>(@"file://\\<...>\neighbor-u9.pt"); var j1 = from p in photoObjAll join n in neighbors on p.objId equals n.objId select new PhotoObjNeighbor(p, n); var w1 = from pn in j1 where pn.objId < pn.neighborObjId && pn.mode select pn; select pn; var j2 = from l in photoObjAll join pn in w1 on l.objId equals pn.neighborObjId select new PhotoObjNeighborAll(l, pn); var w2 = from lp in j2 where lp.l.mode && Math.Abs((lp.p.u-lp.p.g)-(lp.l.u-lp.l.g)) < 0.05 && Math.Abs((lp.p.g-lp.p.r)-(lp.l.g-lp.l.r)) < 0.05 && Math.Abs((lp.p.r-lp.p.i)-(lp.l.r-lp.l.i)) < 0.05 && Math.Abs((lp.p.i-lp.p.z)-(lp.l.i-lp.l.z)) < 0.05 select lp.p.objId; var q = w2.Distinct(); q.ToDryadPartitionedTable("result.pt");

slide-43
SLIDE 43

Scalable clustering algorithm for N Scalable clustering algorithm for N-

  • body simulations in a

body simulations in a shared shared-

  • nothing cluster

nothing cluster

2 4 2 4 6 8 Speed Up S43 OS43 S92 OS92 Ideal 0.00 0.20 0.40 0.60 0.80 1.00 2 4 6 8 Scale Up S43 S92 Ideal

YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. UW Tech Report. UW-CSE-09-06-01. June 2009.

  • Large-scale spatial clustering

– 916M particles in 3D clustered under 70 minutes with 8 nodes.

  • Re-implemented using DryadLINQ

– Partition > Local cluster > Merge cluster > Relabel – Faster development and good scalability – Must ensure near-constant processing time per tuple

2 4 6 8 Number of nodes 2 4 6 8 Number of nodes…

(S=DryadLINQ; OS=OpenMP; 43=sparse; 92=dense)

slide-44
SLIDE 44

Scalable clustering algorithm for N Scalable clustering algorithm for N-

  • body simulations in a

body simulations in a shared shared-

  • nothing cluster

nothing cluster

44

slide-45
SLIDE 45

1791 pairs of red-light and blue-light images acquired from two telescopes, scanned into 23,040x23,040 or 14000x14000 images. ~4TB of uncompressed data. Processed into 1791 RGB color images. Stitched into one terapixel spherical image.

Terapixel Terapixel Sky Image Sky Image

Stitched into one terapixel spherical image. Image seams removed with optimization. Multi-scale resolution image available in WorldWide Telescope and Bing Map.

45

slide-46
SLIDE 46

Computing Computing Vignetting Vignetting Corrections Corrections

Creating Flat Fields Normalization Matrix Normalizing Corners

DryadLINQ => concise code DryadLINQ + Windows HPC => Efficient and robust execution

– Elapsed time to process all flat fields: 8.7 hours – 28 8-core compute nodes => 1,950 CPU hours – Total input data: 417 GB compressed, 4 TB uncompressed.

46

var pixelRows = folders.SelectMany(image => ImageToRows(image, options)); var stackedPixelRows = pixelRows.GroupBy(pixelRow => pixelRow.Position); var finalRows = stackedPixelRows.Select(x => ReduceStackedRows(x)); var flatField = finalRows.Apply(x => SaveFlatField(x, options));

slide-47
SLIDE 47

CAP3 CAP3 -

  • DNA Sequence Assembly Program [1]

DNA Sequence Assembly Program [1]

EST (Expressed Sequence Tag) corresponds to messenger RNAs (mRNAs) transcribed from the genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to re-construct full-length mRNA sequences for each expressed gene.

V V Input files (FASTA)

\\GCB-K18-N01\DryadData\cap3\cluster34442.fsa \DryadData\cap3\cap3data 10 0,344,CGB-K18-N01 1,344,CGB-K18-N01

9,344,CGB-K18-N01

Cap3data.00000000 Cap3data.pf

IQueryable<LineRecord> inputFiles=PartitionedTable.Get<LineRecord>(uri); IQueryable<OutputInfo> = inputFiles.Select(x=>ExecuteCAP3(x.line));

[1] X. Huang, A. Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, vol. 9, no. 9, 1999.

Output files

\\GCB-K18-N01\DryadData\cap3\cluster34442.fsa \\GCB-K18-N01\DryadData\cap3\cluster34443.fsa

...

\\GCB-K18-N01\DryadData\cap3\cluster34467.fsa

Input files (FASTA)

slide-48
SLIDE 48

CAP3 CAP3 -

  • Performance

Performance

“DryadLINQ for Scientific Analyses”, Jaliya Ekanayake, Thilina Gunarathnea, Geoffrey Fox, Atilla Soner Balkir, Christophe Poulain, Nelson Araujo, Roger Barga (IEEE eScience ‘09)

slide-49
SLIDE 49

High Energy Physics Data Analysis High Energy Physics Data Analysis

  • Histogramming of events from a large (up to 1TB) data set
  • Data analysis requires ROOT framework (ROOT Interpreted Scripts)
  • Performance depends on disk access speeds
  • Hadoop implementation uses a shared parallel file system (Lustre)

– ROOT scripts cannot access data from HDFS – On demand data movement has significant overhead

  • Dryad stores data in local disks giving better performance over Hadoop
slide-50
SLIDE 50

Pairwise Distances Pairwise Distances – – ALU Sequencing ALU Sequencing

125 million distances 125 million distances 4 hours & 46 minutes

  • Calculate pairwise distances for a collection of

genes (used for clustering, MDS)

  • O(N^2) effect
  • Fine grained tasks in MPI
  • Coarse grained tasks in DryadLINQ
  • Performance close to MPI
  • Performed on 768 cores (Tempest Cluster)

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 35339 50000

DryadLINQ MPI Xiaohong Qiu, Jaliya Ekanayake, Scott Beason, Thilina Gunarathne, Geoffrey Fox, Roger Barga, Dennis Gannon Cloud Technologies for Bioinformatics Applications (SuperComputing09)

slide-51
SLIDE 51

Acknowledgements

MSR Silicon Valley Dryad & DryadLINQ teams

Andrew Birrell, Mihai Budiu, Jon Currey, Ulfar Erlingsson, Dennis Fetterly, Michael Isard, Pradeep Kunda, Mark Manasse, Chandu Thekkath and Yuan Yu .

http://research.microsoft.com/en-us/projects/dryad http://research.microsoft.com/en-us/projects/dryadlinq

MSR External Research

Advanced Research Tools and Services Team Advanced Research Tools and Services Team

http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

MS Product Groups: HPC, Parallel Computing Platform. Academic Collaborators

Jaliya Ekanayake, Geoffrey Fox, Thilina Gunarathne, Scott Beason, Xiaohong Qiu (Indiana University Bloomington). YongChul Kwon, Magdalena Balazinska (University of Washington). Atilla Soner Balkir, Ian Foster (University of Chicago).

slide-52
SLIDE 52

Dryad/DryadLINQ Papers Dryad/DryadLINQ Papers

  • 1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

(EuroSys’07)

  • 2. DryadLINQ: A System for General-Purpose Distributed Data-Parallel

Computing Using a High-Level Language (OSDI’08)

  • 3. Hunting for prolems with Artemis (Usenix WASL, San Diego 08)
  • 4. Distributed Data-Parallel Computing Using a High-Level Programming
  • 4. Distributed Data-Parallel Computing Using a High-Level Programming

Language (SIGMOD’09)

  • 5. Quincy: Fair scheduling for distributed computing clusters (SOSP’09)
  • 6. Distributed Aggregation for Data-Parallel Computing: Interfaces and

Implementations (SOSP’09)

  • 7. DryadInc: Reusing work in large scale computation (HotCloud 09).
slide-53
SLIDE 53

Conclusion Conclusion

DryadLINQ provides a powerful, elegant programming environment for large-scale data-parallel computing. Still an area of active research… …download it and get involved!

http://connect.microsoft.com/dryad

slide-54
SLIDE 54

54

slide-55
SLIDE 55

Application

Dryad & DryadLINQ in Context Dryad & DryadLINQ in Context

Language DryadLINQ Scope Sawzall Pig, Hive

SQL ≈SQL LINQ, SQL Sawzall

Execution Storage Parallel Databases Map- Reduce GFS BigTable NTFS Azure SQL Server Dryad Hadoop HDFS S3 Postgres, Oracle, DB2, Teradata, etc. Powerful execution engines, restricted languages Radical simplicity Restricted execution engine and language Better languages Layered on MapReduce execution engine, lesser performance SCOPE = custom query language for Search, uses Dryad to execute, cannot leverage LINQ, .NET and Visual Studio ecosystem.