Big Data for Data Science The MapReduce Framework & Hadoop - PowerPoint PPT Presentation

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde

Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 combine result event.cwi.nl/lsde

Parallelisation challenges • How do we assign work units to workers? • What if we have more work units than workers? • What if workers need to share partial results? • How do we know all the workers have finished? • What if workers die? • What if data gets lost while transmitted over the network? What’s the common theme of all of these problems? event.cwi.nl/lsde

Common theme? • Parallelization problems arise from: – Communication between workers (e.g., to exchange state) – Access to shared resources (e.g., data) • Thus, we need a synchronization mechanism event.cwi.nl/lsde

Managing multiple workers • Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know when workers need to communicate partial results – We don’t know the order in which workers access shared data • Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers • Still, lots of problems: – Deadlock, livelock, race conditions... – Dining philosophers, sleeping barbers, cigarette smokers... • Moral of the story: be careful! event.cwi.nl/lsde

Current tools shared memory message passing • Programming models memory – Shared memory (pthreads) – Message passing (MPI) • Design patterns P 1 P 2 P 3 P 4 P 5 P 1 P 2 P 3 P 4 P 5 – Master-slaves – Producer-consumer flows – Shared work queues producer consumer master work queue slaves producer consumer event.cwi.nl/lsde

Parallel programming: human bottleneck • Concurrency is difficult to reason about • Concurrency is even more difficult to reason about – At the scale of datacenters and across datacenters – In the presence of failures – In terms of multiple interacting services • Not to mention debugging… • The reality: – Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage everything • The MapReduce Framework alleviates this – making this easy is what gave Google the advantage event.cwi.nl/lsde

What’s the point? • It’s all about the right level of abstraction – Moving beyond the von Neumann architecture – We need better programming models • Hide system-level details from the developers – No more race conditions, lock contention, etc. • Separating the what from how – Developer specifies the computation that needs to be performed – Execution framework (aka runtime) handles actual execution The data center is the computer! event.cwi.nl/lsde

The Da he Data ta Center is Center is the Compu the Computer ter Can Can you pr ou prog ogram am it? it? event.cwi.nl/lsde Source: Google

MAPREDUCE AND HDFS event.cwi.nl/lsde

Big data needs big ideas • Scale “out”, not “up” – Limits of SMP and large shared-memory machines • Move processing to the data – Cluster has limited bandwidth, cannot waste it shipping data around • Process data sequentially, avoid random access – Seeks are expensive, disk throughput is reasonable, memory throughput is even better • Seamless scalability – From the mythical man-month to the tradable machine-hour • Computation is still big – But if efficiently scheduled and executed to solve bigger problems we can throw more hardware at the problem and use the same code – Remember, the datacenter is the computer event.cwi.nl/lsde

Typical Big Data Problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output Key idea: provide a functional abstraction for these two operations event.cwi.nl/lsde

MapReduce • Programmers specify two functions: map (k 1 , v 1 ) → [<k 2 , v 2 >] reduce (k 2 , [v 2 ]) → [<k 3 , v 3 >] – All values with the same key are sent to the same reducer k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 k 7 v 7 k 8 v 8 map map map map a 1 b 2 c 6 c 3 a 5 c 2 b 7 c 8 shuffle and sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 event.cwi.nl/lsde

MapReduce runtime • Orchestration of the distributed computation • Handles scheduling – Assigns workers to map and reduce tasks • Handles data distribution – Moves processes to data • Handles synchronization – Gathers, sorts, and shuffles intermediate data • Handles errors and faults – Detects worker failures and restarts • Everything happens on top of a distributed file system (more information later) event.cwi.nl/lsde

MapReduce • Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* – All values with the same key are reduced together • The execution framework handles everything else • This is the minimal set of information to provide • Usually, programmers also specify: partition (k’, number of partitions) → partition for k’ – Often a simple hash of the key, e.g., hash(k’) mod n – Divides up key space for parallel reduce operations combine (k’, v’) → <k’, v’>* – Mini-reducers that run in memory after the map phase – Used as an optimization to reduce network traffic event.cwi.nl/lsde

Putting it all together k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 k 7 v 7 k 8 v 8 map map map map a 1 b 2 c 6 c 3 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition shuffle and sort: aggregate values by keys a 1 5 b 2 7 c 2 8 9 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 event.cwi.nl/lsde

Two more details • Barrier between map and reduce phases – But we can begin copying intermediate data earlier • Keys arrive at each reducer in sorted order – No enforced ordering across reducers event.cwi.nl/lsde

“Hello World”: Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); event.cwi.nl/lsde

MapReduce Implementations • Google has a proprietary implementation in C++ – Bindings in Java, Python • Hadoop is an open-source implementation in Java – Development led by Yahoo, now an Apache project – Used in production at Yahoo, Facebook, Twitter, LinkedIn, Netflix, … – The de facto big data processing platform – Rapidly expanding software ecosystem • Lots of custom research implementations – For GPUs, cell processors, etc. event.cwi.nl/lsde

user program (1) submit master (2) schedule map (2) schedule reduce worker split 0 output (6) write (5) remote read worker split 1 file 0 (3) read (4) local write worker split 2 output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files event.cwi.nl/lsde Adapted from (Dean and Ghemawat, OSDI 2004)

How do we get data to the workers? client machine NAS workerworkerworker workerworker workerworker file server farm cluster Compute Nodes NAS (NAS,SAN,..) What’s the problem here ? event.cwi.nl/lsde

Distributed file system • Do not move data to workers, but move workers to the data! – Store data on the local disks of nodes in the cluster – Start up the workers on the node that has the data local real MapReduce Job  worker worker worker worker worker worker worker worker worker worker worker worker Compute Nodes HDFS (GFS) Distributed File-system virtual • Why? – Avoid network traffic if possible – Not enough RAM to hold all the data in memory – Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer – GFS (Google File System) for Google’s MapReduce – HDFS (Hadoop Distributed File System) for Hadoop Note: all data is replicated for fault-tolerance (HDFS default:3x) event.cwi.nl/lsde

GFS: Assumptions • Commodity hardware over exotic hardware – Scale out, not up • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of huge files – Multi-gigabyte files are common, if not encouraged • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads over random access – High sustained throughput over low latency event.cwi.nl/lsde GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

GFS: Design Decisions • Files stored as chunks – Fixed size (64MB) • Reliability through replication – Each chunk replicated across 3+ chunkservers • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large datasets, streaming reads • Simplify the API – Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas) event.cwi.nl/lsde

From GFS to HDFS • Terminology differences: – GFS master = Hadoop namenode – GFS chunkservers = Hadoop datanodes • Differences: – Different consistency model for file appends – Implementation – Performance For the most part, we’ll use Hadoop terminology event.cwi.nl/lsde

Big Data for Data Science The MapReduce Framework & Hadoop - PowerPoint PPT Presentation

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 combine result event.cwi.nl/lsde Parallelisation challenges

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

I Prefer Pi Corey Sinnamon Febuary 3, 2015 Big Day 3/14/15 Big Day 3/14/15 Themes Big

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of

Spark architecture Spark architecture Hardware organization Hardware organization In local

Developing Materials Using some Principles from SLA Diane Schmitt More to Word Knowledge than

Dynamic Deployment and Scalability for the Cloud Jerome Bernard Director, EMEA Operations

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

Big Data for Data Science The MapReduce Framework & Hadoop - PowerPoint PPT Presentation

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 combine result event.cwi.nl/lsde Parallelisation challenges

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 3/4/2020 Week 7-B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

I Prefer Pi Corey Sinnamon Febuary 3, 2015 Big Day 3/14/15 Big Day 3/14/15 Themes Big

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Continuous Integration &amp; Deploying using Jenkins and uDeploy (Projects used are of

Spark architecture Spark architecture Hardware organization Hardware organization In local

Developing Materials Using some Principles from SLA Diane Schmitt More to Word Knowledge than

Dynamic Deployment and Scalability for the Cloud Jerome Bernard Director, EMEA Operations

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of