Large Scale Data Engineering Big Data Frameworks: Hadoop & - PowerPoint PPT Presentation

Large Scale Data Engineering Big Data Frameworks: Hadoop & Spark event.cwi.nl/lsde

Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 combine result event.cwi.nl/lsde

Parallelisation challenges • How do we assign work units to workers? • What if we have more work units than workers? • What if workers need to share partial results? • How do we know all the workers have finished? • What if workers die? • What if data gets lost while transmitted over the network? What’s the common theme of all of these problems? event.cwi.nl/lsde

Common theme? • Parallelization problems arise from: – Communication between workers (e.g., to exchange state) – Access to shared resources (e.g., data) • Thus, we need a synchronization mechanism event.cwi.nl/lsde

Managing multiple workers • Difficult because – We don’t know the order in which workers run – We don’t know when workers interrupt each other – We don’t know when workers need to communicate partial results – We don’t know the order in which workers access shared data • Thus, we need: – Semaphores (lock, unlock) – Conditional variables (wait, notify, broadcast) – Barriers • Still, lots of problems: – Deadlock, livelock, race conditions... – Dining philosophers, sleeping barbers, cigarette smokers... • Moral of the story: be careful! event.cwi.nl/lsde

Current tools shared memory message passing • Programming models memory – Shared memory (pthreads) – Message passing (MPI) • Design patterns P 1 P 2 P 3 P 4 P 5 P 1 P 2 P 3 P 4 P 5 – Master-slaves – Producer-consumer flows – Shared work queues producer consumer master work queue slaves producer consumer event.cwi.nl/lsde

Parallel programming: human bottleneck • Concurrency is difficult to reason about • Concurrency is even more difficult to reason about – At the scale of datacenters and across datacenters – In the presence of failures – In terms of multiple interacting services • Not to mention debugging… • The reality: – Lots of one-off solutions, custom code – Write you own dedicated library, then program with it – Burden on the programmer to explicitly manage everything • The MapReduce Framework alleviates this – making this easy is what gave Google the advantage event.cwi.nl/lsde

What’s the point? • It’s all about the right level of abstraction – Moving beyond the von Neumann architecture – We need better programming models • Hide system-level details from the developers – No more race conditions, lock contention, etc. • Separating the what from how – Developer specifies the computation that needs to be performed – Execution framework (aka runtime) handles actual execution The data center is the computer! event.cwi.nl/lsde

MAPREDUCE AND HDFS event.cwi.nl/lsde

Typical Big Data Problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output Key idea: provide a functional abstraction for these two operations event.cwi.nl/lsde

MapReduce • Programmers specify two functions: map (k 1 , v 1 ) → [<k 2 , v 2 >] reduce (k 2 , [v 2 ]) → [<k 3 , v 3 >] – All values with the same key are sent to the same reducer k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 k 7 v 7 k 8 v 8 map map map map a 1 b 2 c 6 c 3 a 5 c 2 b 7 c 8 shuffle and sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 event.cwi.nl/lsde

MapReduce runtime • Orchestration of the distributed computation • Handles scheduling – Assigns workers to map and reduce tasks • Handles data distribution – Moves processes to data • Handles synchronization – Gathers, sorts, and shuffles intermediate data • Handles errors and faults – Detects worker failures and restarts • Everything happens on top of a distributed file system (more information later) event.cwi.nl/lsde

MapReduce • Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’*) → <k’’, v’’>* – All values with the same key are reduced together • The execution framework handles everything else • This is the minimal set of information to provide • Usually, programmers also specify: partition (k’, number of partitions) → partition for k’ – Often a simple hash of the key, e.g., hash(k’) mod n – Divides up key space for parallel reduce operations combine (k’, v’*) → <k’, v’’*>* – Mini-reducers that run in memory after the map phase – Used as an optimization to reduce network traffic event.cwi.nl/lsde

Putting it all together k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 k 7 v 7 k 8 v 8 map map map map a 1 b 2 c 6 c 3 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition shuffle and sort: aggregate values by keys a 1 5 b 2 7 c 2 8 9 reduce reduce reduce r 1 s 1 r 2 s 2 r 3 s 3 event.cwi.nl/lsde

“Hello World”: Word Count Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); event.cwi.nl/lsde

MapReduce Implementations • Google has a proprietary implementation in C++ – Bindings in Java, Python • Hadoop is an open-source implementation in Java – Development led by Yahoo, now an Apache project – Used in production at Facebook, Twitter, LinkedIn, Netflix, … – Popular on-premise big data processing platform, but.. • Has been losing support to cloud-based platforms event.cwi.nl/lsde

Distributed file system • Do not move data to workers, but move workers to the data! – Store data on the local disks of nodes in the cluster – Start up the workers on the node that has the data local real MapReduce Job ➔ worker worker worker worker worker worker worker worker worker worker worker worker Compute Nodes HDFS (GFS) Distributed File-system virtual • Why? – Avoid network traffic if possible – Not enough RAM to hold all the data in memory – Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer – GFS (Google File System) for Google’s MapReduce – HDFS (Hadoop Distributed File System) for Hadoop Note: all data is replicated for fault-tolerance (HDFS default:3x) event.cwi.nl/lsde

HDFS: Assumptions • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of huge files – Multi-gigabyte files are common, if not encouraged • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads over random access – High sustained throughput over low latency event.cwi.nl/lsde GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

HDFS: Design Decisions • Files stored as chunks – Fixed size (64MB) • Reliability through replication – Each chunk replicated across 3+ chunkservers • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large datasets, streaming reads event.cwi.nl/lsde

HDFS architecture HDFS namenode Application /foo/bar (file name, block id) File namespace block 3df2 HDFS Client (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode block data Linux file system Linux file system … … event.cwi.nl/lsde Adapted from (Ghemawat et al., SOSP 2003)

Namenode responsibilities • Managing the file system namespace: – Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc. • Coordinating file operations: – Directs clients to datanodes for reads and writes – No data is moved through the namenode • Maintaining overall health: – Periodic communication with the datanodes – Block re-replication and rebalancing – Garbage collection event.cwi.nl/lsde

Putting everything together namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node event.cwi.nl/lsde

Basic cluster components • One of each: – Namenode (NN): master node for HDFS – Jobtracker (JT): master node for job submission • Set of each per slave machine: – Tasktracker (TT): contains multiple task slots – Datanode (DN): serves HDFS data blocks event.cwi.nl/lsde

Anatomy of a job • MapReduce program in Hadoop = Hadoop job – Jobs are divided into map and reduce tasks – An instance of running a task is called a task attempt (occupies a slot) – Multiple jobs can be composed into a workflow • Job submission: – Client (i.e., driver program) creates a job, configures it, and submits it to jobtracker – That’s it! The Hadoop cluster takes over event.cwi.nl/lsde

Anatomy of a job • Behind the scenes: – Input splits are computed (on client end) – Job data (jar, configuration XML) are sent to JobTracker – JobTracker puts job data in shared location, enqueues tasks – TaskTrackers poll for tasks – Off to the races event.cwi.nl/lsde

Large Scale Data Engineering Big Data Frameworks: Hadoop & - PowerPoint PPT Presentation

Large Scale Data Engineering Big Data Frameworks: Hadoop & Spark event.cwi.nl/lsde Key premise: divide and conquer work partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 combine result event.cwi.nl/lsde Parallelisation

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Ethics in Techniques for large-scale data Graham J.L. Kemp TECHNIQUES FOR LARGE-SCALE DATA

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

MongoDB large scale data-centric architectures QConSF 2012 Kenny Gorman Founder, ObjectRocket

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Meeting the Challenges of Ultra- -Large Large- - Meeting the Challenges of Ultra Scale Systems

GLAST Large Area Telescope: GLAST Large Area Telescope: Gamma- -ray Large ray Large Gamma

INCORPORATING LARGE-SCALE CITIZEN INCORPORATING LARGE-SCALE CITIZEN DELIBERATION INTO

Workshop Workshop on Large on Large- -Scale Disaster Recovery Scale Disaster Recovery i i

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

Faculty Lightning Introductions Scott Stevens Scott Stevens U.N.C.L.E The Benjamin

Computer Graphics SS2019 Christian Theobalt Mohamed Elgharib Vladislav Golyanik Graphics,

Learning How to Move and Where to Look from Unlabeled Video Kristen Grauman Department of

Introduction to Virtual Reality Alberto Borghese Department of Computer Science University of

Prospects at ILC A gold place for QCD in the perturbative Regge limit Samuel Wallon 1 1

Distributed Systems Virtualization Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise

l Perform ance in Virtual i Vi t Environm ents Stefan Appel f P 1 Analysis of Resource

Riak Core: Dynamo Building Blocks Andy Gross (@argv0) Basho Technologies QCon SF 2010 About