Big Data Programming: an Introduction
Spring 2015,
- X. Zhang
Big Data Programming: an Introduction Spring 2015, X. Zhang - - PowerPoint PPT Presentation
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop High-level
disk cannot read data fast enough…
inverted index (to support search engine)
commodity hardware”
applied to massively parallel processing”
documents containing some given words, and then rank these documents by relevance
documents containing each word
Word: Documents where the work appears the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7
web pages
them
implementation of Distributed File system and MapReduce framework)
map and reduce jobs in a cluster environment
to map-reduce pipelines.
12
clusters of commodity hardware
files
access (i.e., sequential reads)
large parallel, batch processing jobs
through replication
14
15
16
Input: a set of [key,value] pairs Output: a set of [key,value] pairs
Split
intermediate [key,value] pairs [k1,v11,v12, …] [k2,v21,v22, …] …
Shuffle
documents.
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
// key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
17
18
Input: [k,v] pairs intermediate: [k,v] pairs Output: [k,v] pairs
A MapReduce job is a unit of work that client/user wants to be performed
Hadoop system:
* divides job into map and reduce tasks. * divides input into fixed-size pieces called input splits, or splits. * Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split
19
Parallism of MapReduce + very high aggregate I/O bandwidth across a large cluster provided by HDFS => economics of the system are extremely compelling – a key factor in the popularity of Hadoop. Keys: lack of data motion i.e. move compute to data, and do not move data to compute node via network. Specifically, MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. Benefits: reduces network I/O and keeps most of the I/O on local disk or within same rack.
20
There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers.
tasks to run on tasktrackers.
which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.
21
(AM).
22
Hadoop Deamons are Java processes, running in background, talking to other via RPC over SSH protocol.
NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner.
among all applications in the system.
applications
incorporates resource elements such as memory, cpu, disk, network etc.
ResourceManager and working with NodeManager(s) to execute and monitor component tasks.
23