MapReduce, Hadoop and Amazon AWS
Yasser Ganjisaffar
http://www.ics.uci.edu/~yganjisa February 2011
Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa - - PowerPoint PPT Presentation
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables applications to work with
http://www.ics.uci.edu/~yganjisa February 2011
applications.
data.
(GFS).
community of contributors, using the Java programming language.
extensively across its businesses.
http://wiki.apache.org/hadoop/PoweredBy
1 2 1 2 1 2 3 3 3 Data Nodes Namenode Client
This is the first time that a Java program has won this competition.
Hello World Bye World Hello Hadoop Goodbye Hadoop Hello World Bye World Hello Hadoop Goodbye Hadoop Split
Hello World Bye World Mapper Hello, <1> World, <1> Bye, <1> World, <1> Bye, <1> Hello, <1> World, <1, 1> Sort & Merge Bye, <1> Hello, <1> World, <2> Combiner Node 1
Sort & Merge Bye, <1> Hello, <1> World, <2> Goodbye, <1> Hadoop, <2> Hello, <1> Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2> Split Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2>
Bye, <1> Goodbye, <1> Hadoop, <2> Hello, <1, 1> World, <2> Reducer Bye, <1> Goodbye, <1> Hadoop, <2> Reducer Hello, <2> World, <2> Node 1 Node 2 Write on Disk part‐00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 part‐00001
– http://www.apache.org/dyn/closer.cgi/hadoop/core/
Warning! Most of the sample codes on web are for older versions of Hadoop.
Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/WordCount-v1-src.zip
hadoop jar WordCount.jar edu.uci.hadoop.WordCount const.txt word-count-result
hadoop dfs -cat word-count-result/part-r-00000 > word-count.txt
sort -k2 -n -r word-count.txt | more
Master Node JobTracker Slave Node Slave Node Slave Node
TaskTracker TaskTracker TaskTracker
Client Computer
Task Task Task Task Task
Master Node JobTracker Slave Node
TaskTracker
MapReduce layer HDFS layer
TaskTracker
NameNode
DataNode DataNode
Slave Node
TaskTracker DataNode
Source files are available at: http://www.ics.uci.edu/~yganjisa/files/2011/hadoop-presentation/WordCount-v2-src.zip
If your tasks don’t report anything in 10 minutes they would be killed by Hadoop!
Split LineRecordReader
<offset1, line1> <offset2, line2> <offset3, line3> For more complex inputs, You should extend: