Lets keep the intro short Modern data mining: process immense - PowerPoint PPT Presentation

MapReduce ¡and ¡Hadoop ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014

Let’s ¡keep ¡the ¡intro ¡short ¡ § Modern data mining: process immense amount of data quickly § Exploit parallelism Traditional parallelism Bring data to compute MapReduce Bring compute to data Pictures ¡courtesy: ¡Glenn ¡K. ¡Lockwood, ¡glennklockwood.com ¡ ¡ 2 ¡

The ¡MapReduce ¡paradigm ¡ Final ¡ Reduce ¡ Map ¡ Shuffle ¡and ¡sort ¡ Split ¡ Final ¡ original ¡ output ¡ Input ¡ <Key,Value> ¡ Input ¡ pairs ¡ chunks ¡ <Key,Value> ¡ Output ¡ May ¡not ¡ pairs ¡grouped ¡ chunks ¡ May ¡be ¡ need ¡to ¡ by ¡keys ¡ already ¡ combine ¡ split ¡in ¡ The ¡user ¡needs ¡to ¡write ¡the ¡ map() ¡and ¡the ¡ reduce() ¡ filesystem ¡ 3 ¡

An ¡example: ¡word ¡frequency ¡counQng ¡ Final ¡ Reduce ¡ Map ¡ Shuffle ¡and ¡sort ¡ Split ¡ the ¡pairs ¡(w,1) ¡for ¡the ¡same ¡words ¡ reduce: ¡count ¡the ¡number ¡(n) ¡of ¡ subcollec.ons ¡of ¡documnts ¡ pairs ¡for ¡each ¡w, ¡make ¡it ¡(w,n) ¡ collec.on ¡of ¡documnts ¡ output: ¡(w,n) ¡for ¡ map: ¡for ¡each ¡word ¡w, ¡ output ¡pairs ¡(w,1) ¡ are ¡grouped ¡together ¡ each ¡w ¡ Final ¡ original ¡ output ¡ Input ¡ <Key,Value> ¡ Input ¡ pairs ¡ chunks ¡ <Key,Value> ¡ Output ¡ Problem: ¡Given ¡a ¡collecQon ¡ pairs ¡grouped ¡ chunks ¡ of ¡documents, ¡count ¡the ¡ by ¡keys ¡ number ¡of ¡Qmes ¡each ¡ word ¡occurs ¡in ¡the ¡ map: ¡for ¡each ¡word ¡w, ¡ reduce: ¡count ¡the ¡number ¡(n) ¡of ¡ collecQon ¡ emit ¡pairs ¡(w,1) ¡ pairs ¡for ¡each ¡w, ¡make ¡it ¡(w,n) ¡ 4 ¡

An ¡example: ¡word ¡frequency ¡counQng ¡ Final ¡ Reduce ¡ Map ¡ Shuffle ¡and ¡sort ¡ Split ¡ (apple,1) (apple,2) (apple,1) apple (apple,1) (orange,1) orange (orange,1) (orange,1) (orange,3) peach (peach,1) apple orange (orange,1) peach orange (orange,1) (apple,2) plum (plum,1) (guava,1) (guava,1) orange plum (orange,3) orange (orange,1) (guava,1) orange apple (plum,1) apple (apple,1) (plum,2) (plum,2) guava (plum,1) guava (guava,1) (cherry,2) (fig,2) cherry fig (cherry,1) (cherry,1) (peach,3) (cherry,2) cherry fig (cherry,1) (fig,1) peach fig peach (fig,1) (peach,1) Final ¡ (fig,2) peach fig (fig,1) (fig,1) original ¡ peach output ¡ (peach,1) (peach,1) Input ¡ <Key,Value> ¡ Input ¡ (peach,3) (peach,1) pairs ¡ (peach,1) chunks ¡ <Key,Value> ¡ Output ¡ Problem: ¡Given ¡a ¡collecQon ¡ pairs ¡grouped ¡ chunks ¡ of ¡documents, ¡count ¡the ¡ by ¡keys ¡ number ¡of ¡Qmes ¡each ¡ word ¡occurs ¡in ¡the ¡ map: ¡for ¡each ¡word ¡w, ¡ reduce: ¡count ¡the ¡number ¡(n) ¡of ¡ collecQon ¡ output ¡pairs ¡(w,1) ¡ pairs ¡for ¡each ¡w, ¡make ¡it ¡(w,n) ¡ 5 ¡

Apache Hadoop An open source MapReduce framework HADOOP ¡ 6 ¡

Hadoop ¡ § Two main components – Hadoop Distributed File System (HDFS): to store data – MapReduce engine : to process data § Master – slave architecture using commodity servers § The HDFS – Master: Namenode – Slave: Datanode § MapReduce – Master: JobTracker – Slave: TaskTracker 7 ¡

HDFS: ¡Blocks ¡ Datanode ¡1 ¡ Block ¡1 ¡ Block ¡2 ¡ Block ¡3 ¡ Block ¡1 ¡ Block ¡2 ¡ Datanode ¡2 ¡ Block ¡1 ¡ Block ¡3 ¡ Big ¡File ¡ Block ¡3 ¡ Block ¡4 ¡ Block ¡4 ¡ Datanode ¡3 ¡ Block ¡2 ¡ Block ¡6 ¡ Block ¡5 ¡ Block ¡6 ¡ Block ¡5 ¡ Datanode ¡4 ¡ § Runs on top of existing filesystem Block ¡4 ¡ Block ¡6 ¡ § Blocks are 64MB (128MB recommended) § Single file can be > any single disk Block ¡5 ¡ § POSIX based permissions § Fault tolerant 8 ¡

HDFS: ¡Namenode ¡and ¡Datanode ¡ § Namenode – Only one per Hadoop Cluster – Manages the filesystem namespace – The filesystem tree – An edit log – For each block block i, the datanode(s) in which block i is saved – All the blocks residing in each datanode § Secondary Namenode – Backup namenode § Datanodes – Many per Hadoop cluster – Controls block operations – Physically puts the block in the nodes – Do the physical replication 9 ¡

HDFS: ¡an ¡example ¡ 10 ¡

MapReduce: ¡JobTracker ¡and ¡TaskTracker ¡ 1. JobClient submits job to JobTracker; Binary copied into HDFS 2. JobTracker talks to Namenode 3. JobTracker creates execution plan 4. JobTracker submits work to TaskTrackers 5. TaskTrackers report progress via heartbeat 6. JobTracker updates status 11 ¡

Map, ¡Shuffle ¡and ¡Reduce: ¡internal ¡steps ¡ 1. Splits data up to send it to the mapper 2. Transforms splits into key/value pairs 3. (Key-Value) with same key sent to the same reducer 4. Aggregates key/value pairs based on user-defined code 5. Determines how the result are saved 12 ¡

Fault ¡Tolerance ¡ § If the master fails – MapReduce would fail, have to restart the entire job § A map worker node fails – Master detects (periodic ping would timeout) – All the map tasks for this node have to be restarted • Even if the map tasks were done, the output were at the node § A reduce worker fails – Master sets the status of its currently executing reduce tasks to idle – Reschedule these tasks on another reduce worker 13 ¡

Some algorithms using MapReduce USING ¡MAPREDUCE ¡ 14 ¡

Matrix ¡– ¡Vector ¡MulQplicaQon ¡ ¡ § Multiply M = ( m ij ) (an n × n matrix) and v = ( v i ) (an n -vector) § If n = 1000, no need of MapReduce! n Mv = ( x ij ) n ( i , m ij v j ) ∑ n M v x ij = m ij v j j = 1 Case 1: Large n , M does not fit into main memory, but v does § Since v fits into main memory, v is available to every map task § Map: for each matrix element m ij , emit key value pair ( i , m ij v j ) § Shuffle and sort: groups all m ij v j values together for the same i § Reduce: sum m ij v j for all j for the same i 15 ¡

Matrix ¡– ¡Vector ¡MulQplicaQon ¡ ¡ § Multiply M = ( m ij ) (an n × n matrix) and v = ( v i ) (an n -vector) § If n = 1000, no need of MapReduce! This ¡much ¡will ¡fit ¡into ¡main ¡ Mv = ( x ij ) memory ¡ n ( i , m ij v j ) ∑ x ij = m ij v j j = 1 This ¡whole ¡chunk ¡does ¡not ¡fit ¡ in ¡main ¡memory ¡anymore ¡ Case 2: Very large n , even v does not fit into main memory § For every map, many accesses to disk (for parts of v ) required! § Solution: – How much of v will fit in? – Partition v and rows of M so that each partition of v fits into memory – Take dot product of one partition of v and the corresponding partition of M – Map and reduce same as before 16 ¡

RelaQonal ¡Alegebra ¡ Attr 1 Attr 2 Attr 3 Attr 4 § Relation R ( A 1 , A 3 , … , A n ) is a relation with attributes A i xyz abc 1 true abc xyz 1 true § Schema: set of attributes xyz def 1 false § Selection on condition C : bcd def 2 true apply C on each tuple in R , output only those which satisfy C Links ¡between ¡URLs ¡ § Projection on a subset S of URL1 URL2 attributes: output the url1 url2 components for the url2 url1 attributes in S url3 url5 § Union, Intersection, Join… url1 url3 17 ¡

SelecQon ¡using ¡MapReduce ¡ § Trivial Links ¡between ¡URLs ¡ § Map: For each tuple t in R , test if t URL1 URL2 satisfies C. If so, produce the key-value url1 url2 pair ( t , t ). url2 url1 § Reduce: The identity function. It simply url3 url5 passes each key-value pair to the output. url1 url3 18 ¡

Union ¡using ¡MapReduce ¡ § Union of two relations R and S Links ¡between ¡URLs ¡ § Suppose R and S have the same schema URL1 URL2 § Map tasks are generated from chunks of url1 url2 both R and S url2 url1 § Map: For each tuple t , produce the key- url3 url5 value pair ( t , t ) url1 url3 § Reduce: Only need to remove duplicates – For all key t , there would be either one or two values – Output ( t , t ) in either case 19 ¡

Lets keep the intro short Modern data mining: process immense - PowerPoint PPT Presentation

MapReduce and Hadoop Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 10, 2014 Lets keep the intro short Modern data mining: process immense amount of data

2018 Keep Project Moving Quickly Keep Sidewalks Open & Front Doors Accessible Keep

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

06/09/14 10. A (very) short intro to JSP 10. A (very) short intro to JSP Dynamic web pages

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

10. A (very) short intro to JSP Dynamic web pages Web browser Advanced CS Intro

Keep Warm, Keep Safe All Home Heating Fires From 2014 to 2018 7, 053 home heating fires

BEAU TIFUL BEAU TIFUL BEAU TIFUL Keep Ohio Beautiful Annual Meeting & Awards [INSERT

1. Preliminaries Let F be a number field. For each place v of F , let F v be the completion of F at

50 YEARS Let Us Fulfill Your Needs Let Us Fulfill Your Needs We Are VoIP Supply VoIP Supply

Let over lambda (lol) Let-over-lambda refers to the having a let block whose return value is a

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

A characterization of non-Noetherian BFDS and FFDs Richard Erwin Hasenauer March 25, 2019 Let D

Learning R via Python ...or the other way around Drew Conway Dept. of Politics - NYU January 7,

PHP OpenKnowTech & CC ICT-SUD An introduction to the language (PHP 5)... in fact just the

A Formally Verified Interpreter for a Shell-like Programming Language Claude March e Nicolas

Foundations of AI Planning in the situation calculus 1 3 . Planning STRIPS formalism

Planning Some material taken from D. Lin, J-C Latombe 1 Logical Agents Reasoning [Ch 6]

a Tool based Reconstruction Algorithm for Characterising Showers (TRACS) Dom Barker, Ed Tyley,

Logic Programming Using Grammar Rules Temur Kutsia Research Institute for Symbolic Computation

rt t

Lets keep the intro short Modern data mining: process immense - PowerPoint PPT Presentation

MapReduce and Hadoop Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 10, 2014 Lets keep the intro short Modern data mining: process immense amount of data

2018 Keep Project Moving Quickly Keep Sidewalks Open &amp; Front Doors Accessible Keep

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

06/09/14 10. A (very) short intro to JSP 10. A (very) short intro to JSP Dynamic web pages

The Power of Brand Let s start with a game Fast Food Let s start with a game Tennis

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

10. A (very) short intro to JSP Dynamic web pages Web browser Advanced CS Intro

Keep Warm, Keep Safe All Home Heating Fires From 2014 to 2018 7, 053 home heating fires

BEAU TIFUL BEAU TIFUL BEAU TIFUL Keep Ohio Beautiful Annual Meeting &amp; Awards [INSERT

1. Preliminaries Let F be a number field. For each place v of F , let F v be the completion of F at

50 YEARS Let Us Fulfill Your Needs Let Us Fulfill Your Needs We Are VoIP Supply VoIP Supply

Let over lambda (lol) Let-over-lambda refers to the having a let block whose return value is a

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

A characterization of non-Noetherian BFDS and FFDs Richard Erwin Hasenauer March 25, 2019 Let D

Learning R via Python ...or the other way around Drew Conway Dept. of Politics - NYU January 7,

PHP OpenKnowTech &amp; CC ICT-SUD An introduction to the language (PHP 5)... in fact just the

A Formally Verified Interpreter for a Shell-like Programming Language Claude March e Nicolas

Foundations of AI Planning in the situation calculus 1 3 . Planning STRIPS formalism

Planning Some material taken from D. Lin, J-C Latombe 1 Logical Agents Reasoning [Ch 6]

a Tool based Reconstruction Algorithm for Characterising Showers (TRACS) Dom Barker, Ed Tyley,

Logic Programming Using Grammar Rules Temur Kutsia Research Institute for Symbolic Computation

rt t

2018 Keep Project Moving Quickly Keep Sidewalks Open & Front Doors Accessible Keep

BEAU TIFUL BEAU TIFUL BEAU TIFUL Keep Ohio Beautiful Annual Meeting & Awards [INSERT

PHP OpenKnowTech & CC ICT-SUD An introduction to the language (PHP 5)... in fact just the