1/27
732A54 Big Data Analytics
Lecture 10: Machine Learning with MapReduce Jose M. Pe˜ na IDA, Link¨
- ping University, Sweden
732A54 Big Data Analytics Lecture 10: Machine Learning with - - PowerPoint PPT Presentation
732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA, Link oping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks Support Vector
1/27
2/27
▸ Neural Networks ▸ Support Vector Machines ▸ Mixture Models ▸ K-Means
3/27
▸ Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on
▸ Chu, C.-T. et al. Map-Reduce for Machine Learning on Multicore. In
▸ Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on
▸ Yahoo tutorial at
▸ Slides for 732A95 Introduction to Machine Learning.
4/27
▸ Large-scale machine learning problems, e.g. clustering documents from
▸ Extracting properties of web pages, e.g. web access log data. ▸ Large-scale graph computations, e.g. web link graph. ▸ Statistical machine translation. ▸ Processing satellite images. ▸ Production of the indexing system used for Google’s web search engine.
5/27
▸ Map function: ▸ Input: A pair (in key, in value). ▸ Output: A list list(out key, intermediate value). ▸ Reduce function: ▸ Input: A pair (out key, list(intermediate value)). ▸ Output: A list list(out value). ▸ All intermediate values associated with the same intermediate key are
6/27
7/27
8/27
9/27
10/27
▸ First, nodes perform local computations (map), and ▸ then, they exchange messages (reduce).
11/27
▸ Necessary since thousands of nodes may be used. ▸ The master pings the workers periodically. No answer means failure. ▸ If a worker fails then its completed and in-progress map tasks are
▸ Note the importance of storing several copies (typically 3) of the input data
▸ If a worker fails then its in-progress reduce task is re-executed. The results
▸ To be able to recover from the unlikely event of a master failure, the master
▸ M and R are larger than the number of nodes available. ▸ Large M and R values benefit dynamic load balance and fast failure
▸ Too large values may imply too many scheduling decisions, and too many
▸ For instance, M = 200000 and R = 5000 for 2000 available nodes.
12/27
x0 x1 xD z0 z1 zM y1 yK w(1)
MD
w(2)
KM
w(2)
10
hidden units inputs
13/27
14/27
▸ Map function: Compute the gradient for the samples in the piece of input
▸ Reduce function: Sum the partial gradients and update w
15/27
y = 1 y = 0 y = −1 margin y = 1 y = 0 y = −1
16/27
17/27
▸ Map function: Compute the gradient for the samples in the piece of input
▸ Reduce function: Sum the partial gradients and update w
18/27
2 (x
k (x
19/27
20/27
21/27
▸ Map function: Compute
▸ Reduce function: Sum up the results (1) of the map tasks and divide it by
22/27
23/27
24/27
▸ Map function: Choose the population with the closest mean for each
▸ Reduce function: Recalculate the population means from the results of the
25/27
26/27
27/27