MapReduce for Data Intensive Scientific Analyses
Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405
5/11/2009 1 Jaliya Ekanayake
Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey - - PowerPoint PPT Presentation
MapReduce for Data Intensive Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405 5/11/2009 Jaliya Ekanayake 1 Presentation Outline Introduction
5/11/2009 1 Jaliya Ekanayake
5/11/2009 Jaliya Ekanayake 2
5/11/2009 Jaliya Ekanayake 3
5/11/2009 Jaliya Ekanayake 4
5/11/2009 Jaliya Ekanayake 5
Loosely synchronized (milliseconds) SPMDs
Tightly synchronized (microseconds)
5/11/2009 Jaliya Ekanayake 6
5/11/2009 Jaliya Ekanayake 7
5/11/2009 Jaliya Ekanayake 8
5/11/2009 Jaliya Ekanayake 9
5/11/2009 Jaliya Ekanayake 10
D
D
M R M R M R M R Worker Nodes M R D
Architecture of CGL-MapReduce
5/11/2009 Jaliya Ekanayake 11
CGL-MapReduce, the flow of execution
D
D
M R M R M R M R Worker Nodes
5/11/2009 Jaliya Ekanayake 12
Data: Up to 1 terabytes of data, placed in IU Data Capacitor Processing:12 dedicated computing nodes from Quarry (total of 96 processing cores) MapReduce for HEP data analysis HEP data analysis, execution time vs. the volume of data (fixed compute resources)
5/11/2009 Jaliya Ekanayake 13
Execution time vs. the number of compute nodes (fixed data) Speedup for 100GB of HEP data
5/11/2009 Jaliya Ekanayake 14
MapReduce for Kmeans Clustering Kmeans Clustering, execution time vs. the number of 2D data points (Both axes are in log scale)
5/11/2009 Jaliya Ekanayake 16
MapReduce for Matrix Multiplication
Matrix Multiply Histogramming Words
5/11/2009 Jaliya Ekanayake 17
[1] Evaluating MapReduce for Multi-core and Multiprocessor Systems. By C. Ranger et al. [2] Map-Reduce for Machine Learning on Multicore by C. Chu et al.
5/11/2009 Jaliya Ekanayake 19
5/11/2009 Jaliya Ekanayake 20
5/11/2009 Jaliya Ekanayake 21
5/11/2009 Jaliya Ekanayake 22
Implementation Language Java Java Other Language Support Uses Hadoop Streaming (Text Data only) Requires a Java wrapper classes Distributed File System HDFS Currently assumes a shared file system between nodes Accessing binary data from
Currently only a Java interface is available Shared file system enables this functionality Fault Tolerance Support failures of nodes Currently does not support fault tolerance Iterative Computations Not supported Supports Iterative MapReduce Daemon Initialization Requires ssh public key access Requires ssh public key access
5/11/2009 Jaliya Ekanayake 23
5/11/2009 Jaliya Ekanayake 24
5/11/2009 Jaliya Ekanayake 25