SALSA SALSA
Overview of Cloud Computing Platforms
July 28, 2010 Big Data for Science Workshop
Judy Qiu
xqiu@indiana.edu http://salsahpc.indiana.edu
Pervasive Technology Institute School of Informatics and Computing Indiana University
Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu - - PowerPoint PPT Presentation
Overview of Cloud Computing Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu http://salsahpc.indiana.edu Pervasive Technology Institute School of Informatics and Computing Indiana University SALSA SALSA
SALSA SALSA
July 28, 2010 Big Data for Science Workshop
xqiu@indiana.edu http://salsahpc.indiana.edu
Pervasive Technology Institute School of Informatics and Computing Indiana University
SALSA
important again
cores – not extra clock speed
supported data center model building on compute grids
throughout life (e.g. web!)
access/use, programming model
Data Deluge Cloud Technologies eScience Multicore/ Parallel Computing
eResearch applications (biology, chemistry, physics social science and humanities …)
SALSA
There’re several challenges to realizing the vision on data intensive systems and building generic tools (Workflow, Databases, Algorithms, Visualization ).
. . .
Science faces a data deluge. How to manage and analyze information? Recommend CSTB foster tools for data capture, data curation, data analysis ―Jim Gray’s
Talk to Computer Science and Telecommunication Board (CSTB), Jan 11, 2007
SALSA
(65535 Patient/GIS records / 54 dimensions each)
(several million Sequences / at least 300 to 400 base pair each)
(60 million chemical compounds/166 fingerprints each)
(1 Terabyte data placed in IU Data Capacitor)
High volume and high dimension require new efficient computing approaches!
SALSA
Data is too big and gets bigger to fit into memory For “All pairs” problem O(N2), PubChem data points 100,000 => 480 GB of main memory (Tempest Cluster of 768 cores has 1.536TB) We need to use distributed memory and new algorithms to solve the problem Communication overhead is large as main operations include matrix multiplication (O(N2)), moving data between nodes and within one node adds extra overheads We use hybrid mode of MPI and MapReduce between nodes and concurrent threading internal to node on multicore clusters
Concurrent threading has side effects (for shared memory model like CCR and OpenMP) that impact performance sub-block size to fit data into cache cache line padding to avoid false sharing
SALSA Gartner 2009 Hype Curve Source: Gartner (August 2009)
SALSA
7
SaaS: Software as a Service
(e.g. Clustering is a service)
IaaS (HaaS): Infrastructure as a Service
(get computer time with a credit card and with a Web interface like EC2)
PaaS: Platform as a Service
IaaS plus core software capabilities on which you build SaaS (e.g. Azure is a PaaS; MapReduce is a Platform)
Cyberinfrastructure
Is “Research as a Service”
SALSA
SALSA
Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow Workflow: Support workflows that link job components between FutureGrid and Commercial Clouds. Trident from Microsoft Research is initial candidate Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention SQL: Relational Database
Program Library: Store Images and other Program material (basic FutureGrid facility) Blob: Basic storage concept similar to Azure Blob or Amazon S3 DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (Dryad) with compute-data affinity optimized for data processing Table: Support of Table Data structures modeled on Apache Hbase (Google Bigtable) or Amazon SimpleDB/Azure Table (eg. Scalable distributed “Excel”) Queues: Publish Subscribe based queuing system Worker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure Web Role: This is used in Azure to describe important link to user and can be supported in FutureGrid with a Portal framework MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad
SALSA
Instruments Disks Map1 Map2 Map3 Reduce Communication Map = (data parallel) computation reading and writing data Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Portals /Users
MPI and Iterative MapReduce
Map Map Map Map Reduce Reduce Reduce
SALSA
Map(Key, Value) Reduce(Key, List<Value>) Data Partitions Reduce Outputs A hash function maps the results of the map tasks to r reduce tasks
A parallel Runtime coming from Information Retrieval
SALSA
He used a to cut the
and a to make juice.
SALSA (<a’, > , <o’, > , <p’, > )
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs Each output of slice is a list of <key, value> pairs Grouped by key Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <ao, ( …)> Reduced into a list of values The idea of Map Reduce in Data Intensive Computing A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values
SALSA
locality in HDFS (replicated data blocks)
clusters
patterns
Job Tracker Name Node 1 2 3 2 3 4 M M M M R R R R
H D F S
Data blocks Data/Compute Nodes Master Node
Apache Hadoop Microsoft DryadLINQ
Edge : communication path Vertex : execution task
Standard LINQ operations DryadLINQ operations
DryadLINQ Compiler
Dryad Execution Engine
Directed Acyclic Graph (DAG) based execution flows
Job creation; Resource management; Fault tolerance& re-execution of failed tasks/vertices
SALSA
delivered to Client
Higgs in Monte Carlo
SALSA
Input to a map task: <key, value>
key = Some Id value = HEP file Name
Output of a map task: <key, value>
key = random # (0<= num<= max reduce tasks) value = Histogram as binary data
Input to a reduce task: <key, List<value>>
key = random # (0<= num<= max reduce tasks) value = List of histogram as binary data
Output from a reduce task: value
value = Histogram file Combine outputs from reduce tasks to form the final histogram
An application analyzing data from Large Hadron Collider (1TB but 100 Petabytes eventually)
SALSA
AWS/ Azure Hadoop DryadLINQ
Programming patterns “Master-worker” paradigm Independent job execution MapReduce DAG execution, MapReduce + Other patterns Fault Tolerance Task re-execution based
Re-execution of failed and slow tasks. Re-execution of failed and slow tasks. Data Storage S3/Azure Storage. HDFS parallel file system. Local files Environments EC2/Azure clouds, local compute resources Linux cluster, Amazon Elastic MapReduce Windows HPCS cluster Ease of Programming EC2 : ** Azure: *** **** **** Ease of use EC2 : *** Azure: ** *** **** Scheduling & Load Balancing Dynamic scheduling through a global queue, Good natural load balancing Data locality, rack aware dynamic task scheduling through a global queue, Good natural load balancing Data locality, network topology aware
partitions at the node level, suboptimal load balancing
SALSA
assembly program software CAP3.
computations followed by MPI applications for Clustering and MDS (Multi Dimensional Scaling) for dimension reduction before visualization
selection of related chemicals with convenient Google Earth like Browser. This uses either hierarchical MDS (which cannot be applied directly as O(N2)) or GTM (Generative Topographic Mapping).
records with Geographical Information data with over 100 attributes using correlation computation, MDS and genetic algorithms for choosing optimal environmental factors.
SALSA
Illumina/Solexa Roche/454 Life Sciences Applied Biosystems/SOLiD Modern Commerical Gene Sequences Internet
Read Alignment
Visualization Plotviz
Blocking Sequence alignment MDS Dissimilarity Matrix
N(N-1)/2 values
FASTA File N Sequences block Pairings Pairwise clustering
MapReduce MPI
SALSA
“All pairs” problem Data is a collection of N sequences. Need to calcuate N2 dissimilarities (distances) between sequnces (all pairs).
O(100), where 100’s of characters long. Step 1: Can calculate N2 dissimilarities (distances) between sequences Step 2: Find families by clustering (using much better methods than Kmeans). As no vectors, use vector free O(N2) methods Step 3: Map to 3D for visualization using Multidimensional Scaling (MDS) – also O(N2)
Results: N = 50,000 runs in 10 hours (the complete pipeline above) on 768 cores Discussions:
SALSA
Alu Families This visualizes results of Alu repeats from Chimpanzee and Human Genomes. Young families (green, yellow) are seen as tight clusters. This is projection of MDS dimension reduction to 3D of 35399 repeats – each with about 400 base pairs Metagenomics This visualizes results of dimension reduction to 3D of 30000 gene sequences from an environmental sample. The many different genes are classified by clustering algorithm and visualized by MDS dimension reduction
SALSA
5000 10000 15000 20000 35339 50000 DryadLINQ MPI
Calculate Pairwise Distances (Smith Waterman Gotoh)
125 million distances 4 hours & 46 minutes
Moretti, C., Bui, H., Hollingsworth, K., Rich, B., Flynn, P., & Thain, D. (2009). All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids. IEEE Transactions on Parallel and Distributed Systems , 21, 21-36.
SALSA
Inhomogeneous Data I
1500 1550 1600 1650 1700 1750 1800 1850 1900 50 100 150 200 250 300
Time (s) Standard Deviation
Randomly Distributed Inhomogeneous Data Mean: 400, Dataset Size: 10000
DryadLinq SWG Hadoop SWG Hadoop SWG on VM
Inhomogeneity of data does not have a significant effect when the sequence lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes)
SALSA
Inhomogeneous Data II
1,000 2,000 3,000 4,000 5,000 6,000 50 100 150 200 250 300
Total Time (s) Standard Deviation
Skewed Distributed Inhomogeneous data Mean: 400, Dataset Size: 10000 DryadLinq SWG Hadoop SWG Hadoop SWG on VM
This shows the natural load balancing of Hadoop MR dynamic task assignment using a global pipe line in contrast to the DryadLINQ static assignment
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes)
DryadLINQ out performs Hadoop in other cases with data locality awareness
SALSA
10000 20000 30000 40000 50000 0% 5% 10% 15% 20% 25% 30%
SALSA
1
Synchronous Lockstep Operation as in SIMD architectures
SIMD 2
Loosely Synchronous Iterative Compute-Communication stages with independent compute (map) operations for each CPU. Heart of most MPI jobs
MPP 3
Asynchronous Compute Chess; Combinatorial Search often supported by dynamic threads
MPP 4
Pleasingly Parallel Each component independent
MPP, Grids, Clouds 5
Metaproblems Coarse grain (asynchronous) combinations of classes 1)- 4). The preserve of workflow.
Grids, Clouds 6
MapReduce++ It describes file(database) to file(database) operations which has subcategories including. 1) Pleasingly Parallel Map Only (e.g. Cap3) 2) Map followed by reductions (e.g. HEP) 3) Iterative “Map followed by reductions” – Extension of Current Technologies that supports much linear algebra and datamining
Clouds Hadoop/ Dryad
Twister
Classification of Parallel software/hardware use in terms of “Application architecture” Structures
SALSA
Map Only Classic MapReduce Iterative Reductions MapReduce++ Loosely Synchronous
CAP3 Analysis Document conversion (PDF -> HTML) Brute force searches in cryptography Parametric sweeps High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra Many MPI scientific applications utilizing wide variety of communication constructs including local interactions
analysis
HEP Data Analysis
Distances for ALU Sequences
Annealing Clustering
Scaling MDS
Equations and
with short range forces
Input Output map Input map reduce Input map reduce iterations Pij Domain of MapReduce and Iterative Extensions MPI
SALSA
transferred from the map tasks to the reduce tasks – eliminates local files
MapReduce computations
iterative computations Data Split D
MR Driver User Program
Pub/Sub Broker Network D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication Reduce (Key, List<Value>) Iterate Map(Key, Value) Combine (Key, List<Value>) User Program Close() Configure() Static data δ flow Different synchronization and intercommunication mechanisms used by the parallel runtimes
SALSA
SALSA
K-means Matrix Multiplication Performance of K-Means Parallel Overhead Matrix Multiplication Smith Waterman
SALSA
SALSA
TwisterMPIReduce
PairwiseClustering MPI Multi Dimensional Scaling MPI Generative Topographic Mapping MPI Other …
Azure Twister (C# C++) Java Twister Microsoft Azure FutureGrid Local Cluster Amazon EC2
SALSA
Google MapReduce Apache Hadoop Microsoft Dryad Twister Azure Twister
Programming Model MapReduce MapReduce DAG execution, Extensible to MapReduce and
Iterative MapReduce MapReduce-- will extend to Iterative MapReduce Data Handling GFS (Google File System) HDFS (Hadoop Distributed File System) Shared Directories & local disks Local disks and data management tools Azure Blob Storage Scheduling Data Locality Data Locality; Rack aware, Dynamic task scheduling through global queue Data locality; Network topology based run time graph
task partitions Data Locality; Static task partitions Dynamic task scheduling through global queue Failure Handling Re-execution of failed tasks; Duplicate execution of slow tasks Re-execution of failed tasks; Duplicate execution
Re-execution of failed tasks; Duplicate execution of slow tasks Re-execution
Re-execution of failed tasks; Duplicate execution
High Level Language Support Sawzall Pig Latin DryadLINQ Pregel has related features N/A Environment Linux Cluster. Linux Clusters, Amazon Elastic Map Reduce on EC2 Windows HPCS cluster Linux Cluster EC2 Window Azure Compute, Windows Azure Local Development Fabric Intermediate data transfer File File, Http File, TCP pipes, shared-memory FIFOs Publish/Subscr ibe messaging Files, TCP
SALSA
– Large and high dimensional data are everywhere: biology, physics, Internet, … – Visualization can help data analysis
– Map high-dimensional data into low dimensions (2D or 3D). – Need Parallel programming for processing large data sets – Developing high performance dimension reduction algorithms:
– Interactive visualization tool PlotViz
PubChem database with 166 features each
SALSA
points.
target dimension of the given data based on pairwise proximity information while minimize the objective function.
(3D) points
(GTM) [2]
data (in 3D), known as K-cluster problem (NP-hard)
for finding a global solution
likelihood:
[1] I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., 2005. [2] C. Bishop, M. Svens´en, and C. Williams. GTM: The generative topographic mapping. Neural computation, 10(1):215–234, 1998.
SALSA
MDS also soluble by viewing as nonlinear χ2 with iterative linear equation solver GTM MDS (SMACOF) Maximize Log-Likelihood Minimize STRESS or SSTRESS
Objective Function
O(KN) (K << N) O(N2)
Complexity
Purpose
EM Iterative Majorization (EM-like)
Optimization Method
SALSA
37
Chemical compounds shown in literatures, visualized by MDS (left) and GTM (right) Visualized 234,000 chemical compounds which may be related with a set of 5 genes of interest (ABCB1, CHRNB2, DRD2, ESR1, and F2) based on the dataset collected from major journal literatures which is also stored in Chem2Bio2RDF system.
SALSA
n in-sample N-n
Total N data Training Interpolation
Trained data
Interpolated MDS/GTM map
SALSA
upto 100k based on the sample data (12.5k, 25k, and 50k) and original MDS result w/ 100k.
wij = 1 / ∑δij
2
Interpolation result (blue) is getting close to the original (red) result as sample size is increasing.
16 nodes
12.5K 25K 50K 100K Run on 16 nodes of Tempest Note that we gain performance of over a factor of 100 for this data size. It would be more for larger data set.
SALSA
Science computations
modes
use MapReduce model efficiently – Prototype Twister released
SALSA
Data Intensive Paradigms Data intensive application with basic activities: capture, curation, preservation, and analysis (visualization) Cloud infrastructure and runtime Parallel threading and processes
SALSA
iDataplex Bare-metal Nodes Linux Bare- system Linux Virtual Machines Windows Server 2008 HPC Bare-system Virtualization Microsoft DryadLINQ / Twister / MPI Apache Hadoop / Twister/ MPI Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD Using DryadLINQ, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping XCAT Infrastructure Xen Virtualization Applications Runtimes Infrastructure software Hardware Windows Server 2008 HPC
Services and Workflow
SALSA
Acknowledgements SALSA Group
http://salsahpc.indiana.edu
Judy Qiu, Adam Hughes Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Yang Ruan, Hui Li, Bingjing Zhang, Saliya Ekanayake, Stephen Wu
Collaborators
Yves Brun, Peter Cherbas, Dennis Fortenberry, Roger Innes, David Nelson, Homer Twigg, Craig Stewart, Haixu Tang, Mina Rho, David Wild, Bin Cao, Qian Zhu, Maureen Biggers, Gilbert Liu, Neil Devadasan
Support by
Research Technologies of UITS and School of Informatics and Computing