CloudCom2010, Indianapolis , IN
Performing Large Science Experiments
- n Azure: Pitfalls and Solutions
Performing Large Science Experiments on Azure: Pitfalls and - - PowerPoint PPT Presentation
Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu, Jared Jackson, Jaliya Ekanayake, Roger Barga, Nelson Araujo Microsoft eXtreme Computing Group CloudCom2010, Indianapolis , IN Windows Azure Application Storage
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
…
Fabric Compute Storage
Application
CloudCom2010, Indianapolis , IN
Web Role ASP.NET, WCF, etc. Worker Role main( { … }
Queue
2) Put work in queue 3) Get work from queue 4) Do work
To scale, add more of either
failure,
IIS
CloudCom2010, Indianapolis , IN
Instance
Instance Instance
CloudCom2010, Indianapolis , IN
– the most important software in bioinformatics – Identify the similarity between bio-sequences
– Large number of pairwise alignment operations – The size of sequence databases has been growing exponentially
– Building a local cluster – Submit jobs to NCBI or EBI
– Query segmentation
BLAST task Splitting task BLAST task BLAST task BLAST task … Merging Task
CloudCom2010, Indianapolis , IN
Web Portal Web Service Job registration Job Scheduler Worker Worker Worker Global dispatch queue Web Role Azure Table Job Management Role Azure Blob Database updating Role … Blast databases, temporary data, etc.) Job Registry NCBI databases
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
3/31/2010 6:14RD00155D3611B0 Executing the task 251523... 3/31/2010 6:25RD00155D3611B0 Execution of task 251523 is done, it takes 10.9mins 3/31/2010 6:25RD00155D3611B0 Executing the task 251553... 3/31/2010 6:44RD00155D3611B0 Execution of task 251553 is done, it takes 19.3mins 3/31/2010 6:44RD00155D3611B0 Executing the task 251600... 3/31/2010 7:02RD00155D3611B0 Execution of task 251600 is done, it takes 17.27 mins 3/31/2010 8:22RD00155D3611B0 Executing the task 251774... 3/31/2010 9:50RD00155D3611B0 Executing the task 251895... 3/31/2010 11:12RD00155D3611B0 Execution of task 251895 is done, it takes 82 mins
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
Almost one day delay. Try not to orchestrate instances by the tight synchronization (e.g., barrier)
North Europe datacenter, totally 34, 265 tasks processed
Node replacement, Avoid using machine name in your program
CloudCom2010, Indianapolis , IN
North Europe Data Center, totally 34,256 tasks processed
All 62 nodes lost tasks and then came back in a group
Update domain
~30 mins ~ 6 nodes in one group
CloudCom2010, Indianapolis , IN 35 Nodes experienced the blob writing failure at same time
West Europe Datacenter; 30,976 tasks are completed, and job was killed
A reasonable guess: the Fault Domain is working
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
Task 56823 needs 8 hours to complete; it was re-executed by 8 nodes due to the 2-hour max value of the visibliblityTimeout of a message Two-day very low system throughput due to some long-tail tasks
North Europe Data center, 2058 tasks
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
CloudCom2010, Indianapolis , IN
Web Portal Web Service Job registration Job Scheduler Worker Worker Worker Dispatch Queue
Web Role Azure Table Job Manager Role Azure Blob
… Scaling Engine Parametric Engine Sampling Filter
CloudCom2010, Indianapolis , IN
– Derived from Nimrod – Each job can have
– AzureCopy – AzureMount – SelectBlobs
running legacy binaries on Azure
– BLAST – Bayesian Network Machine Learning – Image rendering
<job name="blast"> <prolog> azurecopy http://.../uniref.fasta uniref.fasta </prolog> <cmd> azurecopy %partition% input blastall.exe -p blastp -d uniref.fasta
azurecopy output %partition%.out </cmd> <parameter name="partition"> <selectBlobs> <prefix>partitions/</prefix> </selectBlobs> </parameter> <configure> <minInstances>2</minInstances> <maxInstances>4</maxInstances> <shutdownWhenDone> true </shutdownWhenDone> <sampling> true </sampling> </configure> </job>
Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter
CloudCom2010, Indianapolis , IN
Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter
CloudCom2010, Indianapolis , IN
– Checkpoint by snapshotting the task table – A task can be incomplete – Fix the 7-day/ 2-hour limitation
– Ignore the exceptions, – retry incomplete tasks with reduced number of instance, – minimize the cost of failures
CloudCom2010, Indianapolis , IN
Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter
CloudCom2010, Indianapolis , IN
2 instances
CloudCom2010, Indianapolis , IN
– Sync. Operation
– Async. Operation,
– Sync. Operation
– Async. Operation
– scale-out asynchronously – Scale-in synchronously
New instances join in 20 – 80 minutes Azure randomly picks the instances to shutdown
CloudCom2010, Indianapolis , IN