Performing Large Science Experiments on Azure: Pitfalls and - - PowerPoint PPT Presentation

performing large science experiments
SMART_READER_LITE
LIVE PREVIEW

Performing Large Science Experiments on Azure: Pitfalls and - - PowerPoint PPT Presentation

Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu, Jared Jackson, Jaliya Ekanayake, Roger Barga, Nelson Araujo Microsoft eXtreme Computing Group CloudCom2010, Indianapolis , IN Windows Azure Application Storage


slide-1
SLIDE 1

CloudCom2010, Indianapolis , IN

Performing Large Science Experiments

  • n Azure: Pitfalls and Solutions

Wei Lu, Jared Jackson, Jaliya Ekanayake, Roger Barga, Nelson Araujo Microsoft eXtreme Computing Group

slide-2
SLIDE 2

CloudCom2010, Indianapolis , IN

Windows Azure

Fabric Compute Storage

Application

slide-3
SLIDE 3

CloudCom2010, Indianapolis , IN

Suggested Application Model

Using queues for reliable messaging

Web Role ASP.NET, WCF, etc. Worker Role main( { … }

Queue

2) Put work in queue 3) Get work from queue 4) Do work

To scale, add more of either

  • Decouple the system
  • Absorb the bursts
  • resilient to the instance

failure,

  • Easy to scale

IIS

slide-4
SLIDE 4

CloudCom2010, Indianapolis , IN

Instance

Azure Queue

  • Communication channel

between instances

– Messages in the Queue is reliable and durable

  • 7-day life time
  • Fault tolerance mechanism

– De-queued message becomes visible again after visibilityTimeout if it is not deleted

  • 2-hour maximum limitation

– Idempotent processing

Instance Instance

slide-5
SLIDE 5

CloudCom2010, Indianapolis , IN

AzureBLAST

  • BLAST (Basic Local Alignment Search Tool)

– the most important software in bioinformatics – Identify the similarity between bio-sequences

  • BLAST is highly computation-intensive

– Large number of pairwise alignment operations – The size of sequence databases has been growing exponentially

  • Two choices for running large BLAST jobs

– Building a local cluster – Submit jobs to NCBI or EBI

  • Long job queuing time
  • BLAST is easy to be parallelized

– Query segmentation

BLAST task Splitting task BLAST task BLAST task BLAST task … Merging Task

slide-6
SLIDE 6

CloudCom2010, Indianapolis , IN

AzureBLAST

Web Portal Web Service Job registration Job Scheduler Worker Worker Worker Global dispatch queue Web Role Azure Table Job Management Role Azure Blob Database updating Role … Blast databases, temporary data, etc.) Job Registry NCBI databases

slide-7
SLIDE 7

CloudCom2010, Indianapolis , IN

All-by-All BLAST experiment

  • “All by All” query

– Compare the database against itself – Discovering Homologs

  • inter-relationships of known protein sequences
  • Large protein database (4.2 GB size)

– Totally 9,865,668 sequences

  • In theory100 billion sequence comparisons!
  • Performance estimation

– would require 14 CPU-years – One of biggest BLAST jobs as far as we know

slide-8
SLIDE 8

CloudCom2010, Indianapolis , IN

Our Solution

  • Allocated 3776 weighted instances

– 475 extra-large instances – From three datacenters

  • US South Central, West Europe and North Europe
  • Dividing 10 million sequences into several

segments

– Each will be submitted to one datacenter as one job – Each segment consists of smaller partitions

  • Finally the job took two weeks

– Total size of all outputs is ~230GB

slide-9
SLIDE 9

CloudCom2010, Indianapolis , IN

Understanding Azure by analyzing logs

  • A normal log record should be
  • Otherwise, something is wrong (e.g., lost task)

3/31/2010 6:14RD00155D3611B0 Executing the task 251523... 3/31/2010 6:25RD00155D3611B0 Execution of task 251523 is done, it takes 10.9mins 3/31/2010 6:25RD00155D3611B0 Executing the task 251553... 3/31/2010 6:44RD00155D3611B0 Execution of task 251553 is done, it takes 19.3mins 3/31/2010 6:44RD00155D3611B0 Executing the task 251600... 3/31/2010 7:02RD00155D3611B0 Execution of task 251600 is done, it takes 17.27 mins 3/31/2010 8:22RD00155D3611B0 Executing the task 251774... 3/31/2010 9:50RD00155D3611B0 Executing the task 251895... 3/31/2010 11:12RD00155D3611B0 Execution of task 251895 is done, it takes 82 mins

slide-10
SLIDE 10

CloudCom2010, Indianapolis , IN

Challenges & Pitfalls

  • Failures
  • Instance Idle time
  • Limitation of current Azure Queue
  • Performance/Cost Estimation
  • Minimizing the Needs for Programming
slide-11
SLIDE 11

CloudCom2010, Indianapolis , IN

Almost one day delay. Try not to orchestrate instances by the tight synchronization (e.g., barrier)

Case Study 1

North Europe datacenter, totally 34, 265 tasks processed

Node replacement, Avoid using machine name in your program

slide-12
SLIDE 12

CloudCom2010, Indianapolis , IN

Case Study 2

North Europe Data Center, totally 34,256 tasks processed

All 62 nodes lost tasks and then came back in a group

  • fashion. This is

Update domain

~30 mins ~ 6 nodes in one group

slide-13
SLIDE 13

CloudCom2010, Indianapolis , IN 35 Nodes experienced the blob writing failure at same time

Case Study 3

West Europe Datacenter; 30,976 tasks are completed, and job was killed

A reasonable guess: the Fault Domain is working

slide-14
SLIDE 14

CloudCom2010, Indianapolis , IN

Challenges & Pitfalls

  • Failures

– Failures are expectable and unpredictable

  • Design with failure in mind

– Most are automatically recovered by cloud

  • Instance Idle time
  • Limitation of current Azure Queue
  • Performance/Cost Estimation
  • Minimizing the Needs for Programming
slide-15
SLIDE 15

CloudCom2010, Indianapolis , IN

Challenges & Pitfalls

  • Failures
  • Instance Idle time

– Gap time between two jobs – Diversity of work load – Load imbalance

  • Limitation of current Azure Queue
  • Performance/Cost Estimation
  • Minimizing the Needs for Programming
slide-16
SLIDE 16

CloudCom2010, Indianapolis , IN

Load imbalance

Task 56823 needs 8 hours to complete; it was re-executed by 8 nodes due to the 2-hour max value of the visibliblityTimeout of a message Two-day very low system throughput due to some long-tail tasks

North Europe Data center, 2058 tasks

slide-17
SLIDE 17

CloudCom2010, Indianapolis , IN

Challenges & Pitfalls

  • Failures
  • Instance Idle time
  • Limitation of current Azure Queue

– 2-hour max value of visibilityTimeout

  • Each individual task has to be done in 2 hours

– 7-day max message life time

  • Entire experiment has to be done in less then 7 days
  • Performance/Cost Estimation
  • Minimizing the Needs for Programming
slide-18
SLIDE 18

CloudCom2010, Indianapolis , IN

Challenges & Pitfalls

  • Failures
  • Instance Idle time
  • Limitation of current Azure Queue
  • Performance/Cost Estimation

– The better you understand your application, the more money you can save – BLAST has about 20 arguments – VM size

  • Minimizing the Needs for Programming
slide-19
SLIDE 19

CloudCom2010, Indianapolis , IN

Cirrus: Parameter Sweeping Service on Azure

Web Portal Web Service Job registration Job Scheduler Worker Worker Worker Dispatch Queue

Web Role Azure Table Job Manager Role Azure Blob

… Scaling Engine Parametric Engine Sampling Filter

slide-20
SLIDE 20

CloudCom2010, Indianapolis , IN

Job Definition

  • Declarative Job definition

– Derived from Nimrod – Each job can have

  • Prolog
  • Commands
  • Paramters
  • Azure-related opeartors

– AzureCopy – AzureMount – SelectBlobs

  • Job configuration
  • Minimize the programming for

running legacy binaries on Azure

– BLAST – Bayesian Network Machine Learning – Image rendering

<job name="blast"> <prolog> azurecopy http://.../uniref.fasta uniref.fasta </prolog> <cmd> azurecopy %partition% input blastall.exe -p blastp -d uniref.fasta

  • i input -o output

azurecopy output %partition%.out </cmd> <parameter name="partition"> <selectBlobs> <prefix>partitions/</prefix> </selectBlobs> </parameter> <configure> <minInstances>2</minInstances> <maxInstances>4</maxInstances> <shutdownWhenDone> true </shutdownWhenDone> <sampling> true </sampling> </configure> </job>

Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter

slide-21
SLIDE 21

CloudCom2010, Indianapolis , IN

Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter

Dynamic Scaling

  • Scaling in/out for individual job

– Fit into the [min, max] window specified in the job config – Synchronous Scaling

  • Tasks are dispatched after the scaling is done

– Asynchronous Scaling

  • Tasks execution and scaling operation are simultaneous
  • Scaling in when load imbalance happens
  • Scaling in when not receiving new jobs after a period of

time

– Or if the job is configured as “shutdown-when-done”

  • Usually used for the reducing job.
slide-22
SLIDE 22

CloudCom2010, Indianapolis , IN

Job Pause-ReConfig-Resume

  • Each job maintains a take

status table

– Checkpoint by snapshotting the task table – A task can be incomplete – Fix the 7-day/ 2-hour limitation

  • Handle the exception
  • ptimistically

– Ignore the exceptions, – retry incomplete tasks with reduced number of instance, – minimize the cost of failures

  • Handle the load imbalance
slide-23
SLIDE 23

CloudCom2010, Indianapolis , IN

Job Scheduler Job Manager Role Scaling Engine Parametric Engine Sampling Filter

Performance Estimation by Sampling

  • Observation based approach

– Randomly sample the parameter space based on the sampling ration a

  • Only dispatch the sample tasks

– scaling in only with n’ instances to save cost

  • Assuming the uniform distribution, the

estimation is done by

slide-24
SLIDE 24

CloudCom2010, Indianapolis , IN

Evaluation

  • A complete BLAST running takes 2 hours with 16 instances,
  • a 2%-sampling-run which achieves 96% accuracy only takes about 18 minutes with

2 instances

  • the overall cost for the sampling run is only 1.8% of the complete run.
slide-25
SLIDE 25

CloudCom2010, Indianapolis , IN

Evaluation

  • Scaling-out

– Sync. Operation

  • stall all instances for 80 minutes

– Async. Operation,

  • Existing instances keep working
  • New instances needs 20-80 minutes
  • 16-instance run is 1.4x faster
  • Scaling-in

– Sync. Operation

  • finished in 3 minutes

– Async. Operation

  • caused the random message losing
  • May lead to more idle instance time.
  • the best practices

– scale-out asynchronously – Scale-in synchronously

New instances join in 20 – 80 minutes Azure randomly picks the instances to shutdown

slide-26
SLIDE 26

CloudCom2010, Indianapolis , IN

Conclusion

  • Running large-scale parameter sweeping experiment
  • n Azure
  • Identified Pitfalls

– Design with Failure (most of them are recoverable) – Watch out the instance idle time – understand your application to save cost – Minimize the need of programming

  • Our parameter sweeping solutions

– Declarative job definition – Dynamic scaling, – Job pause-reconfig-resume pattern – Performance estimation