Designing Apps for Amazon Web Services Mathias Meyer, GOTO Aarhus - - PowerPoint PPT Presentation

designing apps for amazon web services
SMART_READER_LITE
LIVE PREVIEW

Designing Apps for Amazon Web Services Mathias Meyer, GOTO Aarhus - - PowerPoint PPT Presentation

Designing Apps for Amazon Web Services Mathias Meyer, GOTO Aarhus 2011 Montag, 10. Oktober 11 Montag, 10. Oktober 11 Me infrastructure code databases @roidrage www.paperplanes.de Montag, 10. Oktober 11 The Cloud Montag, 10.


slide-1
SLIDE 1

Designing Apps for Amazon Web Services

Mathias Meyer, GOTO Aarhus 2011

Montag, 10. Oktober 11

slide-2
SLIDE 2

Montag, 10. Oktober 11

slide-3
SLIDE 3

Me

♥ infrastructure ♥ code ♥ databases @roidrage www.paperplanes.de

Montag, 10. Oktober 11

slide-4
SLIDE 4

The Cloud

Montag, 10. Oktober 11

Unlimited resources, whenever you need them

slide-5
SLIDE 5

Montag, 10. Oktober 11

slide-6
SLIDE 6

Montag, 10. Oktober 11

slide-7
SLIDE 7

Montag, 10. Oktober 11

slide-8
SLIDE 8

10,000 feet

Montag, 10. Oktober 11

slide-9
SLIDE 9

Amazon Web Services

Montag, 10. Oktober 11

slide-10
SLIDE 10

EC2

Montag, 10. Oktober 11

slide-11
SLIDE 11

On-demand computing

Montag, 10. Oktober 11

Utility computing

slide-12
SLIDE 12

API

Montag, 10. Oktober 11

It’s not a cloud if it doesn’t have an API.

slide-13
SLIDE 13

Pay as you go

Montag, 10. Oktober 11

slide-14
SLIDE 14

Multiple regions

Montag, 10. Oktober 11

Geographically distributed, for all web services. US East, West, EU, Singapore, Japan

slide-15
SLIDE 15

Multiple datacenters

Montag, 10. Oktober 11

Called availability zones. At least two data centers in each region. Physically separated locations. API endpoint for a region is unspecific for availability zones

slide-16
SLIDE 16

Different instance types

Montag, 10. Oktober 11

slide-17
SLIDE 17

High CPU vs. High memory

Montag, 10. Oktober 11

One CPU core is about the power of a 2007 Xeon processor.

slide-18
SLIDE 18

1.7 GB

  • 68 GB

Montag, 10. Oktober 11

slide-19
SLIDE 19

Elastic Block Store

Montag, 10. Oktober 11

Storage on instances is ephemeral. Goes away when the instance goes away. EBS allows persisting data for longer than an instance’s lifetime. Bound to a data center.

slide-20
SLIDE 20

Mount to any instance

Montag, 10. Oktober 11

A big number of block volumes can be mounted to a single instance.

slide-21
SLIDE 21

Snapshots

Montag, 10. Oktober 11

A point in time, atomic snapshot of an EBS volume. Not a reliable means of backup, but one means for backups.

slide-22
SLIDE 22

More AWS Products

S3 CloudFront CloudFormation CloudWatch RDS Auto Scaling SimpleDB Route 53 Load Balancing Queue Service Notification Service Elastic MapReduce

Montag, 10. Oktober 11

slide-23
SLIDE 23

What’s Scalarium?

Montag, 10. Oktober 11

slide-24
SLIDE 24

Automates: Setup Configuration One-Click Deploy

Montag, 10. Oktober 11

slide-25
SLIDE 25

...for EC2 ...on EC2

Montag, 10. Oktober 11

Scalarium uses a couple of Amazon’s Web services EC2 is the most interesting one Own monitoring, because customers like saving money, CloudWatch costs money

slide-26
SLIDE 26

Automation

Montag, 10. Oktober 11

Scalarium automates so that customers don’t have to. Most important part about deploying in the cloud. Every manual change is lost when an instance goes down.

slide-27
SLIDE 27

The Dream: Configure a cluster Push a button Boom!

Montag, 10. Oktober 11

slide-28
SLIDE 28

Two ways...

Montag, 10. Oktober 11

slide-29
SLIDE 29

Create image, boot it.

Montag, 10. Oktober 11

slide-30
SLIDE 30

Build once, use forever

Montag, 10. Oktober 11

Point in time snapshot of a system, fully configured. App server, database, web server, etc.

slide-31
SLIDE 31

How do you handle updates?

Montag, 10. Oktober 11

Build a new image, install updates, discard old image. Lather, rinse, repeat, with every update.

slide-32
SLIDE 32

Configure from scratch

Montag, 10. Oktober 11

slide-33
SLIDE 33

Montag, 10. Oktober 11

Start with a blank slate. A clean operating system installation.

slide-34
SLIDE 34

Montag, 10. Oktober 11

Use a configuration management tool. Abstracts installation of packages, writing of configuration files, handling file systems, etc.

slide-35
SLIDE 35

Configuration + Cookbooks/Manifests + Chef/Puppet/etc. = Configured cluster

Montag, 10. Oktober 11

Configuration describes the final result of what the system should look like.

slide-36
SLIDE 36

Configuration: Chef Server RightScale JSON Scalarium

Montag, 10. Oktober 11

There’s an infinite numbers of infrastructure service providers. Just as many ways to store cluster configuration

slide-37
SLIDE 37

Montag, 10. Oktober 11

A Web UI to automate and simplify the Amazon APIs and setting up servers/clusters.

slide-38
SLIDE 38

In the beginning...

Montag, 10. Oktober 11

slide-39
SLIDE 39

Scalarium

Montag, 10. Oktober 11

Lasted for a few months on just one instance. Instance ran RabbitMQ, CouchDB, Redis, background workers, web and application servers. Bootstrapped startup = start small, iterate quickly.

slide-40
SLIDE 40

Montag, 10. Oktober 11

Eventually became overloaded.

slide-41
SLIDE 41

Montag, 10. Oktober 11

slide-42
SLIDE 42

Montag, 10. Oktober 11

slide-43
SLIDE 43

Automate, automate, automate!

Montag, 10. Oktober 11

Chef for everything. Yes, our first isntance was not fully automated. Hypocratic. Vagrant is an excellent tool to test locally. Scalarium customers can test their own and our cookbooks locally. Automate setup, configuration, re-configuration of services, everything!

slide-44
SLIDE 44

EC2 is not a traditional datacenter

Montag, 10. Oktober 11

slide-45
SLIDE 45

Montag, 10. Oktober 11

It doesn’t look like this.

slide-46
SLIDE 46

Montag, 10. Oktober 11

Still looks like this. But it’s transparent to you. No operational access to you.

slide-47
SLIDE 47

Multi-tenant

Montag, 10. Oktober 11

Shared resources (CPU, memory, network). You instance likely shares resources with several other EC2 customers on the same physical host.

slide-48
SLIDE 48

High likelihood of failure

Montag, 10. Oktober 11

It’s still hardware that’s running your servers. Hardware fails.

slide-49
SLIDE 49

Faulty instances

Montag, 10. Oktober 11

They fail, not all the time, but if you have high turnover scaling up and down, they’ll fail. Discard, boot new instance, done.

slide-50
SLIDE 50

Datacenter outage

Montag, 10. Oktober 11

slide-51
SLIDE 51

Network partition

Montag, 10. Oktober 11

slide-52
SLIDE 52

More instances = Higher chance

  • f failure

Montag, 10. Oktober 11

slide-53
SLIDE 53

MTBF

Montag, 10. Oktober 11

Mean Time Between Failures On EC2 as a whole it’s pretty small. Not an important metric. Just because something fails doesn’t mean you have to be afgected.

slide-54
SLIDE 54

21/04/2011

Montag, 10. Oktober 11

US-East data centers become unavailable. Cascading failures in EBS storage. Recovery took four days.

slide-55
SLIDE 55

Montag, 10. Oktober 11

The big chance for nay-sayers and cloud haters.

slide-56
SLIDE 56

Montag, 10. Oktober 11

The big chance for nay-sayers and cloud haters.

slide-57
SLIDE 57

Montag, 10. Oktober 11

The big chance for nay-sayers and cloud haters.

slide-58
SLIDE 58

Montag, 10. Oktober 11

The big chance for nay-sayers and cloud haters.

slide-59
SLIDE 59

Montag, 10. Oktober 11

The big chance for nay-sayers and cloud haters.

slide-60
SLIDE 60

7/8/2011

Montag, 10. Oktober 11

Downtime in EU data centers. Lightning strike caused power outage. Again, cascading failure in the EBS storage layer. More than three days ‘til full recovery.

slide-61
SLIDE 61

Montag, 10. Oktober 11

slide-62
SLIDE 62

Montag, 10. Oktober 11

slide-63
SLIDE 63

Failure is a good thing

Montag, 10. Oktober 11

slide-64
SLIDE 64

You can ignore it

Montag, 10. Oktober 11

slide-65
SLIDE 65

Learn from it

Montag, 10. Oktober 11

slide-66
SLIDE 66

Design for it

Montag, 10. Oktober 11

slide-67
SLIDE 67

Don’t fear failure

Montag, 10. Oktober 11

slide-68
SLIDE 68

Plan for failure

Montag, 10. Oktober 11

Failure becomes a part of your apps’ lifecycle. Deploying in the cloud has a bigger efgect on culture than it does on your application.

slide-69
SLIDE 69

Design for resilience

Montag, 10. Oktober 11

In case of failures, you app should handle them gracefully, not breaking along the way entirely. Serve statics instead of failure notices to the user.

slide-70
SLIDE 70

Plan for recovery

Montag, 10. Oktober 11

Can you re-deploy your app into a difgerent region quickly? If not, why not?

slide-71
SLIDE 71

MTTR

Montag, 10. Oktober 11

slide-72
SLIDE 72

Disaster recovery plan

Montag, 10. Oktober 11

What do you do when your site goes down? What do you do when you need to restore data? Plan, verify, one click.

slide-73
SLIDE 73

Multi-datacenter deployments

Montag, 10. Oktober 11

Deploy App across multiple data centers/availability zones. Make deploying to difgerent data centers part of the deployment process. Staged deployments: new set of instance, flip load balancer.

slide-74
SLIDE 74

Replication

Montag, 10. Oktober 11

Storage becomes a key part in handling failure. Everything else is usually much easier to scale and distribute. Replicate data across availability zones, across regions.

slide-75
SLIDE 75

Montag, 10. Oktober 11

Availability zones are geographically distributed Reading from in between them means increased latency Replication ensures data is in multiple geographic locations. Replication allows to recover quickly by moving to difgerent data centers. Not all databases do this well, but they do it

slide-76
SLIDE 76

Multi-region deployments

Montag, 10. Oktober 11

If you need to be very highly available.

slide-77
SLIDE 77

$$$

Montag, 10. Oktober 11

Deploying highly distributed is expensive. How distributed is up to your budget. And to how much your availability is worth to your business.

slide-78
SLIDE 78

Relax consistency requirements

Montag, 10. Oktober 11

Strong consistency increases need for full availability Distribute and partition data, on difgerent instances and difgerent datacenters

slide-79
SLIDE 79

Latency

Montag, 10. Oktober 11

Immediately became an issue when we scaled out. Network latency adds to EBS latency and made for higher response times from the database. All network traffjc on EC2 is firewalled, even internal traffjc.

slide-80
SLIDE 80

Keep data local

Montag, 10. Oktober 11

slide-81
SLIDE 81

Montag, 10. Oktober 11

slide-82
SLIDE 82

Keep data in memory

Montag, 10. Oktober 11

slide-83
SLIDE 83

Cache is king

Montag, 10. Oktober 11

Disk is expensive because it touches the network.

slide-84
SLIDE 84

Use RAIDs for EBS

Montag, 10. Oktober 11

Increased performance, better network utilization EBS performance is okay, but not great. Don’t expect SATA or SAS like performance. RAID 0, 5 or 10. EBS is redundant, but extra reduncany with striping doesn’t hurt. More likely recovery when one EBS volume fails. RAIDs won’t save you from EBS unavailability.

slide-85
SLIDE 85

Use local storage

Montag, 10. Oktober 11

Don’t use EBS at all. Local storage requires redundancy. Instance storage is lost when the instance is lost. RAID across local storage. More reliable in terms of I/O than EBS. Services that uses local storage where mostly unafgected by the EC2 outages.

slide-86
SLIDE 86

Use bigger instances

Montag, 10. Oktober 11

The bigger your instance the less shared it is on the host. Bigger instances have higher I/O throughput.

slide-87
SLIDE 87

What would I do differently?

Montag, 10. Oktober 11

slide-88
SLIDE 88

Small services

Montag, 10. Oktober 11

Small services can run independent of each other. Small services are easy to deploy, easy to reconfigure (Chef). Don’t have to know about all the other services upfront, leave that to CM tools. Layered system with small services allows failure handling on every layer. Failure in one layer doesn’t have to drag down the rest.

slide-89
SLIDE 89

Frontend vs. Small APIs

Montag, 10. Oktober 11

slide-90
SLIDE 90

Fail fast

Montag, 10. Oktober 11

When components fail, don’t block waiting for them. Timeout quickly. Circuit breaker: track failures and fail operations immediately if you know they’re likely to fail. recover when it’s safe again.

slide-91
SLIDE 91

Retry

Montag, 10. Oktober 11

Retry with an exponential backofg. Assume failure always.

slide-92
SLIDE 92

Don’t just assume failure

Montag, 10. Oktober 11

slide-93
SLIDE 93

Test failure

Montag, 10. Oktober 11

Shut ofg instances randomly, see what happens. Turn on the firewall, adds network timeouts, see what happens. The cloud makes it so easy to bring up test environments, and to move resources quickly when necessary.

slide-94
SLIDE 94

Montag, 10. Oktober 11

Netflix’ Chaos Monkey randomly kills instances.

slide-95
SLIDE 95

“Think about your software running.”

Theo Schlossnagle, OmniTI

Montag, 10. Oktober 11

slide-96
SLIDE 96

Understand your code’s breaking points

Montag, 10. Oktober 11

Use patterns like circuit breakers and bulkheads to reduce failure points Think about outcome and implications, not just features. Understand your code’s breaking points and how they handle unavailability, timeouts, and the like. All these are so much more likely in a cloud environment.

slide-97
SLIDE 97

Isn’t all that what you do at large scale?

Montag, 10. Oktober 11

It’s what you do at any scale where availability is a factor.

slide-98
SLIDE 98

Cloud == Large scale

Montag, 10. Oktober 11

slide-99
SLIDE 99

You’re a part of it

Montag, 10. Oktober 11

Prepare for the worst, plan for the worst. The cloud made failure at a large scale obvious even when you’re working at a small scale.

slide-100
SLIDE 100

Scalarium today

Montag, 10. Oktober 11

slide-101
SLIDE 101

Scalarium runs on Scalarium

Montag, 10. Oktober 11

slide-102
SLIDE 102

Montag, 10. Oktober 11

Easy to boot up somewhere else, switch over DNS, done.

slide-103
SLIDE 103

Montag, 10. Oktober 11

slide-104
SLIDE 104

Montag, 10. Oktober 11

slide-105
SLIDE 105

Lack of visibility

Montag, 10. Oktober 11

Takes up to an hour still to acknowledge problems. Amazon is not good at admitting failure happens a lot on EC2. Not enough education on how to build apps for EC2 and their web services, especially how to deal with failure.

slide-106
SLIDE 106

Don’t fall for SLAs

Montag, 10. Oktober 11

slide-107
SLIDE 107

Amazon only handles infrastructure

Montag, 10. Oktober 11

slide-108
SLIDE 108

How you build on it is up to you

Montag, 10. Oktober 11

slide-109
SLIDE 109

Fun fact

Montag, 10. Oktober 11

slide-110
SLIDE 110

amazon.com is served off EC2

Montag, 10. Oktober 11

Since 21/10/2010. Yes, they were down too, at least during the EU outage.

slide-111
SLIDE 111

It’s not the cloud that matters, it’s how you use it.

Montag, 10. Oktober 11

EC2 doesn’t make everything harder, quite the opposite, it makes things easier: Adding capacity, automation, responding to failures.

slide-112
SLIDE 112

Thank you!

Montag, 10. Oktober 11