The 1 Year and 1 hour Capacity Plan in the Drupal World About me - - PDF document

the 1 year and 1 hour capacity plan in the drupal world
SMART_READER_LITE
LIVE PREVIEW

The 1 Year and 1 hour Capacity Plan in the Drupal World About me - - PDF document

The 1 Year and 1 hour Capacity Plan in the Drupal World About me Principal SRE @Acquia (Cloud Data Team) Joined in December 2011 Location: Lisbon, Portugal Co-authored Seeking SRE w/ Machine Learning for SRE (OReilly)


slide-1
SLIDE 1

The 1 Year and 1 hour Capacity Plan in the Drupal World

slide-2
SLIDE 2

About me

  • Principal SRE @Acquia (Cloud Data Team)
  • Joined in December 2011
  • Location: Lisbon, Portugal
  • Co-authored Seeking SRE w/ Machine Learning for SRE (O’Reilly)
  • Founder and Lead of the Portuguese Drupal Association
  • Fun Facts:

○ Presented in DevOps events including DrupalCons. ○ Dedicated father of 2 kids and still manages to study and write. ○ First Linux installation: Slackware in 1994. ○ Former theatre actor.

Agenda

The problem What is Capacity Why do Capacity Planning Relation to Site Reliability Engineering Budget & Capacity Planning Load Testing Performance Tuning vs. Capacity Planning What to measure How to measure How to track capacity Forecasting First Easy Steps Conclusions

slide-3
SLIDE 3

The Problem

Site Launch & User Expectations

Falcon Heavy launch, Spacex

Typical Drupal Site Launch

What about Capacity Planning??

  • Disable devel
  • Configure cron
  • Check The Upload Sizes & Execution Time
  • Check Recipient Email Addresses
  • Set The File Permissions
  • Pro-tect Your Root Account
  • Check Per-mis-sions
  • Turn Off Error Reporting
  • Han-dle 404 Errors Gracefully
  • Check Robots.txt
  • Com-bine Pathauto With Global Redirect
  • Cre-ate A Main-te-nance Page
  • Con-fig-ure Caching
  • Css And Javascript Optimisation
  • Check Unpub-lished Con-tent Is Not Visible
  • Con-fig-ure Statistics
  • Monitor the Site
  • ** Plan for Failure **
slide-4
SLIDE 4

User Expectations

Drupal click screenshot
  • The end goal of capacity

planning is a smooth and speedy experience for the users

  • Varies depending on what type
  • f application is and what

portion of the application they interact with

No silver bullet

  • Plenty of capacity but a slow

website or unavailable

  • Capacity is only one part of

making the end-user experience fast

  • We want to measure and track

to make forecasts

  • Intolerable amount of latency

should raise a flag

slide-5
SLIDE 5

What is Capacity

resources required to run your services in the context you have chosen to run them

Carbon Fiber Tank, SpaceX

Capacity in Site Reliability Engineering (SRE)

  • Capacity: The maximum amount of output a product deployment is

capable of completing in a given period of time

  • Capacity planning: Process that determines the resources needed,

like people, instances, CPU, memory, time and more, for the company to meet changing demands for its services

  • In the Drupal World we focus mostly on serving WEB capacity
slide-6
SLIDE 6

Resource management

The Art of Capacity Planning

Arun Kejariwal, John Allspaw "O'Reilly Media, Inc."
  • Ensure proper resources are

available to handle load

  • Define procurement and an

approval process

  • Justify capital needs
  • Manage resources after

deployment

Why do Capacity Planning

Kroger grocery store, Lexington Kentucky, 1947, by Brett Streutket
slide-7
SLIDE 7

Quick and Dirty Math

  • Only spend as much as you

actually need

  • Be ahead of sharp growth
  • Avoid emergencies

Stay Fast and Reliable

Site Reliability Engineering

Rocket Laboratory, 1952 NASA/William A. Bowles
slide-8
SLIDE 8

Ben Treynor - Google

...an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s)...

“ “

Demand Forecasting and Capacity Planning

  • Ensuring that there is sufficient

capacity and redundancy

  • Serve projected future demand

with the required availability

  • Ensure the required capacity is

in place by the time it is needed

  • Take both organic and inorganic

growth into account

https://unsplash.com/photos/mexeVPlTB6k
slide-9
SLIDE 9

How SRE advocates for Capacity Planning

  • Perform regular load testing
  • Incorporate SLOs on Capacity
  • Capacity is critical to

availability, therefore the SRE team leads capacity planning initiatives and provisioning

https://unsplash.com/photos/DX9X0g0Cg88

Budget & Capacity Planning

Vintage Grow Your Money by Chris Potter, ccPixs.com
slide-10
SLIDE 10

Keeping the costs low

  • Meet with Finance, Engineering

and Product

  • Gather Systems and Application

metrics

  • Use that data to justify the

investment

Three forces that impact Capacity Planning Product Finance Engineering

Plan

Load Testing

“Hope is not a strategy”

  • St. Margrethen - Load Test by Kecko
slide-11
SLIDE 11

Load testing a Drupal stack

  • How to load test?

“Hit it until it breaks”

  • Include the points of failure in

the calculations

  • Determining backend limits can

be tricky

  • Use those resource ceilings as a

basis while predicting future growth

https://docs.acquia.com/acquia-cloud/arch/

A Few Load testing Tools

simulate

  • Loadrunner

○ http://bit.ly/microfocus-loadrunner

  • Iago

○ https://github.com/twitter/iago

  • JMeter

○ http://jmeter.apache.org/

collect

  • Prometheus

○ http://www.prometheus.io/

  • Signalfx

○ http://www.signalfx.com/

  • Cacti

○ http://cacti.net

  • Ganglia

○ http://ganglia.info

  • Nagios

○ http://nagios.org/

https://www.gocomics.com/calvinandhobbes/1986/11/26
slide-12
SLIDE 12

Performance Tuning

  • vs. Capacity planning

(different goals)

Top Speed by Alexander Nie

What to measure

defining the metrics

End-of-life by Dennis van Zuijlekom
slide-13
SLIDE 13

Divide & Conquer

  • Splitting nodes
  • Understand capacity demands
  • f each node
  • Measure more distinctly
  • How requests or queries per

second affect resources

Identifying the key resources to measure

  • Disk space (MB)
  • Disk throughput (IOPS)
  • CPU performance (FLOPS)
  • RAM memory (MB)
  • Network bandwidth (Mbps)
  • Network IP pool (Netmask)
  • Others
slide-14
SLIDE 14

How to measure

Living Computer Museum, Seattle http://www.brendangregg.com/Perf/linux_perf_tools_full.png

| Tools to measure on Linux servers |

slide-15
SLIDE 15

Collecting resources on web servers

TODO: CODE
  • Example script that

sends metrics to statsd

  • Low footprint using

/proc, df and ps

  • For a constant reliable

monitoring service use collectd: https://collectd.org

  • r Telegraf:

https://www.influxdata.com/time- series-platform/telegraf/

How to track Capacity

slide-16
SLIDE 16

Store and display time-series

  • Signalfx
  • Cacti
  • Ganglia
  • Graphite
  • Signalfx
  • Datadog
  • Ruxit
  • LogicMonitor
  • Sematext
  • CoScale
  • Riemann
  • Prometheus
  • Sensu
  • Idera
  • Bijk
  • X-Pack
  • vRealize Hyperic HQ

A couple of load testing tips load testing Tutorials:

https://www.tutorialspoint.com/jmeter https://www.blazemeter.com/load-testing

docker app for grafana:

https://github.com/kamon-io/docker-grafana-graphite

slide-17
SLIDE 17

Forecasting

(predicting trends)

Numbers And Finance by SeniorLiving.org

Predict the future?

  • Use Context & Math
  • Make educated guesses
  • Long-term view is generally

steady

  • Generate estimates to sustain

growth

  • Use an adjustable process
  • Forecast guides autoscaling

policies

slide-18
SLIDE 18

Ceilings and Historical data

  • Daily storage consumption

example

  • Metric: total available disk space
  • Cumulative total provides an

historical perspective

  • We can predict future needs
  • Storage will probably be

exhausted in the ceiling to where the line is headed

Curve fitting

  • Curve fitting
  • Creative & Scientific
  • Stay ahead of growth
  • Use time-series data
  • Forecast by constructing new

data points beyond the known

  • Reconciliation of what we know

and the best fit equation

  • Consider context before math
y = mx+b
slide-19
SLIDE 19

Forecasting Peak-Driven Resource Usage

  • Track how the peaks change over time
  • Extrapolate from that data to predict

future needs

  • Identify the server resource ceilings
  • Find a relation between resources and

application-level work

  • Decide if we should scale vertically or

horizontally

  • and perform proactive autoscalling
  • Fityk is an Open Source

Software for nonlinear fitting

  • f analytical functions to data.
  • Incorporate cfityk scripts into

automated curve fitting, like:

cfityk ricardo-disk.fit @0 < ricardo-disk.csv guess Quadratic fit info formula quit

Returns the formula:

4888.18 + 363.063 * x + 8.91132 + -1.55119*x + 0.0660771*x^2 Homepage: https://fityk.nieto.pl/

cfityk ricardo-disk.fit @0 < ricardo-disk.csv guess Quadratic fit info formula quit

Automating Forecasts with fityk & cfityk

Small demo: https://youtube.com/watch?v=EZnyq1Hr_7I

slide-20
SLIDE 20

Forecasting with Machine Learning

Seeking SRE

Conversations About Running Production Systems at Scale Publisher: O'Reilly Media
  • Most popular method for

curve-fitting in fityk is Levenberg-Marquardt

  • ML is also an option for

forecasting (book I co-authored)

  • Code examples and guides

https://github.com/ricardoamaro/MachineLearning4SRE

Start with Easy Steps

slide-21
SLIDE 21

Get Started

  • 1. Select a process owner.
  • 2. Identify the resources to be measured.
  • 3. Measure these resources.
  • 4. Compare to maximum capacity.
  • 5. Collect workload forecasts.
  • 6. Use forecasts for IT resource requirements.
  • 7. Map requirements onto existing utilizations.
  • 8. Predict when the system will be out of capacity.
  • 9. Update forecasts and utilizations.

Set a Goal!

  • Two Classes:

○ Load: usually expressed in arrival rate or peak rate of requests hitting the service

  • eg. target for 10.000 authenticated concurrent

Drupal users

○ Performance: usually expressed in the form of Service Level Objectives

  • eg. 99th percentile of all requests should return

in less 500ms

slide-22
SLIDE 22

Be proactive

( plan & document ahead)

Picasso drawing with Paloma and Claude at Villa la Galloise, 1953. By Edward Quinn, EdwardQuinn.com.

Capacity Planning Dashboard

  • Support your conclusions with

metrics in a dashboard

  • Both manual scaling and auto

scaling decision should be based

  • n real data
  • When to scale?

○ date and time (be alerted if needed)

  • How to scale?

○ vertical, horizontal or diagonal scaling

(Example) Drupal Cluster Dashboard

type valu e limit/ node ceiling units limit (total) current (peak) peak % Estimated days left Varnish cache 28 1024 req/sec 2048 600 29% 830 Web 31 80 busy calls 160 145 90% 12 Database 15 60 connections 120 96 80% 36 Storage 14 30 TB 30 14 46% 21
slide-23
SLIDE 23

Conclusions

Drive the system to the appropriate level of risk for the lowest cost.

Questions?

The 1 Year and 1 hour Capacity Plan in the Drupal World

slide-24
SLIDE 24

Join us for contribution opportunities

Mentored Contribution First Time Contributor Workshop General Contribution

#DrupalContributions What did you think?

https://events.drupal.org/node/22330 https://www.surveymonkey.com/r/DrupalConSeattle