SLIDE 1
Beyond The Numbers
Baron Schwartz
SLIDE 2 Who Am I?
- baron@percona.com
- @xaprb
- linkedin.com/in/xaprb
- xaprb.com/blog
SLIDE 3 Who Am I?
- Maatkit
- Innotop
- Aspersa
- JavaScript Libraries
- Percona T
- olkit
- Monitoring Plugins
- Online T
- ols
SLIDE 4
- Consulting
- Support
- Remote DBA
- Engineering
- Conferences &
Training
- Percona Server
- Percona XtraBackup
- Percona XtraDB
Cluster
- Percona T
- olkit
- Many More
SLIDE 5 Today's Agenda
- Benchmarks
- Aggregation and Distributions
- Performance, Capacity & Utilization
- Rules of Thumb
- Queueing Theory and Scalability
SLIDE 6
Benchmarks
SLIDE 7 What's Missing?
- Distribution
- Time Series
- Response Times
- Parameters
- Goals
- System Specs
SLIDE 8 What's Misleading?
- Logarithmic X-Axis
- Interpolation
SLIDE 9 What's Good?
- Y-Axis Reaches 0
- No Fake-Smoothing
SLIDE 10
Behind a Single Dot
SLIDE 11
Look At All That Data...
SLIDE 12
What's With The Grid Lines?!?!?
SLIDE 13
Better Benchmarks
What does an ideal benchmark report look like?
SLIDE 14 Clear Benchmark Goals
- Validating hardware configuration
- Comparing two systems
- Checking for regressions
- Capacity planning
- Reproducing bad behavior to solve it
- Stress-testing to find bottlenecks
SLIDE 15 Hardware and Software
- Specs for CPU, disk, memory, network
- Software versions (OS, SUT, benchmark)
- Filesystem, RAID controller
- Disk queue scheduler
SLIDE 16 Presenting Results
- Ideally, make raw results available
- Include metrics from OS (CPU, RAM, IO,
network)
- Generate some plots to summarize
- This is where the rubber meets the road!
SLIDE 17 Better Aggregate Measures
- Average
- Percentiles
- 95th
- 99th
- Maximum
- Observation Duration
- Question: how bad can 95th percentile be?
SLIDE 18 More Aggregate Measures
- Median (50th Percentile)
- Standard Deviation
- Index of Dispersion
SLIDE 19
Better...
SLIDE 20
Better Still...
SLIDE 21
Keep It Coming...
SLIDE 22
Throughput AND Response Time
SLIDE 23 Performance
wo Metrics
- Response Time (time per task)
- Throughput (tasks per time)
- They're not reciprocals
- More on this later
SLIDE 24 What Performance Isn't
- CPU Usage
- Load Average
- Other metrics of resource consumption
SLIDE 25 Performance
- I often focus on response time
- It represents user experience
- Throughput indicates capacity rather than
performance
- For benchmarking, throughput is primary
SLIDE 26 Utilization
- The portion of time during which the
resource is busy
- i.e. there is at least one thing in progress
SLIDE 27 Utilization is Confusing
- Be very careful with tools that report
utilization
- From the Linux iostat man page:
- “%util: Percentage of CPU time during which
I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.”
- Can you parse that? Is it true?
SLIDE 29
Capacity
SLIDE 30
Capacity – My Definition
Capacity is the maximum throughput ... at achievable concurrency ... with acceptable performance ... as defined by response time ... meeting specified constraints ... over specified observation intervals.
SLIDE 31 Capacity Example
- What is capacity of the system at a
concurrency of 32 with 10-second 95th- percentile response time not to exceed 2ms over a 60-minute duration?
- T
- determine this, we need goal-seeking
benchmark software
- Most benchmark software can't do this
SLIDE 32 Benchmarks, etc Recap
- Most benchmarks reveal very little
- Benchmark reports reveal even less
- It's good to go beyond the surface
SLIDE 33 Amdahl's Law
- “The speedup of a program using multiple
processors in parallel computing is limited by the time needed for the sequential fraction of the program.” - Wikipedia
- It's basically a law of diminishing returns.
SLIDE 34 Should I Defragment My Disk?
- Method 1: Google “defragment”
- Method 2: Try it and see
- Method 3: Measure if the disk is a
bottleneck
SLIDE 35
Spolsky -vs- Millsap
SLIDE 36
Spolsky -vs- Millsap
SLIDE 37 Amdahl's Law
- Don't try to optimize little things.
SLIDE 38 Little's Law
- N = XR
- That is,
- Concurrency = Throughput * Response Time
- This holds regardless of queueing, arrival
rate distribution, response time distribution, etc.
SLIDE 39 Little's Law Example
- If disk IOs average 4ms...
- And there are 280 IOs per second...
- Then the disk's average concurrency is:
- N = 280 * .004
- N = 1.12
- Do you believe this?
- When might it not be true?
SLIDE 40 Little's Law Example #2
- If disk utilization is 98%
- And there are 280 IOs per second
- What do we know?
SLIDE 41 Utilization Law
- U = SX
- Also independent of distributions, etc...
- That is,
- Utilization = Service Time * Throughput
- Utilization = 98% and Throughput = 280
- S = U/X
- Service Time = .98 / 280 = .0035
SLIDE 42 Queueing Theory
- How can we predict the amount of
queueing in a system?
- How can we predict its response times?
- How can we predict capacity?
SLIDE 43 Erlang Queueing
- Erlang's formulas model the probability of
queueing for a given arrival rate, service time, and number of servers.
- A “server” is anything capable of serving
a request.
SLIDE 44 CPU -vs- Disk Queueing
- Scenario: 4-CPU, 4-disk (RAID0) server
- Thought experiment:
- How do processes queue for CPU?
- How do I/O requests queue on disks?
SLIDE 45 Notation
ypically see something like M/M/1
- Each letter is a placeholder in A/S/n
- A = Arrival distribution
- S = Service-time distribution
- n = Number of servers
- A and S can be one of:
- Markov
- Deterministic
- General
SLIDE 46 CPUs -vs- Disks
- CPUs: M/M/4
- Disks: 4 x {M/M/1}
SLIDE 47 M/M/1 Queueing
cmg.org
SLIDE 48 M/M/n Queueing
cmg.org
SLIDE 49 Erlang C Function
- M/M/n queueing is modeled by Erlang C
- See http://en.wikipedia.org/wiki/Erlang_(unit)
SLIDE 50 What's Wrong With Erlang C?
- You must validate your arrival times.
- You must validate your service times.
- The equation is hard to work with.
- In practice, it's hard to use Erlang C.
SLIDE 51 Scalability
- Queueing causes non-linear scaling.
- But first, let's talk about linearity.
SLIDE 52 System Scalability
Concurrency Throughput Why?
SLIDE 53 Universal Scalability Law
Concurrency Throughput Linear Amdahl USL
SLIDE 54
Amdahl Scalability
SLIDE 55
USL Scalability
SLIDE 56
USL Scalability Modeling
SLIDE 57
USL Performance Modeling
SLIDE 58 Scalability Limitations
- Locks
- Synchronization points
- Shared resources
- Duplicated data to be kept in sync
- Weakest-link problems
SLIDE 59 RAID10 On EBS
- Which is faster?
- RAID 10 over 10 EBS volumes
- RAID 10 over 20 EBS volumes
- Hint: http://goo.gl/Xm92Y
- Also, http://goo.gl/fAEIL
SLIDE 60 Debunking “Linear”
- Ask to see the actual numbers.
- They shouldn't be rounded off suspiciously.
- They must be truly linear.
- They must intersect the point (0, 0).
SLIDE 61
Debunking, Example #1
SLIDE 62
Is it Linear?
SLIDE 63
It's Not Linear
SLIDE 64 Resources
- Naomi Robbins' Blog
- http://blogs.forbes.com/naomirobbins/
- Percona White Papers
- http://www.percona.com/
- Neil J. Gunther
- Guerrilla Capacity Planning
- http://www.contextneeded.com/
SLIDE 65
Questions?
SLIDE 66
baron@percona.com @xaprb