Beyond The Numbers Baron Schwartz Who Am I? baron@percona.com - - PowerPoint PPT Presentation

▶

beyond the numbers

Beyond The Numbers Baron Schwartz Who Am I? baron@percona.com - - PowerPoint PPT Presentation

Jan 25, 2024 185 likes •861 views

Beyond The Numbers Baron Schwartz Who Am I? baron@percona.com @xaprb linkedin.com/in/xaprb xaprb.com/blog Who Am I? Maatkit Percona T oolkit Innotop Monitoring Plugins Aspersa Online T ools JavaScript

slide-1

SLIDE 1

Beyond The Numbers

Baron Schwartz

slide-2

SLIDE 2

Who Am I?

baron@percona.com
@xaprb
linkedin.com/in/xaprb
xaprb.com/blog

slide-3

SLIDE 3

Who Am I?

Maatkit
Innotop
Aspersa
JavaScript Libraries
Percona T
olkit
Monitoring Plugins
Online T
ols

slide-4

SLIDE 4

Consulting
Support
Remote DBA
Engineering
Conferences &

Training

Percona Server
Percona XtraBackup
Percona XtraDB

Cluster

Percona T
olkit
Many More

slide-5

SLIDE 5

Today's Agenda

Benchmarks
Aggregation and Distributions
Performance, Capacity & Utilization
Rules of Thumb
Queueing Theory and Scalability

slide-6

SLIDE 6

Benchmarks

slide-7

SLIDE 7

What's Missing?

Distribution
Time Series
Response Times
Parameters
Goals
System Specs

slide-8

SLIDE 8

What's Misleading?

Logarithmic X-Axis
Interpolation

slide-9

SLIDE 9

What's Good?

Y-Axis Reaches 0
No Fake-Smoothing

slide-10

SLIDE 10

Behind a Single Dot

slide-11

SLIDE 11

Look At All That Data...

slide-12

SLIDE 12

What's With The Grid Lines?!?!?

slide-13

SLIDE 13

Better Benchmarks

What does an ideal benchmark report look like?

slide-14

SLIDE 14

Clear Benchmark Goals

Validating hardware configuration
Comparing two systems
Checking for regressions
Capacity planning
Reproducing bad behavior to solve it
Stress-testing to find bottlenecks

slide-15

SLIDE 15

Hardware and Software

Specs for CPU, disk, memory, network
Software versions (OS, SUT, benchmark)
Filesystem, RAID controller
Disk queue scheduler

slide-16

SLIDE 16

Presenting Results

Ideally, make raw results available
Include metrics from OS (CPU, RAM, IO,

network)

Generate some plots to summarize
This is where the rubber meets the road!

slide-17

SLIDE 17

Better Aggregate Measures

Average
Percentiles
95th
99th
Maximum
Observation Duration
Question: how bad can 95th percentile be?

slide-18

SLIDE 18

More Aggregate Measures

Median (50th Percentile)
Standard Deviation
Index of Dispersion

slide-19

SLIDE 19

Better...

slide-20

SLIDE 20

Better Still...

slide-21

SLIDE 21

Keep It Coming...

slide-22

SLIDE 22

Throughput AND Response Time

slide-23

SLIDE 23

Performance

What is Performance?
T

wo Metrics

Response Time (time per task)
Throughput (tasks per time)
They're not reciprocals
More on this later

slide-24

SLIDE 24

What Performance Isn't

CPU Usage
Load Average
Other metrics of resource consumption

slide-25

SLIDE 25

Performance

I often focus on response time
It represents user experience
Throughput indicates capacity rather than

performance

For benchmarking, throughput is primary

slide-26

SLIDE 26

Utilization

The portion of time during which the

resource is busy

i.e. there is at least one thing in progress

slide-27

SLIDE 27

Utilization is Confusing

Be very careful with tools that report

utilization

From the Linux iostat man page:
“%util: Percentage of CPU time during which

I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.”

Can you parse that? Is it true?

slide-28

SLIDE 28

Capacity

What is Capacity?

slide-29

SLIDE 29

Capacity

slide-30

SLIDE 30

Capacity – My Definition

Capacity is the maximum throughput ... at achievable concurrency ... with acceptable performance ... as defined by response time ... meeting specified constraints ... over specified observation intervals.

slide-31

SLIDE 31

Capacity Example

What is capacity of the system at a

concurrency of 32 with 10-second 95th- percentile response time not to exceed 2ms over a 60-minute duration?

T
determine this, we need goal-seeking

benchmark software

Most benchmark software can't do this

slide-32

SLIDE 32

Benchmarks, etc Recap

Most benchmarks reveal very little
Benchmark reports reveal even less
It's good to go beyond the surface

slide-33

SLIDE 33

Amdahl's Law

“The speedup of a program using multiple

processors in parallel computing is limited by the time needed for the sequential fraction of the program.” - Wikipedia

It's basically a law of diminishing returns.

slide-34

SLIDE 34

Should I Defragment My Disk?

Method 1: Google “defragment”
Method 2: Try it and see
Method 3: Measure if the disk is a

bottleneck

slide-35

SLIDE 35

Spolsky -vs- Millsap

slide-36

SLIDE 36

Spolsky -vs- Millsap

slide-37

SLIDE 37

Amdahl's Law

Don't try to optimize little things.

slide-38

SLIDE 38

Little's Law

N = XR
That is,
Concurrency = Throughput * Response Time
This holds regardless of queueing, arrival

rate distribution, response time distribution, etc.

slide-39

SLIDE 39

Little's Law Example

If disk IOs average 4ms...
And there are 280 IOs per second...
Then the disk's average concurrency is:
N = 280 * .004
N = 1.12
Do you believe this?
When might it not be true?

slide-40

SLIDE 40

Little's Law Example #2

If disk utilization is 98%
And there are 280 IOs per second
What do we know?

slide-41

SLIDE 41

Utilization Law

U = SX
Also independent of distributions, etc...
That is,
Utilization = Service Time * Throughput
Utilization = 98% and Throughput = 280
S = U/X
Service Time = .98 / 280 = .0035

slide-42

SLIDE 42

Queueing Theory

How can we predict the amount of

queueing in a system?

How can we predict its response times?
How can we predict capacity?

slide-43

SLIDE 43

Erlang Queueing

Erlang's formulas model the probability of

queueing for a given arrival rate, service time, and number of servers.

A “server” is anything capable of serving

a request.

CPUs
Disks

slide-44

SLIDE 44

CPU -vs- Disk Queueing

Scenario: 4-CPU, 4-disk (RAID0) server
Thought experiment:
How do processes queue for CPU?
How do I/O requests queue on disks?

slide-45

SLIDE 45

Notation

T

ypically see something like M/M/1

Each letter is a placeholder in A/S/n
A = Arrival distribution
S = Service-time distribution
n = Number of servers
A and S can be one of:
Markov
Deterministic
General

slide-46

SLIDE 46

CPUs -vs- Disks

CPUs: M/M/4
Disks: 4 x {M/M/1}

slide-47

SLIDE 47

M/M/1 Queueing

cmg.org

slide-48

SLIDE 48

M/M/n Queueing

cmg.org

slide-49

SLIDE 49

Erlang C Function

M/M/n queueing is modeled by Erlang C
See http://en.wikipedia.org/wiki/Erlang_(unit)

slide-50

SLIDE 50

What's Wrong With Erlang C?

You must validate your arrival times.
You must validate your service times.
The equation is hard to work with.
In practice, it's hard to use Erlang C.

slide-51

SLIDE 51

Scalability

Queueing causes non-linear scaling.
But first, let's talk about linearity.

slide-52

SLIDE 52

System Scalability

Concurrency Throughput Why?

slide-53

SLIDE 53

Universal Scalability Law

Concurrency Throughput Linear Amdahl USL

slide-54

SLIDE 54

Amdahl Scalability

slide-55

SLIDE 55

USL Scalability

slide-56

SLIDE 56

USL Scalability Modeling

slide-57

SLIDE 57

USL Performance Modeling

slide-58

SLIDE 58

Scalability Limitations

Locks
Synchronization points
Shared resources
Duplicated data to be kept in sync
Weakest-link problems

slide-59

SLIDE 59

RAID10 On EBS

Which is faster?
RAID 10 over 10 EBS volumes
RAID 10 over 20 EBS volumes
Hint: http://goo.gl/Xm92Y
Also, http://goo.gl/fAEIL

slide-60

SLIDE 60

Debunking “Linear”

Ask to see the actual numbers.
They shouldn't be rounded off suspiciously.
They must be truly linear.
They must intersect the point (0, 0).

slide-61

SLIDE 61

Debunking, Example #1

slide-62

SLIDE 62

Is it Linear?

slide-63

SLIDE 63

It's Not Linear

slide-64

SLIDE 64

Resources

Naomi Robbins' Blog
http://blogs.forbes.com/naomirobbins/
Percona White Papers
http://www.percona.com/
Neil J. Gunther
Guerrilla Capacity Planning
http://www.contextneeded.com/

slide-65

SLIDE 65

Questions?

slide-66

SLIDE 66

baron@percona.com @xaprb