When it all Goes Wrong @leinweber Will Leinweber @leinweber Citus - - PowerPoint PPT Presentation

when it all goes wrong
SMART_READER_LITE
LIVE PREVIEW

When it all Goes Wrong @leinweber Will Leinweber @leinweber Citus - - PowerPoint PPT Presentation

When it all Goes Wrong @leinweber Will Leinweber @leinweber Citus Data (Microsoft) bitfission.com (warning autoplays midi) @leinweber coming from citus cloud heroku postgres @leinweber special thanks citus cloud dan farina


slide-1
SLIDE 1

When it all Goes Wrong

slide-2
SLIDE 2

@leinweber

Will Leinweber

@leinweber Citus Data (Microsoft) bitfission.com


(warning autoplays midi)

slide-3
SLIDE 3

@leinweber

coming from

citus cloud heroku postgres

slide-4
SLIDE 4

@leinweber

special thanks

citus cloud
 — dan farina (@danfarina) heroku postgres
 — maciek sakrejda (@uhoh_itsmaciek)

slide-5
SLIDE 5

@leinweber

same sorts of problems

from pages & alerts from support tickets

slide-6
SLIDE 6

@leinweber

this talk

more app dev who uses postgres
 rather than dba

slide-7
SLIDE 7

@leinweber

the problem with Postgres

it’s pretty good you don’t get experience with how it breaks

slide-8
SLIDE 8

@leinweber

what to do for a problem

slide-9
SLIDE 9

@leinweber

what to do for a problem

slide-10
SLIDE 10

@leinweber

complicated system

network hardware

  • /s

postgres

slide-11
SLIDE 11

@leinweber

using the database (too much)

95% application 4% auto vacuum 1% everything else

slide-12
SLIDE 12

@leinweber

hard to convince

all the graphs saying DB is slow and nothing has changed …must be the database!

slide-13
SLIDE 13

@leinweber https://upload.wikimedia.org/wikipedia/commons/9/98/Survivorship-bias.png

slide-14
SLIDE 14

@leinweber

“but I didn’t change anything”

no deploys! no database migrations! no scaling!

slide-15
SLIDE 15

@leinweber

“but I didn’t change anything”

https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif

slide-16
SLIDE 16

@leinweber

“but I didn’t change anything”

more traffic? change in access patterns?

  • ne big user logged in?
slide-17
SLIDE 17

@leinweber

run out of a resource

slide-18
SLIDE 18

@leinweber

snowball

slide-19
SLIDE 19

@leinweber

example

manageable user 1s query => 2x expensive frequent, small queries 3ms => 12ms

slide-20
SLIDE 20

@leinweber

assumptions

hardware

maintenance

app

slide-21
SLIDE 21

@leinweber

assumptions

postgres should not crash …with overcommit off and no containers large extensions increase chance

slide-22
SLIDE 22

@leinweber

if not postgres, then what

slide-23
SLIDE 23

@leinweber

system resources

cpu memory disk parallelism / backends locks

slide-24
SLIDE 24

@leinweber

cpu mem disk parallelism cpu mem disk parallelism

slide-25
SLIDE 25

@leinweber

cpu mem disk parallelism

credentials wrong networking broken locking issue, check pg_locks idle in transaction

slide-26
SLIDE 26

@leinweber

cpu mem disk parallelism

application submitting backlogged workload connection leak pool sizes set too large pg_lock issue + application backlog

slide-27
SLIDE 27

@leinweber

cpu mem disk parallelism

workload skew causing thrashing unusual sequential scan workload failover or restart => no cache pg_prewarm

slide-28
SLIDE 28

@leinweber

cpu mem disk parallelism

same as just disk, but also the application is piling on

slide-29
SLIDE 29

@leinweber

cpu mem disk parallelism

large GROUP BYs high disk latency due to unusual page dispersion pattern in the workload

slide-30
SLIDE 30

@leinweber

cpu mem disk parallelism

workload has high mem (GROUP BY)
 + app adding backlog lock contention slowing mem release

slide-31
SLIDE 31

@leinweber

cpu mem disk parallelism

large GROUP BYs + paging in unusual data

slide-32
SLIDE 32

@leinweber

cpu mem disk parallelism

Look for what is causing disk access

slide-33
SLIDE 33

@leinweber

cpu mem disk parallelism

small, in-memory workload lots of seq scans on small table index scan w/ filter dropping lots

slide-34
SLIDE 34

@leinweber

cpu mem disk parallelism

app backlog 
 + too much processing on small data simply a lot of work

slide-35
SLIDE 35

@leinweber

cpu mem disk parallelism

large seq scans

slide-36
SLIDE 36

@leinweber

cpu mem disk parallelism

loading cold data + application backlog

slide-37
SLIDE 37

@leinweber

cpu mem disk parallelism

small # of backends doing a lot more work

slide-38
SLIDE 38

@leinweber

cpu mem disk parallelism

entity, workload, entity*workload soft deletes and non-conditional indexes

slide-39
SLIDE 39

@leinweber

cpu mem disk parallelism

reporting query

slide-40
SLIDE 40

@leinweber

cpu mem disk parallelism

app backlog, but with CPU/mem problems

slide-41
SLIDE 41

@leinweber

tools of the trade

slide-42
SLIDE 42

@leinweber

tools of the trade

C symbols

slide-43
SLIDE 43

@leinweber

tools of the trade: perf

perf record -p <pid> && perf report

slide-44
SLIDE 44

@leinweber

tools of the trade: perf

perf top

slide-45
SLIDE 45

@leinweber

tools of the trade: perf

www.brendangregg.com/perf.html

slide-46
SLIDE 46

@leinweber

tools of the trade: gdb

gdb -batch -ex 'bt' -p <pid>

slide-47
SLIDE 47

@leinweber

slide-48
SLIDE 48

@leinweber

slide-49
SLIDE 49

@leinweber

tools of the trade: iostat

iostat -xm 10

slide-50
SLIDE 50

@leinweber

tools of the trade: iotop

slide-51
SLIDE 51

@leinweber

tools of the trade: htop

slide-52
SLIDE 52

@leinweber

Tools of the trade: bwm-ng

slide-53
SLIDE 53

@leinweber

tools of the trade: backends

pgrep -lf postgres + grep + wc select * from pg_stat_activity

slide-54
SLIDE 54

@leinweber

tools of the trade: pg_s_s

select * from pg_stat_statements

slide-55
SLIDE 55

@leinweber

tools of the trade: summary

cpu mem disk parallelism network perf x gdb x iostat x iotop x htop x x bwm x pgrep x

slide-56
SLIDE 56

@leinweber

what to do

slide-57
SLIDE 57

@leinweber

what to do

configuration change

slide-58
SLIDE 58

@leinweber

what to do

db change

slide-59
SLIDE 59

@leinweber

what to do

code change

slide-60
SLIDE 60

@leinweber

flirting with disaster

Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen:
 Risk management in a dynamic society: a modeling problem

slide-61
SLIDE 61

@leinweber

flirting with disaster

economic boundary

slide-62
SLIDE 62

@leinweber

flirting with disaster

economic boundary workload boundary

slide-63
SLIDE 63

@leinweber

flirting with disaster

economic boundary workload boundary performance boundary

slide-64
SLIDE 64

@leinweber

flirting with disaster

economic boundary workload boundary performance boundary error margin

slide-65
SLIDE 65

@leinweber

flirting with disaster

economic boundary workload boundary performance boundary

slide-66
SLIDE 66

@leinweber

flirting with disaster

economic boundary workload boundary performance boundary error margin

slide-67
SLIDE 67

@leinweber

flirting with disaster

economic boundary workload boundary performance boundary error margin

slide-68
SLIDE 68

@leinweber

flirting with disaster

Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen: 
 Risk management in a dynamic society: a modeling problem

slide-69
SLIDE 69

@leinweber

thank you

Will Leinweber @leinweber citusdata.com