Everything I ever learned about JVM performance tuning @twitter 1 - - PowerPoint PPT Presentation

everything i ever learned about jvm performance tuning
SMART_READER_LITE
LIVE PREVIEW

Everything I ever learned about JVM performance tuning @twitter 1 - - PowerPoint PPT Presentation

Everything I ever learned about JVM performance tuning @twitter 1 12. mrcius 3., szombat Everything More than I ever wanted to learned about JVM performance tuning @twitter 2 12. mrcius 3., szombat http://twitter.com/asz 3 12.


slide-1
SLIDE 1

Everything I ever learned about JVM performance tuning @twitter

1

  • 12. március 3., szombat
slide-2
SLIDE 2

Everything More than I ever wanted to learned about JVM performance tuning @twitter

2

  • 12. március 3., szombat
slide-3
SLIDE 3

3

http://twitter.com/asz

  • 12. március 3., szombat
slide-4
SLIDE 4
  • Memory tuning
  • CPU usage tuning
  • Lock contention tuning
  • I/O tuning

4

  • 12. március 3., szombat
slide-5
SLIDE 5

Twitter’s biggest enemy

5

  • 12. március 3., szombat
slide-6
SLIDE 6

Twitter’s biggest enemy Latency

5

  • 12. március 3., szombat
slide-7
SLIDE 7

Latency contributors

  • By far the biggest contributor is garbage collector
  • others are, in no particular order:
  • in-process locking and thread scheduling,
  • I/O,
  • application algorithmic inefficiencies.

6

  • 12. március 3., szombat
slide-8
SLIDE 8

Areas of performance tuning

  • Memory tuning
  • Lock contention tuning
  • CPU usage tuning
  • I/O tuning

7

  • 12. március 3., szombat
slide-9
SLIDE 9

Areas of memory performance tuning

  • Memory footprint tuning
  • Allocation rate tuning
  • Garbage collection tuning

8

  • 12. március 3., szombat
slide-10
SLIDE 10

Memory footprint tuning

  • So you got an OutOfMemoryError…
  • Maybe you just have too much data!
  • Maybe your data representation is fat!
  • You can also have a genuine memory leak…

9

  • 12. március 3., szombat
slide-11
SLIDE 11

Too much data

  • Run with -verbosegc
  • Observe numbers in “Full GC” messages

[Full GC $before->$after($total), $time secs]

  • Can you give the JVM more memory?
  • Do you need all that data in memory? Consider

using:

  • a LRU cache, or…
  • soft references*

10

  • 12. március 3., szombat
slide-12
SLIDE 12

Fat data

  • Can be a problem when you want to do wacky

things, like

  • load the full Twitter social graph in a single

JVM

  • load all user metadata in a single JVM
  • Slimming internal data representation works at

these economies of scale

11

  • 12. március 3., szombat
slide-13
SLIDE 13

Fat data: object header

  • JVM object header is normally two machine

words.

  • That’s 16 bytes, or 128 bits on a 64-bit JVM!
  • new java.lang.Object() takes 16 bytes.
  • new byte[0] takes 24 bytes.

12

  • 12. március 3., szombat
slide-14
SLIDE 14

Fat data: padding

  • new A() takes 24 bytes.
  • new B() takes 32 bytes.

class A { byte x; } class B extends A { byte y; }

13

  • 12. március 3., szombat
slide-15
SLIDE 15

Fat data: no inline structs

  • new C() takes 40 bytes.
  • similarly, no inline array elements.

class C { Object obj = new Object(); }

14

  • 12. március 3., szombat
slide-16
SLIDE 16

Slimming taken to extreme

  • A research project had to load the full follower

graph in memory

  • Each vertex’s edges ended up being represented

as int arrays

  • If it grows further, we can consider variable-

length differential encoding in a byte array

15

  • 12. március 3., szombat
slide-17
SLIDE 17

Compressed object pointers

  • Pointers become 4 bytes long
  • Usable below 32 GB of max heap size
  • Automatically used below 30 GB of max heap

16

  • 12. március 3., szombat
slide-18
SLIDE 18

Compressed object pointers

Uncompressed Compressed 32-bit Pointer 8 4 4 Object header 16 12* 8 Array header 24 16 12 Superclass pad 8 4 4

* Object can have 4 bytes of fields and still only take up 16 bytes

17

  • 12. március 3., szombat
slide-19
SLIDE 19

Avoid instances of primitive wrappers

  • Hard won experience with Scala 2.7.7:
  • a Seq[Int] stores java.lang.Integer
  • an Array[Int] stores int
  • first needs (24 + 32 * length) bytes
  • second needs (24 + 4 * length) bytes

18

  • 12. március 3., szombat
slide-20
SLIDE 20

Avoid instances of primitive wrappers

  • This was fixed in Scala 2.8, but it shows that:
  • you often don’t know the performance

characteristics of your libraries,

  • and won’t ever know them until you run your

application under a profiler.

19

  • 12. március 3., szombat
slide-21
SLIDE 21

Map footprints

  • Guava MapMaker.makeMap() takes 2272 bytes!
  • MapMaker.concurrencyLevel(1).makeMap()

takes 352 bytes!

  • ConcurrentMap with level 1 makes sense

sometimes (i.e. you don’t want a ConcurrentModificationException)

20

  • 12. március 3., szombat
slide-22
SLIDE 22

Thrift can be heavy

  • Thrift generated classes are used to encapsulate a

wire tranfer format.

  • Using them as your domain objects: almost never

a good idea.

21

  • 12. március 3., szombat
slide-23
SLIDE 23

Thrift can be heavy

  • Every Thrift class with a primitive field has a

java.util.BitSet __isset_bit_vector field.

  • It adds between 52 and 72 bytes of overhead per
  • bject.

22

  • 12. március 3., szombat
slide-24
SLIDE 24

Thrift can be heavy

23

  • 12. március 3., szombat
slide-25
SLIDE 25

Thrift can be heavy

  • Thrift does not support 32-bit floats.
  • Coupling domain model with transport:
  • resistance to change domain model
  • You also miss oportunities for interning and N-to-1

normalization.

24

  • 12. március 3., szombat
slide-26
SLIDE 26

class public String city; public String region; public String countryCode; public int metro; public List<String> placeIds; public double lat; public double lon; public double confidence; Location {

25

  • 12. március 3., szombat
slide-27
SLIDE 27

class public String city; public String region; public String countryCode; public int metro; public List<String> placeIds; public double lat; public double lon; public double confidence; Location { Shared class UniqueLocation { private SharedLocation sharedLocation;

26

  • 12. március 3., szombat
slide-28
SLIDE 28

Careful with thread locals

  • Thread locals stick around.
  • Particularly problematic in thread pools with m⨯n

resource association.

  • 200 pooled threads using 50 connections: you end

up with 10 000 connection buffers.

  • Consider using synchronized objects, or
  • just create new objects all the time.

27

  • 12. március 3., szombat
slide-29
SLIDE 29

Part II: fighting latency

28

  • 12. március 3., szombat
slide-30
SLIDE 30

Performance tradeoff

Memory Time Convenient, but oversimplified view.

29

  • 12. március 3., szombat
slide-31
SLIDE 31

Performance triangle

Memory footprint Throughput Latency

30

  • 12. március 3., szombat
slide-32
SLIDE 32

Performance triangle

Compactness Throughput Responsiveness C ⨯ T ⨯ R = a

  • Tuning: vary C, T, R for fixed a
  • Optimization: increase a

31

  • 12. március 3., szombat
slide-33
SLIDE 33

Performance triangle

  • Compactness: inverse of memory footprint
  • Responsiveness: longest pause the application will

experience

  • Throughput: amount of useful application CPU work
  • ver time
  • Can trade one for the other, within limits.
  • If you have spare CPU, can be pure win.

32

  • 12. március 3., szombat
slide-34
SLIDE 34

Responsiveness vs. throughput

33

  • 12. március 3., szombat
slide-35
SLIDE 35

Biggest threat to responsiveness in the JVM is the garbage collector

34

  • 12. március 3., szombat
slide-36
SLIDE 36

Memory pools

Eden Survivor Old Permanent Code cache

This is entirely HotSpot specific!

35

  • 12. március 3., szombat
slide-37
SLIDE 37

How does young gen work?

Eden S1 Old S2

  • All new allocation happens in eden.
  • It only costs a pointer bump.
  • When eden fills up, stop-the-world copy-collection

into the survivor space.

  • Dead objects cost zero to collect.
  • Aftr several collections, survivors get tenured into
  • ld generation.

36

  • 12. március 3., szombat
slide-38
SLIDE 38

Ideal young gen operation

  • Big enough to hold more than one set of all

concurrent request-response cycle objects.

  • Each survivor space big enough to hold active

request objects + tenuring ones.

  • Tenuring threshold such that long-lived objects

tenure fast.

37

  • 12. március 3., szombat
slide-39
SLIDE 39

Old generation collectors

  • Throughput collectors
  • -XX:+UseSerialGC
  • -XX:+UseParallelGC
  • -XX:+UseParallelOldGC
  • Low-pause collectors
  • -XX:+UseConcMarkSweepGC
  • -XX:+UseG1GC (can’t discuss it here)

38

  • 12. március 3., szombat
slide-40
SLIDE 40

Adaptive sizing policy

  • Throughput collectors can automatically tune

themselves:

  • -XX:+UseAdaptiveSizePolicy
  • -XX:MaxGCPauseMillis=… (i.e. 100)
  • -XX:GCTimeRatio=… (i.e. 19)

39

  • 12. március 3., szombat
slide-41
SLIDE 41

Adaptive sizing policy at work

40

  • 12. március 3., szombat
slide-42
SLIDE 42

Choose a collector

  • Bulk service: throughput collector, no adaptive sizing

policy.

  • Everything else: try throughput collector with

adaptive sizing policy. If it didn’t work, use concurrent mark-and-sweep (CMS).

41

  • 12. március 3., szombat
slide-43
SLIDE 43

Always start with tuning the young generation

  • Enable -XX:+PrintGCDetails, -XX:+PrintHeapAtGC,

and -XX:+PrintTenuringDistribution.

  • Watch survivor sizes!

You’ll need to determine “desired survivor size”.

  • There’s no such thing as a “desired eden size”, mind
  • you. The bigger, the better, with some

responsiveness caveats.

  • Watch the tenuring threshold; might need to tune it

to tenure long lived objects faster.

42

  • 12. március 3., szombat
slide-44
SLIDE 44
  • XX:+PrintHeapAtGC

Heap after GC invocations=7000 (full 87): par new generation total 4608000K, used 398455K eden space 4096000K, 0% used from space 512000K, 77% used to space 512000K, 0% used concurrent mark-sweep generation total 3072000K, used 1565157K concurrent-mark-sweep perm gen total 53256K, used 31889K }

43

  • 12. március 3., szombat
slide-45
SLIDE 45
  • XX:+PrintTenuringDistribution

Desired survivor size 262144000 bytes, new threshold 4 (max 4)

  • age 1: 137474336 bytes, 137474336 total
  • age 2: 37725496 bytes, 175199832 total
  • age 3: 23551752 bytes, 198751584 total
  • age 4: 14772272 bytes, 213523856 total
  • Things of interest:
  • Number of ages
  • Size distribution in ages
  • You want strongly declining.

44

  • 12. március 3., szombat
slide-46
SLIDE 46

Tuning the CMS

  • Give your app as much memory as possible.
  • CMS is speculative. More memory, less punitive

miscalculations.

  • Try using CMS without tuning. Use -verbosegc and
  • XX:+PrintGCDetails.
  • Didn’t get any “Full GC” messages?

You’re done!

  • Otherwise, tune the young generation first.

45

  • 12. március 3., szombat
slide-47
SLIDE 47

Tuning the old generation

  • Goals:
  • Keep the fragmentation low.
  • Avoid full GC stops.
  • Fortunately, the two goals are not conflicting.

46

  • 12. március 3., szombat
slide-48
SLIDE 48

Tuning the old generation

  • Find the minimum and maximum working set size

(observe “Full GC” numbers under stable state and under load).

  • Overprovision the numbers by 25-33%.
  • This gives CMS a cushion to concurrently clean

memory as it’s used.

47

  • 12. március 3., szombat
slide-49
SLIDE 49

Tuning the old generation

  • Set -XX:InitiatingOccupancyFraction to

between 80-75, respectively.

  • corresponds to overprovisioned heap ratio.
  • You can lower initiating occupancy fraction to 0 if

you have CPU to spare.

48

  • 12. március 3., szombat
slide-50
SLIDE 50

Responsiveness still not good enough?

  • Too many live objects during young gen GC:
  • Reduce NewSize, reduce survivor spaces, reduce

tenuring threshold.

  • Too many threads:
  • Find the minimal concurrency level, or
  • split the service into several JVMs.

49

  • 12. március 3., szombat
slide-51
SLIDE 51

Responsiveness still not good enough?

  • Does the CMS abortable preclean phase, well,

abort?

  • It is sensitive to number of objects in the new

generation, so

  • go for smaller new generation
  • try to reduce the amount of short-lived garbage

your app creates.

50

  • 12. március 3., szombat
slide-52
SLIDE 52

Part III: let’s take a break from GC

51

  • 12. március 3., szombat
slide-53
SLIDE 53

Thread coordination

  • ptimization
  • You don’t have to always go for synchronized.
  • Synchronization is a read barrier on entry; write

barrier on exit.

  • Sometimes you only need a half-barrier; i.e. in a

producer-observer pattern.

  • Volatiles can be used as half-barriers.

52

  • 12. március 3., szombat
slide-54
SLIDE 54

Thread coordination

  • ptimization
  • For atomic update of a single value, you only need

Atomic{Integer|Long}.compareAndSet().

  • You can use AtomicReference.compareAndSet() for

atomic update of composite values represented by immutable objects.

53

  • 12. március 3., szombat
slide-55
SLIDE 55

Fight CMS fragmentation with slab allocators

  • CMS doesn’t compact, so it’s prone to fragmentation,

which will lead to a stop-the-world pause.

  • Apache Cassandra uses a slab allocator internally.

54

  • 12. március 3., szombat
slide-56
SLIDE 56

Cassandra slab allocator

  • 2MB slab sizes
  • copy byte[] into them using compare-and-set
  • GC before: 30-60 seconds every hour
  • GC after: 5 seconds once in 3 days and 10 hours

55

  • 12. március 3., szombat
slide-57
SLIDE 57

Slab allocator constraints

  • Works for limited usage:
  • Buffers are written to linearly, flushed to disk and

recycled when they fill up.

  • The objects need to be converted to binary

representation anyway.

  • If you need random freeing and compaction, you’re

heading down the wrong direction.

  • If you find yourself writing a full memory manager
  • n top of byte buffers, stop!

56

  • 12. március 3., szombat
slide-58
SLIDE 58

Soft references revisited

  • Soft reference clearing is based on the amount of

free memory available when GC encounters the reference.

  • By definition, throughput collectors always clear

them.

  • Can use them with CMS, but they increase memory

pressure and make the behavior less predictable.

  • Need two GC cycles to get rid of referenced objects.

57

  • 12. március 3., szombat
slide-59
SLIDE 59

Everything More than I ever wanted to learned about JVM performance tuning @twitter Questions?

58

  • 12. március 3., szombat