The return of OpenStack Telemetry and the 10,000 Instances - - PowerPoint PPT Presentation

the return of openstack telemetry and the 10 000 instances
SMART_READER_LITE
LIVE PREVIEW

The return of OpenStack Telemetry and the 10,000 Instances - - PowerPoint PPT Presentation

The return of OpenStack Telemetry and the 10,000 Instances Telemetry Project Update Alex Krzos Julien Danjou 8 November 2017 The return of OpenStack Telemetry and the 10,000 Instances 20,000 Alex Krzos Julien Danjou 8 November 2017


slide-1
SLIDE 1

The return of OpenStack Telemetry and the 10,000 Instances

Telemetry Project Update

Alex Krzos Julien Danjou 8 November 2017

slide-2
SLIDE 2

The return of OpenStack Telemetry and the 10,000 Instances

Alex Krzos Julien Danjou 8 November 2017

20,000

slide-3
SLIDE 3

Red Hat

Alex Krzos Senior Performance Engineer @ Red Hat akrzos@redhat.com IRC: akrzos Julien Danjou Principal Software Engineer @ Red Hat jdanjou@redhat.com IRC: jd_

Introductions

slide-4
SLIDE 4

Red Hat

  • Why scale test?
  • Telemetry Architecture
  • Gnocchi Architecture
  • The Road to 10,000 Instances
  • Scale and Performance Test Results
  • Conclusion

Lets talk about Telemetry and Scaling...

slide-5
SLIDE 5

Red Hat

Determine capacity and limits Develop good defaults and recommendations Characterize resource utilization Telemetry must scale as number of metrics collected will only increase.

Why Scale Test?

slide-6
SLIDE 6

Red Hat

Telemetry Architecture

slide-7
SLIDE 7

Red Hat

Gnocchi Architecture

slide-8
SLIDE 8

Red Hat

Ocata struggled to get 5,000 instances even with lots of tuned parameters and reducing workload. Goal: Achieve 10,000 instances with less tuning than Ocata and a more difficult workload. Extra Credit: Go beyond 10,000 with same hardware.

The Road to 10,000 Instances

slide-9
SLIDE 9

Red Hat

Boot Persisting Instances with Network

  • 500 at a time, then quiesce

Boot Persisting Instances

  • 1000 at a time, then quiesce

Measure Gnocchi API Responsiveness

  • Metric Create/Delete
  • Resource Create/Delete
  • Get Measures

Workloads

slide-10
SLIDE 10

Red Hat

3 Controllers

  • 2 x E5-2683 v3 - 28 Cores / 56 Threads
  • 128GiB Memory
  • 2 x 1TB 7.2K SATA in Raid 1

12 Ceph Storage Nodes

  • 2 x E5-2650 v3 - 20 Cores / 40 Threads
  • 128GiB Memory
  • 18 x 500GB 7.2K SAS ( 2 - Raid 1 - OS, 16 OSDs), 1 NVMe Journal

59 Compute Nodes

  • 2 x E5-2620 v2 - 12 Cores / 24 Threads
  • 128GiB / 64GiB Memory
  • 2 x 1TB 7.2K SATA in Raid 1

Hardware

slide-11
SLIDE 11

Red Hat

Network Topology

slide-12
SLIDE 12

Red Hat

Workload (20 iterations)

  • 500 instances with attached network booted every 30 minutes

Gnocchi Settings

  • metricd workers per Controller = 18
  • api workers per Controller = 24

Ceilometer Settings

  • notification_workers = 3
  • rabbit_qos_prefetch_count = 128
  • 300s polling interval

10,000 Instances with NICs Test

slide-13
SLIDE 13

Red Hat

Pike Results - 10k Test Gnocchi Backlog

slide-14
SLIDE 14

Red Hat

Pike Results - 10k Test CPU on Controllers

slide-15
SLIDE 15

Red Hat

Pike Results - 10k Test Memory on All Hosts

slide-16
SLIDE 16

Red Hat

Pike Results - 10k Test Disks on Controllers

slide-17
SLIDE 17

Red Hat

Pike Results - 10k Test Disks on CephStorage

slide-18
SLIDE 18

Red Hat

Workload (20 iterations)

  • 1000 instances booted
  • 5000 get measures
  • 1000 metric and resource creates/deletes

Gnocchi

  • metricd workers per Controller = 36
  • api processes per Controller = 24

Ceilometer

  • notification_workers = 5
  • rabbit_qos_prefetch_count = 128
  • 300s polling interval

20,000 Instances Test

slide-19
SLIDE 19

Red Hat

Ocata Results

slide-20
SLIDE 20

Red Hat

Ocata Results

Not in Pike

slide-21
SLIDE 21

Red Hat

Pike Results - 20k Test Gnocchi Backlog

slide-22
SLIDE 22

Red Hat

Pike Results - 20k Test CPU on Controllers

slide-23
SLIDE 23

Red Hat

Pike Results - 20k Test Memory on All Hosts

slide-24
SLIDE 24

Red Hat

Pike Results - 20k Test Disks on Controllers

slide-25
SLIDE 25

Red Hat

Pike Results - 20k Test Disks on CephStorage

slide-26
SLIDE 26

Red Hat

Pike Results - 20k Test Network Controllers Em1

slide-27
SLIDE 27

Red Hat

Pike Results - 20k Test Network Controllers Em2

slide-28
SLIDE 28

Red Hat

API Get Measures - 20k Test

slide-29
SLIDE 29

Red Hat

API Create/Delete Metrics - 20k Test

slide-30
SLIDE 30

Red Hat

API Create/Delete Resources - 20k Test

slide-31
SLIDE 31

Red Hat

Some Differences between versions (Newton, Ocata, Pike) Pike (Gnocchi v4)

  • metricd/api workers
  • Incoming storage driver (Redis is currently prefered)

Ocata / Newton (Gnocchi 3.1 / 3.0)

  • metricd/api workers
  • tasks_per_worker / metric_processing_delay
  • Check scheduler (Use latest version of Gnocchi)

Tuning - Gnocchi

slide-32
SLIDE 32

Red Hat

Always

  • avoid overwhelming Gnocchi backlog (collect what you need/use)
  • check rabbit_qos_prefetch_count - Monitor Rabbitmq too

Pike

  • agent-notification workers

Ocata

  • publish directly to Gnocchi (disable collector)

Newton

  • collector workers

Tuning - Ceilometer

slide-33
SLIDE 33

Red Hat

OpenStack Telemetry is now proven to the 10,000 instance mark and more in Pike Minimal degradation in response timing of API as more and more metrics are collected Of course there is still room for improvements:

  • Reduce the load on the archival storage
  • Spikes in API timings (Frontend API vs Backend API)
  • Performance testing with other storage drivers (Swift, File)

Conclusion

slide-34
SLIDE 34

THANK YOU

plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNews