Demonstrating At Scale Monitoring Of OpenStack Cloud Using - - PowerPoint PPT Presentation

demonstrating at scale monitoring of openstack cloud
SMART_READER_LITE
LIVE PREVIEW

Demonstrating At Scale Monitoring Of OpenStack Cloud Using - - PowerPoint PPT Presentation

Open Infrastructure Summit 2019 Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus Anandeep Pannu Pradeep Kilambi apannu@redhat.com prad@redhat.com 1 Definitions 3 4


slide-1
SLIDE 1

Open Infrastructure Summit 2019

Demonstrating At Scale Monitoring Of OpenStack Cloud Using Prometheus

Anandeep Pannu apannu@redhat.com Pradeep Kilambi prad@redhat.com

1

slide-2
SLIDE 2
slide-3
SLIDE 3

3

Definitions

slide-4
SLIDE 4

4

slide-5
SLIDE 5

○ ○ ○ ○ ○

slide-6
SLIDE 6
slide-7
SLIDE 7

7

Implications for Open Infrastructure

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

Critical Monitoring Features

slide-11
SLIDE 11
  • Portability across different footprints
  • HA, scaling, persistence available for free
  • Re-use platform capabilities - eg. Prometheus
  • Users integrate for capabilities they want
  • Stringent SLAs can be met
  • Plug-in different OSS components with the same API
  • For each API, SLAs achieved can be optimized

○ E.g Fault management uses message bus directly

  • Metrics meta-data and declarative metrics for every

component, so metrics can be incorporated automatically

  • Data sensing, collection and processing

○ Either, some or all processed at the Edge

  • Centralized access to reports, alerts
  • Integration with Analytics
slide-12
SLIDE 12

Service Assurance Framework Architecture

slide-13
SLIDE 13

Architecture Overview

On-site infrastructure platform

slide-14
SLIDE 14

○ ■ ○ ■ ■ ○ ■

slide-15
SLIDE 15

Dispatch Routing Message Distribution Bus (AMQP 1.0)

kernel net cpu mem hardware syslog /proc pid V M V M V M

Metrics Events

Application Components (VM, Container); Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes) 3rd Party Integrations Prometheus Operator MGMT Cluster

APIs

Prometheus-based K8S Monitoring

slide-16
SLIDE 16
  • Collectd container -- Host / VM metrics collection framework

○ Collectd 5.8 with additional OPNFV Barometer specific plugins not yet in collectd project

  • Intel RDT, Intel PMU, IPMI
  • AMQP1.0 client plugin
  • Procevent -- Process state changes
  • Sysevent -- Match syslog for critical errors
  • Connectivity -- Fast detection of interface link status

changes ○ Integrated as part of TripleO (OSP Director)

slide-17
SLIDE 17

write_syslog write_kafka write_prometheus amqp_09 amqp1

slide-18
SLIDE 18

AMQ 7 Interconnect - Native AMQP 1.0 Message Router

  • Large Scale Message Networks

○ Offers shortest path (least cost) message routing ○ Used without broker ○ High Availability through redundant path topology and re-route (not clustering) ○ Automatic recovery from network partitioning failures ○ Reliable delivery without requiring storage

  • QDR Router Functionality

○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints ○ Stateless - no message queuing, end-to-end transfer

Server A Server B Client Client Client Server C

High Throughput, Low Latency Low Operational Costs

slide-19
SLIDE 19
  • Prometheus Operator
slide-20
SLIDE 20

○ ○

slide-21
SLIDE 21

Evolution

slide-22
SLIDE 22

AMQP OS Networks

ceph cntrl 1 cntrl 2 cntrl 3 Prometheus Operator++ Cluster Prometheus Grafana QDR QDR QDR S G S G S G

Central Site Remote Site(s)

ceph ceph ceph compute compute compute compute

AMQP OS Networks

ceph cntrl 1 cntrl 2 cntrl 3 ceph ceph ceph compute compute compute compute

AMQP OS Networks

ceph cntrl 1 cntrl 2 cntrl 3 ceph ceph ceph compute compute compute compute

AMQP OS Networks

ceph cntrl 1 cntrl 2 cntrl 3 ceph ceph ceph compute compute compute compute

Layer 3 Network to Remote Sites

Site 1 Site 2 Site 10

slide-23
SLIDE 23

DCN Use Case

L3 Routed

Controller Nodes

OPTIONAL

AZ0 Compute Nodes

(Local Ephemeral)

Undercloud +Container Registry Ceph Cluster 0

OPTIONAL

Primary Site DCN Site 1

AZ1 Compute Nodes

(Local Ephemeral)

DCN Site 2

AZ2 Compute Nodes

(Local Ephemeral)

DCN Site 3

AZ3 Compute Nodes

(Local Ephemeral)

DCN Site 4

AZ4 Compute Nodes

(Local Ephemeral)

DCN Site n

AZn Compute Nodes

(Local Ephemeral)

AZ0

Deployment Stack

slide-24
SLIDE 24
slide-25
SLIDE 25

Configuration & Deployment

slide-26
SLIDE 26
  • Collectd and QDR profiles are integrated as part of the TripleO
  • Collectd and QDRs run as containers on the openstack nodes
  • Configured via heat environment file
  • Each node will have a qpid dispatch router running with collectd

agent

  • Collectd is configured to talk to qpid dispatch router and send

metrics and events

  • Relevant collectd plugins can be configured via the heat template file

TripleO Integration Of client side components

slide-27
SLIDE 27

## This environment template to enable Service Assurance Client side bits resource_registry: OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false

TripleO Client side Configuration

environments/metrics-collectd-qdr.yaml

slide-28
SLIDE 28

cat > params.yaml <<EOF

  • parameter_defaults:

CollectdConnectionType: amqp1 CollectdAmqpInstances: telemetry: format: JSON presettle: true MetricsQdrConnectors:

  • host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site

port: 443 role: edge sslProfile: tlsProfile verifyHostname: false EOF

TripleO Client side Configuration

params.yaml

slide-29
SLIDE 29

cd ~/tripleo-heat-templates git checkout master cd ~ cp overcloud-deploy.sh overcloud-deploy-overcloud.sh sed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh ./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml

Client side Deployment

Using overcloud deploy with collectd & qdr configuration and environment templates

slide-30
SLIDE 30
slide-31
SLIDE 31

There are 3 core components to the telemetry framework:

  • Prometheus (and the AlertManager)
  • Smart Gateway
  • QPID Dispatch Router

Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.

slide-32
SLIDE 32

To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory.

cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/ ./deploy.sh CREATE

[1] https://github.com/redhat-service-assurance/telemetry-framework

slide-33
SLIDE 33

Operators Custom Resources Service Assurance Framework

Deploying Service Assurance Framework

From Operator to Application

slide-34
SLIDE 34
slide-35
SLIDE 35

Demo

slide-36
SLIDE 36

avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90 Critical CPU Usage Alert: avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90

slide-37
SLIDE 37

Architecture Demo

Service Assurance framework

slide-38
SLIDE 38
  • https://telemetry-framework.readthedocs.io/en/master/
  • https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info
  • https://github.com/redhat-service-assurance
slide-39
SLIDE 39
slide-40
SLIDE 40

○ ○

slide-41
SLIDE 41

○ ○

slide-42
SLIDE 42

○ ○ ○ ○ ○ ○ ○

slide-43
SLIDE 43

○ ○ ○

slide-44
SLIDE 44

○ ○ ○

slide-45
SLIDE 45
slide-46
SLIDE 46
  • Target

/Metrics

Target

/Metrics

Prometheus Server PromQL HTTP HTTP

Visualize

slide-47
SLIDE 47