Policy-Driven Fault Management for NFV Eco System Akhil Jain (NEC) - - PowerPoint PPT Presentation

policy driven fault management for nfv eco system
SMART_READER_LITE
LIVE PREVIEW

Policy-Driven Fault Management for NFV Eco System Akhil Jain (NEC) - - PowerPoint PPT Presentation

April 2019 Policy-Driven Fault Management for NFV Eco System Akhil Jain (NEC) akhil.jain@india.nec.com Eric Kao (VMware) ekcs.openstack@gmail..com Definitions Network Function (NF): A functional building block in a network packet


slide-1
SLIDE 1

Policy-Driven Fault Management for NFV Eco System

Akhil Jain (NEC) akhil.jain@india.nec.com Eric Kao (VMware) ekcs.openstack@gmail..com

April 2019

slide-2
SLIDE 2

Definitions

  • Network Function (NF):

A functional building block in a network ○ packet inspection, CDNs, virus scanner, ...

  • Network Function Virtualization (NFV):

Realizing NFs as virtual appliances

  • Virtual Network Function (VNF):

A network function realized as virtual appliances

slide-3
SLIDE 3

Fault Management

  • Basic fault recovery is standard
  • Complexities beyond the stardard cases:

○ Diversity of fault scenarios ○ Diversity of VNFs ○ Each combination may call for a different fault management response

slide-4
SLIDE 4

Fault Scenarios

  • Sequence of fault signals over time
  • Isolated vs widespread
  • Existing or predicted
  • Fault types

○ Hard failure ○ Stability ○ Degraded performance

  • Fault domains

○ Networking, Host, Storage, Application, etc

slide-5
SLIDE 5

Context

  • Current & anticipated loads
  • VNF capacity
  • Physical infra capacity
  • Example considerations:

○ If load << VNF capacity, ignore certain fault prediction signals ○ If load ~= VNF capacity, preemptively scale-out ■ When physical infra limited, may need to scale-in a less loaded or less critical VNF to make room

slide-6
SLIDE 6

VNF characteristics

  • Stateful vs stateless
  • Monolithic vs microservices
  • Interactions, topology, service function chaining
  • SLAs
  • Business/user impact
slide-7
SLIDE 7

Solution: Policy-driven fault management

  • Fine-grained monitoring & alarming

○ Monasca, Prometheus, ...

  • Rich Context

○ Infra managers: Nova, Kubernetes, … ○ NFV orchestrator: Tacker, ONAP, ... ○ application-level statistics: load, latency, throughput ○ Arbitrary data sources

  • Expressive policy framework

○ Congress

slide-8
SLIDE 8

Alarm Services Congress Policy Service webhook Contextual Data Fault Management Policies data Infra Managers Orchestrators action action

slide-9
SLIDE 9

Congress Architecture

  • Data

○ Get data from webhooks and APIs ○ Store data as tables and JSON

  • Policy

○ Datalog/SQL rules transform data into decisions

  • Action

○ Decisions can trigger API calls

slide-10
SLIDE 10

Advantages

  • Extensible

○ Arbitrary sources of data as needed by use case

  • Expressive

○ Not limited by fixed vocabulary or set of properties

  • Declarative

○ Well understood declarative language for expressing clear and manageable policies ○ Avoid procedural code

slide-11
SLIDE 11

Example: preemptive scale out policy

  • Predictive fault signal
  • Possible response:

○ Ignore ■ failure occur ■ instances go down ■ load increases ■ autoscaling policy adjusts

  • Drawback:

○ Degraded service for a time

slide-12
SLIDE 12

Example: preemptive scale out policy

  • Estimate service disruption/degradation
  • Preemptively scale out as appropriate
  • Minimize risk of degraded service
slide-13
SLIDE 13

Example: preemptive scale out policy

Alarms on hosts Instances data

slide-14
SLIDE 14

Example: preemptive scale out policy

Alarms on hosts Instances data Instances affected

slide-15
SLIDE 15

Example: preemptive scale out policy

Alarms on hosts Instances data Instances affected VNFs data VNFs affected

slide-16
SLIDE 16

Example: preemptive scale out policy

Alarms on hosts Instances data Instances affected VNFs data VNFs affected VNFs load data predicted load

slide-17
SLIDE 17

Example: preemptive scale out policy

Alarms on hosts Instances data Instances affected VNFs data VNFs affected VNFs load data predicted load scale out decisions

slide-18
SLIDE 18

Example: preemptive scale out policy

Alarms on hosts Instances data Instances affected instances_affected(instance_id) :- hosts_alarmed(alarmed_host), nova:servers(server_id=instance_id, host_name=alarmed_host)

slide-19
SLIDE 19

Example: preemptive scale out policy

predicted load scale out decisions scale_out(vnf_id) :- predicted_VNF_load(vnf_id, predicted_load), predicted_load > 0.9

slide-20
SLIDE 20

Demo background

  • Demonstrate the interaction between services

○ Setup VNFs with Tacker ○ Configure Congress to receive Monasca webhook ○ Configure Monasca to send webhook ○ Raise Monasca Alarm ○ See result of actions triggered by Congress policy

slide-21
SLIDE 21

Summary

  • Fault management is complex

○ Diversity of scenarios -> Diversity of response

  • Solution

○ Fine-grained monitoring ○ Contextual data ○ Expressive policy

  • Congress

○ Pluggable data sources ○ Expressive policy language ○ Triggers API calls

slide-22
SLIDE 22

General purpose policy triggers

  • Trigger API calls based on policy+data

○ Adv. fault management policies ○ Adv. autoscaling policies ○ Generic integration glue

slide-23
SLIDE 23

Feedback welcome!

Mailing lists use [congress] prefix

  • penstack-discuss@lists.openstack.org

Eric Kao <ekcs.openstack@gmail.com>

slide-24
SLIDE 24

@OpenStack

Q&A

Thank you!

  • penstack
  • penstack

OpenStackFoundation

Akhil Jain <akhil.jain@india.nec.com> Eric Kao <ekcs.openstack@gmail.com>

slide-25
SLIDE 25

Conceptual policy dataflow

Alarms Data Topology VNFs Tech Data Technical Impact VNFs Biz Data Business Impact Fault Mgmt Decisions Fault Mgmt Feasibility & Risks