Background Large-scale IT service delivery systems No longer - - PowerPoint PPT Presentation

background
SMART_READER_LITE
LIVE PREVIEW

Background Large-scale IT service delivery systems No longer - - PowerPoint PPT Presentation

POLYGRAPH : SYSTEM FOR DYNAMIC REDUCTION OF FALSE ALERTS IN LARGE-SCALE IT SERVICE DELIVERY ENVIRONMENTS SANGKYUM KIM (UIUC) WINNIE CHENG, SHANG GUO, LAURA LUAN, DANIELA ROSU (IBM RESEARCH) ABHIJIT BOSE (GOOGLE) USENIX ATC11 (June 2011,


slide-1
SLIDE 1

POLYGRAPH: SYSTEM FOR DYNAMIC

REDUCTION OF FALSE ALERTS IN LARGE-SCALE IT SERVICE DELIVERY ENVIRONMENTS

USENIX ATC’11 (June 2011, Portland, OR)

SANGKYUM KIM (UIUC) WINNIE CHENG, SHANG GUO, LAURA LUAN, DANIELA ROSU (IBM RESEARCH) ABHIJIT BOSE (GOOGLE)

slide-2
SLIDE 2

Background

 Large-scale IT service delivery systems

 No longer confined to racks within a single data center  Increasing adoption of virtualization and cloud computing

 Our focus

 Monitoring alerts  Significant portion of alerts are false

 Polygraph

 Mine historical alerts to dynamically adjust monitoring

policies

slide-3
SLIDE 3

Basic Alert Policy Types

Type Example IF A; IF (System.Virtual_Memory_Percent_Used > 90) IF A AND B; IF (NTPhysical_Disk.Disk_Time > 80) AND (NT_Physical_Disk.Disk_Time ≤ 90) IF A OR B; IF (SMP_CPU.CPU_Status = ‘off-line’) OR (SMP_CPU.Avg_CPU_Busy_15 > 95)

slide-4
SLIDE 4

Polygraph System Architecture

Resource Utilization and Performance Alert policy generator False alert detector System Configuration Data Alert policy evaluator/simulator Monitoring Rule Dispatcher Monitoring System Management

Monitoring Agent Monitoring Rules

Rule change Alerts

Monitored server

CPU Disk Memory App Events Incident Management

System/event source

Policy deployment Alerts Problem tickets Events Proposed alert policies New/modified alert policies Tune alert policies

Ticket source

False alerts

Polygraph

Expert Review

  • n new rules.

Events Operation Data

(SLA, Maintenance Schedule, …)

slide-5
SLIDE 5

Host-based Alert Policy Threshold Adjustment

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400

min resource of real alerts max resource of false alerts Current threshold

slide-6
SLIDE 6

Time-based Alert Policy Threshold Adjustment (I)

 Finding patterns for false alerts

 Example: periodic patterns  They might include true alerts

2010-05-04 2010-05-05 2010-05-06 2010-05-07 2010-05-08 2010-05-09 2010-05-10 2010-05-11 2010-05-13 2010-05-14 2010-05-15 2010-05-16 2010-05-17 2010-05-18 2010-05-26

slide-7
SLIDE 7

Time-based Alert Policy Threshold Adjustment (II)

 Finding patterns for true alerts

 Mine true ranges

 User-specified threshold given to decide the width of true range

3pm 4pm 8pm

*True range threshold: 1 hour True ranges: (2-5pm), (7-9pm)

slide-8
SLIDE 8

Experiments

40 50 60 70 80 90 100 5 10 15 20 25 Rate (%) Train Data Size (Day)

P1 Total Detected False Events P1 Detected False Events from Hosts with True Sets P2 Total Detected False Events P2 Detected False Events from Hosts with True Sets

Host and Time-based threshold adjustment

20 40 60 80 100 5 10 15 20 25 Rate (%) Train Data Size (Day)

Host-based threshold adjustment

50 60 70 80 90 100 30 60 120 180 Rate (%) True Range Threshold (min)

True range threshold effect

slide-9
SLIDE 9

Discussion

 Leverage operational data for alert policy tuning

 Anti virus (20% of a customer’s alerts)

 Weighted scheme

 Put emphasis on recent input

 Impact of change operations

 Integration of service management data is necessary

 Leverage server similarity

 Grouping similar servers provides a better training

dataset

slide-10
SLIDE 10

Conclusion

 How to reduce false alerts

 Polygraph tunes alert policies based on historical data

 To improve recall, we utilized

 Localized feature: Host

 High recall, barely miss true events

 Time-dependent behavior

 Higher recall, reasonable precision

slide-11
SLIDE 11

Questions ?