Software Misconfigurations are a big deal! There is a much sprawl: - - PowerPoint PPT Presentation

software misconfigurations are a big deal
SMART_READER_LITE
LIVE PREVIEW

Software Misconfigurations are a big deal! There is a much sprawl: - - PowerPoint PPT Presentation

Dexter : Faster Troubleshooting of Misconfiguration Problems Using System Logs Rukma Talwadker NetApp Inc. 22 nd May 2017 Presented at SYSTOR 2017 1 SYSTOR'17: 10th ACM International Systems & Storage Conference Software


slide-1
SLIDE 1

Dexter: Faster Troubleshooting of

Misconfiguration Problems Using System Logs

Rukma Talwadker NetApp Inc. 22nd May 2017 Presented at SYSTOR 2017

1

SYSTOR'17: 10th ACM International Systems & Storage Conference

slide-2
SLIDE 2

Software Misconfigurations are a big deal!

  • There is a much sprawl: nearly 2K options in Firefox and over 36K options in

LibreOffice

  • Complex interaction between configuration settings and execution environment

renders the configuration effort difficult and error-prone

  • Significant portion of the operational cost (over 70%) is made up by people costs
  • Change of hands via escalations
  • Communication logistics

SYSTOR'17: 10th ACM International Systems & Storage Conference

2

slide-3
SLIDE 3

Misconfiguration troubleshooting workflow

SYSTOR'17: 10th ACM International Systems & Storage Conference

3

Disk shelf ID’s

  • 12.234.56.7
  • 12.345.6.7
  • AS45698_WD
  • 12.36.78.6

Suspicious entry

“ Error: Invalid Shelf ID” “ Warning: Delay in I/O response” “ Error: Exceeded the connection limit” “ Warning: Node clockskew detected”

Misconfiguration Root Cause Problem indicator

“ Shelf ID should be a 32-bit numeric address written as four numbers separated by periods.”

Dexter is a misconfiguration troubleshooting tool!

Resolution System Logs

slide-4
SLIDE 4

Dexter Spotlights

Problem Spotlights Ranked list of system log messages, one or more of which correlate with the ongoing misconfiguration problem. Resolution Spotlights Heuristically derived possible solutions to the given misconfiguration problem by mining the command history logs of the system.

Note: AutoSupport is a digital exhaust of system operations from various customer installations, sent back to NetApp periodically. Dexter’s role is two-fold

SYSTOR'17: 10th ACM International Systems & Storage Conference

4

slide-5
SLIDE 5

SYSTOR'17: 10th ACM International Systems & Storage Conference

5

Problem Spotlights

slide-6
SLIDE 6

Extracting System Log features (metrics)

Dexter considers the following features:

  • 1. Reported timestamp
  • 2. software module and/or sub system which logged the message
  • 3. The actual error message
  • 4. Message severity.

A sample Apache HTTP server log [Fri Sep 09 10:42:29.902022 2011] [core:error] [pid 35708:tid 4328636416] [client 72.15.99.187] File does not exist: /usr/local/apache2/htdocs/favicon.ico

SYSTOR'17: 10th ACM International Systems & Storage Conference

6

slide-7
SLIDE 7

Finding Problem Spotlights

Log Metrics defined by Dexter:

  • Message Recency: messages temporally correlated with the problem are potential clues
  • Message Severity: messages with higher severity are more concerning than a warning

message

  • Message Frequency: more frequently appearing error message which is also recent is

possibly correlated Ranked list of log messages is presented to the support engineer

Metrics that matter

SYSTOR'17: 10th ACM International Systems & Storage Conference

7

slide-8
SLIDE 8

SYSTOR'17: 10th ACM International Systems & Storage Conference

8

Resolution Spotlights

slide-9
SLIDE 9

Resolution Spotlights

1) Dexter checks for disappearance of problem spotlights post case closure in the system logs. 2) Correlates disappearance of problem spotlights with:

  • Execution of a new command
  • Re-execution of the new argument or an option with a new value.
  • And time correlation of the command execution with the message disappearance.

A heuristics approach for offline resolution prediction component

SYSTOR'17: 10th ACM International Systems & Storage Conference

9

slide-10
SLIDE 10

Dexter Workflow

  • nline and offline components

SYSTOR'17: 10th ACM International Systems & Storage Conference

10

Misconfiguration Support case

Correlate Commands with Logs temporally Extract Problem spotlights publish Top 10 Problem spotlights Case

  • pened

Case closed Estimate the Problem indicating spotlights Process logs & extract features

Signature database

{problem spotlight, command list} Online workflow Offline workflow

slide-11
SLIDE 11

Results

Validation on a sample set of 600 cases.

SYSTOR'17: 10th ACM International Systems & Storage Conference

11

ranks

slide-12
SLIDE 12

SYSTOR'17: 10th ACM International Systems & Storage Conference

12

Q&A

slide-13
SLIDE 13

An example of resolution prediction.

SYSTOR'17: 10th ACM International Systems & Storage Conference

13

root cifs−server−modify Vserver user

value count value count value count accessed accessed accessed

Foo

20 20 20 16th July, 5.16 AM PC 14th July, 4:38 PM 14th July, 4:38 PM 14th July, 4:38 PM 16th July, 5.16 AM 16th July, 5.16 AM ON value count 1 14th July 14th July 1 16th July 3600 value count 16th July

allowed− hosts

value count accessed

(*)

1 14th July 14th July 4:38 PM 4:38 PM 4:38 PM 4:38 PM 5:16 AM 5:16 AM accessed accessed 16th July, 5.16 AM 14th July, 4:38 PM

timeout signing− enabled

Other Commands

first last first first XYZ last last last accessed accessed first accessed accessed last first accessed last accessed first first accessed last accessed

name

Volume

slide-14
SLIDE 14

Evaluation Criteria

Problem Spotlights Dexter posts the top 10 problem spotlights to the misconfiguration support case

  • AutoSupports were enabled and available
  • The problem indicating log message:
  • was contained (Recorded) in the logs
  • was ranked within the top 10 log messages for the system at the time Dexter was invoked.

Resolution Spotlights All (- pre-filtered) correlated commands are presented as a possible solution

  • AutoSupports were enabled and available
  • The solution indicating command log:
  • was contained (Recorded) in the command history logs.

Effectiveness of Dexter

SYSTOR'17: 10th ACM International Systems & Storage Conference

14