Understanding Customer Problem Troubleshooting from Storage System - PowerPoint PPT Presentation

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang (wjiang3@uiuc.edu) Weihang Jiang *+ , Chongfeng Hu *+ , Shankar Pasupathy + , Arkady Kanevsky + , Zhenmin Li # , Yuanyuan Zhou * University of Illinois * NetApp + Pattern Insight # 1

Customer problem troubleshooting is critical  Customer problems result in costly downtime for customers  Cost a customer 18.35% of TCO [Crimson ’07].  Customer problems are expensive for system vendors  Vendors devote more than 8% of total revenue and 15% of total employee costs on customer problem support [ASP’08].  Complex modern storage systems make problem troubleshooting more challenging 2

Storage system is complex Other layers File system layer RAID layer Storage Layer (software protocol stack) HBA AC power Disk Fan Shelf Enclosure 1 Shelf Enclosure 2 Cables Storage Subsystem 3

Customer problems occur in different ways Other layers File system layer Other layers RAID layer File system layer RAID layer Storage Layer (software protocol stack) Storage Layer HBA (software protocol stack) HBA AC AC powe r power Disk Fan Shelf Enclosure 1 Disk Fan Shelf Enclosure 1 Shelf Enclosure 2 Cables Shelf Enclosure 2 Storage Subsystem Cables Storage Subsystem Customer problems include storage failures, partial failures and any other system  misbehaviors that users observe and do not expect from a healthy system. 4

Customer problem management workflow Resolutions / Workaround Support Center Human-Generated Customer Problems Customers Auto-Generated DB Log Support Engineers Quantitatively understand problem troubleshooting Can we systematically use system logs for troubleshooting? 5

Outline  Motivation  Understanding customer problem troubleshooting  Problem troubleshooting time  Problem category  Problem impacts  Use log information for problem troubleshooting  Conclusions 6

Data source Customer problem case database (636,108) Problem cause Resolution/ Case ID Report Date Auto-generated Critical Event Workaround Date Module-level High-level 1 5/1/06 11:21 5/2/06 13:35 Software Bugs Y Crash File System 2 5/2/06 11:02 5/7/06 9:01 Hardware Fault N N/A SCSI 3 5/3/06 15:40 5/8/06 14:48 Misconfiguration N N/A Shelf Log Log Log Log Log Storage System Log Archive (306,624 logs) 7

Analysis dimensions Problem category Correlation between problem category and troubleshooting time Hardware fault Software bug Misconfiguration System crash? Problem troubleshooting time Usability problem? Performance problem? How critical to automate problem troubleshooting? Correlation between problem impacts and troubleshooting time 8

Problem troubleshooting is time-consuming 9

Problem category distribution e.g., DNS server failures, APP bugs, … e.g., How to take snapshot? e.g., Disk drive, Why am I seeing high CPU? Cable, SCSI controllers, HBA, DRAM, … e.g., Set wrong parameters for devices, Connect cable to wrong ports, Bugs in storage system software Use incompatible components together. Hardware fault (40%) and misconfiguration(21%) are the two most frequent  categories, software bugs count for a small percentage(3%). User knowledge (11%) and customers’ own execution environment (9%).  10

Problem category and troubleshooting time  Software bugs take longer time to troubleshoot.  For all categories, troubleshooting is time-consuming. 11

Problem impact distribution e.g., Can not access a disk volume, Can not take snapshot, … e.g., e.g., Low spare disk count, Disk, link, HBA, power, supply, Instable interconnects, … fan.  Problems are captured at early stages System crash(3%)  Hardware component(44%), unhealthy status(20%)  12

Problem impact and troubleshooting time  System crash takes longer time to troubleshoot.  For all categories, troubleshooting is time-consuming. 13

Outline  Motivation  Understanding customer problem troubleshooting  Problem troubleshooting time  Problem category  Problem impacts  Use log information for problem troubleshooting  Conclusions 14

Use log information for problem diagnosis Customer problem case database (636,108) Problem cause Resolution/ Case ID Report Date Auto-generated Critical Event Workaround Date Module-level High-level 1 5/1/06 11:21 5/1/06 13:35 Software Bugs Y Crash File System 2 5/2/06 11:02 5/2/06 9:01 Hardware Fault N N/A SCSI 3 5/3/06 15:40 5/8/06 14:48 Misconfiguration N N/A Shelf Log Log Log Log Log Storage System Log Archive (306,624 logs) 15

What log information to use? ONE log event is enough? Single Event revealing problem root cause Sat Apr 15 05:58:15 EST [busError]: SCSI adapter encountered an unexpected bus phase. Issuing SCSI bus reset. Sat Apr 15 05:59:10 EST [fs.warn]: volume /vol/vol1 is low on free space. 98% in use. Sat Apr 15 06:01:10 EST [fs.warn]: volume /vol/vol10 is low on free space. 99% in use. Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9a back into service. Or multiple log events? Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9b back into service. More events, better ? …… Sat Apr 15 06:07:19 EST [timeoutError]: device 9a did not respond to requested I/O. I/O will be retried. Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9a: All retries have failed. Sat Apr 15 06:07:19 EST [timeoutError]: device 9b did not respond to requested I/O. I/O will be retried. Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9b. All retries have failed. Sat Apr 15 06:08:23 EST [filerUp]: Filer is up and running. …… Critical Event Sat Apr 15 06:24:07 EST [crash:ALERT]: Crash String: File system hung in process idle_thread1 Critical event is ready to use 16

More log events are more useful How well the signature can uniquely identify cause? F-score = 2 * Precision * Recall / (Precision + Recall) Multiple Events 45% Single Event 27% Critical Event 15% Critical event alone is not enough.  Using more log events can bring better accuracy.  17

Challenges and opportunities  Logs are noisy Single Event revealing problem root cause Sat Apr 15 05:58:15 EST [busError]: SCSI adapter encountered an unexpected bus phase. Issuing SCSI bus reset. Sat Apr 15 05:59:10 EST [fs.warn]: volume /vol/vol1 is low on free space. 98% in use. Sat Apr 15 06:01:10 EST [fs.warn]: volume /vol/vol10 is low on free space. 99% in use. Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9a back into service. Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9b back into service. …… Sat Apr 15 06:07:19 EST [timeoutError]: device 9a did not respond to requested I/O. I/O will be retried. Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9a: All retries have failed. Sat Apr 15 06:07:19 EST [timeoutError]: device 9b did not respond to requested I/O. I/O will be retried. Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9b. All retries have failed. Sat Apr 15 06:08:23 EST [filerUp]: Filer is up and running. …… Critical Event Sat Apr 15 06:24:07 EST [crash:ALERT]: Crash String: File system hung in process idle_thread1 18

Challenges and opportunities  Logs are noisy  Important log events are not easy to locate Single Event revealing problem root cause Sat Apr 15 05:58:15 EST [busError]: SCSI adapter encountered an unexpected bus phase. Issuing SCSI bus reset. Total of 106 log events Critical Event Sat Apr 15 06:24:07 EST [crash:ALERT]: Crash String: File system hung in process idle_thread1 19

Challenges and opportunities  Logs are noisy  Important log events are not easy to locate  Similar log patterns appear on systems experience the same problems

Challenges and opportunities  Logs are noisy  Important log events are not easy to locate  Similar log patterns appear on systems experience the same problems Gather more information for troubleshooting. Retrieve past diagnosis as reference. DB Log 1) Find log event dependency 2) Identify important log events 3) Cluster similar and filter noise logs Engineer A good starting point for manual log analysis. 21

Conclusions  Problem troubleshooting is time-consuming.  Hardware fault and misconfiguration are common causes  Lack of sufficient user knowledge  Most problems have low impact, while high-impact problems are more difficult to troubleshoot  Storage system logs contain useful information for problem troubleshooting  Critical event alone is not enough.  Log analysis tools that can filter noise and identify similar patterns are essential to improve troubleshooting. 22

Thanks Questions? 23

Understanding Customer Problem Troubleshooting from Storage System - PowerPoint PPT Presentation

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang (wjiang3@uiuc.edu) Weihang Jiang + , Chongfeng Hu + , Shankar Pasupathy + , Arkady Kanevsky + , Zhenmin Li # , Yuanyuan Zhou * University of Illinois * NetApp +

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

World 201 1 Help! Problem Solving and Troubleshooting Daniel Rodwell Australian National

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Network Troubleshooting Justin Trieger | Technical Director for Distance Education www.nws.edu

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

ESCALA TION And RESPONSE Out age scenarios John Allspa w Wednesday, April 24, 13

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

Troubleshooting for Intent-based Networking Joon-Myung Kang and Mario A. Snchez Hewlett

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers

ProtoDUNE-SP Reconstruction Software Review and Performance Leigh Whitehead On behalf of the

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation:

Welcome We will begin at 7:30 pm Central Time. OFA Community Engagement Fellowship Spring 2018

QCD critical point and event-by-event fluctuations M. Stephanov U. of Illinois at Chicago QCD

Mohamed Khalgui 07/06/2016 07/06/2016 Existing RPL-based sensor network are non adaptable to the

Sambuz

Useful Links

Newsletter

Mail Us

Understanding Customer Problem Troubleshooting from Storage System - PowerPoint PPT Presentation

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang (wjiang3@uiuc.edu) Weihang Jiang *+ , Chongfeng Hu *+ , Shankar Pasupathy + , Arkady Kanevsky + , Zhenmin Li # , Yuanyuan Zhou * University of Illinois * NetApp +

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Troubleshooting &amp; Q&amp;A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

World 201 1 Help! Problem Solving and Troubleshooting Daniel Rodwell Australian National

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Network Troubleshooting Justin Trieger | Technical Director for Distance Education www.nws.edu

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics &amp; Turf Troubleshooting Presentation Q &amp; A Lawn Basics &amp; Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&amp;S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

ESCALA TION And RESPONSE Out age scenarios John Allspa w Wednesday, April 24, 13

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

Troubleshooting for Intent-based Networking Joon-Myung Kang and Mario A. Snchez Hewlett

A Case for Parallelism Profilers and Advisers with What-If Analyses Santosh Nagarakatte Rutgers

ProtoDUNE-SP Reconstruction Software Review and Performance Leigh Whitehead On behalf of the

How to Solve Complex Problems in Parallel (Parallel Divide and Conquer and Task Parallelism)

Use of Task Graph Model for Parallel Program Design Detailed steps for parallel program design

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation:

Welcome We will begin at 7:30 pm Central Time. OFA Community Engagement Fellowship Spring 2018

QCD critical point and event-by-event fluctuations M. Stephanov U. of Illinois at Chicago QCD

Mohamed Khalgui 07/06/2016 07/06/2016 Existing RPL-based sensor network are non adaptable to the

Sambuz

Useful Links

Newsletter

Mail Us

Understanding Customer Problem Troubleshooting from Storage System Logs Weihang Jiang (wjiang3@uiuc.edu) Weihang Jiang + , Chongfeng Hu + , Shankar Pasupathy + , Arkady Kanevsky + , Zhenmin Li # , Yuanyuan Zhou * University of Illinois * NetApp +

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation: