Behavior-Based Problem Localization for Parallel File Systems
Michael P . Kasick
Rajeev Gandhi, Priya Narasimhan
Carnegie Mellon University
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 1
Behavior-Based Problem Localization for Parallel File Systems - - PowerPoint PPT Presentation
Behavior-Based Problem Localization for Parallel File Systems Michael P . Kasick Rajeev Gandhi, Priya Narasimhan Carnegie Mellon University Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 1 Problem Diagnosis Goals
Carnegie Mellon University
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 1
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 2
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 3
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 3
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 3
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 3
. Kasick et al. Black-box problem diagnosis in parallel file systems. In FAST, San Jose, CA, Feb. 2010. Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 4
1
2
3
4
5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 6
ios0 ios1 ios2 iosN mds0 mdsM
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 7
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 8
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 9
100 200 300 400 500 600 100 200 300 400 500 600 Elapsed Time (s) tcp_v4_rcv Samples
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 10
1
2
3
4
5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 11
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 12
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 13
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 14
Application Image Function Samples pvfs2-server vmlinux tcp_recvmsg 658 vmlinux vmlinux sk_run_filter 808 vmlinux vmlinux tcp_rcv_established 686 vmlinux vmlinux tcp_v4_rcv 943
Function Count Time (s) job_testcontext 58 1.04 dbpf_pwrite 9 0.75 dbpf_dspace_testcontext 118 0.99 dbpf_sync_db 11 0.33
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 15
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 16
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 17
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 18
1
2
3
4
5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 19
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 20
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 21
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 22
< 2232, 1900, 3886 > < 808, 686, 943 > < 830, 678, 977 > < 807, 770, 987 >
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 22
< 2232, 1900, 3886 > < 808, 686, 943 > < 830, 678, 977 > < 807, 770, 987 > 5581 5553 5454 64 129 125
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 22
< 2232, 1900, 3886 > < 808, 686, 943 > < 830, 678, 977 > < 807, 770, 987 > 5581 5553 5454 64 129 125 5553 129 125 129
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 22
< 2232, 1900, 3886 > < 808, 686, 943 > < 830, 678, 977 > < 807, 770, 987 > 5581 5553 5454 64 129 125 5553 129 125 129
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 22
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 23
Application Image Function socat vmlinux copy_user_generic_string vmlinux vmlinux set_normalized_timespec vmlinux vmlinux ktime_get_ts socat socat /usr/bin/socat tg3.ko tg3.ko tg3_poll vmlinux vmlinux tcp_v4_rcv vmlinux vmlinux __inet_lookup_established vmlinux vmlinux sk_run_filter vmlinux vmlinux tcp_rcv_established vmlinux vmlinux kmem_cache_alloc_node
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 24
1
2
3
4
5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 25
diskhog diskbusy wnethog rnethog recvloss sendloss
Samples Count Time Combined
Fault True Positive (%) 20 40 60 80 100
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 26
diskhog diskbusy wnethog rnethog recvloss sendloss
Samples Count Time Combined
Fault True Positive (%) 20 40 60 80 100
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 27
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 28
1
2
3
4
5
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 29
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 30
Michael P . Kasick Behavior-Based Problem Localization October 3, 2010 31