Advanced Network Performance Monitoring and Troubleshooting - - PowerPoint PPT Presentation
Advanced Network Performance Monitoring and Troubleshooting - - PowerPoint PPT Presentation
Advanced Network Performance Monitoring and Troubleshooting Richard Carlson March 5, 2009 rcarlson@internet2.edu Basic Premise Applications performance should meet your expectations! If they dont you should complain! But
Basic Premise
- Application’s performance should meet
your expectations!
- If they don’t you should complain!
- But – you need to complain effectively!
Why is it hard to Find/Fix Problems? Network infrastructure is complex Network infrastructure is shared Network infrastructure consists of multiple components
Example 1 – SCP file transfer Bob and Carol are collaborating on a
- project. Bob needs to send a copy of the
data (50 MB) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take?
- 5 minutes?
- 1 minute?
- 5 seconds?
What should we expect? Assumptions:
- 100 Mbps Fast Ethernet is the slowest link
- 50 msec round trip time
Bob & Carol calculate:
- 50 MB * 8 = 400 Mbits
- 400 Mb / 100 Mb/sec = 4 seconds
Initial SCP Test Results
Initial Test Results This is unacceptable! First look for network infrastructure problem
- Use NDT tester to examine both hosts
Initial NDT testing shows Duplex Mismatch at one end
NDT Found Duplex Mismatch Investigating this it is found that the switch port is configured for 100 Mbps Full- Duplex operation.
- Network administrator corrects configuration
and asks for re-test
Duplex Mismatch Corrected
SCP results after Duplex Mismatch Corrected
Intermediate Results Time dropped from 18 minutes to 40 seconds. But our calculations said it should take 4 seconds!
- 400 Mb / 40 sec = 10 Mbps
- Why are we limited to 10 Mbps?
- Are you satisfied with 1/10th of the possible
performance?
Default TCP window settings
Calculating the Window Size Remember Bob found the round-trip time was 50 msec Calculate window size limit
- 85.3KB * 8 b/B = 698777 b
- 698777 b / .050 s = 13.98 Mbps
Calculate new window size
- (100 Mb/s * .050 s) / 8 b/B = 610.3 KB
- Use 1MB as a minimum
Resetting Window Value
With TCP windows tuned
Steps so far Found and fixed Duplex Mismatch
- Network Infrastructure problem
Found and fixed TCP window values
- Host configuration problem
Are we done yet?
SCP results with tuned windows
Intermediate Results SCP still runs slower than expected
- Hint: SCP uses internal buffers
- Patch available from PSC
SCP Results with tuned SCP
Final Results Fixed infrastructure problem Fixed host configuration problem Fixed Application configuration problem
- Achieved target time of 4 seconds to
transfer 50 MB file over 2000 miles
22
Example 2 - PNNL Throughput Problem
950+ Mbps from remote sites to PNNL
966 Mbps 328 Mbps 930 Mbps
Measured Speeds shows problem when PNNL sends
23
PNNL Throughput Problem
950+ Mbps from remote sites to PNNL
966 Mbps 6 msec 328 Mbps 76 msec 930 Mbps 23 msec
Interesting: RTT increases by a factor of 3 and speed decreases by the same factor
24
PNNL Throughput Problem
950+ Mbps from remote sites to PNNL
966 Mbps 6 msec 0.0094% 6.04% ooo 328 Mbps 76 msec 0.0049% 5.15% ooo 930 Mbps 23 msec 0.0045% 5.5% ooo
Finally: look at loss rate and packet reordering (ooo) rate, problem exists in Seattle – PNNL metro net
Advanced user tools
- Existing NDT tool
- Allows users to test network path for a
limited number of common problems
- Existing NPAD tool
- Allows users to test local network
infrastructure while simulating a long path
Network Diagnostic Tool (NDT)
- Measure performance to users desktop
- Identify real problems for real users
- Network infrastructure is the problem
- Host tuning issues are the problem
- Make tool simple to use and understand
- Make tool useful for users and network
administrators
NDT sample Results
Finding a Server
- What? You don’t have one running at
your site?
- Install the Internet2
Network Performance Toolkit Knoppix Disk
NPAD/pathdiag
- A new tool from researchers at
Pittsburgh Supercomputer Center
- Finds problems that affect long network
paths
- Uses Web100-enhanced Linux based
server
- Web based Java client
Switch 1 Switch 2 Switch 3
Long Path Problem
R1 R3 R4 R2 R7 R6 R9 R8 R5 Switch 4
H1 H2 H3
X
1 msec H1 – H2 70 msec H1 – H3
NPAD Server main page
NPAD Sample results
Finding a Server
- What? You don’t have one running at
your site?
- Install the Internet2
Network Performance Toolkit Knoppix Disk
Sample BWCTL results
OWping Results
NPToolkit Knoppix Disk
Conclusions
- OSG VDT will contain client tools
- Network operators (campus, regional,
national) are standing up servers
- OSG site admins need to stand up