Advanced Network Performance Monitoring and Troubleshooting - - PowerPoint PPT Presentation

advanced network performance monitoring and
SMART_READER_LITE
LIVE PREVIEW

Advanced Network Performance Monitoring and Troubleshooting - - PowerPoint PPT Presentation

Advanced Network Performance Monitoring and Troubleshooting Richard Carlson March 5, 2009 rcarlson@internet2.edu Basic Premise Applications performance should meet your expectations! If they dont you should complain! But


slide-1
SLIDE 1

Advanced Network Performance Monitoring and Troubleshooting

Richard Carlson March 5, 2009 rcarlson@internet2.edu

slide-2
SLIDE 2

Basic Premise

  • Application’s performance should meet

your expectations!

  • If they don’t you should complain!
  • But – you need to complain effectively!
slide-3
SLIDE 3

Why is it hard to Find/Fix Problems? Network infrastructure is complex Network infrastructure is shared Network infrastructure consists of multiple components

slide-4
SLIDE 4

Example 1 – SCP file transfer Bob and Carol are collaborating on a

  • project. Bob needs to send a copy of the

data (50 MB) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take?

  • 5 minutes?
  • 1 minute?
  • 5 seconds?
slide-5
SLIDE 5

What should we expect? Assumptions:

  • 100 Mbps Fast Ethernet is the slowest link
  • 50 msec round trip time

Bob & Carol calculate:

  • 50 MB * 8 = 400 Mbits
  • 400 Mb / 100 Mb/sec = 4 seconds
slide-6
SLIDE 6

Initial SCP Test Results

slide-7
SLIDE 7

Initial Test Results This is unacceptable! First look for network infrastructure problem

  • Use NDT tester to examine both hosts
slide-8
SLIDE 8

Initial NDT testing shows Duplex Mismatch at one end

slide-9
SLIDE 9

NDT Found Duplex Mismatch Investigating this it is found that the switch port is configured for 100 Mbps Full- Duplex operation.

  • Network administrator corrects configuration

and asks for re-test

slide-10
SLIDE 10

Duplex Mismatch Corrected

slide-11
SLIDE 11

SCP results after Duplex Mismatch Corrected

slide-12
SLIDE 12

Intermediate Results Time dropped from 18 minutes to 40 seconds. But our calculations said it should take 4 seconds!

  • 400 Mb / 40 sec = 10 Mbps
  • Why are we limited to 10 Mbps?
  • Are you satisfied with 1/10th of the possible

performance?

slide-13
SLIDE 13

Default TCP window settings

slide-14
SLIDE 14

Calculating the Window Size Remember Bob found the round-trip time was 50 msec Calculate window size limit

  • 85.3KB * 8 b/B = 698777 b
  • 698777 b / .050 s = 13.98 Mbps

Calculate new window size

  • (100 Mb/s * .050 s) / 8 b/B = 610.3 KB
  • Use 1MB as a minimum
slide-15
SLIDE 15

Resetting Window Value

slide-16
SLIDE 16

With TCP windows tuned

slide-17
SLIDE 17

Steps so far Found and fixed Duplex Mismatch

  • Network Infrastructure problem

Found and fixed TCP window values

  • Host configuration problem

Are we done yet?

slide-18
SLIDE 18

SCP results with tuned windows

slide-19
SLIDE 19

Intermediate Results SCP still runs slower than expected

  • Hint: SCP uses internal buffers
  • Patch available from PSC
slide-20
SLIDE 20

SCP Results with tuned SCP

slide-21
SLIDE 21

Final Results Fixed infrastructure problem Fixed host configuration problem Fixed Application configuration problem

  • Achieved target time of 4 seconds to

transfer 50 MB file over 2000 miles

slide-22
SLIDE 22

22

Example 2 - PNNL Throughput Problem

950+ Mbps from remote sites to PNNL

966 Mbps 328 Mbps 930 Mbps

Measured Speeds shows problem when PNNL sends

slide-23
SLIDE 23

23

PNNL Throughput Problem

950+ Mbps from remote sites to PNNL

966 Mbps 6 msec 328 Mbps 76 msec 930 Mbps 23 msec

Interesting: RTT increases by a factor of 3 and speed decreases by the same factor

slide-24
SLIDE 24

24

PNNL Throughput Problem

950+ Mbps from remote sites to PNNL

966 Mbps 6 msec 0.0094% 6.04% ooo 328 Mbps 76 msec 0.0049% 5.15% ooo 930 Mbps 23 msec 0.0045% 5.5% ooo

Finally: look at loss rate and packet reordering (ooo) rate, problem exists in Seattle – PNNL metro net

slide-25
SLIDE 25

Advanced user tools

  • Existing NDT tool
  • Allows users to test network path for a

limited number of common problems

  • Existing NPAD tool
  • Allows users to test local network

infrastructure while simulating a long path

slide-26
SLIDE 26

Network Diagnostic Tool (NDT)

  • Measure performance to users desktop
  • Identify real problems for real users
  • Network infrastructure is the problem
  • Host tuning issues are the problem
  • Make tool simple to use and understand
  • Make tool useful for users and network

administrators

slide-27
SLIDE 27

NDT sample Results

slide-28
SLIDE 28

Finding a Server

  • What? You don’t have one running at

your site?

  • Install the Internet2

Network Performance Toolkit Knoppix Disk

slide-29
SLIDE 29

NPAD/pathdiag

  • A new tool from researchers at

Pittsburgh Supercomputer Center

  • Finds problems that affect long network

paths

  • Uses Web100-enhanced Linux based

server

  • Web based Java client
slide-30
SLIDE 30

Switch 1 Switch 2 Switch 3

Long Path Problem

R1 R3 R4 R2 R7 R6 R9 R8 R5 Switch 4

H1 H2 H3

X

1 msec H1 – H2 70 msec H1 – H3

slide-31
SLIDE 31

NPAD Server main page

slide-32
SLIDE 32

NPAD Sample results

slide-33
SLIDE 33

Finding a Server

  • What? You don’t have one running at

your site?

  • Install the Internet2

Network Performance Toolkit Knoppix Disk

slide-34
SLIDE 34

Sample BWCTL results

slide-35
SLIDE 35

OWping Results

slide-36
SLIDE 36

NPToolkit Knoppix Disk

slide-37
SLIDE 37

Conclusions

  • OSG VDT will contain client tools
  • Network operators (campus, regional,

national) are standing up servers

  • OSG site admins need to stand up

server ‘near’ cluster