EPL606
Transport Layer
EPL606 Transport Layer Outline Transport Layer Services TCP - - PowerPoint PPT Presentation
EPL606 Transport Layer Outline Transport Layer Services TCP Overview Segment structure Sequence/Acknowledgement numbers TCP connection management RTT acks, events, fast retransmit Flow Control Congestion Control
Transport Layer
Segment structure Sequence/Acknowledgement numbers TCP connection management RTT acks, events, fast retransmit
General causes TCP cong control (slow start, AIMD)
required)
(when required)
drop messages re-orders messages delivers duplicate copies of a given message limits messages to some finite size delivers messages after an arbitrarily long delay
guarantee message delivery deliver messages in the same order they are sent deliver at most one copy of each message support arbitrarily large messages support synchronization allow the receiver to flow control the sender support multiple application processes on each host
servers have well-known ports see /etc/services on Unix
psuedo header + UDP header + data
SrcPort DstPort Checksum Length Data 16 31
RFCs: 793, 1122, 1323, 2018, 2581
bi-directional data flow in same connection MSS: maximum segment size
handshaking (exchange of control msgs) init’s sender, receiver state before data exchange
sender will not overwhelm receiver
one sender, one receiver
steam:
no “message boundaries”
TCP congestion and flow control set window size
app writes bytes TCP sends segments app reads bytes
from overrunning receiver
sender from overrunning network
Application process Write bytes TCP Send buffer Segment Segment Segment Transmit segments Application process Read bytes TCP Receive buffer
■ ■ ■
initiate a connection?
registers with the TCP layer instructing it to “accept” connections at a certain port
initiates a “connect” request which is “accept”-ed by the server
receiver establish “connection” before exchanging data segments
seq. #s buffers, flow control info (e.g. RcvWindow)
CTL = Which control bits in the TCP header are set to 1
Closing a connection:
client closes socket: clientSocket.close();
Step 1: client end system sends TCP FIN
control segment to server
Step 2: server receives FIN, replies with
Step 3: client receives FIN, replies with
ACK. Enters “timed wait” - will respond with ACK to received FINs
Step 4: server, receives ACK.
Connection closed.
Note: with small modification, can
handle simultaneous FINs.
client server
close close closed timed wait
connection are numbered by TCP.
number.
Active participant (client) Passive participant (server)
Q: how to set TCP timeout value?
but RTT varies
unnecessary retransmissions
segment loss
Q: how to estimate RTT?
from segment transmission until ACK receipt ignore retransmissions
estimated RTT “smoother” average several recent measurements, not just current SampleRTT EstimatedRTT = (1- α)*EstimatedRTT + α*SampleRTT
Exponential weighted moving average
influence of past sample decreases exponentially fast
typical value: α = 0.125
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr
100 150 200 250 300 350 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 time (seconnds) RTT (milliseconds) SampleRTT Estimated RTT
Setting the timeout
large variation in EstimatedRTT -> larger safety margin
EstimatedRTT: TimeoutInterval = EstimatedRTT + 4*DevRTT DevRTT = (1-β)*DevRTT + β*|SampleRTT-EstimatedRTT| (typically, β = 0.25)
Then set timeout interval:
transfer service on top of IP’s unreliable service
retransmission timer
triggered by:
timeout events duplicate acks
TCP sender:
ignore duplicate acks ignore flow control, congestion control
MSS is the largest segment size that can be sent without IP fragmentation
explicitly send a segment
data rcvd from app:
number of first data byte in segment
running (think of timer as for oldest unacked segment)
TimeOutInterval timeout:
caused timeout
Ack rcvd:
unacked segments
update what is known to be acked start timer if there are
(simplified)
NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */
Comment:
cumulatively ack’ed byte Example:
y= 73, so the rcvr wants 73+ ; y > SendBase, so that new data is acked
Host A
time premature timeout
Host B
Seq=92 timeout
Host A
loss
timeout
lost ACK scenario
Host B
X
time
Seq=92 timeout
SendBase = 100 SendBase = 120 SendBase = 120 Sendbase = 100
Host A
loss
timeout
Cumulative ACK scenario
Host B
X
time
SendBase = 120
after advertising zero window, wait for space equal to a maximum segment size (MSS) delayed acknowledgements
Sender Receiver
relatively long:
long delay before resending lost packet
duplicate ACKs.
Sender often sends many segments back-to-back If segment is lost, there will likely be many duplicate ACKs.
ACKs for the same data, it supposes that segment after ACKed data was lost:
fast retransmit: resend segment before timer expires
event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else { increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { resend segment with sequence number y } a duplicate ACK for already ACKed segment fast retransmit
Segment structure Seq nums Tcp connection management RTT Rtd: acks, events, fast retransmit
General causes Tcp cong control (slow start, AIMD)
connection has a receive buffer:
matching the send rate to the receiving app’s drain rate
reading from buffer sender won’t overflow receiver’s buffer by transmitting too much, too fast
flow control
= RcvWindow = RcvBuffer-[LastByteRcvd
room by including value of RcvWindow in segments
data to RcvWindow
guarantees receive buffer doesn’t overflow
by the receiver window value.
there is congestion in the network.
1. The source does not have to send a full window’s worth of data. 2. The size of the window can be increased or decreased by the destination. 3. The destination can send an acknowledgment at any time.
be.
to be transmitted.
Bandwidth T1 (1.5Mbps) Ethernet (10Mbps) T3 (45Mbps) FDDI (100Mbps) STS-3 (155Mbps) STS-12 (622Mbps) STS-24 (1.2Gbps) Time Until Wrap Around 6.4 hours 57 minutes 13 minutes 6 minutes 4 minutes 55 seconds 28 seconds
max)
RTT
Bandwidth T1 (1.5Mbps) Ethernet (10Mbps) T3 (45Mbps) FDDI (100Mbps) STS-3 (155Mbps) STS-12 (622Mbps) STS-24 (1.2Gbps) Delay x Bandwidth Product 18KB 122KB 549KB 1.2MB 1.8MB 7.4MB 14.8MB
too long: hurts interactive applications too short: poor network utilization strategies: timer-based vs self-clocking
if fills a max segment (and window open): send it else
if there is unack’ed data in transit: buffer it until ACK arrives else: send it
Event at Receiver
Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Arrival of segment that partially or completely fills gap
TCP Receiver action
Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte Immediate send ACK, provided that segment startsat lower end of gap
pre-allocate resources so at to avoid congestion control congestion if (and when) is occurs
hosts at the edges of the network (transport protocol) routers inside the network (queuing discipline)
best-effort (assume for now) multiple qualities of service (later)
Destination 1.5-Mbps T1 link Router S
r c e 2 Source 1 100-Mbps FDDI 10-Mbps Ethernet
sequence of packets sent between source/destination pair maintain soft state at the routers
router-centric versus host-centric reservation-based versus feedback-based window-based versus rate-based
Router Source 2 Source 1 Source 3 Router Router Destination 2 Destination 1
fast for network to handle”
transmitted approaches network capacity”
keep number of packets below level at which performance drops
lost packets (buffer overflow at routers) long delays (queueing in router buffers)
then queue size grows without bound and packet delay goes to infinity
May cause congestion to propagate throughout network
capacity
infinity at full capacity
causes moderate congestion: throughput increases at slower rate than load
increase and eventually throughput to drop to zero
receivers
buffers
congested
throughput
finite shared output link buffers Host A
λin : original data
Host B λout
λ'in : original data, plus retransmitted data
a.
always: (goodput)
b.
“perfect” retransmission only when loss:
c.
retransmission of delayed (not lost) packet makes larger (than perfect case) for same
λin λout
=
λin λout
>
λin λout
“costs” of congestion:
more work (retrans) for given “goodput” unneeded retransmissions: link carries multiple copies of pkt
R/2 R/2
λin λout
b.
R/2 R/2
λin λout
a.
R/2 R/2
λin λout
c.
R/4 R/3
λin
Q: what happens as and increase ?
λin
finite shared output link buffers
Host A
λin : original data
Host B
λout λ'in : original data, plus retransmitted data
Another “cost” of congestion:
when packet dropped, any “upstream transmission capacity
used for that packet was wasted!
H
t A H
t B
λ
t
Implicit end-end congestion control:
network
system observed loss, delay
Network-assisted congestion control:
end systems single bit indicating congestion (SNA, DECbit, TCP/IP ECN, ATM) explicit rate sender should send at “backpressure”
Backward Forward
Binary Credit-based rate-based
egress nodes
forward explicit congestion avoidance
required quick action
backward explicit congestion avoidance
For traffic in same direction as received frame This frame has encountered congestion
For traffic in opposite direction of received frame Frames transmitted may encounter congestion
pre-allocate resources so at to avoid congestion send data and control congestion if (and when) it occurs
hosts at the edges of the network (transport protocol) routers inside the network (queuing discipline)
Attempt to simplify routers
RSVP requires API and application changes
ATM has rate based algorithms to specify acceptable rates for each flow. Alternatives include congestion indication where hosts shrink their window.
Segment structure Seq nums Tcp connection management RTT Rtd: acks, events, fast retransmit
General causes Tcp cong control (slow start, AIMD)
assumes best-effort network (FIFO or FQ routers) each source determines network capacity for itself uses implicit feedback ACKs pace transmission (self-clocking)
determining the available capacity in the first place adjusting to changes in the available capacity
limits how much data source has in transit MaxWin = MIN(CongestionWindow, AdvertisedWindow) EffWin = MaxWin - (LastByteSent - LastByteAcked)
increase CongestionWindow when congestion goes down decrease CongestionWindow when congestion goes up
timeout signals that a packet was lost packets are seldom lost due to transmission error lost packet implies congestion
Increment = (MSS * MSS)/CongestionWindow CongestionWindow += Increment
– increment CongestionWindow by
– divide CongestionWindow by two whenever a timeout occurs (multiplicative decrease)
Source Destination
60 20 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time (seconds) 70 30 40 50 10 10.0
first place
Example: MSS = 500 bytes & RTT = 200 msec initial rate = 20 kbps
desirable to quickly ramp up to respectable rate
fast until first loss event
Host A
RTT
Host B
time
Available Window =
MIN[window, cwnd]
Start connection with
cwnd=1
Double CongWin
every RTT = =
Increment cwnd at
each ACK, to some max
cwnd= cwnd+1
capacity in the first
begin with CongestionWindow = 1 packet double CongestionWindow each RTT (increment by 1 packet for each ACK)
Source Destination
when first starting connection when connection goes dead waiting for timeout
60 20 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time (seconds) 70 30 40 50 10
periods
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time in seconds 10 20 30 40 50 60 70
Value of CongesionWindow timeout Time when transmit Initial transmit of retransmitted packet CongestionThreshold
timeouts lead to idle periods
duplicate ACKs to trigger retransmission
Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Retransmit packet 3 ACK 1 ACK 2 ACK 2 ACK 2 ACK 6 ACK 2 Sender Receiver
timeouts lead to idle periods
duplicate ACKs to trigger retransmission
timeouts lead to idle periods
duplicate ACKs to trigger retransmission
skip the slow start phase go directly to half the last successful
CongestionWindow (ssthresh)
60 20 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Time (seconds) 70 30 40 50 10
control congestion once it happens repeatedly increase load in an effort to find the point at which congestion occurs, and then back off
predict when congestion is about to happen reduce rate before packets start being discarded call this congestion avoidance, instead of congestion control
router-centric: DECbit and RED Gateways host-centric: TCP Vegas
monitors average queue length over last busy+idle cycle set congestion bit if average queue length > 1 attempts to balance throughout against delay
Queue length Current time Time Current cycle Previous cycle Averaging interval
increase CongestionWindow by 1 packet
decrease CongestionWindow by 0.875 times
just drop the packet (TCP will timeout) could make explicit by marking the packet
rather than wait for queue to become full, drop each arriving packet with some drop probability whenever the queue length exceeds some drop level
AvgLen = (1 - Weight) * AvgLen + Weight * SampleLen
0 < Weight < 1 (usually 0.002) SampleLen is queue length each time a packet arrives
MaxThreshold MinThreshold AvgLen
if AvgLen <= MinThreshold then enqueue the packet if MinThreshold < AvgLen < MaxThreshold then calculate probability P drop arriving packet with probability P if ManThreshold <= AvgLen then drop arriving packet
TempP = MaxP * (AvgLen - MinThreshold)/ (MaxThreshold - MinThreshold) P = TempP/(1 - count * TempP)
P(drop) 1.0 MaxP MinThresh MaxThresh AvgLen
roughly proportional to the share of the bandwidth that flow is currently getting
average queue size is halfway between the two thresholds, the gateway drops roughly one out of 50 packets.
sufficiently large to allow link utilization to be maintained at an acceptably high level
the typical increase in the calculated average queue length in one RTT; setting MaxThreshold to twice MinThreshold is reasonable for traffic on today’s Internet
slow-start phase, window grows exponentially.
congestion-avoidance phase, window grows linearly.
set to CongWin/2 and CongWin set to Threshold.
CongWin/2 and CongWin is set to 1 MSS.
Event State TCP Sender Action Commentary ACK receipt for previously unacked data Slow Start (SS) CongWin = CongWin + MSS, If (CongWin > Threshold) set state to “Congestion Avoidance” Resulting in a doubling of CongWin every RTT ACK receipt for previously unacked data Congestion Avoidance (CA) CongWin = CongWin+MSS * (MSS/CongWin) Additive increase, resulting in increase of CongWin by 1 MSS every RTT Loss event detected by triple duplicate ACK SS or CA Threshold = CongWin/2, CongWin = Threshold, Set state to “Congestion Avoidance” Fast recovery, implementing multiplicative
drop below 1 MSS. Timeout SS or CA Threshold = CongWin/2, CongWin = 1 MSS, Set state to “Slow Start” Enter slow start Duplicate ACK SS or CA Increment duplicate ACK count for segment being acked CongWin and Threshold not changed
Ignore slow start
W/2RTT.
probability:
3 ( ) 2 B p p =
Gbps throughput
L RTT MSS ⋅ 22 . 1
Incr: w ← w + a , a =1 Decr: w ← bw , b = 1/2
f1(k+1)=f1(k)+a if f1(k)+f2(k) < B f1(k+1)=bf1(k) if f1(k)+f2(k) >= B f2(k+1)=f2(k)+a if f2(k)+f2(k) < B f2(k+1)=bf2(k) if f1(k)+f2(k) >= B f2(k+1)-f1(k+1)= f2(k)-f1(k) if f1(k)+f2(k) < B f2(k+1)-f1(k+1)= b(f2(k)-f1(k)) if f1(k)+f2(k) >= B
W=1 adaptation on congestion
W=W/2 adaptation on fast retransmit, W=1 on timeout
TCP-Reno + fast recovery
Uses round-trip time as an early-congestion-feedback mechanism Reduces losses
Selective Acknowledgements
performed again
loss and this empties the pipeline
through DUP-ACKs
reduce the ssthresh to half of current window and set cwnd to this value. For each DUP-ACK received increase cwnd by one. If cwnd larger than number of packets in transit send new data else wait. In this way the pipe is not emptied.
performed only on time-out
sender out of the fast recovery phase
lost in one RTT.
indication of another lost packet (which is immediately retransmitted).
ACKed.
acknowledgements
Reno can retransmit only one lost packet per round- trip time
sender about received packets allowing sender to recover from multiple-packet losses faster
17,19-25)
retransmits them immediately
queue is building up and congestion will happen soon; e.g.,
RTT is growing sending rate flattens
(commonly the RTT of the first packet)
ExpectedRate = CongestionWindow / BaseRTT
per RTT
Diff = ExpectedRate – ActualRate if Diff < α
-->increase CongestionWindow linearly
else if Diff >β
-->decrease CongestionWindow linearly
else
-->leave CongestionWindow unchanged
− α = 1 packet − β = 3 packets
keep fine-grained timestamps for each packet check for timeout on first duplicate ACK
70 60 50 40 30 20 10 Time (seconds) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 240 200 160 120 80 40 Time (seconds)
Driving on Ice
60 20 0.5 1.0 1.5 4.0 4.5 6.5 8.0 KB Time (seconds) Time (seconds) 70 30 40 50 10 2.0 2.5 3.0 3.5 5.0 5.5 6.0 7.0 7.5 8.5 900 300 100 0.5 1.0 1.5 4.0 4.5 6.5 8.0 Sending KBps 1100 500 700 2.0 2.5 3.0 3.5 5.0 5.5 6.0 7.0 7.5 8.5 Time (seconds) 0.5 1.0 1.5 4.0 4.5 6.5 8.0 Queue size in router 5 10 2.0 2.5 3.0 3.5 5.0 5.5 6.0 7.0 7.5 8.5
Congestion Window Average send rate at source Average Q length in router
to current throughput
linearly
linearly
doubling the window every other RTT, rather than every RTT and of using a boundary in the difference between throughputs to exit the Slow Start phase, rather than a window size value.
155 622 2500 5000 10000
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2000 4000 6000 8000 10000 Link Capacity (Mbps)
Link Utilization
NS-2 Simulation (100 sec)
Link Capacity = 155Mbps, 622Mbps, 2.5Gbps, 5Gbps, 10Gbps, Drop-Tail Routers, 0.1BDP Buffer 5 TCP Connections, 100ms RTT, 1000-Byte Packet Size
Utilization of a link with 5 TCP connections Cannot fully utilize the huge capacity of high- speed networks!
cwnd,
(Round-Trip Time).
Time (RTT) Slow start Congestion avoidance Packet loss Packet loss Packet loss cwnd Packet loss
TCP
cwnd = cwnd + 1 cwnd = cwnd * (1-1/2)
Packet loss Time (RTT) Congestion avoidance Packet loss Packet loss cwnd Slow start Packet loss A TCP connection with 1250-Byte packet size and 100ms RTT is
running over a 10Gbps link (assuming no other connections, and no buffers at routers) 100,000 10Gbps 50,000 5Gbps 1.4 hours 1.4 hours 1.4 hours
TCP big decrease slow increase
1/8.
Packet loss Time (RTT) Slow start Congestion avoidance Packet loss Packet loss cwnd Packet loss
cwnd = cwnd + 1 cwnd = cwnd + 0.01*cwnd cwnd = cwnd * (1-1/2) cwnd = cwnd * (1-1/8)
TCP
decrement.
Packet loss Time (RTT) Slow start Congestion avoidance Packet loss Packet loss cwnd Packet loss
cwnd = cwnd * (1-1/2) cwnd = cwnd * (1-dec(cwnd)) cwnd = cwnd + 1 cwnd = cwnd + inc(cwnd)
TCP
Some Measurements of Throughput CERN -SARA
Standard TCP txlen 100 25 Jan03 100 200 300 400 500 1043509370 1043509470 1043509570 1043509670 1043509770 Time I/f Rate Mbits/s 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Out Mbit/s In Mbit/s Hispeed TCP txlen 2000 26 Jan03 100 200 300 400 500 1043577520 1043577620 1043577720 1043577820 1043577920 Time I/f Rate Mbits/s 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Out Mbit/s In Mbit/s Scalable TCP txlen 2000 27 Jan03 100 200 300 400 500 1043678800 1043678900 1043679000 1043679100 1043679200 Time II/f Rate Mbits/s 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Out Mbit/s In Mbit/s
– Average Throughput 345 Mbit/s
– Average Throughput 340 Mbit/s
– 1 GByte file transfers – Blue Data – Red TCP ACKs
– Average Throughput 167 Mbit/s – Users see 5 - 50 Mbit/s!
user .
performance.
infer congestion.
(Sylvain Ravot, caltech)
OC4 8 OC1 9 2
Linux TCP Linux TCP FAST 92%
txq= 100 txq= 10000
95% 16% 48% Linux TCP Linux TCP FAST
2G 1G
Average utilization 19% 27%
Feedback Round Trip Time Congestion Window
Congestion Header
Feedback Round Trip Time Congestion Window
Feedback = + 0.1 packet
Feedback = + 0.1 packet Round Trip Time Congestion Window Feedback =
Congestion Window = Congestion Window + Feedback
XCP extends ECN and CSFQ
Goal: Divides ∆ between flows to converge to fairness Looks at a flow’s state in Congestion Header Algorithm: If ∆ > 0 ⇒ Divide ∆ equally between flows If ∆ < 0 ⇒ Divide ∆ between flows proportionally to their current rates
MIMD AIMD
Goal: Matches input traffic to link capacity & drains the queue Looks at aggregate traffic & queue Algorithm: Aggregate traffic changes by ∆ ∆ ~ Spare Bandwidth ∆ ~ - Queue Size So, ∆ = α davg Spare - β Queue
Congestion Controller Fairness Controller
∆ = α davg Spare - β Queue 2 2 4
2
α β π α = < < and
Theorem: System converges
to optimal utilization (i.e., stable) for any link bandwidth, delay, number of sources if:
(Proof based on Nyquist Criterion)
Congestion Controller Fairness Controller
Algorithm: If ∆ > 0 ⇒ Divide ∆ equally between flows If ∆ < 0 ⇒ Divide ∆ between flows proportionally to their current rates
Need to estimate number of flows N
× =
T in pkts pkt pkt RTT
Cwnd T
N
) / (
1
RTTpkt : Round Trip Time in header Cwndpkt : Congestion Window in header T: Counting Interval
Bottleneck Bandwidth (Mb/s) Utilization as a function of Bandwidth Round Trip Delay (sec) Utilization as a function
Bottleneck Bandwidth (Mb/s) Utilization as a function of Bandwidth Round Trip Delay (sec) Utilization as a function