Causes of packet loss on the internet – Understanding Network and Security for Near-Edge Computing
The primary cause of packet loss on the internet is congestion or throttling at a peering point between two Autonomous Systems (ASs). ASs are typically operated by a single large organization such as an ISP, a large technology company, a cloud service provider, a university, or a government agency. Every time traffic crosses the boundary between two ASs, the odds of the overall flow experiencing packet loss somewhere along the way increase:

Figure 2.3 – Example of AS traversal on the internet
Figure 2.3 elaborates upon the path taken by our hypothetical client in Dallas. From this client to the server, there are a total of 12 routing hops. Not all of the hops are equivalent, though. The first four hops are within Sprint’s network, while the fifth hop is on AS 2914, which is operated by NTT. That junction between providers is what is known as a peering point. Routing hops that traverse peering points are much more likely to introduce packet loss due to congestion than intra-AS routing hops.
TCP receive window (RWIN)
Even on a perfectly clean network, there’s another thing that can artificially limit throughput when the RTT gets high. In TCP, part of the ongoing conversation between the sender and the receiver is something called the TCP receive window (RWIN). It is a value that can range from zero to 1,073,725,440 bytes, or about 1 GB3.
3 Technically, the maximum RWIN value is 65,535 bytes, but it is multiplied by the window scaling value, which can be up to 16,384. For example, an RWIN of 65,535 bytes with a window scaling value of 2500 would result in an effective window size of 163,837,500 bytes or ~163MB.
It represents the amount of total amount of unacknowledged data a sender may have in flight before it must stop and wait for the receiver to send an ACK message for one or more of the already sent TCP segments. This is why it is also called the receive buffer.
On connections with a high RTT, this can lead to situations where the sender has to stop and wait so often that the effective throughput is noticeably impacted.
While packet loss due to congestion or throttling is possible across the transit links within a provider’s network, for a host of reasons, it is most often observed at the peering points:

Figure 2.4 – TCP receive window and latency
Why this leads to problems with long RTTs isn’t always intuitive. Therefore, we’ve drawn up an analogy in Figure 2.4. In this scenario, the sender can see that there is room on the road for more trucks, but they aren’t allowed to send more until they receive a phone call from the receiving warehouse saying it is okay to do so. There are multiple reasons this could be the case.
Here are some possibilities for any given truck:
It arrived, but it hasn’t been unloaded yet as the dock is overwhelmed
It’s just a long trip and we need to be patient – they are still on the way
It crashed somewhere in the middle or was hijacked and is lost forever
Ideally, the receiving warehouse will call at some point and say they’ve successfully unloaded the cargo (TCP ACK) and what they unloaded matches the manifest (TCP checksum match). In this case, the sender will cross that one off the list and send the next truck waiting to go.
If enough time goes by without that phone call, the sender will decide a truck has been lost (TCP timeout) and send a replacement (TCP retransmit). If this happens enough times, the sender might decide that the problem is a traffic jam in the middle that they might be contributing to. Therefore, they need to start waiting longer and longer (TCP retransmission timeout) until the lost truck issue stops happening for a while (TCP congestion control using exponential backoff):

Figure 2.5 – TCP MSS, RWIN, and ACKs
Because the congestion and latency situation is different for every connection on the internet; modern operating systems usually do not have a set RWIN size in their TCP stack. Rather, they dynamically ramp up or down throughout the connection.
For example, if the receiver’s buffer keeps filling up due to memory issues, it may set a lower RWIN to slow the sender down. Alternatively, the receiver may start with a low RWIN at first and keep increasing it until retransmissions occur, which likely indicates congestion in the middle. At this point, it will back off a little to find the sweet spot. How much of a problem a retransmission is depends on the size of the window – having to resend an MB of data is a different story than having to resend a GB. That’s why finding and maintaining an optimal window size takes many things into account.
You may also like
Archives
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- May 2023
- April 2023
- February 2023
- January 2023
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- December 2021
- November 2021
- October 2021
- September 2021
- June 2021
Calendar
M | T | W | T | F | S | S |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 |
21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 |
Leave a Reply