OpenBSD version: Actually none Arch: Any NSFP: Please no...
So, for the last couple of months, the mastodon instance of Digitalcourage e.V. has been suffering from network issues. These manifested themselves in users having a very low (around 100KB/s) download speed from the instance. This, of course, was rather ‘not so good’.
Over these months, and with lots of user feedback, some clear patterns started to emerge. Specifically, it seemed like mostly customers of Deutsche Telekom were affected, while others were able to retrieve files at the speeds they were expecting. The admins over at digital courage had even tried to change the IP address of the system: To no avail. Interestingly, though, another system right next to the mastodon instance at digitalcourage.social performed pretty much fine, even for otherwise affected Deutsche Telekom users. Ultimately, this lead to a rather extensive discussion on whether Deutsche Telekom might be treating traffic to Digitalcourage’s servers a bit less equal than it should.
Last Wednesday (9 Nov 2022), i saw the ongoing thread and was kind of intruiged by the somewhat strange problem. I chatted up a contact over at Deutsche Telekom, who was already involved in debugging the issue, who was friendly enough to introduce me to the Digitalcourage people, suggesting i might be able to help. They agreed, and got into contact with me, and we could take a shot at debugging this.
As people might find it fun to read up on what really happened there (to which i only got tonight), i thought i drop this into a blog article.
Starting to debug
The first thing one needs when debugging such an issue is, naturally, a way to really look at what is happening, i.e., you have to be able to reproduce the issue. For the case at hand, this was surprisingly easy. When i tried to download the test-file from my home connection, things did not look good:
% wget -O /dev/null https://digitalcourage.social/1GB.bin --2022-11-09 18:53:17-- https://digitalcourage.social/1GB.bin Resolving digitalcourage.social (digitalcourage.social)... 188.8.131.52 Connecting to digitalcourage.social (digitalcourage.social)|184.108.40.206|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1048576000 (1000M) [application/octet-stream] Saving to: ‘/dev/null’ /dev/null 0%[ ] 615.70K 98.8KB/s eta 2h 53m
This is good for multiple reasons.
- a) It gives me a system to look at while debugging.
- b) My home connection is through a tunnel ultimately leaving through AS59645, which (partially) draws upstream from IN-BERLIN, the hoster Digitalcourage uses as well.
- c) It essentially rules out Deutsche Telekom as a culprit, which means that whatever is broken can most likely be fixed over at the systems of Digitalcourage.
Hence, knowing DT is most likely out of the picture, and equipped with full controll on the (at least overlay) path from my machine up to the point where packets are handed to IN-BERLIN, i went to the next step.
So, the next steps was seeing ‘when’ things broke.
- Getting the file from my workstation was slow.
- Getting the file from the first router was slow.
- Getting the file from my router over at IN-BERLIN was fast. As the link between my router at IN-BERLIN and the one in front of my workstation goes via a less-than-1500 MTU link, this lead to a reall strong suspicion that this is an MTU related issue.
The MTU, or Maximum Transmission Unit is a byte value, which tells networked systems how large network packets are allowed to be. It is strongly tied to the MSS (Maximum Segment Size) for TCP connections. Commonly, the MSS is MTU - 20b (for IPv4) and MTU - 40b (for IPv6), i.e., if the MTU is 1500 (as it is for most ethernet links, and essentially most links on the Internet), the MSS should be 1480 for IPv4 and 1460 for IPv6.
Sadly, not all links on the Internet have an MTU of 1500. My (wireguard based) tunnel connection, for example, has an MTU of 1420. Hence, the MSS for IPv4 is 1400. Similarly, DSL customers of Deutsche Telekom have an MTU of 1492, because 8b are needed for PPPoE. Cable network providers ususally do not use PPPoE, and hence their customers commonly have an MTU of 1500.
Of course, with links below 1500b being common, there are mechanisms for hosts to determine the maximum MTU on a path to a server/client, so they can make sure to not send packets larger than that. This is commonly called Path MTU Discovery (PMTUD), and i had my fair share of headeaches around that already. The idea behind PMTUD is, essentially, that each host on the path, as soon as it can not forward a packet because it is too large sends back am ICMP error message of ‘Type 3 Code 4’ (Fragmentation Needed and Don’t Fragment was Set). When the sending host gets that packet, it knows that it has to send smaller packets if it wants them to arrive.
Funnily enough, the issue being related to the MTU would also explain why Deutsche Telekom customers are affected, while others are not: The 1492b MTU due to PPPoE.
Making sure it’s the MTU
Verifying that this issue is MTU related was comparatively easy.
I fired up
tcpdump on the outbound interface of my router at IN-BERLIN (where packets come in on a 1500 MTU interface and go out via a 1420 MTU interface), and set a filter for ICMP packets destined to digitalcourage.social; Then, i tried to
wget a file from my workstation. And sure enough i saw:
% tcpdump -i vio0 -n host 220.127.116.11 and icmp tcpdump: listening on vio0, link-type EN10MB 20:16:38.979978 IP 18.104.22.168 > 22.214.171.124: ICMP 126.96.36.199 unreachable - need to frag (mtu 1420), length 36 20:16:39.984627 IP 188.8.131.52 > 184.108.40.206: ICMP 220.127.116.11 unreachable - need to frag (mtu 1420), length 36 20:16:39.985268 IP 18.104.22.168 > 22.214.171.124: ICMP 126.96.36.199 unreachable - need to frag (mtu 1420), length 36 20:16:40.984956 IP 188.8.131.52 > 184.108.40.206: ICMP 220.127.116.11 unreachable - need to frag (mtu 1420), length 36 ...
You might notice that this is odd. There should not be that many, the sender of large packets (18.104.22.168) should rather quickly get the message that the MTU is too large… But the box seems to be somewhat oblivious…
Where PMTUD is lost
Together with Christian from Digitalcourage (who helped me debugging this and was my
/bin/instantmessangersh for commands on the Digitalcourage infrastructure, as i of course did not have a login on those machines) i now started to dig where those packets might get lost.
Digging was complete relatively quickly, as both Digitalcourage’s router and the digitalcourage.social only allowed ICMP type 8 (echo-reply, what you usually know was
ping), but not Type 3 (code 4).
Interestingly, the mastodon documentation suggested this; Based on our findings, this was already updated, though.
Two (integrated in the systems’ firewall frameworks)
iptables -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT later, the icmp packets finally arrived where they should.
Eagerly, i fired up a wget, looking forward to seeing the file rushing in with unseen speed and saw… well, nothing having changed, really.
Thinking about oddities
Now, with things not working now, there must be a bit more going wrong. First of all, technically my systems do MSS clamping. The MSS is communicated to a remote host when a connection is established, and my routers should set it to 1320 for packets traversing the tunnel. So, PMTUD aside, the remote server should not have needed it to begin with. With PMTUD now technically being able to work, there should really be nothing keeping these hosts from communicating as they should.
Hence, we started digging into what might be different between the working and non-working host.
Specifically, we looked at
sysctl -a (nothing special),
iptables -L -v -n (again, nothing), and
route get $myip/
route show $myip (nothing out of the ordinary).
We also held a tcpdump to the wire and verified that my client communicated the correct MSS (1320, and yes, it indicated that when starting the TCP session).
There seemed to be nothing special about digitalcourage.social, yet it was still blissfully ignorant of the existence of any MTUs < 1500.
This meant that packets would be resend, each time with a segmentsize 40b lower, until it would finally arrive. This, of course, causes a bit of overhead, making things… slow.
Making things work (for now)
Somewhat frustrated with how things were going (and thanks to some input from anwlx), we decided to install a route for my IP with a locked MTU on the mastodon host:
ip route add 22.214.171.124 via 126.96.36.199 mtu lock 1320
Lo and behold… things suddenly worked. I got an excess far beyond 100 Mbit from the host.
We continued to try around for a bit, but were unable to figure out why the host was ignoring MTU and MSS, except when explicitly locked. At this point–slightly after 12AM–I suggested to apply a hotfix, which would work for the majority of users, and call it a night:
ip r a 0.0.0.0/1 via 188.8.131.52 mtu lock 1320 ip r a 184.108.40.206/1 via 220.127.116.11 mtu lock 1320
This locked the MTU for two more-specifics (i.e., preferred) routes for ‘everyting’ to an MTU of 1320; Essentially the same thing as for my single IP, but for, well, everything. And, in case of doubt, this is a more resillient approach than trying to update the default route on a production box. Christian was actually quiet happy about that suggestion, and moved it in place. For now, this fixed the issue, and replies to the announcement that things should work now suggested, that it actually worked.
The hotfix was not really satisfactory for me (and i promised you some input on the root cause). Hence, i tried to replicate the issue:
- Setup a system with the same software (Debian 10) under my control (debugging in prod is always a bit unfunny)
- Install Mastodon without docker (maybe this breaks something?!)
- Set all sysctl values and loaded kernel modules the same as on the live host
I set that up on Saturday, and was actually quiet hopeful that things would not work. Well, except they did. I could not replicate the issue.
Which brings us to today. I started to wonder whether maybe there was something about the virtualization environment going on.
I checked in with Christian again, and asked them what they were using (plain libvirt+kvm on Debian).
Then, on a whim, i asked for an
lspci from the box. Maybe there is something there?
Looking at the
lspci output, one line immediately sprung into my eye:
00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 20)
Usually, when running kvm VMs, you want to have a virtio network card.
So, quick change over on my test-VM, making the NIC ‘rtl8139’ and not ‘virtio’ aaaaaand… things broke. Finally.
wget behind a low MTU link finally gave me only around 100KB/s.
Similarly, checking the second box from which thins worked all along, Christian found that it uses a virtio NIC:
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
Equipped with something to search for (to be fair ‘rtl8139, broken’ is not the most unique thing to look for), i ultimately found this thread on the qemu-dev/netdev mailinglists. It appears that others actually had the same problems before. The thread rather quickly notes that the MTU for TCP Segment Offloading seems to be fixed at 1500 in qemu’s rtl8139 code. The thread even suggests a patch which should fix the issue (but runs dry afterwards). Ultimately, the code with this bug must have been written in 2006, so round about 16 years ago (or rather: already ten years ago when the issue was discussed on the mailinglist).
Looking at the ‘Modifications:’ header in
rtl8139.c i have a strong suspicion that this ultimately boils down to this change:
* 2006-Jul-04 : Implemented TCP segmentation offloading * Fixed MTU=1500 for produced ethernet frames
This also fits well with the fact that disabling TSO/GSO for the rtl8139 fixes the issue as well:
ethtool -K ens18 tx off sg off tso off
But, as i am not really a
C-ish person, or overly qualified coder… i figured it might be better to just file a bug ticket and call it a day.
At least I have closure now.
And the people at Digitalcourage a reason to change the network interface type of their virtualized NIC.
So, as a rather final-final thought; There is one big lesson learned.
If you compare two virtual machines on the same hypervisor, do not assume that the hardware is the same.
Or that virtual machines cannot have hardware bugs.