OpenBSD version: Actually none
Arch:            Any
NSFP:            Please no...

So, for the last couple of months, the mastodon instance of Digitalcourage e.V. has been suffering from network issues. These manifested themselves in users having a very low (around 100KB/s) download speed from the instance. This, of course, was rather ‘not so good’.

Over these months, and with lots of user feedback, some clear patterns started to emerge. Specifically, it seemed like mostly customers of Deutsche Telekom were affected, while others were able to retrieve files at the speeds they were expecting. The admins over at digital courage had even tried to change the IP address of the system: To no avail. Interestingly, though, another system right next to the mastodon instance at digitalcourage.social performed pretty much fine, even for otherwise affected Deutsche Telekom users. Ultimately, this lead to a rather extensive discussion on whether Deutsche Telekom might be treating traffic to Digitalcourage’s servers a bit less equal than it should.

Last Wednesday (9 Nov 2022), i saw the ongoing thread and was kind of intruiged by the somewhat strange problem. I chatted up a contact over at Deutsche Telekom, who was already involved in debugging the issue, who was friendly enough to introduce me to the Digitalcourage people, suggesting i might be able to help. They agreed, and got into contact with me, and we could take a shot at debugging this.

As people might find it fun to read up on what really happened there (to which i only got tonight), i thought i drop this into a blog article.

Starting to debug

The first thing one needs when debugging such an issue is, naturally, a way to really look at what is happening, i.e., you have to be able to reproduce the issue. For the case at hand, this was surprisingly easy. When i tried to download the test-file from my home connection, things did not look good:

% wget -O /dev/null https://digitalcourage.social/1GB.bin
--2022-11-09 18:53:17--  https://digitalcourage.social/1GB.bin
Resolving digitalcourage.social (digitalcourage.social)... 217.197.90.87
Connecting to digitalcourage.social (digitalcourage.social)|217.197.90.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1048576000 (1000M) [application/octet-stream]
Saving to: ‘/dev/null’

/dev/null             0%[                    ] 615.70K  98.8KB/s    eta 2h 53m 

This is good for multiple reasons.

a) It gives me a system to look at while debugging.
b) My home connection is through a tunnel ultimately leaving through AS59645, which (partially) draws upstream from IN-BERLIN, the hoster Digitalcourage uses as well.
c) It essentially rules out Deutsche Telekom as a culprit, which means that whatever is broken can most likely be fixed over at the systems of Digitalcourage.

Hence, knowing DT is most likely out of the picture, and equipped with full controll on the (at least overlay) path from my machine up to the point where packets are handed to IN-BERLIN, i went to the next step.

So, the next steps was seeing ‘when’ things broke.

Getting the file from my workstation was slow.
Getting the file from the first router was slow.
Getting the file from my router over at IN-BERLIN was fast. As the link between my router at IN-BERLIN and the one in front of my workstation goes via a less-than-1500 MTU link, this lead to a reall strong suspicion that this is an MTU related issue.

MTU?

The MTU, or Maximum Transmission Unit is a byte value, which tells networked systems how large network packets are allowed to be. It is strongly tied to the MSS (Maximum Segment Size) for TCP connections. Commonly, the MSS is MTU - 20b (for IPv4) and MTU - 40b (for IPv6), i.e., if the MTU is 1500 (as it is for most ethernet links, and essentially most links on the Internet), the MSS should be 1480 for IPv4 and 1460 for IPv6.

Sadly, not all links on the Internet have an MTU of 1500. My (wireguard based) tunnel connection, for example, has an MTU of 1420. Hence, the MSS for IPv4 is 1400. Similarly, DSL customers of Deutsche Telekom have an MTU of 1492, because 8b are needed for PPPoE. Cable network providers ususally do not use PPPoE, and hence their customers commonly have an MTU of 1500.

Of course, with links below 1500b being common, there are mechanisms for hosts to determine the maximum MTU on a path to a server/client, so they can make sure to not send packets larger than that. This is commonly called Path MTU Discovery (PMTUD), and i had my fair share of headeaches around that already. The idea behind PMTUD is, essentially, that each host on the path, as soon as it can not forward a packet because it is too large sends back am ICMP error message of ‘Type 3 Code 4’ (Fragmentation Needed and Don’t Fragment was Set). When the sending host gets that packet, it knows that it has to send smaller packets if it wants them to arrive.

Funnily enough, the issue being related to the MTU would also explain why Deutsche Telekom customers are affected, while others are not: The 1492b MTU due to PPPoE.

Making sure it’s the MTU

Verifying that this issue is MTU related was comparatively easy. I fired up tcpdump on the outbound interface of my router at IN-BERLIN (where packets come in on a 1500 MTU interface and go out via a 1420 MTU interface), and set a filter for ICMP packets destined to digitalcourage.social; Then, i tried to wget a file from my workstation. And sure enough i saw:

% tcpdump -i vio0 -n host 217.197.90.87 and icmp
tcpdump: listening on vio0, link-type EN10MB
20:16:38.979978 IP 217.197.83.197 > 217.197.90.87: ICMP 195.191.197.217 unreachable - need to frag (mtu 1420), length 36
20:16:39.984627 IP 217.197.83.197 > 217.197.90.87: ICMP 195.191.197.217 unreachable - need to frag (mtu 1420), length 36
20:16:39.985268 IP 217.197.83.197 > 217.197.90.87: ICMP 195.191.197.217 unreachable - need to frag (mtu 1420), length 36
20:16:40.984956 IP 217.197.83.197 > 217.197.90.87: ICMP 195.191.197.217 unreachable - need to frag (mtu 1420), length 36
...

You might notice that this is odd. There should not be that many, the sender of large packets (217.197.90.87) should rather quickly get the message that the MTU is too large… But the box seems to be somewhat oblivious…

Where PMTUD is lost

Together with Christian from Digitalcourage (who helped me debugging this and was my /bin/instantmessangersh for commands on the Digitalcourage infrastructure, as i of course did not have a login on those machines) i now started to dig where those packets might get lost. Digging was complete relatively quickly, as both Digitalcourage’s router and the digitalcourage.social only allowed ICMP type 8 (echo-reply, what you usually know was ping), but not Type 3 (code 4). Interestingly, the mastodon documentation suggested this; Based on our findings, this was already updated, though.

Two (integrated in the systems’ firewall frameworks) iptables -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT later, the icmp packets finally arrived where they should. Eagerly, i fired up a wget, looking forward to seeing the file rushing in with unseen speed and saw… well, nothing having changed, really.

Thinking about oddities

Now, with things not working now, there must be a bit more going wrong. First of all, technically my systems do MSS clamping. The MSS is communicated to a remote host when a connection is established, and my routers should set it to 1320 for packets traversing the tunnel. So, PMTUD aside, the remote server should not have needed it to begin with. With PMTUD now technically being able to work, there should really be nothing keeping these hosts from communicating as they should.

Hence, we started digging into what might be different between the working and non-working host. Specifically, we looked at sysctl -a (nothing special), iptables -L -v -n (again, nothing), and route get $myip/route show $myip (nothing out of the ordinary). We also held a tcpdump to the wire and verified that my client communicated the correct MSS (1320, and yes, it indicated that when starting the TCP session). There seemed to be nothing special about digitalcourage.social, yet it was still blissfully ignorant of the existence of any MTUs < 1500. This meant that packets would be resend, each time with a segmentsize 40b lower, until it would finally arrive. This, of course, causes a bit of overhead, making things… slow.

Making things work (for now)

Somewhat frustrated with how things were going (and thanks to some input from anwlx), we decided to install a route for my IP with a locked MTU on the mastodon host:

ip route add 195.191.197.206 via 217.197.90.81 mtu lock 1320

Lo and behold… things suddenly worked. I got an excess far beyond 100 Mbit from the host.

We continued to try around for a bit, but were unable to figure out why the host was ignoring MTU and MSS, except when explicitly locked. At this point–slightly after 12AM–I suggested to apply a hotfix, which would work for the majority of users, and call it a night:

ip r a 0.0.0.0/1 via 217.197.90.81 mtu lock 1320
ip r a 128.0.0.0/1 via 217.197.90.81 mtu lock 1320

This locked the MTU for two more-specifics (i.e., preferred) routes for ‘everyting’ to an MTU of 1320; Essentially the same thing as for my single IP, but for, well, everything. And, in case of doubt, this is a more resillient approach than trying to update the default route on a production box. Christian was actually quiet happy about that suggestion, and moved it in place. For now, this fixed the issue, and replies to the announcement that things should work now suggested, that it actually worked.

But WHY?!

The hotfix was not really satisfactory for me (and i promised you some input on the root cause). Hence, i tried to replicate the issue:

Setup a system with the same software (Debian 10) under my control (debugging in prod is always a bit unfunny)
Install Mastodon without docker (maybe this breaks something?!)
Set all sysctl values and loaded kernel modules the same as on the live host

I set that up on Saturday, and was actually quiet hopeful that things would not work. Well, except they did. I could not replicate the issue.

Which brings us to today. I started to wonder whether maybe there was something about the virtualization environment going on. I checked in with Christian again, and asked them what they were using (plain libvirt+kvm on Debian). Then, on a whim, i asked for an lspci from the box. Maybe there is something there? Looking at the lspci output, one line immediately sprung into my eye:

00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 20)

Usually, when running kvm VMs, you want to have a virtio network card.

So, quick change over on my test-VM, making the NIC ‘rtl8139’ and not ‘virtio’ aaaaaand… things broke. Finally. Doing a wget behind a low MTU link finally gave me only around 100KB/s. Similarly, checking the second box from which thins worked all along, Christian found that it uses a virtio NIC:

00:03.0 Ethernet controller: Red Hat, Inc Virtio network device

Equipped with something to search for (to be fair ‘rtl8139, broken’ is not the most unique thing to look for), i ultimately found this thread on the qemu-dev/netdev mailinglists. It appears that others actually had the same problems before. The thread rather quickly notes that the MTU for TCP Segment Offloading seems to be fixed at 1500 in qemu’s rtl8139 code. The thread even suggests a patch which should fix the issue (but runs dry afterwards). Ultimately, the code with this bug must have been written in 2006, so round about 16 years ago (or rather: already ten years ago when the issue was discussed on the mailinglist).

Looking at the ‘Modifications:’ header in rtl8139.c i have a strong suspicion that this ultimately boils down to this change:

 *  2006-Jul-04                 :   Implemented TCP segmentation offloading
 *                                  Fixed MTU=1500 for produced ethernet frames

This also fits well with the fact that disabling TSO/GSO for the rtl8139 fixes the issue as well:

ethtool -K ens18 tx off sg off tso off

But, as i am not really a C-ish person, or overly qualified coder… i figured it might be better to just file a bug ticket and call it a day.

At least I have closure now.

And the people at Digitalcourage a reason to change the network interface type of their virtualized NIC. ^^

Lesson Learned

So, as a rather final-final thought; There is one big lesson learned.

If you compare two virtual machines on the same hypervisor, do not assume that the hardware is the same.

Or that virtual machines cannot have hardware bugs.