OpenBSD version: 7.1
Arch:            Any
NSFP:            Uhm...

Something something MTU

The Internet is a network of networks. Networks tend to be connected over links. Packets flow over these links. And very much like in the real world, these packets tend to have a size. Just as $random_parcel_service will decline packets over a specific size, links tend to have a maxmimum size for packets traversing them.

This value, the MTU (Maximum Transmission Unit) is 1500 bytes for an Ethernet v2 link. However, there are also larger MTUs, for example for old stuff like FDDI, or new stuff like jumbo frames. Similarly, the MTU can also get smaller than 1500 bytes, for example when we use a tunnel–cramming additional headers in fron of our packets–and do not want to fragment tunnel packets Along the same lines PPPoE can snack a few bytes (usually eight) away from us when sending packets. Note that for IPv6 packets to happily flow through the Internet, the MTU must not fall below 1280.

Naturally, if we send packets from A to B, we want them to have the ideal size to neatly fit through all links they have to travers. While, technically, we can just start fragmenting packets when a router can no longer forward them whole, this comes with a lot of caveats. So, our next best option is each router along the path telling the sender of a packet that it is too big, so the sender can resend smaller packets. Consider this example:

     MTU 1500         MTU 1420
A -------------- B -------------- C

When packets are too big

If A sends a packet to C via B with a total size over 1420 bytes, B can not forward the packet without fragmenting. B can now send a message to A, notifying A that the MTU of the next link is $size, so that A can resend a smaller packet (and adjust the size of subsequent packets). This gives us ‘Path MTU Discovery’, enabled by datagram too big ICMP packets, for which there is of course also a version for IPv6 in ICMPv6. This, of course, only works if these packets from B can actually reach A.

To see why packets may not make their way to be, let’s expand our example a bit:

     MTU 1500         MTU 1500         MTU 1420         MTU 1500    
A -------------- B -------------- C -------------- D -------------- E

Here, for a 1500 byte packet from A to E, C would have to send a note to A that the packet is too big. However, B might refuse to forward said packet back to A, for example because the link between B and C uses private (RFC1918 for IPv4, for example) addresses for the transit link, and B hard-filters bogons. If said packets are now responses in, say, a TCP session initiated from E to A, the experience for E will be that replies simply do not arrive, while A is confused because packets send will not be acknowledged by E. Essentially the connection will die, or–in case of a website–the user experience will be that the side just does not load. We have one of those classical “MTU problems”. (Fun sidenote: If this happens for SMTP, it is even harder to debug, because it will usually just work fine via telnet/netcat, as most SMTP commands before DATA fit well into packets waaaay smaller than 1000 bytes; Only DATA will make you feel the pain.)

Enjoying PMTUD issues

I was recently hit by this problem when trying to access some websites via IPv6 located over at Hetzner. This was somewhat curious, as i kind of assumed PMTUD to work. Well, let’s take a look at an example traceroute:

 ~ % traceroute6 -A dns2.aperture-labs.org
traceroute6 to dns2.aperture-labs.org (2a01:4f9:c010:15cc::1), 64 hops max, 60 byte packets
vm (2a06:d1c0:dead:4::1) [AS59645]  0.31 ms  0.325 ms  0.279 ms
wg1.gw01.dus01.as59645.net (2a06:d1c0::dead:beef:a01) [AS59645]  19.071 ms  23.52 ms  19.293 ms
ipv6.decix-dusseldorf.core1.dus1.he.net (2001:7f8:9e::1b1b:0:1)  20.829 ms  22.396 ms  20.092 ms
* * *
* * *
netnod-ix-ge-b-sth-4470.hetzner.de (2001:7f8:d:fb::71)  38.252 ms netnod-ix-ge-b-sth-1500.hetzner.de (2001:7f8:d:fe::71)  39.254 ms netnod-ix-ge-a-sth-1500.hetzner.de (2001:7f8:d:ff::71)  43.991 ms
core32.hel1.hetzner.com (2a01:4f8:0:3::29) [AS24940]  46.726 ms core32.hel1.hetzner.com (2a01:4f8:0:3::29d) [AS24940]  41.685 ms core31.hel1.hetzner.com (2a01:4f8:0:3::42d) [AS24940]  49.371 ms
2a01:4f9:0:c001::a072 (2a01:4f9:0:c001::a072) [AS24940]  45.493 ms 2a01:4f9:0:c001::a076 (2a01:4f9:0:c001::a076) [AS24940]  45.578 ms 2a01:4f9:0:c001::a002 (2a01:4f9:0:c001::a002) [AS24940]  44.024 ms
* * *
2a01:4f9:0:c001::2127 (2a01:4f9:0:c001::2127) [AS24940]  43.235 ms  43.153 ms  49.461 ms
* * *

What we essentially have here is our case from the second example, and our traceroute starts from E to A (with some more hosts between C and A than in the example). The link between vm to gw01.dus01.as59645.net goes via a wireguard tunnel with an MTU of 1420.

Logging into dns2.aperture-labs.org and running tcpdump in the uplink interface, we also notice that for some reason no PMTUD packets from gw01.dus01.as59645.net arrive. This opens the interesting question… why. ping and traceroute are also rather non-verbose when they are run from gw01.dus01.as59645.net:

gw01.dus01.as59645.net ~ # ping6 -c 4 dns2.aperture-labs.org
PING dns2.aperture-labs.org (2a01:4f9:c010:15cc::1): 56 data bytes

--- dns2.aperture-labs.org ping statistics ---
4 packets transmitted, 0 packets received, 100.0% packet loss
gw01.dus01.as59645.net ~ # traceroute6 -A dns2.aperture-labs.org
traceroute6 to dns2.aperture-labs.org (2a01:4f9:c010:15cc::1), 64 hops max, 60 byte packets
 1  ipv6.decix-dusseldorf.core1.dus1.he.net (2001:7f8:9e::1b1b:0:1)  1.64 ms  1.563 ms  1.679 ms
64  * * *

Everyone has their limits BCP38 (hopefully)

Well, one might notice that these packets flow via IPv6 and Hurricane Electric via DECIX in Duesseldorf. While RFC7454 is a bit more announce-y about this, most IXPs including DECIX are pretty much fans of keeping their peering LANs out of the DFZ. This means that, of course, gw01.dus01.as59645.net should not receive any replies from dns2.aperture-labs.org, simply because the latter should not have a route back to the peering LAN. Still, one might hope that packets still arive, and for PMTUD to work, it would suffice if packets arrive at dns2.aperture-labs.org, no?

Well, thing is, in this case, they don’t. Hetzner is pretty adamant about which packets they forward and which not, implementing BCP38 rather strictly; And after opening a ticket, it became clear that ‘not in the DFZ’ pretty much translates to ‘not on our network’ (neither as a source or destination) for them. This, of course, is not overly healthy for PMTUD, leaving us with essentially five options to fix this issue:

Convince Hetzner to accept ICMP traffic with IXP peering LAN source addresses (yeah, no, not gonna happen).
Do some traffic engineering to not see Hetzner via the IXP anymore (boring).
Do funny NAT/routing things to make gw01.dus01.as59645.net use, e.g., its (public) loopback address for originating packets via the IXP link to non IXP LAN destinations (feels somewhat ugly).
Find a way to move the MTU break away from gw01.dus01.as59645.net (somewhat interesting)
Apply proper MSS (Maximum Segment Size) clamping on the lower MTU link.

Making it ping load again

I decided to combine solutions 4. and 5.; This is not reeeeaaally necessary, as MSS clamping would be enough… but then again,… where is the fun in that?

The first part of the problem is that gw01.dus01.as59645.net is also the host with an adjacent low-MTU (tunnel) link. If we move that tunnel to another router connected via a 1500 byte MTU link, the size exceeded messages would (naturally) originate from that transit link’s address on that newly added router. As that would use addresses from my networks also announced to the DFZ, Hetzner will now happily forward those packets. Hence, for all my tunneling-needs, there now is gw02.dus01.as59645.net.

Now, the more straigth forward solution for fixing this issue is performing MSS clamping for TCP. This means that we overwrite the MSS hosts reaching each other via the low(er) MTU link communicate to other hosts. By ensuring that the MSS is below or equal to the minimum MSS for the MTU we are dealing with (1420), we can make sure nobody wants to send packets that are too large for our link. In this case, the maximum payload size signalled by the host behind vm to dns2.aperture-labs.org over at Hetzner when initiating the connection will be overwritten with a value fitting well into the MTU of the following link. This is MTU - 40 for IPv4 and MTU - 60 for IPv6, so an MSS of 1380 for IPv4 and 1360 for IPv6:

So, the pf.conf on gw02.dus01.as59645.net and vm also gets (a bit doubled, but well…):

...
match in on $tunnel_if from 0.0.0.0/0 scrub (no-df random-id max-mss 1380)
match out on $tunnel_if from ::/0 scrub (max-mss 1360)
...

And after doing this, websites (finally) start loading correctly again.