OpenBSD version: Would have helped; But mostly all RHEL and derivates with NetworkManager
Arch:            Any
NSFP:            Well,... apparently... not. -.-'

So, we all noticed that Microsoft is currently in camp “How can we best motivate people to figure out if there are other OSes besides Windows for their non-Macs?” Unsurprisingly, this also affects some people for whom I do the occassional support thing.

With users kind of liking things not changing, I usually opt for installing some form of RHEL derivate with a long support cycle. Traditionally, that had me install CentOS or Scientific Linux. With history having been as it was, though, both are not really an option anymore. Instead, my poison of choice for these cases now became Rocky Linux.

No VPN for you

Migrating a person over the past couple of days, I ran into a fun little issue: Adding a wireguard connection via NetworkManager led to no connectivity at all. For some reason, though, doing the same thing with the same configuration file with wg-quick did actually work. Now, clicking a button tends to be more fun than opening a terminal for the average user… so… time to figure out why this happens.

Hopping around the Internet, I found out that I am not alone with that funny behavior. While some articles apparently found a solution (using wireguard.ip4-auto-default-route 1 explicitly, even though that should be the default), I was not so lucky.

First hints then dropped in the article introducing wireguard support for NetworkManager. Apparently there is some ‘magic’ happening with routing tables, to ensure that more specifics for the tunnel endpoints can get where they are supposed to be.

IP crumbs left by DenverCoder9

However, similar to previous travelers looking for DenverCoder9, I found the ip rule sets insalled by wg-quick and nmcli upon bringing the interfaces up to be pretty much identical (+- me having reduced the case to IPv4 for nmcli):

For wg-quick:

# ip rule show
     from all lookup local
 from all lookup main suppress_prefixlength 0
 not from all fwmark 0xca6c lookup 51820
 from all lookup main
 from all lookup default

And for NetworkManager:

     from all lookup local
 from all lookup main suppress_prefixlength 0
 not from all fwmark 0xcbdb lookup 52187
 from all lookup main
 from all lookup default

Also, as the previous traveler noted, the routing table entries were also effectively the same.

Looking at (and banging my head against longer than I care to admit) the output of wg-quick then made it apparent what was going wrong here:

# wg-quick up wg0
[#] ip link add wg0 type wireguard
[#] wg setconf wg0 /dev/fd/63
[#] ip -4 address add 192.0.2.2 dev wg0
[#] ip -6 address add 2001:db8::2/128 dev wg0
[#] ip link set mtu 1420 up dev wg0
[#] wg set wg0 fwmark 51820
[#] ip -6 route add ::/0 dev wg0 table 51820
[#] ip -6 rule add not fwmark 51820 table 51820
[#] ip -6 rule add table main suppress_prefixlength 0
[#] nft -f /dev/fd/63                                       <<< !
[#] ip -4 route add 0.0.0.0/0 dev wg0 table 51820
[#] ip -4 rule add not fwmark 51820 table 51820
[#] ip -4 rule add table main suppress_prefixlength 0
[#] sysctl -q net.ipv4.conf.all.src_valid_mark=1
[#] nft -f /dev/fd/63                                       <<< !

A new hope…

So, there are some firewall rules being set. Not being an idiot (narrator voice: “In fact, being–in hindsight–fully aware of being an idiot, he was not only an idiot, but apparently also a liar.”) I of course had run iptables-nft -L -v -n to check if there was anything in place. Nothing was to be seen there.

I then figured that nft should, of course, also be able to tell me what rules it was setting there. So, nft list ruleset and go.

Lo and behold, there was a difference now; After starting wg0 with wg-quick, the output ended with:

table ip6 wg-quick-wg0 {
        chain preraw {
                type filter hook prerouting priority raw; policy accept;
                iifname != "wg0" ip6 daddr 2001:db8::2 fib saddr type != local drop
        }

        chain premangle {
                type filter hook prerouting priority mangle; policy accept;
                meta l4proto udp meta mark set ct mark
        }

        chain postmangle {
                type filter hook postrouting priority mangle; policy accept;
                meta l4proto udp meta mark 0x0000ca6c ct mark set meta mark
        }
}
table ip wg-quick-wg0 {
        chain preraw {
                type filter hook prerouting priority raw; policy accept;
                iifname != "wg0" ip daddr 192.0.2.2 fib saddr type != local drop
        }

        chain premangle {
                type filter hook prerouting priority mangle; policy accept;
                meta l4proto udp meta mark set ct mark
        }

        chain postmangle {
                type filter hook postrouting priority mangle; policy accept;
                meta l4proto udp meta mark 0x0000ca6c ct mark set meta mark
        }
}

For the nmcli instance, nothing of that sort was to be seen.

Things now kind of made sense; The firewall setup is somewhat unique to RHEL-like OSes (including Fedora, of course), and NetworkManager did not set the corresponding rules in the right place to make sure that packets for the wireguard underlay get the right fwmarker to actually find their way where they belong. My feeble attempts to find those rules before failed, bcause these rules are in the raw table; iptables-nft -L -v -n will not show that.

Grabbing the above rules, putting them into /tmp/nft, and simply applying them with nft -f /tmp/nft then made a wireguard connection started from nmcli work. As it should.

Erst den Pinoepel durch die Lasche ziehen, und dann an der Kurbel drehen

After figuring out what is going wrong, i now need to fix this for the user of the machine. Now, me being a lazy person (and the number of wireguard tunnels on this machine likely never exceeding 1), i went for a simple additional if-up script. This is not the recommended solution, though; Instead, you should actually figure out how to do this properly, or–even better–annoy the distro maintainers until they fix that.

However, I tend to be lazy. With the help of some blog post, this then netted me the following file in /etc/NetworkManager/dispatcher.d/00-wg-set-nft (debug code left in as a boon to the reader ;-P):

#!/bin/bash
set -u
set -e

#date --iso-8601=s >> /tmp/up_event
#printenv >> /tmp/up_event

if [ "$NM_DISPATCHER_ACTION" == "up" ] || [ "$NM_DISPATCHER_ACTION" == "down" ];
then
#        echo "Found interface '$NM_DISPATCHER_ACTION' for $DEVICE_IFACE connection $CONNECTION_ID" >> /tmp/up_event;
        if [ -f "/etc/wireguard/nft-$CONNECTION_ID-$DEVICE_IFACE-up" ];
        then
#               echo "NFT ruleset for $CONNECTION_ID ($DEVICE_IFACE) present at /etc/wireguard/nft-$CONNECTION_ID-$DEVICE_IFACE-up!" >> /tmp/up_event;
                if [ "$NM_DISPATCHER_ACTION" == "up" ];
                then
#                       echo "Setting FW rules for $CONNECTION_ID" >> /tmp/up_event;
                        nft -f "/etc/wireguard/nft-$CONNECTION_ID-$DEVICE_IFACE-up";
#                       nft list ruleset >> /tmp/up_event;
                else
#                       echo "Deleting FW rules for $CONNECTION_ID" >> /tmp/up_event;
                        nft delete table ip wg-quick-"$DEVICE_IFACE";
                        nft delete table ip6 wg-quick-"$DEVICE_IFACE";
#                       nft list ruleset >> /tmp/up_event;
                fi;
#       else
#               echo "NFT ruleset for $CONNECTION_ID ($DEVICE_IFACE) not found at /etc/wireguard/nft-$CONNECTION_ID-$DEVICE_IFACE-up!" >> /tmp/up_event;
        fi;
fi;

Now, remember, i am doing stupid thigns, and this is pretty much not how it should be done. However… it works; At least if there i a corresponding file in /etc/wireguard/nft-$CONNECTION_ID-$DEVICE_IFACE-up containing the very NFT rules I already pasted above. Note that I explicitly set the fwmark for the wireguard connection to make sure that this matches.

Now what

Given that some of the rather similarly sounding issues have been around for a bit now… I am a bit wondering why there is no solution to this yet (publicly posted); I mean… not even a patch, but instead somebody noticing that the absence of marker application rules in the right place might have triggered this.

Also, I am a bit unsure where to drop this bug. Technically, this happens on Rocky Linux 9; But the wireguard-tools package is in EPEL. But the component that has issues is NetworkManager from base. So erm… not sure.

Luckily, I ultimately decided for filing it against EPEL. This lead me to the RedHat bug tracker and this funny little bug here from 2023-09-17: Bug 2239353 - WireGuard connection doesn’t work, if endpoint is IPv6

Interestingly, this bug describes pretty much the same behaivor. I am not sure why it concludes that this is specific to IPv6 endpoints and not general; The discussion also notes that this seems to not be IPv6 specific, as it correctly identifies the absence of fwmarkers being applied as the root cause.

The last push on that was 2024-05-31 11:49:18 UTC; So I guess it is time to drop another friendly note that this really is an issue. ;-)