SERVFAIL me one more time: Reflections on TU Delft's Downtime
OpenBSD version: might have helped
Arch: Any
NSFP: Well... obviously not.
“It has been 0 days since it was DNS.”
Last Monday I noticed some of my nagios checks going critical. Specifically, it
was the checks for the anycasted DNS recursors testing Momoka’s draft on v4
resolution for v6 resolvers that would try to resolve tudelft.nl
.
Why tudelft.nl
? Well, this zone is reliably free of any IPv6 support.
At the same time, various projects I operate were unable to deliver mail to
users @tudelft.nl
. The reason here being that the domain was not resolvable.
I initially did not really want to write about this incident, but given that the number of people who asked me about this, given my background in a) operations, and b) research about “exactly this” has slowly reached “too many”, I figured this might be easier.
Disclaimer
I am writing this text as a system and network engineer and scientist working on the subject of digital infrastructure operations. Statements, suggestions, and conclusions presented in this document are based on my practical experience and scientific findings around running digital infrastructure in a higher-education/research network context. My statements do not necessarily represent the official position of my employer or any other affiliation I hold, and are, as such, my own.
I note that I was employed by TU Delft until 2022-03-31 and held a hospitation agreement with TU Delft up until 2024-03-31. However, I was never part of the operational staff involved with the digital infrastructure of TU Delft. At this point in time, I am in no way affiliated with TU Delft. As such, I do not have any privileged insight or information on the operational efforts in general or the incident at hand specifically.
All information provided and statements made in this document are either the result of analysis informed by public information, or conjecture based on common operational practices. This document does not contain confidential or proprietary information of TU Delft.
The purpose of this document is the open discussion of challenges in the operation of digital infrastructure for higher education institutions. Observed deviations from operational best practices are only stated when independently verifiable. Conjecture is marked as such.
Timeline of Events
At the moment, I see the following timeline of events in relation to the events at hand, all times UTC:
- 2024-05-28T02:19:24: monitoring for
n64v6res01.ber01.as59645.net
notified that?IN A tudelft.nl
cannot be resolved.n64v6res01.ber01.as59645.net
first notified at 2024-05-28T02:28:32, and finallyn64v6res01.ams01.as59645.net
notified at 2024-05-28T03:18:50. - 2024-05-28T11:41:10: Monitoring notes that
tudelft.nl
is resolvable again onn64v6res01.dus01.as59645.net
,n64v6res01.ber01.as59645.net
follows at 2024-05-28T11:44:06.n64v6res01.ams01.as59645.net
was already able to resolvetudelft.nl
at 2024-05-28T06:12:59. - 2024-05-28: TU Delft ICT published a notification on https://meldingen-ict.tudelft.nl/en/, noting ongoing issues with DNS which are being investigated. This message has since been removed.
- 2024-05-30: Issues with resolving
tudelft.nl
persist intermittently. It appears that the DNSSEC configuration oftudelft.nl
has been misconfigured, i.e., new keys were added on the authoritative servers, but theDS
records in.nl
were not updated. - 2024-05-31: While intermittent reachability issues continue. Meanwhile, TU Delta reported that the attack from 2024-05-28 reached 2.8 trillion requests per half hour. A former colleague also informally noted that, apparently, during the attack external connectivity was briefly interrupted by ICT to be able to work on further mitigating the attack.
- 2024-06-01:
tudelft.nl
seems to have partially recovered. - 2024-06-01: Around 13:00, DNS resolution seems to return to normal for most
clients. However, DNSViz still reports the zone to be
bogus. Based on that data, it appears that
tudelft.nl
currently responds withRRSIG
s signed with keyid 135, while.nl
only holdsDS
records for keyid 47965 and keyid 28945, with neither signing 135. Furthermore, responses from thetudelft.nl
namesevers still have a high RTT, sometimes timing out.
What might be going on…
The question now is: What really happened, and why is it not yet fixed? Let’s delve into that, shall we?
The nature of the DoS
The root cause of these events seems to be the DoS attack from Monday. This has
been claimed to have caused 2.8 trillion requests per 30 minutes. Now, this
may mean 1,000,000,000,000 (10^12
) or 1,000,000,000,000,000,000 (10^18
)
requests depending on the definition of trillion.
This allows us to gauge the bandwidth that must have come in per second, on average.
A request for ?IN A tudelft.nl
has, using dig A tudelft.nl @192.0.2.1
, 79 bytes.
For 10^12
, this makes ca. 351.11 Gbit/s on average, for 10^18
it makes ca.
351.11 Pbit/s; Mildly unlikely.
Even the avg. 351.11 Gbit/s are somewhat unlikely to have reached the servers, given that it would require a campus uplink of at least 800 Gbit/s (considering that such small packets come with a lot of overhead) if the statement that the servers had to process this amount of requests also should hold. Then again, these stats might also have been collected upstream, i.e., in the SURF backbone.
In any case, this is obviously a lot of data, and nothing overly easy to handle.
The Authoritative DNS Servers of tudelft.nl
RFC2182 has some opinions on running authoritative nameservers; or rather, the second one for a zone, and says in 3.1:
When selecting secondary servers, attention should be given to the
various likely failure modes. Servers should be placed so that it is
likely that at least one server will be available to all significant
parts of the Internet, for any likely failure.
Now, how does this look for tudelft.nl
:
% dig NS tudelft.nl @ns1.dns.nl
; <<>> DiG 9.16.48 <<>> NS tudelft.nl @ns1.dns.nl
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17759
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;tudelft.nl. IN NS
;; AUTHORITY SECTION:
tudelft.nl. 3600 IN NS ns1.tudelft.nl.
tudelft.nl. 3600 IN NS ns2.tudelft.nl.
;; ADDITIONAL SECTION:
ns1.tudelft.nl. 3600 IN A 130.161.180.1
ns2.tudelft.nl. 3600 IN A 130.161.180.65
;; Query time: 30 msec
;; SERVER: 2001:678:2c:0:194:0:28:53#53(2001:678:2c:0:194:0:28:53)
;; WHEN: Sat Jun 01 17:31:06 CEST 2024
;; MSG SIZE rcvd: 107
Well. Those two are certainly in the same /24. No matter how geographically distributed the setup is, this does not fulfill the requirements of RFC2182 or RFC3258. And this has been like that for… some… time. And, incidentally, still is like that.
How it all Comes Together
Based on the observations above, I make the following conjectures concerning the events that transpired in response to the DoS attack.
- I assume that the authoritative DNS servers of
tudelft.nl
were the primary target of the attack, given that TU Delft ICT first reported to be investigating an issue related to their DNS servers. - As the nameservers of
tudelft.nl
are single-homed behind its campus network, disconecting or otherwise severing, i.e., via the DoS, the campus network from the rest of the Internet makestudelft.nl
unreachable, also making, e.g., emails undeliverable. - In response to the observed attack, TU Delft ICT hopefully asked SURF to ACL, e.g., via Flowspec, certain routes into their network.
- As there was still notable inbound requests hitting the authoritative NS, ICT decided to either replace or provision an additional authoritative DNS worker. Alternatively, possibly a middle box or DNS filtering solution was brought in.
Together, the above steps restored operations, especially after the DoS attack
subsided a bit. Issues then resurfaced after roughly 48 hours when the TTL
of 172800 seconds for the DNSKEY records of tudelft.nl
expired on more and
more recursive resolvers.
Possibly, during the change (adding or upgrading a worker/traffic filter) a new
ZSK DNSSEC key (keyid=135) was introduced. Based on the DNSViz resolutions, it
seems like this DNSKEY is not always delivered with an RRSIG signed by one of
the keys for which a DS record is installed in .nl
, even though it seems like
it should be signed by keyid=47965. Nevertheless,
it not always is, and the keyid was not present
earlier this year. which might have benign reasons, e.g.,
a ZSK rollover.
Similarly, it might be that those queries with those specific rrsigs/ DNSKEY records simply get eaten somewhere on-path or enjoy just being dropped. This may be the result of a misconfiguration, the result of ongoing DoS, or caused by the introduction of specific defense mechanisms, e.g., heavy per-client rate limits.
In any case, something is still broken, and this heavily impacts reachability
for tudelft.nl
in the DNS.
Update 2024-06-02: The more I think about this, the more this feels like the result of ratelimiting; Which will likely also impact large resolvers (q1/q8/q9) less, because they will re-resolve from different IPs. Also makes sense given 135 is a zsk, and the other two are ksk.
Absence of Monitoring
Given the prolongued nature of the DNSSEC issues, I would argue that TU Delft ICT might not be fully aware of the issue, may not yet have identified the root cause, or may face challenges in fixing the specific DNS server implementation they are running to present properly signed zones.
What could have been done
Now, it is always easy (and mildly fun) to shout at the TV when a football game is going on. So, being a person that likes easy things, I will go for that. What would I have done seeing a ton of packets coming to my authoritative NS all of a sudden?
Do not run NS in a single /25
The obvious first step is not having all NS in a single /25 (yes, ‘twenty-five’), and most certainly not all of them on campus. Ideally, also more than just two. Instead, I would have tried to find multiple secondaries, likely just ingesting AXFR, across different networks. Ideally also one anycasted.
This would already have sufficed to ensure that email delivery is not impacted, and all cloud hosted services remain accessible.
DoS Defense
Assuming that NS are also off-campus in other netblocks with unique routing policies would have made it a lot easier to very blanket-ly block, e.g., inbound packets with dport udp/53 and tcp/53; Flowspec comes in handy there, and SURF would likely have been happy to assist there.
There are also some options with the current setup of ‘one’ DNS server network. Depending on what else lives in those /25 (ideally not much but well… ), one could have drawn an AXFR from the authoritative and started serving the zone from the /24 now being anycasted.
That would have required surf to actually set a ROUTE object, and ideally configure a ROA in RPKI; Furthermore, one would have had to convince SURF to borrow one of their (spare) ASNs. Hence, in general, this is more of a fun thing to do with a bit more time than ‘everything burns’.
Monitor Stuff
I noted above that there is some monitoring likely… missing, given that the issue still persists. Which is ‘not really ideal’. So getting that in place would also be a top priority.
Now, something one could rather easily do would be setting up some RIPE Atlas
measurements; Like these ones for mpg.de
:
- IPv4 UDP: https://atlas.ripe.net/measurementdetail/72293568/
- IPv4 TCP: https://atlas.ripe.net/measurementdetail/72293620/
- IPv6 UDP: https://atlas.ripe.net/measurementdetail/72293583/
- IPv6 TCP: https://atlas.ripe.net/measurementdetail/72293641/
That can then be easily polled by some off-site monitoring setup. Such a monitor can then also check a lot of other services, especially end-to-end things like “Can a user send an email, does it arrive somewhere else, and can its DKIM signature be verified then?”
Conclusion
There can be broad speculation as to why this is now happening to TU Delft, and why DNS(SEC) does not really want to play nice at the moment. In my scientific work, there are some conjectures that an increasing introduction of a cloud first strategy can lead to a capability errosion, especially for basic services, due to an organization realizing cost savings by not maintaining or actively reducing in-house capacity for, especially, basic services. Still, even though TU Delft is following a very cloud focused approach, finding a conclusive answer there would require perspectives from inside the organization, which I do not have.
Hence, what remains is me being hopefull that ICT is aware of the underlying issue, and am convinced that the team members do their best to resolve the incident.
From a technical perspective, the issues should get a lot better by adding new authoritative servers, i.e., simple AXFR’ing secondaries. I am not sure why this is not being done (not only now, but in the years before). It might be, for example, the complexity of the current setup, or, simply that organizational knowledge on the setup was lost over time;
However, as a person who had to emergency migrate a production setup to a PowerDNS authoritative in a few hours… I am pretty sure that it is not too difficult to:
- Get a few additional servers in different providers
- Make sure that there exists one complete copy of the zone, i.e., an AXFR that is properly signed and contains all necesarry DNS key records
- Optional: Add a new DS to tudelft.nl via the registrar
- Throw NSD, Knot, Bind9, or PowerDNS on the servers with a configuration that can handle ~7k qps
- Make the registrar add those new NS (+glue) to .nl
- Bonus: Get additional NS under different TLDs for extra redundancy
But I guess things are just looking a bit too simple from the seat of a professional commentator: With the remote to the right, and a bowl of chips to the left, giving good advice from afar. And with that, I grab another beverage, and murmor at the TV: “Well, maybe you should have not tried to drive a golf car through a race track!”; Not realizing that I might be watching Golf instead.