Health Checks in ExaBGP
Posted on
While lurking around on the Internet I found a link to this blog post which talks about DNS Anycast using BGP. Since I'm operating the DNS servers at work this piqued my interest, I'm always interested in reading about how others find solutions.
The post explains the concept of anycast, server configuration and how to setup BGP for this so I don't really have anything to add.
I do however wanted to expand on the ExaBGP configuration. What Mr Yeti does is querying the local DNS resolver for yetiops.net and checks the result. If the exit code is anything but 0 (meaning some kind of error happened) then the BGP route for the DNS resolver will be withdrawn from the network. Otherwise, announce the route and receive traffic from users.
This isn't a bad thing to do, but we're putting a lot of trust into the servers that host the zone being checked. If those servers are experiencing issues and the zone being checked isn't resolving the routes will be withdrawn, cutting off DNS for all the users even though there's no real problem with the resolver.
One option could be to check some other zone, but as we saw with Facebook back in 2021 the big giants can also fail, and the way I see it it's really just a matter of time. Make sure to read the link, it describes the problem quite well.
The option I went with instead is to always announce the IP address of the DNS resolver, but in case of a failure increase the MED to move the traffic away from the server. The server will still accept traffic (and throw it away in case of an actual error) but the network will choose not to send any.
What this means is that if a server encounters a local error (for instance crashing software) it will be taken out of rotation, but if there's a major fault with the zone being checked all servers will increase their MED and nothing will really change for the users.
Of course, if there's both a local fault and a fault upstream the server in question will cause issues for the users but hopefully this is something that can be caught using monitoring.
This is the ExaBGP configuration I'm using, there's more processes for secondary IPv4 addresses and IPv6 but in the end all checks copies from this:
process resolver_v4-check {
        run /usr/local/bin/python3.9 -m exabgp healthcheck \
                --cmd "/usr/local/bin/dig +timeout=1 +tries=1 -4 -t SOA google.com @192.0.2.53" \
                --interval 1 \
                --rise 30 \
                --fall 3 \
                --disable /usr/local/etc/exabgp/disable_resolver \
                --ip 192.0.2.53 \
                --next-hop 198.51.100.2 \
                --up-metric 0 \
                --down-metric 10000 \
                --disabled-metric 10000 \
                --up-execute "logger DNS Resolver 192.0.2.53 going into state UP" \
                --down-execute "logger DNS Resolver 192.0.2.53 going into state DOWN" \
                --disabled-execute "logger DNS Resolver 192.0.2.53 going into state DISABLED";
        encoder text;
}
neighbor 198.51.100.1 {
        router-id 198.51.100.2;
        local-address 198.51.100.2;
        local-as 65053;
        peer-as 64496;
        hold-time 31;
        family {
                ipv4 unicast;
        }
        api anycast_v4 {
                processes [ resolver_v4-check ];
        }
}
192.0.2.53 is the local IP address bound to the loopback interface. There's a check every second and the --fall parameter tells ExaBGP to increase the MED after three failed checks. Once check per second might be to exaggerate a bit but I want to react quickly on faults and the result from the upstream DNS server is cached locally so I don't feel bad for putting any extra load on them.
The --rise parameter instead tells ExaBGP to only announce the route after 30 seconds of successful checks (meaning 30 checks). I choose this number to ensure there's no intermittent failures, causing traffic flapping back and forth.
I've also added the --disable parameter. If the file specified exists then ExaBGP will set the --disabled-metric. I touch the file before any maintenance on the server, like upgrades and reboots, and remove it afterwards. This will drain the server of queries and the users will not experience any loss of service.
Other than that there's some logging which will send events to the local syslog daemon which sends if off to a remote syslog server.
This configuration serves me well and it's been running in production for a few years. Maybe someone else will find it useful.