July 2023

I was on vacation when Talos (the former primary DNS server) was going down. At first we needed someone to physically reboot the server, but after a while that wasn't enough. I investigated via SSH to see that nsd.service used for authoritative DNS was failing on boot. I learned about systemd config files for services on the system in order to get the configuration file for the nsd.service. I was able to change it so that when powered on, Talos will try more times to restart the nsd service. In the process of this, I also learned that crontab, which is a job scheduler for processes on the system.

The problem was the nsd.service was failing on boot with error saying it was "restarting too quickly" through some research of the problem in systemd, turns out that package has files in /etc/systemd/system/multi-user.target.wants in which you can adjust how systemd interacts with the service. Since it was being restarted too quickly, you can find that among all other commands allowed and not there by default, there is StartLimitBurst=x where x is an int that represents how many tries to restart the service, and StartLimitInterval=x where x is an int that sets how long in between each restart attempt.