Auto-recovery of crashed services with systemd

We were recently faced with a service that crashed occasionally on one of our Linux servers. We had to find a way to make it recover automatically, ideally using tools that were already present on the server.

The landscape of process-management in Linux is somewhat complex and is in a state of flux, with different tools vying to replace the venerable SysV-style init that sysadmins have relied on for decades. Luckily, it looks like the highly-capable systemd is becoming the standard, and we’ll hopefully have another long stable period soon.

In the meantime, we needed to get our service running reliably. In our case, the delayed_job service that supported one of our Rails apps was stopping occasionally. The service was controlled by a SysV-style init script in /etc/init.d, and it was started automatically when the system started up. Unfortunately, SysV-style init doesn’t include monitoring or auto-restart capability^*.

* You can actually used the respawn feature in the /etc/inittab file, but that looked too scary to us!

Now that systemd is becoming a standard (present on Ubuntu 14.10+, RedHat, 7+, CentOS 7+, and Fedora 15+), we wanted to learn how to use it. systemd is controlled by .service files in a variety of locations. However, we wanted to avoid rewriting the existing startup script. Our research revealed the following:

systemd is compatible with legacy /etc/init.d scripts in this manner: when systemd loads service definitions, the systemd-sysv-generator generates .service files on the fly from the scripts in /etc/init.d.
You can add configuration to a service by adding “drop-in” files to a correctly-named folder in /etc/systemd/system/.

So, we simply had to add a file like this at /etc/systemd/system/delayed_job.service.d/restart.conf:


[Service]
Type=forking
PIDFile=/srv/www/sites/rails_app/current/tmp/pids/delayed_job.pid
RemainAfterExit=no
Restart=on-failure
RestartSec=5s

This configuration will vary for different services. You’ll have to read the systemd.service docs to figure out the specifics. In particular, you should read the docs for the Type, Restart, GuessMainPID, PIDFile, and RemainAfterExit options.

Then, after reloading our service definitions with systemd daemon-reload, we can see that systemd knew what it needed to know about our service.


$ sudo systemctl status delayed_job
● delayed_job.service - LSB: Manage delayed jobs for application rails_app
   Loaded: loaded (/etc/init.d/delayed_job; bad; vendor preset: enabled)
  Drop-In: /etc/systemd/system/delayed_job.service.d
           └─restart.conf
   Active: active (running) since Mon 2017-10-23 15:33:25 CDT; 12min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 6359 ExecStop=/etc/init.d/delayed_job stop (code=exited, status=0/SUCCESS)
  Process: 6389 ExecStart=/etc/init.d/delayed_job start (code=exited, status=0/SUCCESS)
 Main PID: 6412 (bundle)
   CGroup: /system.slice/delayed_job.service
           ‣ 6412 delayed_job

Notice the “Drop-In” section, which tells us that systemd knows about the new config file we added, and the Main PID line, which indicates that systemd knows which process to monitor.

We can test this config by force-killing the service in question: sudo kill -9 6412. Running system status delayed_job immediately will show something like Active: deactivating (stop) (Result: signal). Within 5 seconds (per the config), It should return to saying Active: active (running).

If you’re using an older distro, or one that doesn’t have systemd for any reason, check out Digital Ocean’s exhaustive guide to automatic service recovery.