We were recently faced with a service that crashed occasionally on one of our Linux servers. We had to find a way to make it recover automatically, ideally using tools that were already present on the server.
The landscape of process-management in Linux is somewhat complex and is in a state of
flux, with different tools vying to replace the venerable SysV-style
init that sysadmins have relied on for decades.
Luckily, it looks like the highly-capable systemd is becoming the standard, and we’ll
hopefully have another long stable period soon.
In the meantime, we needed to get our service running reliably. In our case, the
delayed_job service that supported one of our Rails apps was stopping occasionally. The
service was controlled by a SysV-style init script in /etc/init.d, and it was started
automatically when the system started up. Unfortunately, SysV-style init doesn’t include
monitoring or auto-restart capability*.
* You can actually used the respawn feature in the /etc/inittab file, but that
looked too scary to us!
Now that systemd is becoming a standard (present on Ubuntu 14.10+, RedHat, 7+, CentOS
7+, and Fedora 15+), we wanted to learn how to use it. systemd is controlled by
.service files in a variety of locations. However, we wanted to avoid rewriting the
existing startup script. Our research revealed the following:
systemdis compatible with legacy/etc/init.dscripts in this manner: whensystemdloads service definitions, thesystemd-sysv-generatorgenerates.servicefiles on the fly from the scripts in/etc/init.d.- You can add configuration to a service by adding “drop-in” files to a correctly-named
folder in
/etc/systemd/system/.
So, we simply had to add a file like this at /etc/systemd/system/delayed_job.service.d/restart.conf:
[Service]
Type=forking
PIDFile=/srv/www/sites/rails_app/current/tmp/pids/delayed_job.pid
RemainAfterExit=no
Restart=on-failure
RestartSec=5s
This configuration will vary for different services. You’ll have to read the
systemd.service
docs to figure out
the specifics. In particular, you should read the docs for the Type, Restart,
GuessMainPID, PIDFile, and RemainAfterExit options.
Then, after reloading our service definitions with systemd daemon-reload, we can see
that systemd knew what it needed to know about our service.
$ sudo systemctl status delayed_job
● delayed_job.service - LSB: Manage delayed jobs for application rails_app
Loaded: loaded (/etc/init.d/delayed_job; bad; vendor preset: enabled)
Drop-In: /etc/systemd/system/delayed_job.service.d
└─restart.conf
Active: active (running) since Mon 2017-10-23 15:33:25 CDT; 12min ago
Docs: man:systemd-sysv-generator(8)
Process: 6359 ExecStop=/etc/init.d/delayed_job stop (code=exited, status=0/SUCCESS)
Process: 6389 ExecStart=/etc/init.d/delayed_job start (code=exited, status=0/SUCCESS)
Main PID: 6412 (bundle)
CGroup: /system.slice/delayed_job.service
‣ 6412 delayed_job
Notice the “Drop-In” section, which tells us that systemd knows about the new config
file we added, and the Main PID line, which indicates that systemd knows which process
to monitor.
We can test this config by force-killing the service in question: sudo kill -9 6412.
Running system status delayed_job immediately will show something like Active:
deactivating (stop) (Result: signal). Within 5 seconds (per the config), It should return
to saying Active: active (running).
If you’re using an older distro, or one that doesn’t have systemd for any reason, check
out Digital Ocean’s exhaustive guide to automatic service recovery.