We were recently faced with a service that crashed occasionally on one of our Linux servers. We had to find a way to make it recover automatically, ideally using tools that were already present on the server.
The landscape of process-management in Linux is somewhat complex and is in a state of
flux, with different tools vying to replace the venerable SysV-style
init that sysadmins have relied on for decades.
Luckily, it looks like the highly-capable systemd
is becoming the standard, and we’ll
hopefully have another long stable period soon.
In the meantime, we needed to get our service running reliably. In our case, the
delayed_job
service that supported one of our Rails apps was stopping occasionally. The
service was controlled by a SysV-style init script in /etc/init.d
, and it was started
automatically when the system started up. Unfortunately, SysV-style init doesn’t include
monitoring or auto-restart capability*.
* You can actually used the respawn
feature in the /etc/inittab
file, but that
looked too scary to us!
Now that systemd
is becoming a standard (present on Ubuntu 14.10+, RedHat, 7+, CentOS
7+, and Fedora 15+), we wanted to learn how to use it. systemd
is controlled by
.service
files in a variety of locations. However, we wanted to avoid rewriting the
existing startup script. Our research revealed the following:
systemd
is compatible with legacy/etc/init.d
scripts in this manner: whensystemd
loads service definitions, thesystemd-sysv-generator
generates.service
files on the fly from the scripts in/etc/init.d
.- You can add configuration to a service by adding “drop-in” files to a correctly-named
folder in
/etc/systemd/system/
.
So, we simply had to add a file like this at /etc/systemd/system/delayed_job.service.d/restart.conf
:
[Service]
Type=forking
PIDFile=/srv/www/sites/rails_app/current/tmp/pids/delayed_job.pid
RemainAfterExit=no
Restart=on-failure
RestartSec=5s
This configuration will vary for different services. You’ll have to read the
systemd.service
docs to figure out
the specifics. In particular, you should read the docs for the Type
, Restart
,
GuessMainPID
, PIDFile
, and RemainAfterExit
options.
Then, after reloading our service definitions with systemd daemon-reload
, we can see
that systemd
knew what it needed to know about our service.
$ sudo systemctl status delayed_job
● delayed_job.service - LSB: Manage delayed jobs for application rails_app
Loaded: loaded (/etc/init.d/delayed_job; bad; vendor preset: enabled)
Drop-In: /etc/systemd/system/delayed_job.service.d
└─restart.conf
Active: active (running) since Mon 2017-10-23 15:33:25 CDT; 12min ago
Docs: man:systemd-sysv-generator(8)
Process: 6359 ExecStop=/etc/init.d/delayed_job stop (code=exited, status=0/SUCCESS)
Process: 6389 ExecStart=/etc/init.d/delayed_job start (code=exited, status=0/SUCCESS)
Main PID: 6412 (bundle)
CGroup: /system.slice/delayed_job.service
‣ 6412 delayed_job
Notice the “Drop-In” section, which tells us that systemd
knows about the new config
file we added, and the Main PID
line, which indicates that systemd
knows which process
to monitor.
We can test this config by force-killing the service in question: sudo kill -9 6412
.
Running system status delayed_job
immediately will show something like Active:
deactivating (stop) (Result: signal)
. Within 5 seconds (per the config), It should return
to saying Active: active (running)
.
If you’re using an older distro, or one that doesn’t have systemd
for any reason, check
out Digital Ocean’s exhaustive guide to automatic service recovery.