Hardening systemd services via system call filter
No code is perfect, so it’s useful to limit its potential for doing harm. systemd provides a system call filter which we can use to do just that, but it’s easy to limit it so much that it breaks the service.
Let’s start with systemd-analyze security SERVICE
to get a list of settings
which can limit how much risk the service is to the system (“exposure”). (We’re
only looking into system calls in this article, but I would recommend looking
into everything it lists.) Here’s an example analysing the SSH daemon with
systemd-analyze security sshd
:
NAME DESCRIPTION EXPOSURE
✓ AmbientCapabilities= Service process does not receive ambient capabilities
✗ CapabilityBoundingSet=~CAP_AUDIT_* Service has audit subsystem access 0.1
✗ CapabilityBoundingSet=~CAP_BLOCK_SUSPEND Service may establish wake locks 0.1
✗ CapabilityBoundingSet=~CAP_BPF Service may load BPF programs 0.1
✗ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP) Service may change file ownership/access mode/capabilities unrestricted 0.2
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER) Service may override UNIX file/IPC permission checks 0.2
✗ CapabilityBoundingSet=~CAP_IPC_LOCK Service may lock memory into RAM 0.1
✗ CapabilityBoundingSet=~CAP_KILL Service may send UNIX signals to arbitrary processes 0.1
✗ CapabilityBoundingSet=~CAP_LEASE Service may create file leases 0.1
✗ CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE Service may mark files immutable 0.1
✗ CapabilityBoundingSet=~CAP_MAC_* Service may adjust SMACK MAC 0.1
✗ CapabilityBoundingSet=~CAP_MKNOD Service may create device nodes 0.1
✗ CapabilityBoundingSet=~CAP_NET_ADMIN Service has network configuration privileges 0.2
✗ CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges 0.1
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP) Service may change UID/GID identities/capabilities 0.3
✗ CapabilityBoundingSet=~CAP_SYS_ADMIN Service has administrator privileges 0.3
✗ CapabilityBoundingSet=~CAP_SYS_BOOT Service may issue reboot() 0.1
✗ CapabilityBoundingSet=~CAP_SYS_CHROOT Service may issue chroot() 0.1
✗ CapabilityBoundingSet=~CAP_SYSLOG Service has access to kernel logging 0.1
✗ CapabilityBoundingSet=~CAP_SYS_MODULE Service may load kernel modules 0.2
✗ CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE) Service has privileges to change resource use parameters 0.1
✗ CapabilityBoundingSet=~CAP_SYS_PACCT Service may use acct() 0.1
✗ CapabilityBoundingSet=~CAP_SYS_PTRACE Service has ptrace() debugging abilities 0.3
✗ CapabilityBoundingSet=~CAP_SYS_RAWIO Service has raw I/O access 0.2
✗ CapabilityBoundingSet=~CAP_SYS_TIME Service processes may change the system clock 0.2
✗ CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG Service may issue vhangup() 0.1
✗ CapabilityBoundingSet=~CAP_WAKE_ALARM Service may program timers that wake up the system 0.1
✓ Delegate= Service does not maintain its own delegated control group subtree
✗ DeviceAllow= Service has no device ACL 0.2
✗ IPAddressDeny= Service does not define an IP address allow list 0.2
✓ KeyringMode= Service doesn't share key material with other services
✗ LockPersonality= Service may change ABI personality 0.1
✗ MemoryDenyWriteExecute= Service may create writable executable memory mappings 0.1
✗ NoNewPrivileges= Service processes may acquire new privileges 0.2
✓ NotifyAccess= Service child processes cannot alter service state
✗ PrivateDevices= Service potentially has access to hardware devices 0.2
✗ PrivateMounts= Service may install system mounts 0.2
✗ PrivateNetwork= Service has access to the host's network 0.5
✗ PrivateTmp= Service has access to other software's temporary files 0.2
✗ PrivateUsers= Service has access to other users 0.2
✗ ProcSubset= Service has full access to non-process /proc files (/proc subset=) 0.1
✗ ProtectClock= Service may write to the hardware clock or system clock 0.2
✗ ProtectControlGroups= Service may modify the control group file system 0.2
✗ ProtectHome= Service has full access to home directories 0.2
✗ ProtectHostname= Service may change system host/domainname 0.1
✗ ProtectKernelLogs= Service may read from or write to the kernel log ring buffer 0.2
✗ ProtectKernelModules= Service may load or read kernel modules 0.2
✗ ProtectKernelTunables= Service may alter kernel tunables 0.2
✗ ProtectProc= Service has full access to process tree (/proc hidepid=) 0.2
✗ ProtectSystem= Service has full access to the OS file hierarchy 0.2
RemoveIPC= Service runs as root, option does not apply
✗ RestrictAddressFamilies=~AF_(INET|INET6) Service may allocate Internet sockets 0.3
✗ RestrictAddressFamilies=~AF_NETLINK Service may allocate netlink sockets 0.1
✗ RestrictAddressFamilies=~AF_PACKET Service may allocate packet sockets 0.2
✗ RestrictAddressFamilies=~AF_UNIX Service may allocate local sockets 0.1
✗ RestrictAddressFamilies=~… Service may allocate exotic sockets 0.3
✗ RestrictNamespaces=~cgroup Service may create cgroup namespaces 0.1
✗ RestrictNamespaces=~ipc Service may create IPC namespaces 0.1
✗ RestrictNamespaces=~mnt Service may create file system namespaces 0.1
✗ RestrictNamespaces=~net Service may create network namespaces 0.1
✗ RestrictNamespaces=~pid Service may create process namespaces 0.1
✗ RestrictNamespaces=~user Service may create user namespaces 0.3
✗ RestrictNamespaces=~uts Service may create hostname namespaces 0.1
✗ RestrictRealtime= Service may acquire realtime scheduling 0.1
✗ RestrictSUIDSGID= Service may create SUID/SGID files 0.2
✗ RootDirectory=/RootImage= Service runs within the host's root directory 0.1
SupplementaryGroups= Service runs as root, option does not matter
✗ SystemCallArchitectures= Service may execute system calls with all ABIs 0.2
✗ SystemCallFilter=~@clock Service does not filter system calls 0.2
✗ SystemCallFilter=~@cpu-emulation Service does not filter system calls 0.1
✗ SystemCallFilter=~@debug Service does not filter system calls 0.2
✗ SystemCallFilter=~@module Service does not filter system calls 0.2
✗ SystemCallFilter=~@mount Service does not filter system calls 0.2
✗ SystemCallFilter=~@obsolete Service does not filter system calls 0.1
✗ SystemCallFilter=~@privileged Service does not filter system calls 0.2
✗ SystemCallFilter=~@raw-io Service does not filter system calls 0.2
✗ SystemCallFilter=~@reboot Service does not filter system calls 0.2
✗ SystemCallFilter=~@resources Service does not filter system calls 0.2
✗ SystemCallFilter=~@swap Service does not filter system calls 0.2
✗ UMask= Files created by service are world-readable by default 0.1
✗ User=/DynamicUser= Service runs as root user 0.4
→ Overall exposure level for sshd.service: 9.6 UNSAFE 😨
As we can see, any issues with the SSH daemon would open up the system to a lot of scary side effects. Based on discussions it seems that limiting the capabilities of the SSH daemon without breaking any of the vast array of functionality it provides is actually quite difficult. But most services are much simpler and can be hardened a lot more, for example:
NAME DESCRIPTION EXPOSURE
✓ SystemCallFilter=~@clock System call allow list defined for service, and @clock is not included
✓ SystemCallFilter=~@cpu-emulation System call allow list defined for service, and @cpu-emulation is not included
✓ SystemCallFilter=~@debug System call allow list defined for service, and @debug is not included
✓ SystemCallFilter=~@module System call allow list defined for service, and @module is not included
✓ SystemCallFilter=~@mount System call allow list defined for service, and @mount is not included
✓ SystemCallFilter=~@obsolete System call allow list defined for service, and @obsolete is not included
✓ SystemCallFilter=~@privileged System call allow list defined for service, and @privileged is not included
✓ SystemCallFilter=~@raw-io System call allow list defined for service, and @raw-io is not included
✓ SystemCallFilter=~@reboot System call allow list defined for service, and @reboot is not included
✗ SystemCallFilter=~@resources System call allow list defined for service, and @resources is included (e.g. ioprio_set is allowed) 0.2
✓ SystemCallFilter=~@swap System call allow list defined for service, and @swap is not included
Not bad! This service only needs access to some of the system calls in the
resources
group, such as ioprio_set
. We can use
systemctl show --property=SystemCallFilter SERVICE
to show the full list.
Anything not allowed will result in signal 31, aka. “SYS”, which can be inspected in the core dump log. Below is an example from a service which I had changed without updating the hardening settings:
❯ journalctl --identifier=systemd-coredump --lines=1 --output=cat
Process 582 (sed) of user 1000 dumped core.
Module [omitted]/sed without build-id.
Stack trace of thread 582:
#0 0x00007f8f72d1051b fchown (libc.so.6 + 0x11051b)
#1 0x000000000040817b closedown ([omitted]/sed + 0x817b)
#2 0x0000000000408d70 read_pattern_space ([omitted]/sed + 0x8d70)
#3 0x000000000040ad4d process_files ([omitted]/sed + 0xad4d)
#4 0x0000000000403abf main ([omitted]/sed + 0x3abf)
#5 0x00007f8f72c2a4d8 __libc_start_call_main (libc.so.6 + 0x2a4d8)
#6 0x00007f8f72c2a59b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a59b)
#7 0x0000000000403be5 _start ([omitted]/sed + 0x3be5)
ELF object binary architecture: AMD x86-64
In this case we can see that sed
tried to run fchown
. We can use
systemd-analyze syscall-filter
to see which system calls (and other groups)
are in each group, and searching for fchown
we can see that the chown
group
contains fchown
. So we’ll need to either
- add
fchown
or@chown
toSystemCallFilter
(easy, and low risk since the service runs as my user); - change the
sed
command to not need to callfchown
(potentially impossible without changingsed
itself); or - use some other command than
sed
in the service (potentially lots of work).
The first option is easy, and the risk is tolerable, so that’s what I ended up doing.
Knowing which commands are available and how to find relevant debugging info is half the battle, so I hope this was useful.
Thanks to
_Andrew on the NixOS Discourse
for the journalctl
tip!
No webmentions were found.