No code is perfect, so it’s useful to limit its potential for doing harm. systemd provides a system call filter which we can use to do just that, but it’s easy to limit it so much that it breaks the service.

Let’s start with systemd-analyze security SERVICE to get a list of settings which can limit how much risk the service is to the system (“exposure”). (We’re only looking into system calls in this article, but I would recommend looking into everything it lists.) Here’s an example analysing the SSH daemon with systemd-analyze security sshd:

  NAME                                                        DESCRIPTION                                                             EXPOSURE
✓ AmbientCapabilities=                                        Service process does not receive ambient capabilities
✗ CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                           0.1
✗ CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                             0.1
✗ CapabilityBoundingSet=~CAP_BPF                              Service may load BPF programs                                                0.1
✗ CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unrestricted      0.2
✗ CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                         0.2
✗ CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                             0.1
✗ CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                         0.1
✗ CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                               0.1
✗ CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                             0.1
✗ CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                                 0.1
✗ CapabilityBoundingSet=~CAP_MKNOD                            Service may create device nodes                                              0.1
✗ CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                                 0.2
✗ CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                                   0.1
✗ CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                           0.3
✗ CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                         0.3
✗ CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                                   0.1
✗ CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                                   0.1
✗ CapabilityBoundingSet=~CAP_SYSLOG                           Service has access to kernel logging                                         0.1
✗ CapabilityBoundingSet=~CAP_SYS_MODULE                       Service may load kernel modules                                              0.2
✗ CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters                     0.1
✗ CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                       0.1
✗ CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                                     0.3
✗ CapabilityBoundingSet=~CAP_SYS_RAWIO                        Service has raw I/O access                                                   0.2
✗ CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes may change the system clock                                0.2
✗ CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                                  0.1
✗ CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service may program timers that wake up the system                           0.1
✓ Delegate=                                                   Service does not maintain its own delegated control group subtree
✗ DeviceAllow=                                                Service has no device ACL                                                    0.2
✗ IPAddressDeny=                                              Service does not define an IP address allow list                             0.2
✓ KeyringMode=                                                Service doesn't share key material with other services
✗ LockPersonality=                                            Service may change ABI personality                                           0.1
✗ MemoryDenyWriteExecute=                                     Service may create writable executable memory mappings                       0.1
✗ NoNewPrivileges=                                            Service processes may acquire new privileges                                 0.2
✓ NotifyAccess=                                               Service child processes cannot alter service state
✗ PrivateDevices=                                             Service potentially has access to hardware devices                           0.2
✗ PrivateMounts=                                              Service may install system mounts                                            0.2
✗ PrivateNetwork=                                             Service has access to the host's network                                     0.5
✗ PrivateTmp=                                                 Service has access to other software's temporary files                       0.2
✗ PrivateUsers=                                               Service has access to other users                                            0.2
✗ ProcSubset=                                                 Service has full access to non-process /proc files (/proc subset=)           0.1
✗ ProtectClock=                                               Service may write to the hardware clock or system clock                      0.2
✗ ProtectControlGroups=                                       Service may modify the control group file system                             0.2
✗ ProtectHome=                                                Service has full access to home directories                                  0.2
✗ ProtectHostname=                                            Service may change system host/domainname                                    0.1
✗ ProtectKernelLogs=                                          Service may read from or write to the kernel log ring buffer                 0.2
✗ ProtectKernelModules=                                       Service may load or read kernel modules                                      0.2
✗ ProtectKernelTunables=                                      Service may alter kernel tunables                                            0.2
✗ ProtectProc=                                                Service has full access to process tree (/proc hidepid=)                     0.2
✗ ProtectSystem=                                              Service has full access to the OS file hierarchy                             0.2
  RemoveIPC=                                                  Service runs as root, option does not apply
✗ RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                        0.3
✗ RestrictAddressFamilies=~AF_NETLINK                         Service may allocate netlink sockets                                         0.1
✗ RestrictAddressFamilies=~AF_PACKET                          Service may allocate packet sockets                                          0.2
✗ RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                           0.1
✗ RestrictAddressFamilies=~…                                  Service may allocate exotic sockets                                          0.3
✗ RestrictNamespaces=~cgroup                                  Service may create cgroup namespaces                                         0.1
✗ RestrictNamespaces=~ipc                                     Service may create IPC namespaces                                            0.1
✗ RestrictNamespaces=~mnt                                     Service may create file system namespaces                                    0.1
✗ RestrictNamespaces=~net                                     Service may create network namespaces                                        0.1
✗ RestrictNamespaces=~pid                                     Service may create process namespaces                                        0.1
✗ RestrictNamespaces=~user                                    Service may create user namespaces                                           0.3
✗ RestrictNamespaces=~uts                                     Service may create hostname namespaces                                       0.1
✗ RestrictRealtime=                                           Service may acquire realtime scheduling                                      0.1
✗ RestrictSUIDSGID=                                           Service may create SUID/SGID files                                           0.2
✗ RootDirectory=/RootImage=                                   Service runs within the host's root directory                                0.1
  SupplementaryGroups=                                        Service runs as root, option does not matter
✗ SystemCallArchitectures=                                    Service may execute system calls with all ABIs                               0.2
✗ SystemCallFilter=~@clock                                    Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                         0.1
✗ SystemCallFilter=~@debug                                    Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@module                                   Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@mount                                    Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@obsolete                                 Service does not filter system calls                                         0.1
✗ SystemCallFilter=~@privileged                               Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@raw-io                                   Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@reboot                                   Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@resources                                Service does not filter system calls                                         0.2
✗ SystemCallFilter=~@swap                                     Service does not filter system calls                                         0.2
✗ UMask=                                                      Files created by service are world-readable by default                       0.1
✗ User=/DynamicUser=                                          Service runs as root user                                                    0.4

→ Overall exposure level for sshd.service: 9.6 UNSAFE 😨

As we can see, any issues with the SSH daemon would open up the system to a lot of scary side effects. Based on discussions it seems that limiting the capabilities of the SSH daemon without breaking any of the vast array of functionality it provides is actually quite difficult. But most services are much simpler and can be hardened a lot more, for example:

  NAME                             DESCRIPTION                                                                                         EXPOSURE
✓ SystemCallFilter=~@clock         System call allow list defined for service, and @clock is not included
✓ SystemCallFilter=~@cpu-emulation System call allow list defined for service, and @cpu-emulation is not included
✓ SystemCallFilter=~@debug         System call allow list defined for service, and @debug is not included
✓ SystemCallFilter=~@module        System call allow list defined for service, and @module is not included
✓ SystemCallFilter=~@mount         System call allow list defined for service, and @mount is not included
✓ SystemCallFilter=~@obsolete      System call allow list defined for service, and @obsolete is not included
✓ SystemCallFilter=~@privileged    System call allow list defined for service, and @privileged is not included
✓ SystemCallFilter=~@raw-io        System call allow list defined for service, and @raw-io is not included
✓ SystemCallFilter=~@reboot        System call allow list defined for service, and @reboot is not included
✗ SystemCallFilter=~@resources     System call allow list defined for service, and @resources is included (e.g. ioprio_set is allowed)      0.2
✓ SystemCallFilter=~@swap          System call allow list defined for service, and @swap is not included

Not bad! This service only needs access to some of the system calls in the resources group, such as ioprio_set. We can use systemctl show --property=SystemCallFilter SERVICE to show the full list.

Anything not allowed will result in signal 31, aka. “SYS”, which can be inspected in the core dump log. Below is an example from a service which I had changed without updating the hardening settings:

❯ journalctl --identifier=systemd-coredump --lines=1 --output=cat
Process 582 (sed) of user 1000 dumped core.

Module [omitted]/sed without build-id.
Stack trace of thread 582:
#0  0x00007f8f72d1051b fchown (libc.so.6 + 0x11051b)
#1  0x000000000040817b closedown ([omitted]/sed + 0x817b)
#2  0x0000000000408d70 read_pattern_space ([omitted]/sed + 0x8d70)
#3  0x000000000040ad4d process_files ([omitted]/sed + 0xad4d)
#4  0x0000000000403abf main ([omitted]/sed + 0x3abf)
#5  0x00007f8f72c2a4d8 __libc_start_call_main (libc.so.6 + 0x2a4d8)
#6  0x00007f8f72c2a59b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a59b)
#7  0x0000000000403be5 _start ([omitted]/sed + 0x3be5)
ELF object binary architecture: AMD x86-64

In this case we can see that sed tried to run fchown. We can use systemd-analyze syscall-filter to see which system calls (and other groups) are in each group, and searching for fchown we can see that the chown group contains fchown. So we’ll need to either

  1. add fchown or @chown to SystemCallFilter (easy, and low risk since the service runs as my user);
  2. change the sed command to not need to call fchown (potentially impossible without changing sed itself); or
  3. use some other command than sed in the service (potentially lots of work).

The first option is easy, and the risk is tolerable, so that’s what I ended up doing.

Knowing which commands are available and how to find relevant debugging info is half the battle, so I hope this was useful.

Thanks to _Andrew on the NixOS Discourse for the journalctl tip!