All articles
TutorialsApr 09, 2026 · 23 min read

Harden Docker on a VPS - Rootless, User Namespaces, and More

Harden Docker on a VPS - Rootless, User Namespaces, and More

A default Docker install on a VPS is convenient and dangerous in equal measure. The daemon runs as root, the docker group is a passwordless sudo, containers ship with broad capabilities, and one bad image or one mounted socket can hand an attacker the host.

Every one of those problems has a fix that ships with Docker itself. This guide walks through the changes that move the needle on a single-host VPS, in the order I'd apply them on a fresh server.

The single most dangerous thing you can do with Docker is bind-mount `/var/run/docker.sock` into a container. That socket is root-equivalent. A container with the socket can launch a new privileged container that mounts the host filesystem and reads `/etc/shadow`. Treat it like the root password.

TL;DR

  • Run the daemon rootless, or enable user namespace remapping if you can't
  • Drop all capabilities and add back only what each container actually needs
  • Run containers with --read-only, no-new-privileges, and a tmpfs for /tmp
  • Don't put humans in the docker group, ever
  • Cap log file sizes to stop a chatty container from filling the disk
  • Set mem_limit and cpus so one container can't starve the rest
  • Scan images with Trivy before you pull them into production
  • Never mount the daemon socket into a container you don't fully trust

What You Need

  • A VPS running Ubuntu 22.04 or 24.04 (Debian 12 works too)
  • Root or sudo access for the initial setup
  • A non-root user that you'll actually use for day-to-day work
  • About 30 minutes if you're starting fresh

This guide assumes Docker Engine, not Docker Desktop. The commands target a standard Linux host.

Step 1: Don't Put Yourself in the docker Group

The default installer adds your user to the docker group so you can run docker without sudo. That group membership is equivalent to root. Anyone in the docker group can run:

docker run --rm -v /:/host alpine chroot /host sh

That's a root shell on the host. No password prompt, no audit trail. There is no way to limit what a docker group member can do short of removing them from the group.

The fix is one of:

  1. Run rootless Docker (Step 2). The daemon runs as your user and there's no docker group at all.
  2. Use sudo docker for everything and don't add humans to the docker group.
  3. Restrict who can SSH in and accept that any admin is effectively root.

If you share a server with people you wouldn't give full sudo, options 1 and 2 are the only honest answers.

Step 2: Install Rootless Docker

Rootless mode runs dockerd as a regular user. The daemon, containers, and network stack all live in a user namespace owned by your account. A container escape gives the attacker your unprivileged shell, not root.

First, install the prerequisites and the rootless extras package:

sudo apt update sudo apt install -y uidmap dbus-user-session fuse-overlayfs slirp4netns \ docker-ce-rootless-extras

If you already have rootful Docker installed, disable the system service so the rootless one can take over:

sudo systemctl disable --now docker.service docker.socket sudo rm -f /var/run/docker.sock

Then, as your regular user (not root):

dockerd-rootless-setuptool.sh install

The script writes a systemd user unit and prints the environment variables you need. Add them to your shell profile:

cat >> ~/.bashrc <<'EOF' export PATH=/usr/bin:$PATH export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock EOF source ~/.bashrc

Enable lingering so the daemon survives logout:

sudo loginctl enable-linger "$USER" systemctl --user enable --now docker

Verify:

docker info | grep -i rootless

You should see rootless listed under Security Options.

Rootless Docker has real limits. It cannot bind to ports below 1024 by default, AppArmor profiles for containers may behave differently, and some workloads that expect privileged kernel features (raw sockets, specific cgroup tweaks) will not work. For a typical web stack behind a reverse proxy on port 8080, none of that matters. For a VPN concentrator, it does.

Step 3: If You Can't Go Rootless, Use User Namespaces

Some workloads still need rootful Docker. The daemon's user namespace remap maps container UID 0 to a high unprivileged UID on the host, so a break-out leaves the attacker as dockremap rather than root.

Edit (or create) /etc/docker/daemon.json:

{ "userns-remap": "default", "live-restore": true, "no-new-privileges": true, "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "5" }, "default-ulimits": { "nofile": { "Name": "nofile", "Hard": 65536, "Soft": 65536 } } }

Then restart:

sudo systemctl restart docker

The default value tells Docker to create a dockremap user and map the container UID range to it. New directories appear under /var/lib/docker/<subuid>.<subgid>/. Existing image and volume data lives in the old paths, so plan a migration.

A few things break under userns-remap:

  • Bind-mounts that need specific UIDs from the host fail unless you chown the host path to the remapped UID.
  • BuildKit had a long-running incompatibility with userns-remap. Use the legacy builder (DOCKER_BUILDKIT=0 docker build .) or build images on a separate host.
  • Containers that share a network or PID namespace with the host bypass the remap. Don't combine --userns=host with anything you care about.

If those constraints are too painful, prefer Step 2.

Step 4: Drop Capabilities by Default

Linux capabilities are the granular pieces of root. By default, Docker grants a generous subset including NET_RAW, SYS_CHROOT, and SETUID. Most apps need none of these. Drop everything and add back only what's required:

docker run \ --cap-drop=ALL \ --cap-add=NET_BIND_SERVICE \ --read-only \ --tmpfs /tmp:rw,noexec,nosuid,size=64m \ --security-opt no-new-privileges \ --memory=512m \ --cpus=1.0 \ nginx:alpine

What each flag does:

  • --cap-drop=ALL strips every capability from the container.
  • --cap-add=NET_BIND_SERVICE lets the process bind to ports below 1024. nginx needs this for port 80; most other apps don't.
  • --read-only mounts the root filesystem read-only. The container can't drop a webshell on disk because there's no writable disk.
  • --tmpfs /tmp gives the app a writable scratch area in RAM. It vanishes on container restart, which is what you want.
  • --security-opt no-new-privileges prevents setuid binaries from raising privileges inside the container. Stops a whole class of escalation tricks.
  • --memory and --cpus cap resource usage so a runaway worker can't kill the host.

Capabilities you might legitimately need:

  • NET_BIND_SERVICE: bind to ports under 1024
  • CHOWN, DAC_OVERRIDE, FOWNER: writing files as different users (most package managers in entrypoints)
  • SETUID, SETGID: dropping privileges inside the container

If your container fails with a permissions error after you drop everything, strace or check the logs for the missing capability and add only that one back.

Step 5: Express the Same Hardening in Compose

You probably aren't running everything from the command line. The same flags translate cleanly to a Compose file:

services: web: image: ghcr.io/example/web:1.4.2 restart: unless-stopped read_only: true tmpfs: - /tmp:rw,noexec,nosuid,size=64m - /run:rw,noexec,nosuid,size=8m cap_drop: - ALL cap_add: - NET_BIND_SERVICE security_opt: - no-new-privileges:true mem_limit: 512m mem_reservation: 256m cpus: 1.0 pids_limit: 200 ulimits: nofile: soft: 4096 hard: 8192 logging: driver: json-file options: max-size: "10m" max-file: "5" environment: NODE_ENV: production networks: - frontend healthcheck: test: ["CMD", "wget", "-qO-", "http://localhost:8080/health"] interval: 30s timeout: 5s retries: 3 networks: frontend:

A few things to call out:

  • The image tag is pinned (1.4.2), not latest. Pinning is a security control: it stops a poisoned upstream tag from rolling into production on the next docker compose pull.
  • pids_limit stops a fork bomb inside the container from taking the host with it.
  • mem_limit and cpus are non-negotiable on a multi-tenant VPS. Without them the OOM killer becomes a lottery.
  • Per-service logging overrides ride on top of any daemon-level defaults you set in daemon.json. Belt and suspenders.

If your app legitimately needs to write somewhere persistent, add a named volume and keep the rest read-only. Don't relax read_only for the whole service just because one cache directory needs to be writable.

Step 6: Scan Every Image with Trivy

The hardening above protects against runtime exploits. It does nothing against an image that ships with a known-vulnerable OpenSSL or a backdoor in a transitive npm dependency. For that, you need a scanner.

Trivy is fast, free, and runs as a single binary:

sudo apt install -y wget gnupg wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | \ sudo gpg --dearmor -o /usr/share/keyrings/trivy.gpg echo "deb [signed-by=/usr/share/keyrings/trivy.gpg] \ https://aquasecurity.github.io/trivy-repo/deb generic main" | \ sudo tee /etc/apt/sources.list.d/trivy.list sudo apt update sudo apt install -y trivy

Then scan an image before you run it:

trivy image --severity HIGH,CRITICAL --exit-code 1 ghcr.io/example/web:1.4.2

The --exit-code 1 flag makes Trivy fail the command if it finds anything HIGH or CRITICAL. Wire it into your deploy script and the script will refuse to roll out a vulnerable image.

For an existing host, scan everything you've already pulled:

docker images --format '{{.Repository}}:{{.Tag}}' | \ grep -v '<none>' | \ xargs -I{} trivy image --severity HIGH,CRITICAL --quiet {}

The first scan on a fresh server is usually a wake-up call. Pin newer base images, rebuild, rescan.

Step 7: Cap the Log Driver Before It Caps You

A noisy container with the default json-file log driver will quietly eat your disk. There is no rotation by default. I've seen 80 GB of logs from one misbehaving worker that nobody noticed until the host ran out of inodes.

The daemon.json snippet from Step 3 sets a sensible global default: 10 MB per file, 5 files retained, per container. That's 50 MB max per container, which is plenty for triage and small enough to be safe.

If you want to ship logs off-host instead, swap the driver:

{ "log-driver": "journald" }

journald rotates with the rest of the system journal and integrates with journalctl -u docker.service CONTAINER_NAME=foo.

Step 8: Lock Down the Daemon Socket

The Docker daemon socket lives at /var/run/docker.sock (rootful) or $XDG_RUNTIME_DIR/docker.sock (rootless). Anything that can talk to that socket controls the daemon, which means it controls the host.

Two rules I treat as non-negotiable:

  1. Never bind-mount /var/run/docker.sock into a container that runs untrusted code. That includes Watchtower, Portainer agents, CI runners, and webhook receivers. If the container is compromised, so is the host.
  2. If a tool genuinely needs Docker access (like a CI runner), put it on a separate host or use a socket proxy that exposes only the API endpoints it needs:
services: socket-proxy: image: tecnativa/docker-socket-proxy:latest environment: CONTAINERS: 1 IMAGES: 1 POST: 0 volumes: - /var/run/docker.sock:/var/run/docker.sock:ro restart: unless-stopped watchtower: image: containrrr/watchtower environment: DOCKER_HOST: tcp://socket-proxy:2375 depends_on: - socket-proxy restart: unless-stopped

The proxy turns a root-equivalent socket into a tightly scoped HTTP API. Watchtower can list containers and pull images, but it can't create privileged containers or mount the host filesystem.

The `docker` group on the host and `/var/run/docker.sock` inside a container are the same risk wearing two different hats. Treat membership in the group and access to the socket as equivalent to root, because functionally they are.

Step 9: Sanity Check Your Containers

After all of the above, walk through one of your running services and confirm the hardening is actually applied:

docker inspect myservice --format ' ReadOnly: {{.HostConfig.ReadonlyRootfs}} CapDrop: {{.HostConfig.CapDrop}} CapAdd: {{.HostConfig.CapAdd}} SecurityOpt: {{.HostConfig.SecurityOpt}} Memory: {{.HostConfig.Memory}} NanoCPUs: {{.HostConfig.NanoCpus}} PidsLimit: {{.HostConfig.PidsLimit}} '

You're looking for:

  • ReadOnly: true
  • CapDrop: [ALL]
  • CapAdd: only the capabilities you intentionally granted
  • SecurityOpt: includes no-new-privileges
  • Memory and NanoCPUs are non-zero
  • PidsLimit is set

If any of those are empty, the Compose file or the run command isn't doing what you think it is.

Troubleshooting

Rootless Docker can't bind to port 80 or 443. That's expected; rootless mode cannot bind to ports below 1024 without extra configuration. Either run your reverse proxy on the host (rootful, but exposed only to localhost) and proxy to a high port in the rootless namespace, or set net.ipv4.ip_unprivileged_port_start=80 in /etc/sysctl.d/99-rootless.conf and reboot. The latter has its own implications, since any user can now bind low ports.

BuildKit fails after enabling userns-remap. This is a known limitation. Build with the legacy builder using DOCKER_BUILDKIT=0 docker build ., or do your image builds on a separate host (or in CI) that doesn't have userns-remap enabled. Push the built image and pull it on the hardened host.

An AppArmor profile blocks a legitimate workload. Symptoms include processes failing to write to a path that the bind mount clearly allows, or odd EACCES errors from inside an otherwise fine container. Check with dmesg | grep DENIED. The pragmatic fix is to write a custom AppArmor profile for that one image (--security-opt apparmor=my-profile) rather than disabling AppArmor system-wide.

Can't bind-mount /etc/passwd in rootless mode. Rootless Docker maps your host UID range into the container, and the container's view of /etc/passwd is a remapped one. Mounting the host file directly produces UID mismatches. Generate a passwd-style file inside the container instead, or pre-bake the user into the image.

Trivy reports zero vulnerabilities on a six-month-old image. Almost certainly stale CVE data. Run trivy image --download-db-only or just trivy image --reset and rescan. The DB ships separately from the binary.

Going Further

A few things to look at once the basics above are in place:

  • Sysbox is an alternative OCI runtime that gives you stronger isolation, including support for running systemd and Docker-in-Docker without --privileged. Worth it for multi-tenant setups.
  • gVisor runs containers inside a user-space kernel. Performance overhead is real but the syscall surface presented to the host is dramatically smaller.
  • Falco watches kernel syscalls and alerts on suspicious container behavior at runtime. Good complement to image scanning, which only catches known issues at build time.
  • cosign signs images so you can verify provenance before pulling. Pair with a policy engine like Kyverno or an admission controller if you eventually move to Kubernetes.
  • Network policies. Put each service on its own Docker network and don't expose ports you don't need. The host firewall (ufw or nftables) is your last line; treat it that way.

None of these are required to be safe. They're what you reach for when you've outgrown the basics.

That's the working set. Run rootless when you can, drop capabilities aggressively, scan images, and treat the docker socket and the docker group with the same respect you'd give the root password. The defaults aren't safe, but the safe configuration isn't far away.


Looking for a VPS that's ready for hardened Docker workloads? Our Linux plans ship with NVMe storage, IPv6, and full root access. See the options.