Moving Day

Moving Day
Photo by Nilayam Patel / Unsplash

My home infrastructure, while functional, is a bit of a hot mess. For years I've been focused on precision of outcome rather than flexibility of implementation, to the point side projects sat on my backburner until I finished completely arbitrary milestones or learning programs. I ended up with such a complex end-state architecture design that I couldn't move from Plex to Jellyfin until I'd mastered Kubernetes and Linux and Cilium and Istio and built a production K8s cluster across two NUCs with HA and learned HAProxy and setup iSCSI connectors for storage and deployed Ceph...

Yeah. This was not tenable.

So hot on the heels of my successes with premium AI offerings - and after my month expired - I decided to hunker down, and get this fucking done. Let's worry about functional and practical first, instead of pre-optimizing for perfection at home (because I could; work does not allow you the luxury of pursuing perfection, and damn I thought my OCD wanted perfection).

The Status Quo

I've done a lot of moves in my life, and if there's one thing I've learned it's that a successful move requires proper planning and accountability of everything before you even think of moving.

The current infrastructure, as I stated at the outset, is fine. A Raspberry Pi 4 acts as a recursive DNS server via Pi-Hole and Unbound, but its lack of a real-time clock means I have to fuss with it after every power outage. Storage is handled by a Synology DS1019+, which does double-duty as the internal DNS server and hosting the FreshRSS and RSSBridge containers I use to get my news feeds, and does reverse proxy for them. Two Beelink NUCs with N100 CPUs and 16GB of RAM act as compute nodes, with one running Proxmox to host Plex in a VM while the other remains...in limbo, essentially. All of this is backed by an Ubiquiti Pro Max 16 PoE switch and UniFi Dream Machine firewall/gateway, with an Eaton UPS to keep everything running when the power fails. Oh, and a smattering of other PoE kit like the Home Assistant Yellow and Hue Bridge v2, all of which sit in a little 9U rack.

Outside the rack is the rest of the apartment which, aside from the UniFi U7 Pro Max WAP, are all clients of various stripes. VLANs are already done...but stuff like the NAS still sits on VLAN 1, the only IPv6-enabled VLAN. Everything is single-homed as well, including the NAS. No redundancies, no failovers, just on and off-site backups and non-networked UPSes.

Functional. Performant, even. Just...fine, and I'm quite tired with fine.

Let's lay out the move.

The Base Layer: Debian Linux

I'd done ESXi. I'd tried Proxmox and Talos. Windows and its CALs are way beyond my price range for a homelab, and I'd rummaged through various flavors of Linux including RHEL, Ubuntu, Debian, Raspbian, Fedora...you get the idea.

I decided to go with Debian for a few reasons:

  • Learning Linux administration via Debian translates into command knowledge across a wider swath of Linux distros versus RHEL, Arch, or SUSE
  • My Pi-Hole runs a flavor of Debian (Raspbian) with an uptime measured in years
  • My limited experience with Debian is it's so bulletproof that you have to actively try to fuck it up, and even then it's generally recoverable - or at least quickly rebuildable

I threw Debian onto the 2nd NUC, deployed Cockpit and Podman, and began writing my new compose files - starting with Caddy. Things went well enough at first, though I very quickly began running into two issues with my use of Podman:

  • Containers stopped at logout and never started after boot, due to Podman's daemonless and rootless design.
  • Getting GPU passthrough into Jellyfin was simply not happening.

I spent three days going through documentation, trying different solutions, remapping GIDs, annotating the compose documentation, everything I could find and all to no avail. The consensus seemed to be that this was by design, and I'd have to substantially alter the OS for Jellyfin to have hardware transcoding. This, combined with the containers only starting if re-written into systemd files, killed the Podman approach; it was excellent for learning and testing, less so for production. Thus a reminder of lesson number one:

LESSON 1: Do not deploy multiple new technologies you do not have an understanding of simultaneously

So after blowing away Debian and reinstalling it clean again (so detritus didn't persist), I installed Docker, re-mounted my NFS shares, and was off to the races with my compose files.

Caddy: Fore!

Setting up Caddy is about the easiest thing I've ever done in my entire IT career, full-stop. First came the Docker compose...

services:
    caddy:
        build:
            pull: true
            context: .
            dockerfile_inline: |
              FROM docker.io/caddy:builder AS builder
              
              RUN --mount=type=cache,target=/go/pkg/mod \
                  --mount=type=cache,target=/root/.cache/go-build \
                  xcaddy build \
                  --with github.com/mholt/caddy-dynamicdns \
                  --with github.com/mholt/caddy-l4
              
              FROM docker.io/caddy:latest
              
              COPY --from=builder /usr/bin/caddy /usr/bin/caddy
            tags:
                - "caddy-ddns-l4:latest"
        image: caddy-ddns-l4:latest
        restart: unless-stopped
        container_name: caddy-rproxy
        user: 1000:1000
        ports:
            - "18080:80"
            - "18443:443"
            - "18443:443/udp"
            - "56697:56697/tcp"
        volumes:
            - /mnt/nfs/.../caddy/conf:/etc/caddy
            - /mnt/nfs/.../caddy/data:/data
            - /mnt/nfs/.../caddy/config:/config
        networks:
            - core

...followed by the Caddyfile:

{
 layer4 {
    :56697 {
        @ircd tls sni irc.example.net
        route @ircd {
            tls
            proxy host.internal.example.net:6667
            }
        }
    }
 email mine@example.net
 default_sni example.net
}
service.example.net {
    reverse_proxy http://host.internal.example.net:55001
}

Domain names changed to deter bots from slamming my infra.

Then all I had to do was port forward 443 and 80 to the NUC's IP and Port Numbers for Caddy, and voila, certificates were provisioned for the domains added - provided they were also in the public DNS records.

Which brings me to caveat number two:

LESSON 2: Caddy is not great for Internal DNS

Caddy can do internal DNS just fine! It just involves wrangling self-signed certificates, which kinda defeats the point of Certbot/ACME. That, or DNS-01 challenges for certs not dependent on associated public DNS records. Since I'm on home internet service with a DHCPv4 address, and my registrar wants static IPs for API calls, this is a non-starter for me.

This was also an excellent opportunity to learn how to build my own containers - of a sort, anyway. Using Docker's buildx, I built a custom Caddy container with the caddy-dynamicdns and caddy-l4 extension modules (though at present, only the Layer 4 module is used), enrolled it into the local image repository, and then called it for the final service build in Compose. This lets me use Caddy to terminate TLS and proxy traffic for Layer 4 services like IRC, making it a lot easier to scale services up or down without having to involve more complicated load balancers or routers like Traefik.

LESSON 3: Adding complexity to simple products can be more beneficial than using complicated products simply.

I'm not knocking Traefik, mind you - it was my go-to for my private cloud K8s PoC at a prior employer, and I quite liked it - but Caddy seems tailor-made for simple, scalable deployments that don't need convoluted network abstraction or routing mechanisms. Does it lack the fancy overlay features of its competitors? Yeah, of course, but I don't need those.

Seriously, Caddy is so good I'm considering moving this blog in-house onto a static site and dump Ghost wholesale. Excellent work, Caddy team.

Seriously, thank you for making an HTTP Server that doesn't suck to setup.

Identity: Ugh.

Professionally, I've been an IAM Engineer of some stripe for almost fifteen years. I have owned Active Directory for orgs with objects in the hundreds of thousands, rebuilt KCCs and site links by hand to cut latency down to sub-10ms globally for AD transactions, migrated SSO to Okta and Entra, and struggled to get users to please, dear god, use a fucking password manager. So when it came time to do Identity for my own set of services, I was very much aligned with Xe Iaso's take:

A lot of the time with homelabs and self-hosted services you end up making a new account, admin permissions flags, group memberships, and profile pictures for every service. This sucks and does not scale. Something like Pocket-ID lets you have one account to rule them all. It’s such a time-saver.

Thus, Pocket ID bubbled to the surface as my preferred identity manager for the current era of "Everything-as-HTTP" services. Except there was one motherfucking little service that didn't want to play nicely with OpenID Connect, at least not yet.

Because my migration away from Plex is an article all its own...

Jellyfin wants LDAP. Really, a lot of services still want LDAP for identity. I get it: it's old, it's something most homelab admins know from their day jobs, and it's straightforward enough to implement into your project.

I fucking hate it. I hate Active Directory's butchering of it into a mess of Group Policy objects and Microsoft's insistence that they know better than you as to how it operates. I hate that it is so old and fossilized that it's essentially its own layer of the OSI model at this point. I hate that it insists upon traditional user/pass auth that is incredibly insecure. I especially hate that in 2026, not one single fucking company can put out a container or binary that bootstraps this shit quickly and simply. Not OpenLDAP, not FreeIPA, not ApacheDS. Everyone still insists you juggle keys, services, servers, connections, and replication for something that should be as straightforward as the following Docker Compose:

services:
    ldap:
        image: lldap/lldap:stable-debian-rootless
        container_name: ldap
        user: 1000:1000
        ports:
            - "17170:17170"
        volumes:
            - "/mnt/docker/.../lldap/data:/data"
        networks:
            - core
        environment:
            #- UID=1000
            #- GID=1000
            - TZ=UTC
            - LLDAP_JWT_SECRET=YOUR_SECRET_HERE
            - LLDAP_KEY_SEED=YOUR_SEED_HERE
            - LLDAP_LDAP_BASE_DN=dc=example,dc=net
            #- LLDAP_LDAP_USER_PASS=STARTING_PASSWORD_HERE
        restart: 'unless-stopped'

networks:
    core:
        external: true

Turns out that LLDAP loves that Docker Compose. I run it for all of two services: Pocket ID (synchronizing apps via group memberships) and Jellyfin (via its LDAP plugin), and that's all. It does exactly what most folks need in 2026 from LDAP, which is just user and group management. It does so in a container that can be thrown at Kubernetes with a volume and be trusted to not fall over dead due to complexity or a failed replication. For a homelab, it is perfection.

Between the two, Identity is a solved problem. Passkeys by default, passwords for edge-case shit that doesn't want to join the modern era.

God, that feels great to say.

Backups!

At this point I was feeling pretty jazzed. Identity was sorted between Pocket ID and LLDAP, Caddy was reverse-proxying my workloads (IRC, FreshRSS, Owncast, etc), and everything was humming along contently on a Debian host with my various infra mounts in the /mnt tree.

Time to back it all up.

Normally, this is where the post would devolve into hair-pulling over Docker volumes and databases and the like. Except I'm a simple dinosaur who tries to avoid abstractions that otherwise don't provide value. Docker volumes, in general, do not provide value to such a small space with simple applications, and so I bound local mount points instead.

The benefit of this approach? Backups are a simple cronjob.

#! /usr/bin/bash
# Tars the contents of the source folder recursively, then copies to the destination directory
# Timing configured via cronjob depending on needs
# Destination responsible for snapshots/versioning

srcdir=/mnt/docker/
destdir=/mnt/nfs/.../backup/
tstamp=$(date '+%Y-%m-%d')
aname=backup-$tstamp.tar.gz

tar -czf $aname $srcdir
cp $aname $destdir

Done. Full tarball every night, sent to the NAS (which backs up off site and to a local external disk), with all the data raw and ready to decompress into whatever new host is built during a potential recovery. Everything is just files, making it highly portable and easily testable.

Monitoring & Observability

Or, "How to monitor Docker containers without grafting your telemetry kit to the Unix socket and thus pissing off OWASP".

This, truth be told, is still a work in progress. SRE is an area of growth for me professionally (along with databases - expect a "moving from SQLite to Postgres" post in the future), and I've rarely found benefit from the copious metrics collections done in the enterprise. Thus I initially leaned back on bad habits and installed New Relic's infrastructure agent, but stopped short of binding it to Docker (because OWASP will come to your house and murder you if you bind a container to /var/run/docker.sock). Then I went through my research phase: Prometheus, OpenTelemetry, Wazuh, Checkmate, Grafana, the usual suspects.

Thing is, I never defined what I wanted from my monitoring and observability setup, nor did I define my audience. Rookie mistake that I'd let fester for far too long, if I'm being honest. Once I thought it through, I realized I had two distinct needs - and vastly different challenges:

  • External users need to know if something is up or down, and I need to be pinged if something has fallen over
  • I, the SRE, need to have detailed metrics for capacity planning and triage

I was inadvertently foot-gunning myself focusing on a singular solution for everything, as opposed to scoping out the problem and audiences first. Thus I settled on solving the easier problem of the two (status page and pager for downtime), leaving the metrics for later.

Which brings me to Uptime Kuma, another service I'm hosting over at PikaPods. Simple, straightforward, easy to configure and pretty to look at. ~$2.50 a month for externally-sourced monitoring of services and systems, so that I know if something is down externally as opposed to internally, far away from my network. It may not be the final product I stick with, but it's more than enough for my current needs. You can check the status of my services there, if you like.

More to come...

This was a decent amount of work, and I learned a lot along the way. I didn't cover setting up my own IRC server daemon to get my friends off Discord, or spooling up an Owncast server to host game nights from, nor did I cover the migration from Plex to Jellyfin, or the struggles with moving to rootless Docker. I also haven't nailed down Uptime Kuma's final config, nor have I finished setting up telemetry for the host itself.

I still have a lot of ground to cover in future posts, but this feels like a satisfying conclusion for the time being. I have what I want: a homelab I can quickly spool up or wind down services within, that's easy to keep updated, and that brings my friends closer together through shared services.

And all it took was giving up on complexity in favor of simplicity.