The Day My Homelab Tried to Kill Me
Or: How a $15 Smart Plug Nearly Took Down an Entire Server Infrastructure
It started, as most disasters do, with a completely reasonable question: “Is there a cheaper way to monitor power consumption than using smart plugs?”
Spoiler: No. But that innocent question kicked off a twelve-hour shitstorm that included kernel panics, NFS mount death spirals, a TrueNAS NIC swap gone sideways, a Proxmox node that refused to boot for over an hour, and a ThirdReality smart plug that may or may not be possessed by a demon. Buckle up.
Act I: The Backup Reckoning
It was a quiet Monday morning. The kind where you check your server rack, notice your USB backups are still running at 9:30 AM, and think, “That can’t be right.”
It was very right.
Digging into the Proxmox backup configs revealed a scheduling nightmare I’d apparently been living with for months. PBS backups kicking off at 00:30, USB backups starting at 02:30 while PBS was still grinding, and two 634GB Immich backups from October 2025 sitting on a USB drive like forgotten leftovers in the back of a fridge. Seven months old. Never pruned. Just vibing. How the hell did I miss this?
The nightly PBS job was backing up 33 VMs to spinning disks, followed immediately by USB jobs writing to more spinning disks, all competing for the same I/O on pve1. It was like scheduling three flights to use the same runway at the same time and wondering why there are delays.
After way too much log archaeology, including manually decoding hex timestamps from Proxmox UPID filenames because apparently that’s a thing you need to know how to do, the picture became clear:
- PBS nightly: 3-6 hours
- USB backup: 4-8 hours
- Both running simultaneously on the same node
- One specific VM (Portainer, LXC 112) taking two hours at 4.9 MiB/s because apparently it wanted to savor the fucking experience
The fix was straightforward: stagger everything so nothing overlaps, cut the USB job down to only critical VMs, and stop backing up Nextcloud and Immich every single night. They get their own twice-weekly schedule now because they’re enormous and I’m not made of disk I/O.
New schedule:
- 22:00 Mon/Thu: PBS backs up the two big boys (Nextcloud + Immich)
- 22:00 Tue/Fri: TrueNAS backs up the two big boys
- 23:00 nightly: PBS backs up everything else
- 03:00 nightly: TrueNAS backs up everything else
- 04:00 nightly: USB backs up critical VMs only
Clean. Sequential. No overlap. Like a well-managed departure sequence. Should have done this months ago.
Act II: The TrueNAS Backup Target
With the schedule sorted, the next logical step was adding TrueNAS as a third backup location. NFS share, plain vzdump files, directly readable without needing PBS to restore. The escape hatch for when everything else shits the bed.
Setting it up was almost suspiciously easy:
- Created a dataset on TrueNAS:
satchpool/backups/proxmox-backups - Shared it via NFS with authorized hosts
- Mounted it on pve1
- Added it as Proxmox storage
There was a brief wrestling match with NFS root squash (TrueNAS wouldn’t let pve1 create the required dump directory), solved by setting Maproot User to root in the share config. Five minutes of work.
Then I added it to fstab. Just a simple line:
192.168.0.41:/mnt/satchpool/backups/proxmox-backups /mnt/truenas-backups nfs defaults,_netdev 0 0
This harmless-looking line would later try to burn my entire infrastructure to the ground.
Act III: The NIC Swap That Started a War
With backups sorted, it was time to install a new 2.5GbE NIC in TrueNAS. Simple hardware swap. Should take ten minutes.
“I’ll just shut down TrueNAS, swap the card, boot it back up.”
What I forgot: pve1 now had an NFS mount pointing at TrueNAS. And that mount was configured with defaults, which in Linux fstab language means “this mount is required for the system to function, and if it’s not available, feel free to lose your entire mind.”
TrueNAS went down for the NIC swap. pve1 didn’t care. It was still running, NFS mount was just stale. No big deal.
Then I tried to reboot pve1.
Mistake.
Act IV: The Boot Loop from Hell
Attempt 1: pve1 hangs on boot. Red text: Timed out while waiting for udev queue to empty. Cursor blinks. Nothing happens. Five minutes. Ten minutes. Nothing. Are you shitting me.
Attempt 2: Hard power off via iDRAC. Reboot. Same error. Same hang. Same infuriating blinking cursor mocking me from an otherwise black screen.
Attempt 3: Edit GRUB, add init=/bin/bash to bypass systemd entirely. Same error. Wait, what? init=/bin/bash should bypass everything. How is udev still running? What the actual fuck?
Attempt 4: Edit GRUB, add systemd.unit=emergency.target. Kernel says Unknown udev kernel command line option "udev.timeout", ignoring. Cool. Very helpful. Thanks for absolutely nothing.
Attempt 5: Try to boot from Proxmox USB installer for rescue mode. BIOS ignores the USB drive. Boots from hard drive anyway. Because of fucking course it does.
At this point I discovered something fun in the BIOS: PXE boot was set as the first boot device. The server was trying to network boot before touching the local disk. Fixed that, but it wasn’t the actual problem. The udev timeout happens after GRUB, not before. Another dead end. I’m losing my mind.
Attempt 6: Boot into GRUB’s recovery mode. Kernel panic. Full stack trace. Page faults. Memory errors. The works. Fourteen screenshots of register dumps scrolling by like I’m trapped in a nightmare I can’t wake up from.
Scrolling through all of that panic output, the smoking gun finally appeared:
sd 12:0:0:0: [sdk] tag#0 timing out command, waited 180s
sd 11:0:0:0: [sdl] tag#0 timing out command, waited 180s
Two SCSI devices timing out. 180 seconds each. The kernel was waiting for USB drives that weren’t plugged in. I had unplugged them during troubleshooting, and fstab was configured to require them for boot.
Four USB drives in fstab. Three physical backup drives plus the NFS mount to a TrueNAS server that was in pieces on a workbench. All configured with defaults. All required. All missing.
The system was doing exactly what it was told: “These mounts are mandatory. I will wait forever for them. I don’t care about your feelings.”
You’ve got to be kidding me. All of this, the kernel panics, the boot loops, the hour of troubleshooting, because of a missing comma and six letters in fstab. nofail. That’s it. That’s the whole thing.
Act V: The Fix (Finally)
Attempt 7: Unplug every single USB drive. Boot pve1 with nothing attached. System drops to emergency mode (because fstab mounts are missing) but this time offers a root password prompt instead of kernel panicking.
Entered the root password. Got a shell. Opened fstab. Commented out every last line. Rebooted.
It booted.
Holy shit, it actually booted. I stood there staring at the login prompt like it was a mirage in the desert.
Then, without celebrating, without breathing, without even sitting down, I SSH’d in and fixed fstab properly:
UUID=aff5b236-... /media/VM_Backup xfs defaults,nofail 0 0
UUID=ca4b5447-... /media/satchcloud-backup ext4 defaults,nofail 0 0
UUID=b5ae0e63-... /media/plexbackup ext4 defaults,nofail 0 0
192.168.0.41:/mnt/satchpool/backups/proxmox-backups /mnt/truenas-backups nfs defaults,_netdev,soft,timeo=30,retrans=3,nofail 0 0
Three magic words: nofail on every non-essential mount. Two more for NFS: soft and timeo.
nofail= “If this drive isn’t here, skip it and keep booting. Don’t be a hero.”soft= “If NFS doesn’t respond, give up gracefully instead of hanging forever like a clingy ex.”timeo=30,retrans=3= “Try three times with a 3-second timeout, then move on with your life.”
This should have been there from day one. I’ve been running this homelab for over a year with live grenades in my fstab and never knew it.
Act VI: The TrueNAS NIC Saga
Meanwhile, TrueNAS had its own adventure. The new 2.5GbE Intel i226-V went in fine. Booted fine. Network came up. Then twenty minutes later: PCI1360 — A bus fatal error was detected on a component at slot 5. System BIOS halted due to NMI. Full crash.
Are you serious right now.
Reseated the NIC. Booted fine again. Consumer NIC in an enterprise server slot, sometimes they just need a firm seating. Doesn’t exactly inspire confidence for the next reboot, but whatever. It works. Don’t touch it. Don’t breathe on it. Don’t even think about it too hard.
Then came the speed disappointment. Both pve1 and TrueNAS confirmed running at 2500 Mbps on their 2.5GbE NICs. Network was perfect end to end. Backup speed to TrueNAS? 8.3 MiB/s.
Eight. Point. Three. Megabytes per second. Over a 2.5 gigabit link.
The spinning disks don’t give a shit about your fancy network card. They write at the speed they write. The 2.5GbE upgrade bought exactly zero improvement for backup throughput. The network was never the bottleneck. It was always the rust. I basically wasted an afternoon and nearly bricked my primary server to upgrade something that didn’t matter.
Act VII: The Smart Plug That Started It All
Remember the original question? Power monitoring? After all of this, the backup optimization, the TrueNAS target, the NIC swap, the boot loop, the kernel panics, the fstab fix, I still needed to pair a ThirdReality smart plug to ZHA.
Hold the button for 10 seconds. Nothing. Hold for 20 seconds. Nothing. Hold for 30 seconds. Nothing. Unplug, hold button, plug back in. Nothing. Press 5 times rapidly. Nothing.
The plug just sat there, silently judging me, its LED dark and unresponsive, like it knew exactly what it had put me through and simply did not care. Not even a flicker. Not a blink. Just cold, dead, plastic indifference.
Oh, and while I was holding the button trying to pair it? The fans on both Dell R730s spun up to jet engine speed for no apparent reason, and pve2 and pve3 lost power briefly. Because why the hell not. Kick me while I’m down.
After a generous pour of whiskey and some deep breaths, I had a moment of clarity: stop trying to pair the plug while it’s connected to the server rack, you idiot. Grab a different plug. Pair it to ZHA first with nothing plugged into it. Confirm it works. Then shut everything down cleanly and swap it inline.
And that’s exactly what I did. Grabbed a fresh ThirdReality plug, paired it to ZHA in about 15 seconds like it was nothing, confirmed it was reporting power data, shut down the rack in an orderly fashion, plugged the strip into the now-paired smart plug, and powered everything back up.
Worked perfectly. The server rack is now monitored. The whole point of this entire day, accomplished in the last 20 minutes.
Lessons Learned
nofailgoes on every non-essential fstab mount. Period. The only mount that should be required is root. Everything else getsnofail. This is non-negotiable. Tattoo it on your forearm. Name your firstborn childnofail.soft,timeo=30,retrans=3on every NFS mount. Hard NFS mounts will hold your entire system hostage when the server disappears. That’s not resilience, that’s a liability.- Always unmount NFS before shutting down the NFS server. Or just use
nofailandsoftso it doesn’t matter. See items 1 and 2. I shouldn’t have to say this but here we are. - USB drives in fstab are ticking time bombs. They’re removable media configured as permanent infrastructure. It’s like building a house on a trailer and being surprised when someone drives away with it.
- Stagger your damn backup jobs. Running three backup targets simultaneously on spinning disks doesn’t make them go faster. It makes them all go slower. This isn’t parallelism, it’s a bar fight for disk I/O.
- 2.5GbE doesn’t help when your storage writes at 8 MiB/s. Upgrade the bottleneck, not the thing next to the bottleneck. I learned this the hard way so you don’t have to. You’re welcome.
- Keep your Proxmox USB installer handy and actually test that it boots. When everything goes sideways, you need a rescue path that works. Mine didn’t. Don’t be me.
- iDRAC is not optional for enterprise servers. Without remote management, I would have been physically power-cycling a rack-mounted server by pulling cables. In 2026. Like some kind of barbarian.
- Pair smart plugs before putting them inline with critical infrastructure. This seems obvious in hindsight. Everything seems obvious in hindsight. That’s what hindsight is for. And whiskey.
Final Score
Duration: ~12 hours Reboots: Lost count Kernel panics: 2 fstab edits: 4 USB drives unplugged in frustration: 4 NICs reseated: 1 Whiskey consumed: Heavily Smart plugs successfully paired: 1 (eventually) Times I said “fuck” out loud: Stopped counting around noon Neighbors concerned: Probably Blog posts written out of spite: 1
Dave is a commercial airline pilot who runs a homelab because apparently flying 737s isn’t stressful enough. His server rack is named after Star Wars ships, his automations roast his family members, and his backup strategy now includes the word “nofail” more times than his therapy notes include the word “boundaries.”