The Great 3 AM Network Mystery: A Homelab Horror Story

Or: How I Spent Three Weeks Fighting a Problem I Created Six Months Ago While Drunk

The Problem That Ruined My Life

For three glorious weeks, my Proxmox cluster became the world’s most expensive alarm clock. Every morning at 3 AM, it would die. Not gracefully. Not with warning. Just… dead.

Network? Gone. VMs? Unreachable.
My sanity? Deteriorating rapidly. My wife’s patience? Even worse.

“It’s fine,” I told her on Day 1. “Just a fluke.” “It’ll be fixed tonight,” I promised on Day 3. By Day 10, she stopped asking and just pointed toward the garage when I woke up.

The Symptoms (AKA: My Daily Humiliation Ritual)

Every. Single. Night.

  • 2:00 AM: Automated PBS backups start (like a responsible adult configured them to)
  • 3:15 AM: Network commits sudoku
  • 6:00 AM: My alarm goes off
  • 6:02 AM: I check my phone, see monitoring alerts
  • 6:03 AM: I say words my children shouldn’t hear
  • 6:15 AM: Server gets the three-finger salute
  • 6:20 AM: Network magically works again
  • 6:21 AM: I question my life choices

The weird part? Manual backups during the day worked perfectly. I could backup VMs all day long without issues.

It was like the network had a personal vendetta against the hours between midnight and dawn. Or maybe it just really valued its beauty sleep.

Week 1: The Realtek Witch Hunt (Wherein I Blame Innocent Hardware)

“It’s obviously the Realtek card,” I announced confidently to my cat, who was the only being in the house still willing to listen to my server rants.

I mean, come on. Everyone knows Realtek NICs are the networking equivalent of that friend who says they’ll help you move but shows up three hours late with a Smart Car.

Attempt 1: The Driver Switcharoo

First, I switched from the in-kernel r8169 driver to Realtek’s own r8125 driver. Compiled from source like a real Linux person. Felt very accomplished.

Next morning: Network dead at 3 AM.

Cool. Cool cool cool.

Attempt 2: Disable ALL The Things

Maybe power management was the issue? I started disabling features with the enthusiasm of someone who has no idea what they’re doing but looks busy:

bash

GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"

“This will fix it,” I told the cat. The cat, being smarter than me, said nothing and walked away.

Next morning: Network dead at 3 AM.

Attempt 3: The Kitchen Sink Approach

At this point, I was basically throwing spaghetti at the wall:

  • Changed MTU from 9000 to 1500 (sad storage performance noises)
  • Disabled TCP offloading
  • Tweaked ring buffer sizes
  • Disabled wake-on-LAN
  • Considered exorcism

Fun fact: None of this worked.

Next morning: Network dead at 3 AM.

The cat stopped making eye contact with me.

Week 2: The Monitoring Script (Finally, Actual Science)

“You know what I need?” I asked the cat, who had relocated to a different room.

“DATA. GLORIOUS DATA.”

I spent an entire Saturday writing a monitoring script that would:

  • Check connectivity every 5 minutes
  • Capture full diagnostics when shit hits the fan:
    • Network interface status
    • PCIe device tree
    • Kernel messages
    • Driver info
    • Everything except my dignity
  • Attempt automatic recovery
  • Send me passive-aggressive alerts

Deployed with cron:

bash

*/5 * * * * /usr/local/bin/network_monitor.sh
```

Then I waited. Not for the failure (that was guaranteed), but for the beautiful diagnostic logs that would tell me why my life had become a series of 6 AM reboots.

## The Glorious Logs Arrive (It's Not What I Expected)

Next morning, I woke up to a 70KB log file. I opened it with the excitement of a kid on Christmas morning.

**Before failure:**
```
enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
```
Everything normal. Network happy. Life good.

**After failure:**  
```
Device "enp4s0" does not exist.
```
Network: GONE. Vanished. Poofed out of existence like my hopes and dreams.

**BUT WAIT:**
```
lspci | grep Ethernet
04:00.0 Ethernet controller: Realtek RTL8125 2.5GbE Controller

THE NIC WAS STILL ON THE PCIE BUS.

It hadn’t fallen off. The hardware was fine. Linux had just… forgotten it existed? Like that awkward moment when you run into someone you vaguely know and can’t remember their name, except it’s a network interface and it happens every single night.

The Plot Thickens (I Realize I’m an Idiot)

I needed to understand what made the automated backups different from manual ones.

Manual backup: Works fine. Birds singing. Rainbows. Joy. Automated backup: Network commits die. Chaos. Despair. Sad trombone noises.

The only difference? The automated job backed up 22 VMs in sequence. My manual tests only did 1-2 VMs because I’m lazy and impatient.

So I started looking at the individual VMs. Nothing weird about 101. 102 looked fine. 103, 104, 105… all normal.

Then I got to VM 119.

bash

qm config 119
```

And there, buried in the config like a land mine from a past drunk-configuration session:
```
hostpci0: 0000:04:00

PCIe PASSTHROUGH OF THE REALTEK NIC.

THE SAME NIC THE HOST WAS USING. THE SAME NIC PBS WAS TRYING TO PUSH BACKUP DATA THROUGH. THE SAME NIC I’D BEEN BLAMING FOR THREE WEEKS.

I literally said “you’ve got to be fucking kidding me” out loud. The cat, from two rooms away, meowed in what I can only assume was judgment.

The Lightbulb Moment (Why Didn’t I Check This First?)

Let me paint you a picture of what was happening every night:

  1. PBS backup job starts, happy as a clam
  2. Works through VMs: 101 ✓, 102 ✓, 103 ✓… everything’s great
  3. Around 3:15 AM, reaches VM 119
  4. Proxmox: “Time to snapshot this VM!”
  5. Also Proxmox: “Wait, this VM has PCIe passthrough of… hardware we’re currently using?”
  6. Proxmox: “This seems fine. Let’s do it anyway.”
  7. Linux kernel: “WHAT ARE YOU DOING”
  8. Driver: dies
  9. Network interface: stops existing
  10. Me, six hours later: sad reboot noises

You literally cannot snapshot a VM that has PCIe passthrough of hardware the host is actively using. It’s like trying to photocopy your hand while simultaneously using that hand to press the copy button. Physics doesn’t work that way. Linux doesn’t work that way. Nothing works that way.

Except apparently my configuration, which had been “working that way” for six months until the backups started hitting it.

The Fix (This Is Embarrassing)

bash

qm set 119 --delete hostpci0

One command. One. Single. Command.

VM 119 didn’t even need the PCIe passthrough. It was leftover from some “brilliant idea” I’d had six months ago (probably after several beers) and then completely forgot about.

I re-enabled PBS backups. Held my breath. Considered prayer.

That night: Backups ran. Network stayed up. Birds sang. Angels wept.

Next morning: Still working. No reboot needed.

I laid in bed staring at the ceiling, processing the fact that I’d spent three weeks, dozens of hours, countless forum posts, multiple driver recompilations, and significant amounts of coffee troubleshooting a problem I created myself while probably drunk.

The cat walked into the room, looked at me, and I swear to god she was laughing.

The Victory Lap (Because I Can’t Leave Well Enough Alone)

With the crisis solved, a normal person would have stopped.

I am not a normal person.

“You know what would make this better?” I asked the cat, who immediately left.

“NETWORK BONDING WITH REDUNDANT NICS.”

I ordered an Intel i226-LM 2.5GbE card from eBay for $32 (which felt like a steal until I remembered I’d just wasted three weeks of my life on a self-inflicted problem).

When it arrived, I ripped out the “problematic” Realtek card (which was never actually problematic) and configured proper bonding:

bash

auto bond0  
iface bond0 inet manual
    bond-slaves eno1 enp4s0
    bond-miimon 100
    bond-mode active-backup
    bond-primary enp4s0
```

Now if either NIC dies, traffic automatically fails over. Because apparently three weeks of suffering wasn't enough education on single points of failure.

## The TrueNAS Subplot (Because I Never Learn)

Drunk on success (not actual drunk this time), I decided to set up bonding on my TrueNAS Scale server too.

"This will be easy," I thought. "I'm basically a bonding expert now."

Narrator: *It was not easy.*

I tried to bond my Realtek 2.5GbE NIC with an Intel 1GbE NIC. Different speeds, different vendors, different drivers, but hey—failover mode should handle that, right?

TrueNAS looked at my configuration and said: "Nah."

The bond would come up. Get an IP via DHCP. Show as UP. Everything looked perfect.

Then, exactly 2 seconds later:
```
bond0 (unregistering): Released all slaves

The bond would commit suicide faster than my will to live during week 2 of the network failures.

After several attempts, I realized: TrueNAS Scale does not appreciate my creative networking choices.

Solution? Ordered another Intel i226 card ($29.97, even cheaper than the first!). When it arrives, I’ll bond two identical Intel NICs like a responsible adult who has learned from their mistakes.

Lesson learned: When bonding NICs, match your hardware. Same vendor, same speed, same chipset. Don’t get creative. Creativity is what got me into this mess in the first place.

Lessons Learned (Things I Should Have Known)

1. Check Your Configs Before Spending Three Weeks in Hell

Seriously. Just… check them. Look at your VMs. Read the configs. Don’t assume past-you made good decisions. Past-you was an idiot. Past-you had been drinking.

2. PCIe Passthrough + Backups = Bad Time

If you’re passing hardware to VMs, they won’t snapshot cleanly. This is not a mystery. This is not a driver bug. This is physics. Or computer science. Or both.

Don’t pass through hardware the host is also using. This should be obvious. It was not obvious to me at 11 PM six months ago.

3. Monitoring is Your Best Friend

That monitoring script saved me. Without it, I’d probably still be randomly swapping NICs and performing ritualistic driver recompilations under the full moon.

4. Manual Testing ≠ Production

“It works when I test it manually!” is the homelab equivalent of “it works on my machine!”

Test the actual automated workflow. All of it. Including the VMs you forgot you configured wrong while drunk six months ago.

5. The Realtek NIC Was Innocent

I’m sorry, Realtek RTL8125. You didn’t deserve what I put you through. You were working perfectly. The problem was me. It’s always been me.

6. Intel NICs Are Worth It Anyway

Despite the Realtek being innocent, I’m still switching to Intel NICs. They’re more expensive, but they have:

  • Better Linux driver support
  • Lower CPU overhead
  • Excellent bonding support
  • A track record of not making me question my life choices at 6 AM

7. Sometimes You Need to Walk Away

Around day 12, I should have stopped, taken a break, and approached the problem fresh. Instead, I descended into madness, trying increasingly desperate solutions.

Pro tip: If you find yourself considering a voodoo ritual to fix your network, take a break.

8. Document Everything (So You Can Laugh Later)

I kept notes throughout this ordeal. What I tried, what failed, what the logs said, what swear words I invented.

Without those notes, this blog post wouldn’t exist. And I wouldn’t have a written record of my descent into homelab madness to share with future generations.

The Happy Ending (We Made It, Folks)

Current Status:

Proxmox Cluster:

  • ✅ Bonded Intel NICs (i226 + onboard Intel)
  • ✅ Automatic failover configured
  • ✅ PBS backups running nightly without drama
  • ✅ All 22 VMs backing up successfully
  • ✅ VM 119 living its best life without unnecessary PCIe passthrough
  • ✅ Zero 3 AM failures for 3+ weeks
  • ✅ I sleep past 6 AM now
  • ✅ My wife tolerates me again

TrueNAS Scale:

  • ✅ Single NIC running stable (for now)
  • ✅ Second Intel i226 ordered ($29.97)
  • ✅ Future bonding planned with matching hardware
  • ✅ Lessons learned about creative network configs

Me:

  • ✅ Sanity partially restored
  • ✅ New appreciation for checking configs
  • ✅ Deep understanding of what NOT to do with PCIe passthrough
  • ✅ Slightly traumatized by the experience
  • ✅ Will never live this down

The Cat:

  • ✅ Has regained some respect for me
  • ✅ No longer leaves the room when I talk about networking
  • ✅ Still judges me, but quietly

The Final Tally

Time Investment:

  • 3 weeks of intermittent troubleshooting
  • ~20 hours of actual work
  • Countless hours of obsessive log reading
  • Several mornings of 6 AM reboots
  • One entire Saturday writing a monitoring script
  • One ego-destroying moment of realization

Financial Cost:

  • 2x Intel i226-LM NICs: ~$60
  • Coffee: $irreplaceable
  • Ethernet cables: $0 (had spares)
  • My dignity: Priceless

Emotional Damage:

  • Frustration: Off the charts
  • Embarrassment: Profound
  • Relief when fixed: Overwhelming
  • Satisfaction: Moderate (tempered by knowing I caused it)
  • Amusement in retrospect: Growing daily

Words of Wisdom (From Someone Who Clearly Lacks It)

To anyone currently troubleshooting mysterious homelab issues:

  1. Check your VM configs first
  2. Look for PCIe passthrough
  3. Question every decision past-you made
  4. Assume past-you was drunk
  5. Don’t spend three weeks fighting a one-line config issue
  6. Use monitoring tools
  7. Take breaks
  8. Don’t be like me

And if you’re currently experiencing 3 AM network failures with automated backups… check for PCIe passthrough. Learn from my pain. Be better than me.

Now if you’ll excuse me, I have some perfectly stable, boringly reliable backups to not worry about.

And a cat to apologize to.


The moral of this story: Always assume past-you was an idiot. Because past-you probably was.

Also, buy Intel NICs. Your future-you will thank you.

And maybe don’t configure servers after drinking.

Add a Comment

Your email address will not be published. Required fields are marked *