A Proxmox Hypervisor Down - Recovery, Postmortem, and the Hardening That Followed
A Proxmox Hypervisor Down
The Afternoon That Started With No Internal Services
This afternoon, the primary Proxmox hypervisor in the homelab
refused to boot. The out-of-band console showed the box stuck
during init, never reaching multi-user mode. The VMs running on
it had been serving internal services that one moment were
available and the next moment were not. There was no warning. No
alert. No early indicator. The first signal that anything was
wrong was a failed connection to a service that should not fail
silently.
This post is the postmortem. It covers the failure mode that
produced the hang, the emergency boot procedure that got the box
back to a shell, the controller-driven recovery that restored
every VM and every service without data loss, the Docker subtlety
that required redeploying every container despite intact
persistent volumes, and the three hardening layers that have been
added so that the next incident in this family is caught before
the box even reboots.
Total downtime: roughly four hours. Total data lost: none.
Permanent improvements to the operational posture: more than the
incident itself was worth.
The Failure Mode
The proximate cause was a hung pve-cluster.service during early
boot. pmxcfs, the Proxmox cluster filesystem daemon, could not
resolve its own hostname to the node's interface IP. The base
configuration that had been on the host for two years populated
/etc/hosts with the Debian-convention line 127.0.1.1 <hostname>.
That convention works for everything else on a Debian system.
pmxcfs is the exception. It needs the hostname to resolve to the
actual LAN IP because the FUSE filesystem it implements uses that
IP for inter-node communication. With 127.0.1.1 in /etc/hosts,
pmxcfs would hang indefinitely at startup, blocking
pve-cluster, blocking multi-user-target, blocking everything.
The secondary cause was a /srv/local mount that failed because
the disk enumeration order had shifted between reboots. The
external USB stick that holds the PVE install had previously been
enumerated as sda and was now enumerated as sdc. The PERC
hardware RAID volume backing /srv/local shifted correspondingly.
The UUID-based fstab entry should have been resilient to this. It
was not, for reasons not fully understood, though the working
theory is that the mount was attempted before USB enumeration had
settled and the UUID lookup transiently failed at exactly the
wrong moment.
The deeper cause was that neither of these failure modes had been
caught by any monitoring layer. The hypervisor went from "running
fine" to "completely unreachable" with no intermediate alert.
There was no SMART warning. There was no temperature alert. There
was no out-of-band IPMI event. The first indication anything was
wrong was when a service that depended on the hypervisor stopped
responding.
Emergency Boot
The recovery procedure assumed an operator on the chassis console
with the ability to interact with GRUB. The sequence was
straightforward once it was clear what was happening.
At the GRUB boot menu, press e on the kernel entry to edit.
Append to the linux … line:
systemd.unit=emergency.target pve-cluster.service=masked
Press Ctrl-X to boot. The box drops into emergency mode with a
root shell, without attempting to start pve-cluster.
From there:
1# Remount root rw to edit config:
2mount -o remount,rw /
3
4# Edit /etc/hosts and replace 127.0.1.1 with the LAN IP:
5vim /etc/hosts
6
7# Mask pve-cluster persistently so it stays masked on next boot:
8systemctl mask pve-cluster
9
10# Force-kill any pmxcfs that is still spinning:
11pkill -9 pmxcfs
12sleep 2
13
14# Start pmxcfs in local-only mode so /etc/pve mounts:
15pmxcfs -l
16echo "local" > /etc/pve/.clustermount
17
18# Exit emergency, continue boot:
19exit
The box came up. pve-cluster was still masked and inactive,
/etc/pve/qemu-server/ was empty, and /etc/pve/storage.cfg was
missing. But the kernel was running, SSH was up, and a remote
shell could reach the host.
Diagnosis: Nothing Was Destroyed
The first read-only inventory pass was the moment the panic level
dropped from "rebuild from scratch" to "file restore from
backup."
1ls /etc/pve/
2ls /etc/pve/qemu-server/
3lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT,LABEL
The qemu-server/ directory was empty, which meant the VM
definitions were gone from the live pmxcfs view. The lsblk
output showed sda as an unmounted ext4 volume of the correct
size, which was the PERC RAID-1 holding the VM images. The data
itself was on disk. It just was not mounted, and the configuration
that would have made it useful was no longer in /etc/pve.
Two things saved this. The first was that the PERC RAID volume
still contained the entire /srv/local/vz/images/ tree intact.
The second was the off-site backup. A nightly script ships the
contents of /etc/pve (which is a FUSE filesystem and not
directly backed by any block device) to a remote storage target
at 02:00 every night. The previous night's snapshot contained the
complete pre-incident state of datacenter.cfg, storage.cfg,
user.cfg, jobs.cfg, the priv/ directory, and every
qemu-server/<vmid>.conf file. Everything that needed to be
restored was already on a remote target and a scp away from
being back on the hypervisor.
Recovery
The recovery was mechanical once the inventory pass had
established what was lost and what was not.
First, re-mount /srv/local and confirm the data:
1dumpe2fs -h /dev/sda 2>&1 | grep -i UUID
2mkdir -p /mnt/peek
3mount -o ro /dev/sda /mnt/peek
4ls /mnt/peek/vz/images/
The UUID matched the commented-out fstab line. The VM images
directory contained every expected subdirectory. The data was
intact. The fstab line was uncommented and the volume re-mounted
at /srv/local.
Second, restore /etc/pve from the remote backup:
1systemctl stop pvestatd pvescheduler pveproxy pvedaemon
2BAK=/mnt/backup/<host>/<date>/bind-mounts/etc/pve
3cp $BAK/datacenter.cfg /etc/pve/datacenter.cfg
4cp $BAK/storage.cfg /etc/pve/storage.cfg
5cp $BAK/user.cfg /etc/pve/user.cfg
6cp $BAK/jobs.cfg /etc/pve/jobs.cfg
7cp -r $BAK/priv/* /etc/pve/priv/
8cp -r $BAK/nodes/<host>/qemu-server/* /etc/pve/nodes/<host>/qemu-server/
9systemctl start pvestatd pvescheduler pveproxy pvedaemon
10qm list
qm list returned every expected entry. pvesm status returned
every storage active. The hypervisor knew about its VMs again.
Third, boot the VMs:
1for vmid in <list of vmids>; do qm start $vmid; done
All VMs reached the running state within thirty seconds. Pings
to their LAN addresses confirmed connectivity.
The Docker Surprise
The VMs were back. The services running on them were not.
Docker on each VM started fresh with zero containers and zero
images. The persistent bind-mounted volumes under /opt/<service>/
were intact, the data was all on disk where it had been, but the
container records in /var/lib/docker/containers/ had not survived
the hard qm stop that preceded the recovery. The
restart_policy: unless-stopped directive that should have
brought them back on next boot was meaningless because there were
no containers to restart.
The fix was to re-run the deployment automation that defines the
services. Each service deployment is idempotent by design. Running
it against a VM with intact persistent volumes but no containers
produces new containers attached to the existing volumes. From the
user's perspective the services had simply been restarted.
The lesson is worth naming explicitly. Persistent volumes under
/opt/ survive hard VM termination. Container records under
/var/lib/docker/containers/ do not, reliably. Recovery requires
re-applying the deployment definition, which is fast because the
data is intact and the definitions are idempotent, but it does
require that the definitions exist and are correct. A homelab
that runs Docker without a corresponding deployment-as-code
discipline would be in real trouble at this point. With it, the
re-deployment took minutes.
The Hardening That Followed
The incident was survivable. The deeper problem was that no
monitoring layer had caught the failure modes before they
cascaded into a full hypervisor outage. The hours after the
recovery were spent building three new monitoring layers, each
deployed across the relevant hosts via the existing automation.
The first layer is smartd, configured with short weekly and
long monthly SMART self-tests, pre-failure attribute alerting via
the existing mail pipeline, temperature thresholds, and a shutdown
helper that triggers shutdown -h +5 on USAGE_FAILED or
OFFLINE_FAILED so the box graceful-shuts instead of running on
a dying disk.
The second layer is a small temperature watchdog running as a
five-minute cron job. It reads /sys/class/thermal/* and sensors,
mails on a warning threshold (default 75 °C), and shuts the host
down at a critical threshold (default 90 °C). This prevents the
box from cooking when fans or HVAC fail.
The third layer is ipmievd, which subscribes to the BMC's
System Event Log on Dell iDRAC hardware. PSU loss, fan failure,
disk hotswap events, and CPU thermal warnings all surface via
syslog where the existing daily log-error-digest mail picks them
up.
Separately, prometheus-node-exporter was deployed across every
Linux host on the LAN. Listen on 0.0.0.0:9100. Scrape sources
default to the two monitoring hosts. This finally produces fleet-
wide CPU, memory, disk, network, and load metrics from a single
declarative source. A blackbox-exporter container runs on the
monitoring hosts for HTTP up/down probes against every service
endpoint that matters. A new Grafana dashboard renders the
fleet-wide overview with host up/down status, per-host CPU and
RAM and disk used percentages, per-host network throughput, and
per-host one-minute load, filterable by instance.
The composition produces three independent signal streams. smartd
warns about disk degradation before failure. ipmievd warns about
chassis-level events the OS would never see. node-exporter plus
blackbox-exporter produce continuous metrics and probes that
distinguish "host is up but service is down" from "host itself is
gone." Each layer is independently useful. The composition closes
the visibility gap that allowed this afternoon's failure to be
invisible until users hit it.
What I Took From This
Four specific lessons, each turned into a permanent change.
127.0.1.1 in /etc/hosts is dangerous on Proxmox hosts. The
Debian convention works for everything except pmxcfs. The base
configuration now supports an override that skips the
127.0.1.1 line on Proxmox hosts, which is set on every
hypervisor. The hostname resolves to the actual LAN IP, pmxcfs is
happy, the failure mode that produced this afternoon's incident
is structurally eliminated.
Disk enumeration order is not stable across reboots. USB-stick
boot devices in particular can shift between sda, sdb, and
sdc depending on USB enumeration timing. UUID-based fstab
entries are the correct mitigation, which was already in use, but
they need to be resilient to transient lookup failures during
boot. The fstab now uses nofail on non-essential mounts so a
transient lookup failure does not block the boot sequence. Any
device-name reference in backup or vzdump configuration is now
UUID-based.
Docker containers do not reliably survive qm stop + qm start.
Persistent volumes do. The recovery path is "re-apply the
deployment definition against fresh containers attached to
existing volumes." This works only if the definitions exist and
are idempotent, which they were, which is why the recovery took
minutes per service instead of hours. A boot-time systemd unit on
each VM is planned that runs the appropriate redeploy
automatically if the container count is zero, eliminating the
manual step entirely.
The off-site /etc/pve backup saved everything. pmxcfs is a
FUSE filesystem and /etc/pve is not directly backed by any
block device. Without the nightly snapshot, VM configurations and
storage config and cluster credentials would have had to be
reconstructed by hand or restored from PVE Backup Server images,
which would have been hours of additional work. The nightly
snapshot is the single most valuable line in any Proxmox backup
configuration. Verify yours runs. Verify the destination is
reachable. Verify the contents are complete.
Closing
The incident was survivable because the discipline was in place
before the incident happened. The hypervisor had a hardware RAID
that protected the VM disks. The fstab used UUIDs rather than
device names. The /etc/pve configuration was backed up off-site
every night. The container workloads were defined declaratively
in a place that could re-create them on demand. The recovery was
mechanical because the preparation had been done over the
previous two years.
The hardening that followed turned the lesson into permanent
infrastructure. The next incident in this family will not produce
a four-hour outage because smartd will warn about disk
degradation before it cascades, ipmievd will surface BMC events
before the OS even notices, and node-exporter will show the
hypervisor falling behind on its baseline metrics long before
users notice the symptoms.
The deeper lesson is the one that applies to every operational
discipline. The work to make the incident survivable is paid
forward. The cost of automation I had not yet needed, of a backup
target I had not yet restored from, of an UUID-based fstab that
had been overkill for two years, was paid before today. The
benefit was collected this afternoon. The interest compounds
quietly until the day the principal is repaid all at once.
The hypervisor is back. The hardening is deployed. The runbook
documenting the emergency boot procedure is committed. The next
time pmxcfs hangs on a Proxmox host, the recovery will take
minutes rather than hours, because the procedure has been written
down and the monitoring will catch it before it reaches the
blind-boot stage.
The incident was scary. The aftermath was useful. Both are part
of operating real systems.