The Art of Reading Support Files - Systematic Linux Diagnostics Before You Reach for strace
The Wrong First Move
It is 02:00. The pager is screaming. The dashboard is red. You SSH into the
host. What do you do?
Most engineers, in my observation, reach for a tool that produces more data:
strace, tcpdump, perf, bpftrace, dtrace. Sometimes they go straight
for an interactive shell into a misbehaving container with --entrypoint /bin/sh. Occasionally, in the truly desperate moments, I have seen people run
tcpdump -A -i any | tee /tmp/everything.pcap and hope something useful
floats to the surface.
This is the wrong first move. Not because these tools are bad — they are
extraordinary — but because they generate data faster than a human can read
it, they require a hypothesis about where the bug lives, and they assume you
can reproduce the failure on demand. None of those assumptions hold during a
real production incident.
The right first move is almost always to read what the system has already
written down about itself. Every well-engineered system, from a Fritz!Box
home router to a Proxmox cluster to a fleet of Kubernetes nodes, produces a
structured snapshot of its own internal state. The diagnostic information is
already there. The skill is knowing where to look and what patterns to
recognise.
This skill — fluent reading of support files, log archives, and self-diagnostic
output — is the highest-leverage production-operations skill that nobody
teaches systematically. It is not in any certification I have seen. It is
barely in any textbook. It is transmitted from senior engineer to junior
engineer through war stories and incident retrospectives, slowly, imperfectly,
and inconsistently. And it is the single largest determinant of whether an
incident takes thirty minutes to resolve or six hours.
This post is an attempt to write some of it down.
Why Support Files Beat Live Tracing for First-Pass Diagnosis
Before the scenarios, the principle. Live tracing tools (strace, tcpdump,
perf, bpftrace) have three characteristics that make them dangerous as a
first move during an incident:
They produce data at line rate, not at human reading speed. A single
strace -f on a busy GitLab Puma worker produces tens of megabytes per
minute. No human reads 10MB of strace output. You either have a hypothesis
that lets you grep for a specific syscall pattern, or you have hours of
post-hoc analysis ahead of you. During an active incident, you have neither.
They require the bug to be reproducible while you watch. Many of the
nastiest production bugs are intermittent. They fired at 02:13, woke you up,
and by 02:14 the symptom is gone but the consequence — a corrupted record, a
stuck job, a wedged connection — remains. Live tracing cannot capture what
already happened.
They impose runtime overhead. strace can slow a process down by 10–100x.
On a system that is already overloaded enough to page someone, this is the
opposite of helpful. You may push it from "degraded" to "completely down" by
the act of trying to debug it.
Support files have none of these problems. They are written by the system
itself, on its own schedule, into a stable format. They capture the state at
the moment of capture, including all the historical context the system has
chosen to retain. Reading them is cheap, repeatable, and does not change the
system being observed.
The Heisenberg principle of debugging applies here: the act of measurement
changes what is measured. Live tracing measures by intervention. Support
files measure by reading what was already written. The latter is almost
always strictly better for first-pass diagnosis. Live tracing earns its place
later, after the support files have narrowed the search space to a specific
hypothesis worth testing.
What Counts as a Support File
The term "support file" is mine, used loosely. It refers to any structured
representation of a system's internal state that the system itself produces
on demand or continuously. Different platforms call them different things:
support archives, diagnostic dumps, tech-support output, sosreports, debug
bundles, application backups. The mechanism is identical.
A non-exhaustive map by platform:
systemd-managed Linux exposes its state through journalctl. The
persistent journal at /var/log/journal/ survives reboots if configured
correctly (and it should be — see the diagnostics role at the end of this
post). journalctl -k returns kernel messages. journalctl -u <unit>
filters to a specific service. journalctl -b -1 shows the previous boot's
log, which is invaluable when a host has rebooted unexpectedly. systemctl --failed lists every unit that did not start cleanly. systemd-analyze blame ranks services by their startup time, exposing the slow path through
boot.
Fritz!Box and most consumer routers produce a textual support file
through a UI option, typically buried under System → FRITZ!Box-Support → Save
Support Information. The output is a single text file, usually 5–30MB, that
contains the running configuration, current DSL parameters, recent log
entries, and internal state of every running daemon. The Fritz!Box file in
particular includes the PPPoE handshake history, VLAN configuration, and
DSLAM negotiation details that are not exposed anywhere in the GUI.
Proxmox VE exposes cluster state through pvecm status, cat /etc/pve/.members, and journalctl -u pve-cluster.service -u corosync.service. Per-VM configuration lives in /etc/pve/qemu-server/<vmid>.conf
and /etc/pve/lxc/<ctid>.conf. Storage state comes from pvesm status. The
entire /etc/pve/ directory is itself a corosync-replicated filesystem
that contains the canonical truth about the cluster's intended state — when
something is wrong, the difference between /etc/pve/ and what is actually
running tells you exactly what diverged.
GitLab has gitlab-rake gitlab:check, which exercises every internal
component and reports findings in plain English; gitlab-ctl tail, which
streams logs from every service simultaneously; and the contents of
gitlab-backup create archives, which contain a structured dump of the
database, repository data, and secrets. The gitlab-ctl status output is
the first thing to read on any GitLab incident.
Cisco IOS devices produce show tech-support, which is by convention the
single most comprehensive diagnostic dump on networking gear. It contains
running-config, interface counters, routing tables, log buffers, and
hardware status, all in one paginated stream. show logging alone usually
explains a routing or VLAN problem.
Docker has docker inspect <id>, which returns a complete JSON
description of a container's configuration and runtime state; docker logs <id>, which returns its stdout/stderr; docker events, which streams a
real-time feed of container lifecycle events across the host. The combination
of inspect and logs resolves perhaps eighty percent of "container won't
start" issues.
Kubernetes is similar but distributed: kubectl describe pod <name>
returns events and status; kubectl logs <pod> --previous returns logs from
the prior incarnation of a crashloop pod; kubectl get events --all-namespaces --sort-by='.lastTimestamp' returns the cluster-wide event
stream chronologically. For node-level issues, crictl ps, crictl logs,
and the kubelet's journal entries fill in the gap.
PostgreSQL exposes runtime state through pg_stat_activity (active
queries), pg_locks (held and waited locks), pg_stat_database (cumulative
counters), and pg_stat_replication (replication lag and state). The
postgresql.log directory contains the actual error stream and slow-query
log. EXPLAIN ANALYZE on a problematic query is itself a support file in
miniature.
The list goes on. Every well-engineered system has equivalents. Learning the
specific incantations for the platforms you operate is a finite, learnable
investment with enormous compounding returns.
Five Scenarios from Real Engineering Work
Abstract principles do not transmit operational skill. Concrete walkthroughs
do. Five scenarios, each illustrating a different facet of the discipline.
Scenario 1: The Fritz!Box Support File and the Hidden VLAN
A freshly provisioned Init7 Copper7 line refused to authenticate against the
Init7 PPPoE server. The Fritz!Box DSL synchronised cleanly at 92,111 / 35,867
kbit/s on VDSL2 profile 17a. The credentials matched the Init7 data sheet
character for character. The log line repeated every sixty seconds: "Log in
with internet service provider failed. Authentication failure."
The instinct here is to suspect Layer 3 — wrong password, wrong realm,
account not provisioned. The discipline says: read the support file before
trusting any of those hypotheses.
System → FRITZ!Box-Support → Save Support Information produced a 5.2MB
text file. Opening it in an editor and grepping for the relevant terms — grep -i -n "AC-Name\|AC-Cookie\|PADO\|PADI\|Service-Name\|pppoe\|vlan" —
returned a small set of lines. Line 4163 read:
0: iface ptm0 PPPoE/26/dsl e8:df:70:c2:4f:37 stay online 1 vlan 7 tcom (prop: default internet)
And line 10405:
default_tcom_vlan = 7
The Fritz!Box was sending its PPPoE Initiation packets tagged with VLAN 7,
which is the German Telekom default. Init7's BBCS wholesale infrastructure
in Switzerland either expects untagged frames or a different tag. The PPPoE
Discovery never reached Init7's BRAS because the frames were being silently
dropped or routed to the wrong VLAN by the Swisscom DSLAM.
The GUI's VLAN configuration page showed "VLAN deactivated". The support
file showed the firmware's internal default was active anyway. The
configuration that the GUI presented and the configuration that the system
was executing were different. The support file revealed the gap in the time
it took to grep for "vlan".
The lesson: GUI configuration is a partial view of system state. The support
file is the full view. When observed behaviour does not match what the GUI
claims is configured, the support file shows you why.
This was today's debug session, by the way. It is not a hypothetical.
Scenario 2: The Proxmox Cluster Quorum Split
A Proxmox VE cluster of three nodes appeared healthy in each individual web
UI. Each node showed the other two as online. But cluster operations —
migrating a VM, changing a storage definition, even running pvesh ls /cluster — hung indefinitely.
The instinct is to restart pve-cluster.service on the node from which you
are operating. The discipline says: read the cluster state directly before
intervening.
1cat /etc/pve/.members
2pvecm status
3journalctl -u corosync.service -u pve-cluster.service -n 200 --no-pager
The .members file lists what each node believes about the cluster
membership. pvecm status returns the corosync view of quorum. The corosync
journal contains the heartbeat history.
Reading these in sequence: the .members file on node 1 listed all three
nodes as members. So did node 2's. Node 3's listed only itself. pvecm status
on node 3 reported "Activity blocked" and zero expected votes. The corosync
log on node 3 contained a stream of "Token has not been received in X ms"
messages stretching back forty-five minutes.
Node 3 had become network-isolated from the other two. The web UI on each
node was answering from its own local cache of cluster state, which is why
they all looked healthy. Cluster operations hung because the operations
required a quorum write to /etc/pve/, which corosync could not commit
without a majority.
The fix was a network problem on the switch port serving node 3 — a flapping
interface that had begun dropping multicast frames. Restarting
pve-cluster.service would have done nothing because the underlying transport
was broken. Reading three files in three minutes located the actual fault.
A naive restart-and-pray approach would have produced a recurring incident
that returned every time the switch port hiccupped.
The lesson: distributed systems lie individually but tell the truth
collectively. Reading state from each node and comparing reveals divergences
that any single node's UI cannot show.
Scenario 3: The Silent GitLab Backup Restore Failure
A GitLab Community Edition restore from gitlab-backup.tar.gz ran for
twenty minutes, then failed. The web UI showed "Restore failed. Check the
logs." The Rails console was unresponsive. Reaching for strace -f -p $(pgrep -f gitlab) would have been catastrophic — GitLab on a typical
deployment has between forty and sixty processes, and which one is the
"failed" one is not obvious.
The discipline says: GitLab tells you what is broken if you read in the
right order.
1gitlab-ctl status
2gitlab-ctl tail | head -200
3tar -tzf gitlab-backup.tar.gz | head -20
gitlab-ctl status returned every component as "run", but unicorn (the Rails
HTTP server) had a recent restart count of 12, which is suspicious. gitlab-ctl tail is the magic incantation: it streams logs from every service
simultaneously, with the service name as a prefix on each line. Within
seconds, the production.log stream showed:
Errno::ENOENT: No such file or directory @ rb_sysopen - /etc/gitlab/gitlab-secrets.json
repeating approximately once per second. The tar -tzf output confirmed the
backup archive contained the database dump, the repositories, and the
configuration, but not the secrets file. Either the backup had been created
with SKIP=gitlab-secrets, or the secrets file had been removed before
backup. Without it, GitLab cannot decrypt encrypted columns in the database
— stored CI/CD variables, two-factor recovery codes, integration tokens —
and unicorn workers crash on every request that touches them.
The fix was to restore the secrets file from a separate backup (we keep
/etc/gitlab/gitlab.rb and /etc/gitlab/gitlab-secrets.json in a separate
encrypted archive precisely because of this risk). Total diagnosis time:
about ninety seconds, almost all of it in gitlab-ctl tail. Total time if I
had reached for strace first: incalculable, because I might never have
identified the missing-file pattern from a syscall trace.
The lesson: applications with mature operations tooling almost always have
a "tail everything" command. Find it for your stack. Use it first.
Scenario 4: The Persistent-Journal Kernel Oops Recovery
A development laptop kernel-panicked under sustained load and rebooted. By
the time I logged in to investigate, the system was running normally again.
The instinct is to enable kdump, configure a vmcore destination, and wait
for the next panic to capture a usable trace.
The discipline says: the journal is persistent if you configured it that
way, and the previous boot's logs survived the reboot.
1journalctl --list-boots
2journalctl -k -b -1 --no-pager | tac | grep -B2 -A100 "Oops:\|kernel BUG\|general protection fault\|Call Trace:"
journalctl --list-boots returned a list of recent boot IDs with timestamps.
The previous boot ended at 14:47, three minutes before my session started.
The kernel log from that boot, read in reverse and grepped for the canonical
panic markers, returned a clean call trace pointing at nf_conntrack and
specifically at a known issue in the kernel version running on this Pop!_OS
laptop. The trace pointed at the function, which pointed at the module,
which pointed at a known upstream bug fixed in a later kernel release.
The fix was to upgrade the kernel package and reboot. The diagnosis took
about four minutes, almost all of it spent reading the call trace
carefully. Setting up kdump would have taken longer than the diagnosis
itself, and would have provided strictly less information than was already
in the persistent journal.
The lesson: configure the persistent journal everywhere, and learn to read
kernel call traces. The journal is free. Call traces are not magic — they
are stack frames, ordered from most recent to oldest, with module names and
function names. The bottom of the trace is where the bug usually lives.
Scenario 5: The Container That Will Not Start
The classic operations problem: a developer pushes a new image, deploys it,
and the container exits within a second. docker ps -a shows it as Exited
(1). The instinct is to add --entrypoint /bin/sh -ti to the run command
and shell into the container interactively to figure out what is wrong.
The discipline says: read the logs first, because they almost certainly
already contain the answer.
1docker logs <container-id> 2>&1 | head -50
2docker inspect <container-id> | jq '.[].State, .[].Config.Env'
In the canonical case, the logs contain a single line:
Error: Required environment variable DATABASE_URL is not set
The container exited because it could not find a configuration value it
required. The fix is to either set the variable or to pass an env-file. Total
diagnosis: ten seconds of reading. Total diagnosis if you shell in
interactively first: at least several minutes, plus the cognitive overhead
of remembering which sub-shell you are in.
In the slightly less canonical case, the logs contain something like:
PermissionError: [Errno 13] Permission denied: '/data/cache'
Now docker inspect becomes useful: the Mounts section tells you what is
mounted at /data, the User field tells you what UID the container is
running as, and a quick ls -la /path/on/host confirms the host directory
is owned by root with mode 0700, but the container is running as UID 1000.
The fix is to chown the host directory or to override the container's user.
Total diagnosis: under a minute.
The lesson: the first ten lines of docker logs are correct ninety percent
of the time. Read them first. Reach for interactive debugging only when the
logs leave you with a genuine ambiguity that requires live state inspection.
The Pattern Across All Five Scenarios
Each scenario has the same structure, and that structure is the entire
content of the discipline:
First, the system has already written down what is wrong, somewhere. This is
true for almost every production failure that is not a hardware fault. The
information you need to diagnose the problem is, in the overwhelming
majority of cases, already present in some log, configuration file,
diagnostic dump, or self-reported state field on the affected system.
Second, the wrong move is to gather more data using high-overhead live
tools. These tools are not bad. They are simply the wrong instrument for
first-pass diagnosis, in the same way that a microscope is the wrong
instrument for finding your car keys. They have their place, and that place
is after the support files have narrowed the search to a specific testable
hypothesis.
Third, the right move is to read what is already there, in the right order,
with the right grep patterns. The "right order" is platform-specific and
learnable. The "right grep patterns" come from understanding the failure
modes of the system you are operating, which is itself a learnable skill.
Fourth, the skill is knowing which file to read first on each platform. This
is the pure operational expertise — the part that comes only from time on
the platform, from reading logs that turned out to be red herrings, from
seeing the same failure pattern in three slightly different forms across
two years of incidents. There is no shortcut. There is only practice and
documentation.
Fifth, fluency in this skill compounds. Every incident in which you found
the root cause through support-file reading teaches you a new pattern. The
patterns accumulate. After enough years, you reach for the right file
without thinking, and your time-to-resolution drops by orders of magnitude
relative to the engineer who is still reaching for strace at 02:00.
A First-Move Checklist by Platform
For my own homelab and the teams I work with, I keep an internal cheat sheet
of the first three commands to run on each platform when something is
wrong. A condensed version:
Generic systemd Linux host:
1systemctl --failed
2journalctl -p err -b --no-pager | tail -100
3journalctl --list-boots # to spot unexpected reboots
Linux host, suspected hardware or kernel issue:
1journalctl -k -b --no-pager | grep -iE "error|fail|oops|panic|hung|stall"
2dmesg --level=err,warn --notime | tail -100
3cat /var/log/kern.log | tail -200 # if syslog-based
Network connectivity issue on Linux:
1ip -br addr ; ip -br link ; ip route
2ss -tlnp # who is listening on what
3journalctl -u NetworkManager --since=-1h --no-pager
Docker host:
1docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
2docker logs <suspect-id> 2>&1 | head -50
3docker inspect <suspect-id> | jq '.[].State, .[].Mounts'
Proxmox node:
1pvecm status
2journalctl -u pve-cluster.service -u corosync.service --since=-30m --no-pager
3pvesm status # storage view
Proxmox VM that will not start:
1qm config <vmid>
2qm status <vmid>
3journalctl -u pve-guests.service --since=-30m --no-pager
4ls -lh /srv/local/vz/images/<vmid>/ # disk image present and sized?
GitLab incident:
1gitlab-ctl status
2gitlab-ctl tail | head -200 # streams ALL logs, prefixed by service
3gitlab-rake gitlab:check # the official self-check
PostgreSQL slowness or hang:
1SELECT pid, state, wait_event_type, wait_event, query_start, query
2FROM pg_stat_activity
3WHERE state != 'idle'
4ORDER BY query_start;
5
6SELECT * FROM pg_locks WHERE NOT granted;
Fritz!Box or similar consumer router PPP/DSL failure:
1GUI: System → FRITZ!Box-Support → Save Support Information
2Then: grep -i -n "PADO\|PADI\|AC-Name\|AC-Cookie\|vlan\|authentication" support.txt
This is not a complete list. It is a starting list. The point of having it
written down is so that during an incident, you do not need to remember the
incantations under stress. You read them off the page and start producing
useful output within seconds.
Why This Skill Has Atrophied
If support-file reading is so high-leverage, why does it not seem to be
universally taught and practised? Three reasons, in increasing order of
discomfort.
First, modern operations tooling abstracts the underlying platform. A
Kubernetes operator manages a database. A Helm chart manages the operator. A
GitOps controller manages the chart. Each layer adds value but also adds a
layer of indirection between the engineer and the system actually
experiencing the problem. When the database is slow, the engineer asks the
operator, the operator asks the chart, the chart asks the controller, and
nobody reads pg_stat_activity. The skill required to read pg_stat_activity
fluently was never built, because every layer above it was supposed to make
it unnecessary.
Second, observability tooling has shifted the centre of gravity from "what
does the system say about itself" to "what do my external probes say about
the system". Prometheus, Datadog, New Relic, and their kin are powerful, and
they catch many incidents earlier than support-file reading would. But they
also encourage a habit of looking only at metrics that were instrumented in
advance. When something fails in a way that is not on a dashboard, the
engineer who only knows dashboards is stuck. The engineer who knows how to
read journalctl, /var/log, and application self-diagnostic output keeps
moving.
Third, and most uncomfortably, the skill is not glamorous. Reading log
files is slow, methodical, and visually unimpressive. It does not produce
flashy demos. It does not lend itself to conference talks (though it
should). Conference talks are about new tools. Support-file reading is
about old habits. New tools attract attention. Old habits accumulate
expertise. The latter compounds. The former does not, on its own.
The result is a generation of engineers who can write a Helm chart but
cannot read a journalctl output, who can configure Prometheus but cannot
read pg_stat_activity, who can spin up a Kubernetes cluster but freeze
when a single node misbehaves in an unexpected way. This is not their
fault. It is a gap in the education path the industry has built for them.
The fix is not to abandon modern tooling. The fix is to layer support-file
fluency underneath it, as the foundational skill on which all the higher
abstractions ultimately rest.
The Closing Argument
Production debugging is, in the end, a literacy problem. The system has
written down what is wrong with it, in a language and a format that is
specific to that system. Engineers who can read that language fluently
resolve incidents in minutes. Engineers who cannot — who reach for live
tracing, interactive debugging, or external observability tools first —
spend hours doing what amounts to translation work. They are not bad
engineers. They simply have not yet built the foundational reading skill on
which everything else depends.
Building that skill takes time. It cannot be acquired in a single sprint.
It comes from years of incidents, each of which adds a few patterns to the
mental library, each of which compounds with what came before. The path is
unglamorous, slow, and reliably effective.
The single most useful piece of advice I can give to a junior engineer
operating production systems is: before you reach for any tool, read what
the system has already written down. Open the journal. Run the
self-diagnostic. Read the support file. Grep for the obvious words. The
answer is, in the overwhelming majority of cases, already there, waiting
for someone to read it.
Be that person. The pager will thank you at 02:00.
Appendix: Persistent Journal Configuration
Several of the scenarios above relied on the journald persistent journal
surviving a reboot. This is not the default on every Linux distribution. To
ensure your hosts retain logs across reboots — which is a precondition for
post-mortem analysis of any unexpected restart — drop the following file
into place on every Linux host:
1# /etc/systemd/journald.conf.d/size.conf
2[Journal]
3Storage=persistent
4SystemMaxUse=1G
And ensure the directory exists with the right permissions:
1mkdir -p /var/log/journal
2chown root:systemd-journal /var/log/journal
3chmod 2755 /var/log/journal
4systemctl restart systemd-journald
This is in the diagnostics role of my homelab Ansible repository. Every
managed host has it. It is the cheapest insurance you will ever buy for
production debugging — about thirty seconds of configuration in exchange
for the ability to read the previous boot's kernel log after an unexpected
reboot. There is no excuse for not having it on every machine you operate.