$ Zombies with ps - You Cannot Kill What Is Already Dead
Zombies with ps
A zombie process is a child that has exited but whose parent has not
yet called wait() to collect its exit status. The kernel keeps a
slot in the process table for it, holding only the PID and the exit
code, until the parent acknowledges that the child is gone. The
process consumes no memory and no CPU. It occupies a row.
The first thing to know about zombies is that you cannot kill them.
They are already dead. kill -9 sends SIGKILL to a process. SIGKILL
has nothing to act on, because the process is not running. The
zombie stays in the table.
Finding Them
The most reliable form of ps for this purpose is the forest view,
which shows the parent of each process explicitly.
1$ ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/'
2 PID PPID STAT COMMAND
312847 4221 Z myservice <defunct>
412849 4221 Z myservice <defunct>
The Z status and the <defunct> suffix in the command name both
indicate a zombie. The PPID column is the part that matters. The
parent process is the actual subject of any remediation.
A wider view with the process tree shows the structure:
1$ ps auxf | grep -B 1 defunct
This typically reveals a parent process that has spawned dozens of
short-lived children without reaping them, which is the
characteristic pattern.
Fixing Them
The fix is one of two actions, both directed at the parent.
The lighter touch is to send SIGCHLD to the parent, which tells the
parent's signal handler that it should check on its children. A
well-written parent that simply forgot to install a handler will
often start reaping when nudged.
1$ kill -CHLD 4221
The heavier touch is to kill the parent itself. When the parent
dies, the kernel reparents the zombie to init (PID 1, which is
systemd on most modern Linux systems). The init process is
contractually obligated to reap any child reparented to it. The
zombies disappear within microseconds of the parent's death.
1$ kill 4221
Either action releases the process table slots. Neither action
involves the zombies themselves.
Why This Matters
A zombie costs essentially nothing individually. A zombie leak
accumulating thousands of entries will eventually exhaust the system
process table, at which point fork() starts failing for every
process on the system. The symptom is sudden and catastrophic. The
cause is a parent process that someone wrote ten years ago and
nobody has touched since.
The parent is always the bug. Read its code. Add the missing
wait() call. Install a SIGCHLD handler. The zombies are the
symptom of a bug located exactly one level up the process tree, and
the fix lives there.