Christian Lehnert — Linux, Hacking & Faith

Dirty COW - The Nine-Year-Old Root Bug Hiding in Copy-on-Write

Christian Lehnert2016-10-21~7 min read

On 19 October the kernel got its branded vulnerability for the year. CVE-2016-5195, christened Dirty COW, complete with a logo and a .ninja domain, because apparently that is how we ship security advisories now. Linus had already merged the fix the day before. Distributions scrambled overnight, and by the time most people read the news their apt-get upgrade was already waiting for them.

Strip away the branding and what is left is genuinely unpleasant: a race condition in the kernel's copy-on-write logic that lets any unprivileged local user obtain write access to read-only memory mappings, and from there, root. It had been sitting in mm/gup.c since kernel 2.6.22 — released in 2007. Nine years. Every Debian, every Ubuntu, every Android phone, every container host built on a kernel from the last near-decade was exploitable, and we found out because Phil Oester pulled the exploit out of an HTTP packet capture on a compromised server. Nobody reported the bug. Somebody was already using it.

Copy-on-write, the one-paragraph version

When you mmap a file as a private, read-only mapping (MAP_PRIVATE, PROT_READ), the kernel does not copy anything. Your process and the page cache share the same physical pages. The copy is deferred until the first write — that is the "copy" in copy-on-write. A write triggers a page fault, the fault handler allocates a fresh private page, copies the original into it, and points your mapping at the copy. The original page-cache page — the one backed by the file on disk — is never touched. That is the entire safety guarantee: your writes land on your copy, never on the shared, read-only original.

Dirty COW breaks that guarantee.

The race

The kernel lets you write to your own address space through /proc/self/mem. Internally that path calls get_user_pages() with the "force" flag, which is allowed to break COW even on a read-only mapping — it has to, otherwise debuggers like GDB could never patch read-only code. So the sequence for a legitimate write is roughly:

  1. Fault in the page, breaking COW: allocate a private copy, dirty it, redirect the mapping.
  2. Write your bytes into that private copy.
    The bug is that you can run a second thread that, in between those two steps, calls madvise(addr, len, MADV_DONTNEED) on the same mapping. MADV_DONTNEED tells the kernel "I no longer need this private copy" — so the kernel throws the private page away. On the next iteration the write-via-/proc/self/mem re-faults, but because of how the force/retry path in mm/gup.c re-evaluated the mapping, it could now resolve the write straight onto the original, shared, file-backed page instead of breaking COW again.

You hammer those two threads in a tight loop. Most iterations do nothing. Eventually the window lines up, and your write lands on the page-cache page of a file you only have read access to. The kernel marks it dirty and, in due course, flushes it back to disk. You just edited a root-owned, read-only file as a nobody.

# the canonical harmless demonstration — a file you can read but NOT write
$ echo "this is not a test" > foo
$ sudo chown root:root foo && sudo chmod 0404 foo
$ ls -l foo
-r-----r-- 1 root root 19 Oct 21 14:02 foo
 
# run the dirtyc0w PoC against it as an unprivileged user...
$ ./dirtyc0w foo m00000000000000000
mmap 7f... madvise 0 procselfmem 1800000000
 
$ cat foo
m00000000000000000     # read-only, root-owned, and yet overwritten

I am deliberately showing the benign version — overwriting a throwaway file to prove the primitive exists. The weaponised variants do the obvious thing: instead of foo, they target a root-owned setuid binary like /usr/bin/passwd (back it up, overwrite its executable bytes with shellcode that drops a root shell, run it, restore), or simply rewrite a line in /etc/passwd. The mechanism is identical. The target is the only thing that changes, and root-owned read-only files that an unprivileged process can mmap are not exactly scarce.

"It's only local privilege escalation"

That is the sentence people reach for to feel calm about a bug like this, and in 2016 it is more wrong than it has ever been.

A decade ago, "local" meant someone already had a shell on your box, so the game was arguably half-lost already. That threat model is dead. Today "local" includes:

  • Every shared-hosting tenant on the same kernel as a thousand strangers.
  • Every CI runner executing untrusted pull-request code.
  • Every container. This is the one that should worry you. Docker is everywhere this year, and the comforting mental model — "containers are like lightweight VMs" — is false at exactly this layer. Containers share the host kernel. The COW logic in mm/gup.c is the host's COW logic. A Dirty COW exploit that works inside an unprivileged container works against the host kernel, which is a long way toward a container escape. Namespaces and cgroups partition resources; they do not give you a second kernel. Dirty COW is a clean reminder that your isolation boundary is only as strong as the one shared kernel underneath it.
    "Local" is now the default position of half the code you run. Treating local privesc as second-tier is a 2009 instinct applied to a 2016 architecture.

The uncomfortable part: nine years

The bug did not hide because it was subtle to trigger — the PoC is a couple of threads and a loop. It hid because nobody with the right eyes was looking at that path with malice. Worse: the race was glimpsed years earlier during work on get_user_pages, and a hardening attempt around it was backed out because it broke another architecture. So the kernel carried a known-awkward path for the better part of a decade until someone weaponised it and got caught on a packet capture.

Linus, for his part, declined to treat it as special. His take, paraphrased: the spectacular security holes are not more important than the thousands of boring, ordinary bugs — there are simply far more of the boring ones, and they matter more in aggregate. He is right about the math, and it is still cold comfort when the "boring" bug is a nine-year-old free root.

The honest lesson is not "the kernel is insecure." It is that age is not assurance. "This code has been in the tree for nine years without incident" is not evidence of correctness; it is evidence that nobody published an incident. Dirty COW is what that distinction costs.

What to actually do

This is not a configuration problem, so there is no clever mitigation that beats patching. There is no sysctl, no permission change, no mount flag that closes the race. You update the kernel and you reboot.

# Debian Jessie — the fixed kernel shipped within about a day (DSA for the Jessie kernel)
apt-get update
apt-get install --only-upgrade linux-image-amd64
reboot

The reboot is the unglamorous catch: a fix this deep in the memory manager is not something you hot-swap with a systemctl restart. Live-patching frameworks (kpatch, kGraft) existed and a few shops used them, but for the rest of us "schedule the reboot" is the answer, and "we couldn't take the downtime" is how you stay exploitable for another month.

Detection is close to hopeless after the fact. The exploit is a couple of syscalls in a loop; it writes no log line, touches no audit trail you have enabled by default, and the only reason this one surfaced at all is that an attacker was sloppy enough to transfer it in cleartext over a link someone was capturing. Do not plan to catch the next one. Plan to not be running the vulnerable kernel when it arrives.

Bottom line

Dirty COW is not interesting because it is exotic. It is a plain race in a well-trodden function. It is interesting because it was trivial, ancient, and invisible all at once, and because it lands in the exact year we decided to put untrusted workloads on a shared kernel and call it isolation.

Patch the kernel. Reboot the box. And the next time someone shrugs that a bug is "only local," ask them how many strangers are currently sharing their kernel. In 2016, the answer is rarely zero.

Tagged:
#linux #kernel #cve #privilege-escalation
← Back to posts