This blog post was originally published on June 5, 2011.
On May 16, 2011, Fenghua Yu submitted a series of patches to the upstream Linux kernel implementing support for a new Intel CPU feature: Supervisor Mode Execution Protection (SMEP). This feature is enabled by toggling a bit in the cr4 register, and the result is the CPU will generate a fault whenever ring0 attempts to execute code from a page marked with the user bit.
First, some background on why this feature is useful. Like most mainstream operating systems, the vanilla Linux kernel does not leverage x86 segmentation, instead defining flat segment descriptors with limits encompassing the entire 4gb address space. Additionally, each process has the kernel’s page table entries replicated, resulting in the kernel address space being mapped in the upper 1gb of every user process. Both of these decisions are for performance reasons: reloading segment selectors at every trap and kernel-to-user (or vice versa) copy operation introduces a non-negligible (but not necessarily unacceptable) performance hit, and having completely separate user and kernel address spaces would necessitate a TLB flush on every trap, which is even more expensive.
The result of this is that the kernel is free to incorrectly access data residing in userspace, as well as execute code in the user region. In addition to enabling the exploitation of many bugs that rely on the kernel incorrectly using user data, this allows kernel exploits to simply map a suitable payload in userspace and divert kernel execution to that payload.
The PaX project solves this problem in a general way with a feature
called PAX_UDEREF. When this feature is enabled, PaX
leverages segmentation to isolate user and kernel addresses, such that a
fault will be generated when the kernel incorrectly accesses user data
or code. Unfortunately, due to the performance hit associated with
reloading segment registers and the fact that this touches
mission-critical code, it’s unlikely that this solution would be
accepted into the upstream Linux kernel.
Update: I’m told by the PaX team that recent
benchmarks have shown there is almost no measurable performance impact
for UDEREF on i386, as reloading segment registers has
become much cheaper since the initial benchmarks of this feature (on the
order of 16 cycles). However, it’s still unlikely that the upstream
kernel would find the feature suitable, since using segmentation would
be a significant departure from current kernel design principles.
Enter SMEP. Now, the mainline Linux kernel can take advantage of a
subset of this protection at essentially no performance cost, as the
functionality is presumably implemented in hardware in a way that’s
similar to existing CPL checks. With SMEP enabled, it’s no longer
possible to map exploit payloads in userland, as the CPU will trigger a
fault if it attempts to execute those user pages in kernel mode. Note
that this is still only a subset of what UDEREF protects
against, as it does nothing to prevent the kernel from incorrectly
accessing user data as opposed to code. But it’s certainly a
start.
It may take awhile for the hardware to catch up – it doesn’t seem any existing CPUs actually implement SMEP, and we all know how long adoption of hardware NX has taken (and continues to take). However, once SMEP is widespread, what are kernel exploit writers going to do? Is this the end of Linux kernel exploits?
Of course not. While SMEP is definitely a very good security feature and is a step in the right direction, no single feature is going to “win security”. Let’s go into a few ways to bypass this protection (I’m sure there are more).
The first problem is the kernel’s page permissions aren’t yet in a
completely sane state. By compiling a kernel with
CONFIG_X86_PTDUMP (or using Kees Cook’s modularized version
of this feature), we can take a look at the permissions of kernel pages
via the /sys/kernel/debug/kernel_page_tables debugfs file.
In particular, we’re interested in pages that are both writable and
executable:
# grep RW /sys/kernel/debug/kernel_page_tables | grep -v NX
0xc009b000-0xc009f000 16K RW GLB x pte
0xc00a0000-0xc0100000 384K RW GLB x pte
0xc1400000-0xc1580000 1536K RW GLB x pte
The first two regions are especially useful, since they will appear at static addresses on many modern 32-bit kernels. The first region is reserved for the BIOS, and the second is the so-called “I/O hole” used for DMA. While it’s probably best to avoid scribbling all over the I/O hole, as it’s commonly used at runtime, there’s no reason that writing into the BIOS region would cause any stability issues after booting is complete.
So, if we have a kernel write primitive, all we have to do is write
our payload into the BIOS region and divert execution there. If the
target kernel leaks symbol locations via /proc/kallsyms or
similar, then diverting execution is a simple matter of resolving the
address of a suitable function pointer, overwriting it, and triggering
it. Otherwise, it’s trivial to issue a sidt instruction to retrieve the
address of the IDT and set up a trap handler pointing into the payload.
SMEP will have nothing to complain about, since we never cause the
kernel to attempt to execute from user pages.
A second way to bypass this protection is to leverage the
addr_limit variable, which resides in the
thread_info structure at the base of each process’ kernel
stack.
As described in Jon Oberheide’s and my presentation on Stackjacking,
it’s possible to exploit the leakage of uninitialized stack data, a
common bug, in order to infer the address of the base of a process’
kernel stack. I developed a library called libkstack to do so
generically. Once this address is inferred, a kernel write vulnerability
can simply write ULONG_MAX (0xffffffff) into
the addr_limit variable, which is at a reliable offset from
the kernel stack base. At this point, arbitrary kernel memory can be
read from and written to, since all kernel copy functions will accept
kernel pointers as user arguments. For example, you can do a
write(pipefd, kernel_addr, len) to read the data from
kernel_addr into a pipe, to be retrieved later. Once you
have an arbitrary kernel read and write, the current process’ cred
structure can be found and written into, escalating privileges to root.
Again, this attack does not require executing any user code with kernel
privileges, so SMEP cannot stop it.
Update: it’s worth noting that grsecurity protects
against this type of attack by removing the thread_info
structure from the kernel stack.
In the event that kernel symbols can be resolved on the target kernel
(especially common on distro kernels) and the attacker has a stack
overflow or another vulnerability that allows pivoting the stack pointer
into an area of attacker controlled data, kernel ROP is possible.
Fortunately, the setup_smep function, which has code to
both enable and disable the SMEP bit in the cr4 register, is marked
__init, so it’s likely to have been cleaned up by the
kernel after initialization and is not a good candidate for ROP.
However, more complex ROP payloads are certainly possible, as I hope to
demonstrate later this year. For now, I’ll leave this up to your
imagination. ;-)
Some progress on removing useful sources of information leakage has
been made with the kptr_restrict and
dmesg_restrict sysctls. Continued work on plugging similar
leaks should improve the usefulness of these features. However, it’s
still trivial to resolve the locations of kernel code and data on
distribution kernels, since they are shipped as binaries that are
identical across all machines with the same kernel version. This is
demonstrated perfectly by Jon Oberheide’s ksymhunter project.
The solution I’m currently working on is implementing randomization
of the address at which the kernel is decompressed at boot. This way,
even if an attacker can download an identical kernel image as the target
host, he won’t know where kernel data and code resides in a running
kernel, assuming an absence of information leakage. In order to be
effective, this solution requires relocating the IDT - otherwise, it
will reside at the location pointed to by the idt_table
symbol, and an sidt instruction would allow an attacker to calculate the
offsets of every other kernel symbol relative to the address of the IDT.
This has its own challenges, but I’m making progress and hope to submit
a working version in the coming weeks. This will also have the useful
side effect of marking the IDT read-only, which will prevent it from
being a generic target for kernel write vulnerabilities.
Next, more work needs to be done on making sure page protections in the kernel are sane. Most importantly, RWX mappings should be removed and function pointer tables should be enforced read-only. Fortunately, efforts are underway in this area as well, with help from Kees Cook.
Hopefully, with the combined efforts to remove information leakage via restricting leaks and kernel image randomization, stronger page protections in the kernel, and SMEP, the Linux kernel will have significantly raised the bar for exploitation.