Paging in Linux on x86 – part 2

Our last post left the story of paging in Linux on x86 incomplete. Today we will cover that. This is a vast topic whose tentacles go far and wide into almost all aspects of the kernel. So we will cover only some interesting highlights. Our focus is on x86 without Physical Address Extension (PAE) enabled.

Data structures:

Although x86 uses two-level page tables, there are architectures which use three or four levels too. For example, x86 with Physical Address Extension (PAE) uses three page tables and the 64-bit x86_64 uses four page tables. Linux aims to cover all architecures and therefore uses four page tables. They are:

  • Page Global Directory
  • Page Upper Directory
  • Page Middle Directory
  • Page Table

This means, Linux divides virtual address into five parts – one for indexing into each of the tables above and one offset into the physical page frame. An entry in each of the tables above is a 32-bit unsigned int on x86 (without PAE). These data types are used to represent each of them respectively: pgd_t, pud_t, pmd_t and pte_t. There is also a set of helper macros and functions used to manipulate them.

On x86:

On x86, there are only two levels of page tables. Linux reconciles that with its four levels by nullifying the effect of PUD and PMD. It does this by keeping just one record in each of them. So practically, it is only using PGD and Page Table.

Kernel typically divides virtual address space into 3GB from 0x00000000 to 0xbfffffff for user space and 0xc0000000 to 0xffffffff for kernel space. In kernel code, the macro PAGE_OFFSET contains virtual address at which kernel starts, i.e. 0xc0000000 on a typical x86 set up.

Early on during boot, kernel learns size of RAM by querying BIOS. Then it scans physcial addresses to find those addresses which are unavailable. They can be:

  • addresses which are mapped with hardware devices’ I/O (memory-mapped I/O)
  • addresses pointing to page frames containing BIOS data

Typically the kernel lodges itself at physical address 0x00100000, i.e. 2nd MB onwards. Reasons for skipping first MB are architecture specfic – not just x86, but other machines also do some special things in that first MB of physical memory. Those page frames in which kernel sits never get swapped out to disk. Kernel also never swaps out unavailable addresses mentioned above.

Kernel address mapping:

From the 1 GB of virtual address space that kernel occupies, 896MB is directly mapped to physical addresses, i.e. there is a one-to-one mapping. Since kernel pages are never swapped out, this mapping always holds true. Macro __pa() converts kernel virtual address into physical address. It basically does simple maths like

Physical address = virtual address – PAGE_OFFSET

Another macro __va does the same thing in reverse.

Kernel sets aside the highest 128MB of its 1GB address space for non-contiguous allocations (high mem) and what is called fix-mapped linear addresses. Non-contiguous allocations is a separate topic on its own but we will quickly describe fix-mapped linear addresses before concluding this article. They are basically constant mappings from virtual to physical addresses but unlike first 896MB, they don’t follow simple offset formula above. Their mapping is arbitrary but fixed nonetheless. Kernel uses them as pointers as they are more efficient in terms of memory accesses required.


Paging in Linux on x86

In our last post we covered how x86 logical address is translated into linear address. In this one we will look at translation from linear to physical. We will use the terms ‘virtual address’ and ‘linear address’ interchangeably.

A piece of hardware called paging unit is responsible for converting virtual addresses to physical. However, the operating system needs to set it up with correct data structures – page tables. On x86, paging is enabled by setting a flag inside a special register. When that flag is zero, paging is not enabled and linear addresses are treated as physical addresses. Linux first sets up page tables and then enables paging.

Pages and page tables

For ease of management of memory, e.g. access rights, physical memory is divided into `page frames`. These are contiguous cells of RAM, usually 4KB in size. Corresponding to each physical page frame there is a `page` of virtual addresses. For instance virtual addresses 0x20300000 – 0x20301000 represent a page which corresponds to 4096 physical addresses each of which points to a cell (one byte) in RAM. A page as well as a page frame represent contiguous addresses, so inside a page, the virtual-to-physical mapping is one-to-one. Page is basic unit of memory management in Linux. A key function of paging unit is to check type of access to a virtual address (read or write) against access rights of the page to which that virtual address belongs. When access right is violated, paging unit generates a Page Fault.

Page table is an array in RAM which maps virtual address to physical address. Each user process has its own page table and when a context switch happens, the page tables are changed as part of it. Each entry inside page table points to a page frame inside RAM. So a 32-bit virtual address has two parts: page table index (20 most significant bits) and page offset (12 bits because page size is 4096). Using page tabele index, we will get page frame. Inside page frame we use page offset to get the exact memory cell, the byte that the virtual address points to.

A naive way of organising page table would be to have one page table whose indices are 20 most significant bits of virtual address and whose values contain (among other things) physical address of page frame. That would be wasteful. If each entry is 4 bytes, a page table would require (2^20 * 4) bytes = 4MB of RAM. That is for each process. x86 instead breaks single page table into two: Page Directory and Page Table. Virtual address is also divided into three parts: index inside Page Directory, to get Page Table entry, index inside Page Table entry to get page frame address, and then the same 12-bit page offset to find the cell inside page frame. This way, each process will have to have a Page Directory but there is no need to allocate all Page Tables upfront. Instead Page Tables can be set up when they are needed.

Management of Pages

Physical address of Page Directory is stored in a special register and that registered is updated when there is a context switch. Entries in Page Directory and Page Table have same format. Along with address of corresponding page frames (or Page Table in case of Page Directory’s entry), it stores privilege level needed to access that page. The privilege level is a single byte so has two possible values. It depends upon CPU Privilege Level (CPL) – a two byte value on x86 which represents four levels. In page table entry, it only checks whether a page requires supervisor mode (CPL = 0) or not (CPL = 1, 2 or 3).

Page table entry also contains access type allowed: read and write. In contrast, access rights for segments are three: read, write and execute. So a page which is read only cannot be written to.

What about Linux?

As you might have noticed, this post hasn’t really lived up to its title and only talks about paging in x86. Time and other conditions permitting, we will discuss paging in Linux in a follow-up article.