Paging in Linux on x86 – part 2

Our last post left the story of paging in Linux on x86 incomplete. Today we will cover that. This is a vast topic whose tentacles go far and wide into almost all aspects of the kernel. So we will cover only some interesting highlights. Our focus is on x86 without Physical Address Extension (PAE) enabled.

Data structures:

Although x86 uses two-level page tables, there are architectures which use three or four levels too. For example, x86 with Physical Address Extension (PAE) uses three page tables and the 64-bit x86_64 uses four page tables. Linux aims to cover all architecures and therefore uses four page tables. They are:

  • Page Global Directory
  • Page Upper Directory
  • Page Middle Directory
  • Page Table

This means, Linux divides virtual address into five parts – one for indexing into each of the tables above and one offset into the physical page frame. An entry in each of the tables above is a 32-bit unsigned int on x86 (without PAE). These data types are used to represent each of them respectively: pgd_t, pud_t, pmd_t and pte_t. There is also a set of helper macros and functions used to manipulate them.

On x86:

On x86, there are only two levels of page tables. Linux reconciles that with its four levels by nullifying the effect of PUD and PMD. It does this by keeping just one record in each of them. So practically, it is only using PGD and Page Table.

Kernel typically divides virtual address space into 3GB from 0x00000000 to 0xbfffffff for user space and 0xc0000000 to 0xffffffff for kernel space. In kernel code, the macro PAGE_OFFSET contains virtual address at which kernel starts, i.e. 0xc0000000 on a typical x86 set up.

Early on during boot, kernel learns size of RAM by querying BIOS. Then it scans physcial addresses to find those addresses which are unavailable. They can be:

  • addresses which are mapped with hardware devices’ I/O (memory-mapped I/O)
  • addresses pointing to page frames containing BIOS data

Typically the kernel lodges itself at physical address 0x00100000, i.e. 2nd MB onwards. Reasons for skipping first MB are architecture specfic – not just x86, but other machines also do some special things in that first MB of physical memory. Those page frames in which kernel sits never get swapped out to disk. Kernel also never swaps out unavailable addresses mentioned above.

Kernel address mapping:

From the 1 GB of virtual address space that kernel occupies, 896MB is directly mapped to physical addresses, i.e. there is a one-to-one mapping. Since kernel pages are never swapped out, this mapping always holds true. Macro __pa() converts kernel virtual address into physical address. It basically does simple maths like

Physical address = virtual address – PAGE_OFFSET

Another macro __va does the same thing in reverse.

Kernel sets aside the highest 128MB of its 1GB address space for non-contiguous allocations (high mem) and what is called fix-mapped linear addresses. Non-contiguous allocations is a separate topic on its own but we will quickly describe fix-mapped linear addresses before concluding this article. They are basically constant mappings from virtual to physical addresses but unlike first 896MB, they don’t follow simple offset formula above. Their mapping is arbitrary but fixed nonetheless. Kernel uses them as pointers as they are more efficient in terms of memory accesses required.