## Why Has Heat Dissipation Slowed CPU Speeds?

Since around 2003, growth in CPU clock speeds has stagnated. The speeds have remained roughly between 3GHz and 4GHz over the past 17 years. In decades leading up to 2003, however, the CPU clock speeds had almost been doubling every two years. The main reason for the slowdown is that too much heat would be dissipated if clock speeds were higher that what they are. Why is that the case? This article tries to explain the answer in simplified mathematical terms. A proper treatment of the subject is way beyond the scope.

In CPU, heat dissipated is function of power consumed. More power means more heat.

Total power consumed by a chip is sum of static power and dynamic power. Static power is due to what is called leakage current. This is current that’s used up while the chip is not doing anything, i.e. when the transistors in CPU are not switching between logical 0 and 1 states.

P = IV

where P stands for power, I stands for current and V for voltage. In case of static power, the current is leakage current. There is not much we can do to reduce static power. A common solution is to turn off power for parts of the chip when they aren’t doing any work. This technique is called power gating. When no current flows, there can’t be leakage current which means there can’t be static power consumption.

Dynamic power is when the chip is doing something useful, i.e. when transistors are switching between 0 and 1 states. It helps our discussion to view dynamic power as energy consumed per unit of time. Energy consumed E when a transistor goes from 0 to 1 to 0 (or 1 to 0 to 1) is directly proportional to capacitive load C and square of voltage V:

E ∝ CV²

Energy used up when transistor switches just once, from 0 to 1 or from 1 to 0 is

E ∝ ¹/₂ CV²

Since power is energy consumed per unit of time, we can say

P ∝ ¹/₂ CV²f

where f is frequency with which the transistor changes its state per second.

Capacitive load is a function of how many resistors we connect to the output and things like capacitance of wires. It’s not something we can really control. More transistors means higher capacitive load. Voltage has been brought down from 5V in early chips to just under 1V now. It’s hard to optimise voltage any further. That leaves us with clock frequency. If we increase clock frequency then power goes up which means heat goes up too. In 2003 we hit that clock frequency wall at which heat dissipated was too much for the CPU. Which is why clock speeds have stagnated since.

## System Call Table in Linux

System call table is an array of function pointers. It is defined in kernel space as variable sys_call_table and it contains pointers to functions which implement system calls. Index of each function pointer in the array is the system call number for that syscall. These are denoted by NR_* macros in header files, such as /usr/include/asm/unistd_64.h for x86_64.

On x86 systems, when a user mode program makes a system call it puts the system call number in RAX register and calls sysenter assembly instruction. This instruction switches CPU from user mode into kernel mode. It sets instruction pointer RIP to the value stored in SYSENTER_EIP_MSR register and stack pointer RSP to the value stored in SYSENTER_ESP_MSR register. MSR is short for Model Specific Register.  These are registers which are present on specific models of Intel processors, such as 64-bit processors only. The above mentioned MSRs are set up by Linux kernel to contain addresses of system_call() kernel function for RIP and kernel mode stack belonging to the process which started the system call for RSP (yes each process – or more specifically thread – has a kernel mode stack in addition to user mode stack).

system_call() function is like a multiplexer for syscalls. It saves hardware context on stack, performs some checks, e.g. whether the process is being syscall-traced in which case it needs to notify the tracer, and if all checks pass, ultimately jump into function pointed to by the pointer at syscall number index inside system call table. Return from syscall happens with sysexit assembly instruction. Upon return, the hardware context is restored and execution continues in user-space code which usually is a libc wrapper routine.

## Linux Kernel Symbols

Kernel symbols are names of functions and variables. Global symbols are those
which are available outside the file they are declared in. Global symbols
in the Linux kernel currently running on a system are available through
`/proc/kallsyms` file. This includes symbols defined inside kernel modules

Global symbols are of two types:

1. those explicitly exported through EXPORT_SYMBOL_GPL and EXPORT_SYMBOL
macros, and
2. those which are not declared with `static` C keyword and hence visible to
code which is statically linked with the kernel itself and may be available
outside the kernel image.

The first type, explicitly exported ones, are denoted with capital letter in
output of `cat /proc/kallsyms` – e.g. T if the symbol is in text section, i.e.
a function name. The second type are denoted with small letter – e.g. t for a
function which isn’t exported via EXPORT_SYMBOL_GPL or EXPORT_SYMBOL.

Inside kernel code, we can access symbols which are exported explicity by
simply using them like other variables, e.g. by calling printk() function.

For global symbols which aren’t explicitly exported, but are still available,
we can attempt to access them by calling kallsyms_lookup_name() function,
defined in kernel/kallsyms.c:

`unsigned long kallsyms_lookup_name(const char *name);`

This takes symbol name as argument and returns its address in memory, i.e. a
pointer to it. The calling code can dereference the pointer to make use of that
symbol. If the symbol isn’t found, the function returns NULL.

## try/catch in Linux Kernel

In higher level languages like Java and C#, one can recover from unexpected bahaviour using try/catch like mechanism. Things are different inside Linux kernel. The code is considered trusted and reliable. Faults and exceptions have severe penalities. For example, division by zero will likely cause a kernel oops and the whole system to hang. So how would you run some code which you know can fault? Enter extable, a kernel mechanism by which one can have try/catch like facility. It is not a high-level extension to C programming language. Instead it works at assembly level.

We will take following piece of blatant division by zero and make it safe to run inside kernel using this exception handling mechanism. We will be assuming x86_64 platform in this article.

```void blatant_div_by_zero(void)
{
/* quotient and divisor */
int q, d;

d = 0;
asm ("movl \$20, %%eax;"
"movl \$0, %%edx;"
"div %1;"
"movl %%eax, %0;"
: "=r"(q)
: "b"(d)
:"%eax"
);

pr_debug("quotient is %d\n", q);
}```

## extable

extable is basically an ELF section inside Linux kernel binary image which contains mappings between potentially faulting instructions and their respective handlers. Users interface with extable through _ASM_EXTABLE* family of macros. All macros delegate to one macro:

`_ASM_EXTABLE_HANDLE(from, to, handler)`

`from`: address of faulting instruction, i.e. the ‘try’ part
`to`: address to which control will be transferred when fault occurs, i.e. the ‘catch’ part
`handler`: is the function which transfers execution to catch part

`handler` function is important here. It has following signature:

```bool handler(const struct exception_table_entry *fixup,
struct pt_regs *regs, int trapnr)```

`fixup` contains address of catch part, i.e. ‘to’ argument of _ASM_EXTABLE_HANDLE. `regs` contains registers, as they were when the fault happened and `trapnr` is the fault number. handler transfers control to catch part by simply setting instruction pointer register in regs argument to fixup address. However, before doing that, it has an opportunity to set up environment for catch part. That’s where the different wrappers around _ASM_EXTABLE_HANDLE come in. Each wrapper uses a differnt handler function. Let’s take a couple of examples from arch/x86/include/asm/asm.h.

```# define _ASM_EXTABLE(from, to) \
_ASM_EXTABLE_HANDLE(from, to, ex_handler_default)

# define _ASM_EXTABLE_FAULT(from, to) \
_ASM_EXTABLE_HANDLE(from, to, ex_handler_fault)```

ex_handler_default() is a handler function which simply sets instruction pointer to catch part. ex_handler_fault() moves fault number to rax register in addition to setting instruction pointer to catch part, so catch part can access fault number.

## Context of handler()

At this point, some might be wondering which context does handler() function execute in. Short answer is interrupt handler context. If that works for you, you may jump to the next section. Here we will do a very quick detour of how this interrupt context comes about.

After executing an instruction and before going into next instruction, the control unit checks if the just-executed instruction resulted in an interrupt or exception. If an exception did occur, then it determines vector associated with it. In case of our divide-by-zero fault, it will be vector 0. The control unit will then read the corresponding entry – zeroth entry for divide-by-zero – of Interrupt Descriptor Table (IDT). It is an array of entries inside processor’s RAM. Address of that table is stored inside the ‘idtr’ register. That entry in IDT in turn points to an entry in another table of entries called Global Descriptor Table (GDT). The entry inside GDT then info – such as base address of segment – related to handler of the exception whose vector we started out with. Processor then performs some checks and stores context when the exception occurred (e.g. register contents) on stack. When the handler() function discussed above sets instruction pointer, it actually sets instruction pointer in this saved context. After that process jumps to interrupt handler. It is this interrupt handler which then looks up extable section to find the handler() function for the faulting instruction. If such a handler function is found, the processor jumps to it. Therefore our handler function is executed inside interrupt context. Ultimately, when execution returns from interrupt context, the context that was saved on the stack before interrupt handler was executed, is restored. Since our handler function above sets instruction pointer and potentially other registers which are part of that pre-interrupt context which will be restored, execution will jump to ‘catch’ part after interrupt handler returns.

## Solution and How it Works

Using extable macro’s we can re-write our blatant_div_by_zero() function in a safe way. Let’s first see what the final solution looks like. Then we will break it down into easily understandable parts.

```void blatant_div_by_zero(void)
{
/* quotient and divisor */
int q, d;

d = 0;
asm volatile ("movl \$20, %%eax;"
"movl \$0, %%edx;"
"1: div %1;"
"movl %%eax, %0;"
"2:\t\n"
"\t.section .fixup,\"ax\"\n"
"3:\tmov\t\$-1, %0\n"
"\tjmp\t2b\n"
"\t.previous\n"
_ASM_EXTABLE(1b, 3b)
: "=r"(q)
: "b"(d)
:"%eax"
);

pr_debug("quotient is %d\n", q);
}```

There are two key differences here. First, we are adding some code to a section called .fixup. Second, _ASM_EXTABLE() macro. Code in .fixup section is the ‘catch’ part. _ASM_EXTABLE(), as we saw above, uses a default handler which merely sets instruction pointer to ‘catch’ part:

```__visible bool ex_handler_default(const struct exception_table_entry *fixup,
struct pt_regs *regs, int trapnr)
{
return true;
}```

First argument of _ASM_EXTABLE is label 1b for faulting instruction “1: div %1;”. Letter ‘b’ in 1b means backwards and doesn’t have functional significance in our context. Second argument, 3b, is the catch part which is inside .fixup section. Now let’s look at the execution flow with this set up. Division by zero instruction is executed, it faults, goes into exception handler at vector 0, which finds the correct handler – in our case ex_handler_default() – which in turn, sets instruction pointer to address at label 3, our catch part. As noted, label 3 is inside .fixup section. Inside label 3, we set quotient q to -1 and jump to label 2 which is known safe point after faulting instruction. From there, execution continues to next instruction like normal. Note that the next instruction is neither “\t.section .fixup,\”ax\”\n” (which is a directive to linker) nor “3:\tmov\t\$-1, %0\n” (which lives in a different section from where instruction at label 2 is).

Now let’s see what a real object file looks like when compiled with above extable code. Here is a hello-world kernel module, modified to contain blatant_div_by_zero() function. After compiling it, we can inspect its sections using readelf*:

```\$ readelf -S hello.ko
There are 34 section headers, starting at offset 0x10b8:

Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000000 00000040
0000000000000024 0000000000000000 A 0 0 4
[ 2] .text PROGBITS 0000000000000000 00000064
0000000000000000 0000000000000000 AX 0 0 1
[ 3] .init.text PROGBITS 0000000000000000 00000064
0000000000000034 0000000000000000 AX 0 0 1
[ 4] .rela.init.text RELA 0000000000000000 00000cd8
0000000000000060 0000000000000018 I 31 3 8
[ 5] .fixup PROGBITS 0000000000000000 00000098
000000000000000a 0000000000000000 AX 0 0 1
[ 6] .rela.fixup RELA 0000000000000000 00000d38
0000000000000018 0000000000000018 I 31 5 8
[ 7] .exit.text PROGBITS 0000000000000000 000000a2
000000000000000c 0000000000000000 AX 0 0 1
[ 8] .rela.exit.text RELA 0000000000000000 00000d50
0000000000000030 0000000000000018 I 31 7 8
[ 9] .rodata.str1.1 PROGBITS 0000000000000000 000000ae
000000000000001c 0000000000000001 AMS 0 0 1
[10] __ex_table PROGBITS 0000000000000000 000000cc
000000000000000c 0000000000000000 A 0 0 4
[11] .rela__ex_table RELA 0000000000000000 00000d80
0000000000000048 0000000000000018 I 31 10 8
[...]```

Above is partial output, containing the sections which are relevant to our discussion. First, there is .fixup section at number 5. Its type of PROGBITS means it will be loaded in memory and flag X means it will be executable, which is what we want since this will execute the catch part. Further down at number 10 is __ex_table section. This is loaded into memory (type PROGBITS) but not marked as executable. We will only use it to read address of handler function but won’t need to execute the section itself. There is a function in kernel/extable.c which searches extables for correct handler for a given faulting address. It is reproduced below:

```/* Given an address, look for it in the exception tables. */
const struct exception_table_entry *search_exception_tables(unsigned long addr)
{
const struct exception_table_entry *e;

e = search_extable(__start___ex_table,
if (!e)
return e;
}```

As you can see, if the search in kernel image fails, the function looks in kernel modules’ extables for a corresponding entry through the call to search_module_extables(addr). This is the function which will search our hello.ko module for entry corresponding to division-by-zero instruction.

* Apologies for the formatting. Please bear until I find a way to fix it in the wordpress editor.

Title Image: Catch taken when playing cricket in street – a popular pastime in South Asia. Taken from https://www.flickr.com/photos/flickcoolpix/

## Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

## VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

### VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

## KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

## QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

## How Does an Intel Processor Boot?

When we switch on a computer, it goes through a series of steps before it is able to load the operating system. In this post we will see how a typical x86 processor boots. This is a very complex and involved process. We will only present a basic overall structure. Also what path is actually taken by the processor to reach a state where it can load an OS, is dependent on boot firmware. We will follow example of coreboot, an open source boot firmware.

## Before Power is Applied

Let us start with BIOS chip, also known as boot ROM. BIOS chip is a piece of silicon on the motherboard of a computer and it can store bytes. It has two characteristics which are of interest to us. First, it (or a part of it) is memory mapped into the CPU’s address space, which means that the CPU can access it in the same way it would access RAM. In particular, the CPU can point its instruction pointer to executed code inside BIOS chip. Second, the bytes that BIOS chip stores, represent the very first instructions that are executed by the CPU. BIOS chop also contains other pieces of code and data. A typical BIOS contains flash descriptor (a contents table for BIOS chip), BIOS region (the first instructions to be executed), Intel ME (Intel Management Engine) and GbE (gigabit ethernet). As you can see, BIOS chip is shared between serveral components of the system and not exclusive to CPU.

## When power is applied

Modern Intel chips come with what is called Intel Management Engine. As soon as power is available – through battery or from mains – Intel ME comes on. It does its own initialisations which requires it to read BIOS’s flash descriptor to find where Intel ME region is and then from Intel ME region of BIOS, read in code and config data. Next when we press power button on the computer, the CPU comes on. On a multiprocessor system, there is always a designated processor, called Bootstrap Processor (BSP), which comes on. In either case, the processor always comes on in what is called 16-bit Real Mode with insruction pointer pointing to address 0xffff.fff0, the reset vector.

EDIT: (thanks to burfog for indicating that this needs explaination)

You might be wondering how could a 16-bit system address 0xffff.fff0 which is clearly beyond 0xffff, the max 16-bit value? In 16-bit mode,  physical address is calculated by left shifting code segment (CS) selector register by 4 bits and then adding instruction pointer (IP) address. On reset, IP cotains value 0xfff0 and CS has value 0xf000 [1]. By the above formula the physical address should be:

CS << 4 + IP = 0x000f.0000 + 0xfff0 = 0x000f.fff0

which is still not what we expected. This is because on reset, the system is in a “special” Real Mode, where the first 12 address lines are asserted. So all addresses look like 0xfffx.xxxx. This means in our case, we need to set the most significant 12 bits in the address we derived, which results in our expected address 0xffff.fff0. These 12 address lines remain asserted until a long JMP is executed, after which they are de-asserted and normal Real Mode addressing calculations resume.

The BIOS chip is also set up in such a way that first instruction to be executed from the BIOS is at physical address 0xffff.fff0 of the processor. Hence processor is able to execute the first instruction from BIOS region of the BIOS chip. This region contains what is called boot firmware. Examples of boot firmware are UEFI implementations, coreboot and the classic BIOS.

One of the first things that the boot firmware does is switch to 32-bit mode. It is also “protected mode”, i.e. segmentation is turned on and various segments of processor’s address space can be managed with different access permissions. Boot firmware however would have just one segment, effectively turning off segmentation. This is called flat mode.

## Early Initialisations

It is worth noting that at this point in boot process, DRAM is not available. DRAM Initialisation is one of the main objectives of boot firmware. But before it can initialise DRAM, it needs to do some preparation.

Microcode patches are like patches for CPU to function correctly. Intel keeps publishing microcode patches for different CPUs. The boot firmware applies those patches very early on in boot process. Part of the processor is what is called south bridge or I/O controller hub (ICH) or peripheral controller hub (PCH). There are some initialisations to be performed for ICH also. For example, ICH may contain a watchdog timer which can go off which DRAM is being initialised. That watchdog timer must be turned off first.

Of course all of this is being done by firmware which is code written by someone. Now most of the code we know utilises stack. But we have mentioned that DRAM hasn’t been initialised yet so there is no memory. So how is this code written and run? Answer is that this is stackless code. Either it is hand written x86 assembly or, as in case of coreboot, it is written in C and compiled using special compiler called ROMCC which translates C to stackless assembly instructions. This of course comes with some restrictions so ROMCC compiled code is not how we want to execute everything. We need stack as soon as possible.

So, the next step is setting up what is called cache-as-RAM (CAR). Boot firmware basically sets up CPU caches so that they can be temporarily used as RAM. This way the firmware can run code which is not stackless, but still restricted in terms of stack size and general amount of memory available.

## Memory Initialisation and Intel FSP

On Intel systems, memory initialisation is performed using a blob called Intel Firmware Support Package (FSP). This is supplied by Intel in binary form. Intel FSP does a lot of heavy lifting when it comes to bootstrapping Intel processors and is not just limited to memory init. It is basically a three stage API. The way boot firmware interacts with FSP is set up some parameters and a return address, and jump into an FSP stage. The FSP stage would execute taking into account the parameters and then use the return address to jump back into boot firmware. This continues across these three FSP stages and in that order:

• TempRamInit(): This performs some init for RAM and hand control back to boot firmware. Boot firmware can kick off some actions and then go on to next stage. This is because the next step performs chipset and memory initialisation which may take some time. For example memory training is a time consuming operation. So this is an opportunity for boot firmware to kick off other initialisations, like spinning up hard drive, which can take time to stabilise.
• FspInitEntry(): This is where actual DRAM is achieved. This also performs other silicon init, like PCH and CPU itself. After this finishes, it passes control back to boot firmware. However, since this time, the memory has been initialised, the passing back of control and data is different from TempRamInit stage. After this stage, firmware does most of the rest of initialisations – described in the next section ‘After Memory Init’ – before passing control to the next stage of FSP.
• NotifyPhase(): This is where boot firmware would pass control back to FSP and set params which would tell FSP what sort of actions it needs to take before winding down. The types of things that FSP can do here are platform dependent but they include things like post PCI enumeration.

## After Memory Init

Once DRAM is ready, it breathes a new life into boot process. First that the firmware does is copy itself into DRAM. This is done with help of “memory aliasing”, which means that reads and writes to addresses below 1MB are routed to and from DRAM. Then, firmware sets up the stack and transfer control to DRAM.

Next, some platform specific inits are done, such as GPIO configuration and re-enabling the watchdog timer in ICH which was disabled before memory init, paving the way for interrupts enabling. Local Advanced Programmable Interrupt Controller (LAPIC) sites inside each processor, i.e. it is local to each CPU in a multiprocessor system. LAPIC determines how each interrupt is delivered to that particular CPU. I/O APIC (IOxAPIC) lives inside ICH and there is one IOxAPIC for all processors. There can also be a Programmable Interrupt Controller (PIC) which is for use in Real Mode as is Interrupt Vector Table which contains 256 interrupt vectors – pointers to handlers for corresponding interrupts. Interrupt Descriptor Table on the other hand, is used to hold interrupt vectors when in Protected Mode.

Firmware then sets up various timers depending upon platform and the firmware. Programmable Interrupt Timer (PIT) is the system timer and sits on IRQ0. It lives inside ICH. High Precision Event Time (HPET) also sits inside ICH but boot firmware may not initialise it, letting the OS to set it up if needed. There is also a clock, the Real Time Clock (RTC) which too resides in ICH. There are other timers too, particularly LAPIC timer which is inside each CPU. Next, the firmware sets up memory caching. This basically means setting up different cache characteristics – write-back, uncached etc – for different ranges of memory.

## Other Processors, I/O Devices and PCI

Finally, it is time to bring up other processors as all the work so far was being handled by the bootstrap processor. To find out about other application processors (AP) on the same package, BSP runs CPUID instruction. Then using its LAPIC, BSP sends an interrupt called SIPI, to each AP. Each SIPI points to the physical address at which the receiving AP should start executing. It is worth noting that each AP comes up in Real Mode, therefore the SIPI address must be less than 1MB, the maximum addressable in Real Mode. Usually soon after initialisation, each AP executes HLT instruction and gets into halt state, waiting for further instructions from BSP. However, just before OS gains control, APs are supposed to be in “waiting-for-SIPI” state. BSP achieves this by sending a couple of inter-processor interrupts to each AP.

Next come I/O devices like Embedded Controller (EC) and Super I/O, and after that PCI init. PCI init basically boils down to:

1. enumerating all PCI devices
2. allocating resources to each PCI device

This discussion here applies to PCIe also. PCI is a hierarchical bus system where for each bus, leaf is either a PCI device or a PCI bridge leading to another PCI bus. CPU communicates with PCI by reading and writing PCI registers. The resources needed by PCI devices are range inside memory address space, range inside I/O address space and IRQ assignment. CPU finds out about address ranges and their types (memory-mapped or I/O) by writing to and reading from Base Address Registers (BARs) of PCI devices. IRQs are usually set up based how the board is designed.

During PCI enumeration, firmware also reads Option ROM register. If that register is not empty then it contains address of Option ROM. This is ROM chip that is physically situated on the PCI device. For example the network card may contain Option ROM which holds iPXE firmware. When an Option ROM is encountered then it is read into DRAM and executed.

## Handing Control to OS loader

Before handing over control to next stage loader which is usually an OS loader like GRUB2 or LILO, the firmware sets up some information inside memory which is later to be consumed by the OS. This information is things like Advanced Configuration and Power Interface (ACPI) tables and memory map itself. Memory map tells the OS what address ranges have been set up for what purposes. The regions can be gerenal memory for OS use, ACPI related address ranges, reserved (i.e. not to be used by OS), IOAPIC (to be used by IOAPIC), LAPIC (to be used by LAPICs). Boot firmware also sets up handlers for System Management Mode (SMM) interrupts. SMM is an operating mode of Intel CPUs, just like Real, Protected and Long (64-bit) modes. A CPU enters SMM mode upon receipt of an SMM interrupt which can be triggered by a number of things like chip’s temperature reaching a certain level. Before handing control to OS loader, the firmware also locks down some registers and CPU capability, so that it can’t be changed afterwards by the OS.

Actual transfer of control to the OS loader usually takes form of a JMP to that part of memory. An OS loader like GRUB2 will perform actions based on its config and ultimately pass controle to an operating system like Linux. For Linux, this will usually be a bzImage (big zImage, not bz compression). It is worth noting that the OS, like Linux would enumerate PCI devices again and may have other overlap with some of the final initialisations done by boot firmware. Linux usually picks up the system in 32-bit mode with paging turned off and performs its own initialisations which include setting up page tables, enabling paging and switching to long mode, i.e. 64-bit.

[1] userbinator on Hacker News pointed out that IP hasn’t always held the value 0xfff0 on a reset. On 8086/8088 it was 0x0. Here’s what he found from Intel’s documentation:

``````8086/88:   CS:IP = FFFF:0000 first instruction at FFFF0
80186/188: CS:IP = FFFF:0000 first instruction at FFFF0
80286:     CS:IP = F000:FFF0 first instruction at FFFF0
80386:     CS:IP = 0000:0000FFF0 or F000:0000FFF0[1], first instruction at FFFFFFF0
80486+:    CS:IP = F000:0000FFF0(?) first instruction at FFFFFFF0``````

## Typical classification of sockets

Typically, sockets are classified along two orthogonal dimensions: domain and type. This is reflected in the system call used to create a socket

int socket(int domain, int type, int protocol)

In typical IPC, protocol is usually zero.

## Domain:

Domain means two things:

• range of communication (e.g. on same host or between two remote hosts)
• address format used to identify a peer (e.g. a path name or (IPv4 address, port) pair)

At least following three domains are supported by OSs:

• UNIX domain (identified by C macro AF_INET)
• IPv4 domain (AF_INET)
• IPv6 domain (AF_INET6)

Note that in above macro names, prefix PF_* can also be used instead of AF_*. Both mean same thing.

## Type:

Again typically, two types of sockets are used:

• Stream sockets (identified by C macro SOCK_STREAM)
• Datagram sockets (SOCK_DGRAM)

Stream sockets are connection-oriented. One socket is connected to only one peer. They are byte-stream based and don’t preserve message boundaries. This means that basic unit of data transfer between two SOCK_STREAM sockets is byte. If a sender sends two messages in quick succession, and then receiver does a receive then bytes from second message will follow bytes of first message as a continous stream of bytes, rather than two separate messages. In contrast, a SOCK_DGRAM socket will receive one message in each call to recvfrom().

Apart from above, stream sockets provide reliable (in-order and non-duplicate) two-way communication.

Datagram sockets are message oriented. Unit of transfer is a single message. If the message size is too big, i.e. ‘length’ parameter of recvfrom is less than actual message length, then the message is silently truncated to ‘length’. Datagram sockets are also unreliable (messages may be lost, duplicated or received out of order) and connectionless, i.e. unlike SOCK_STREAM where one socket is connected to only one peer. Therefore sender has to specify recipient address everytime when sending data – sendto() syscall does that. Similarly, recvfrom() identifies sender to receiver. Having said that, connectionlessness comes with one qualificatoin mentioned below.

## Connected datagram socket:

Stream sockets use connect() system call to connect to their peer, thus forming the one-to-one pairing mentioned above. It turns out, connect() can also be called on datagram socket. The effect is that kernel creates an association between caller and remote address specified in connect(). Then that socket can use write() or send() syscall, without specifying recipient address every time. At the same time, that socket will only
receive datagrams from the socket that it is connected to. Note that connectedness of datagram sockets is asymmetrical – the remote socket doesn’t have to be connected to local one which called connect().

Connection can be changed by calling connect again on the same datagram socket but with a different remote socket. To abolish the connection, specify address family of peer address argument of connect as AF_UNSPEC. However, abolishing of connection is Linux-specific only and thus not portable.

## Paging in Linux on x86 – part 2

Our last post left the story of paging in Linux on x86 incomplete. Today we will cover that. This is a vast topic whose tentacles go far and wide into almost all aspects of the kernel. So we will cover only some interesting highlights. Our focus is on x86 without Physical Address Extension (PAE) enabled.

## Data structures:

Although x86 uses two-level page tables, there are architectures which use three or four levels too. For example, x86 with Physical Address Extension (PAE) uses three page tables and the 64-bit x86_64 uses four page tables. Linux aims to cover all architecures and therefore uses four page tables. They are:

• Page Global Directory
• Page Upper Directory
• Page Middle Directory
• Page Table

This means, Linux divides virtual address into five parts – one for indexing into each of the tables above and one offset into the physical page frame. An entry in each of the tables above is a 32-bit unsigned int on x86 (without PAE). These data types are used to represent each of them respectively: pgd_t, pud_t, pmd_t and pte_t. There is also a set of helper macros and functions used to manipulate them.

## On x86:

On x86, there are only two levels of page tables. Linux reconciles that with its four levels by nullifying the effect of PUD and PMD. It does this by keeping just one record in each of them. So practically, it is only using PGD and Page Table.

Kernel typically divides virtual address space into 3GB from 0x00000000 to 0xbfffffff for user space and 0xc0000000 to 0xffffffff for kernel space. In kernel code, the macro PAGE_OFFSET contains virtual address at which kernel starts, i.e. 0xc0000000 on a typical x86 set up.

Early on during boot, kernel learns size of RAM by querying BIOS. Then it scans physcial addresses to find those addresses which are unavailable. They can be:

• addresses which are mapped with hardware devices’ I/O (memory-mapped I/O)
• addresses pointing to page frames containing BIOS data

Typically the kernel lodges itself at physical address 0x00100000, i.e. 2nd MB onwards. Reasons for skipping first MB are architecture specfic – not just x86, but other machines also do some special things in that first MB of physical memory. Those page frames in which kernel sits never get swapped out to disk. Kernel also never swaps out unavailable addresses mentioned above.

From the 1 GB of virtual address space that kernel occupies, 896MB is directly mapped to physical addresses, i.e. there is a one-to-one mapping. Since kernel pages are never swapped out, this mapping always holds true. Macro __pa() converts kernel virtual address into physical address. It basically does simple maths like

Another macro __va does the same thing in reverse.

Kernel sets aside the highest 128MB of its 1GB address space for non-contiguous allocations (high mem) and what is called fix-mapped linear addresses. Non-contiguous allocations is a separate topic on its own but we will quickly describe fix-mapped linear addresses before concluding this article. They are basically constant mappings from virtual to physical addresses but unlike first 896MB, they don’t follow simple offset formula above. Their mapping is arbitrary but fixed nonetheless. Kernel uses them as pointers as they are more efficient in terms of memory accesses required.

## Paging in Linux on x86

In our last post we covered how x86 logical address is translated into linear address. In this one we will look at translation from linear to physical. We will use the terms ‘virtual address’ and ‘linear address’ interchangeably.

A piece of hardware called paging unit is responsible for converting virtual addresses to physical. However, the operating system needs to set it up with correct data structures – page tables. On x86, paging is enabled by setting a flag inside a special register. When that flag is zero, paging is not enabled and linear addresses are treated as physical addresses. Linux first sets up page tables and then enables paging.

## Pages and page tables

For ease of management of memory, e.g. access rights, physical memory is divided into `page frames`. These are contiguous cells of RAM, usually 4KB in size. Corresponding to each physical page frame there is a `page` of virtual addresses. For instance virtual addresses 0x20300000 – 0x20301000 represent a page which corresponds to 4096 physical addresses each of which points to a cell (one byte) in RAM. A page as well as a page frame represent contiguous addresses, so inside a page, the virtual-to-physical mapping is one-to-one. Page is basic unit of memory management in Linux. A key function of paging unit is to check type of access to a virtual address (read or write) against access rights of the page to which that virtual address belongs. When access right is violated, paging unit generates a Page Fault.

Page table is an array in RAM which maps virtual address to physical address. Each user process has its own page table and when a context switch happens, the page tables are changed as part of it. Each entry inside page table points to a page frame inside RAM. So a 32-bit virtual address has two parts: page table index (20 most significant bits) and page offset (12 bits because page size is 4096). Using page tabele index, we will get page frame. Inside page frame we use page offset to get the exact memory cell, the byte that the virtual address points to.

A naive way of organising page table would be to have one page table whose indices are 20 most significant bits of virtual address and whose values contain (among other things) physical address of page frame. That would be wasteful. If each entry is 4 bytes, a page table would require (2^20 * 4) bytes = 4MB of RAM. That is for each process. x86 instead breaks single page table into two: Page Directory and Page Table. Virtual address is also divided into three parts: index inside Page Directory, to get Page Table entry, index inside Page Table entry to get page frame address, and then the same 12-bit page offset to find the cell inside page frame. This way, each process will have to have a Page Directory but there is no need to allocate all Page Tables upfront. Instead Page Tables can be set up when they are needed.

## Management of Pages

Physical address of Page Directory is stored in a special register and that registered is updated when there is a context switch. Entries in Page Directory and Page Table have same format. Along with address of corresponding page frames (or Page Table in case of Page Directory’s entry), it stores privilege level needed to access that page. The privilege level is a single byte so has two possible values. It depends upon CPU Privilege Level (CPL) – a two byte value on x86 which represents four levels. In page table entry, it only checks whether a page requires supervisor mode (CPL = 0) or not (CPL = 1, 2 or 3).

Page table entry also contains access type allowed: read and write. In contrast, access rights for segments are three: read, write and execute. So a page which is read only cannot be written to.

As you might have noticed, this post hasn’t really lived up to its title and only talks about paging in x86. Time and other conditions permitting, we will discuss paging in Linux in a follow-up article.

## Background:

Address space segmentation basically means dividing all possible virtual addresses into groups – segments – and applying some properties on those segments, e.g. privilege level required to access them. Segmentation applies to virtual addresses so it comes into play before virtual-to-physical address translation takes place. In x86, segmentation is a relic from past. 286 didn’t have virtual addressing so it divided address space into segments so that processes could keep themselves to addresses in their own segments. Then 386 added virtual addresses but still kept segments.

In x86, there are three different types of addresses.

• Logical
• Linear
• Physical

This requires two steps to translate from logical to physical address. Translation from logical to linear is described in this article. Translation from linear to physical is done using page tables and we might cover it in a follow-up article.

Logical address consists of two parts: segment and offset. Segment is basically an index into an array of 8-byte records (discriptors) stored in RAM. This array is called Global Discriptor Table (GDT). There is also a per-process Local Descriptor Table (LDT) but we will ignore it as it doesn’t play a significant role in this discussion.

Each entry inside GDT contains info about a segment that it represents: base address, range (max address), CPU privelege level needed to access it and some other info.