When we switch on a computer, it goes through a series of steps before it is able to load the operating system. In this post we will see how a typical x86 processor boots. This is a very complex and involved process. We will only present a basic overall structure. Also what path is actually taken by the processor to reach a state where it can load an OS, is dependent on boot firmware. We will follow example of coreboot, an open source boot firmware.
Before Power is Applied
Let us start with BIOS chip, also known as boot ROM. BIOS chip is a piece of silicon on the motherboard of a computer and it can store bytes. It has two characteristics which are of interest to us. First, it (or a part of it) is memory mapped into the CPU’s address space, which means that the CPU can access it in the same way it would access RAM. In particular, the CPU can point its instruction pointer to executed code inside BIOS chip. Second, the bytes that BIOS chip stores, represent the very first instructions that are executed by the CPU. BIOS chop also contains other pieces of code and data. A typical BIOS contains flash descriptor (a contents table for BIOS chip), BIOS region (the first instructions to be executed), Intel ME (Intel Management Engine) and GbE (gigabit ethernet). As you can see, BIOS chip is shared between serveral components of the system and not exclusive to CPU.
When power is applied
Modern Intel chips come with what is called Intel Management Engine. As soon as power is available – through battery or from mains – Intel ME comes on. It does its own initialisations which requires it to read BIOS’s flash descriptor to find where Intel ME region is and then from Intel ME region of BIOS, read in code and config data. Next when we press power button on the computer, the CPU comes on. On a multiprocessor system, there is always a designated processor, called Bootstrap Processor (BSP), which comes on. In either case, the processor always comes on in what is called 16-bit Real Mode with insruction pointer pointing to address 0xffff.fff0, the reset vector.
EDIT: (thanks to burfog for indicating that this needs explaination)
You might be wondering how could a 16-bit system address 0xffff.fff0 which is clearly beyond 0xffff, the max 16-bit value? In 16-bit mode, physical address is calculated by left shifting code segment (CS) selector register by 4 bits and then adding instruction pointer (IP) address. On reset, IP cotains value 0xfff0 and CS has value 0xf000 . By the above formula the physical address should be:
CS << 4 + IP = 0x000f.0000 + 0xfff0 = 0x000f.fff0
which is still not what we expected. This is because on reset, the system is in a “special” Real Mode, where the first 12 address lines are asserted. So all addresses look like 0xfffx.xxxx. This means in our case, we need to set the most significant 12 bits in the address we derived, which results in our expected address 0xffff.fff0. These 12 address lines remain asserted until a long JMP is executed, after which they are de-asserted and normal Real Mode addressing calculations resume.
The BIOS chip is also set up in such a way that first instruction to be executed from the BIOS is at physical address 0xffff.fff0 of the processor. Hence processor is able to execute the first instruction from BIOS region of the BIOS chip. This region contains what is called boot firmware. Examples of boot firmware are UEFI implementations, coreboot and the classic BIOS.
One of the first things that the boot firmware does is switch to 32-bit mode. It is also “protected mode”, i.e. segmentation is turned on and various segments of processor’s address space can be managed with different access permissions. Boot firmware however would have just one segment, effectively turning off segmentation. This is called flat mode.
It is worth noting that at this point in boot process, DRAM is not available. DRAM Initialisation is one of the main objectives of boot firmware. But before it can initialise DRAM, it needs to do some preparation.
Microcode patches are like patches for CPU to function correctly. Intel keeps publishing microcode patches for different CPUs. The boot firmware applies those patches very early on in boot process. Part of the processor is what is called south bridge or I/O controller hub (ICH) or peripheral controller hub (PCH). There are some initialisations to be performed for ICH also. For example, ICH may contain a watchdog timer which can go off which DRAM is being initialised. That watchdog timer must be turned off first.
Of course all of this is being done by firmware which is code written by someone. Now most of the code we know utilises stack. But we have mentioned that DRAM hasn’t been initialised yet so there is no memory. So how is this code written and run? Answer is that this is stackless code. Either it is hand written x86 assembly or, as in case of coreboot, it is written in C and compiled using special compiler called ROMCC which translates C to stackless assembly instructions. This of course comes with some restrictions so ROMCC compiled code is not how we want to execute everything. We need stack as soon as possible.
So, the next step is setting up what is called cache-as-RAM (CAR). Boot firmware basically sets up CPU caches so that they can be temporarily used as RAM. This way the firmware can run code which is not stackless, but still restricted in terms of stack size and general amount of memory available.
Memory Initialisation and Intel FSP
On Intel systems, memory initialisation is performed using a blob called Intel Firmware Support Package (FSP). This is supplied by Intel in binary form. Intel FSP does a lot of heavy lifting when it comes to bootstrapping Intel processors and is not just limited to memory init. It is basically a three stage API. The way boot firmware interacts with FSP is set up some parameters and a return address, and jump into an FSP stage. The FSP stage would execute taking into account the parameters and then use the return address to jump back into boot firmware. This continues across these three FSP stages and in that order:
- TempRamInit(): This performs some init for RAM and hand control back to boot firmware. Boot firmware can kick off some actions and then go on to next stage. This is because the next step performs chipset and memory initialisation which may take some time. For example memory training is a time consuming operation. So this is an opportunity for boot firmware to kick off other initialisations, like spinning up hard drive, which can take time to stabilise.
- FspInitEntry(): This is where actual DRAM is achieved. This also performs other silicon init, like PCH and CPU itself. After this finishes, it passes control back to boot firmware. However, since this time, the memory has been initialised, the passing back of control and data is different from TempRamInit stage. After this stage, firmware does most of the rest of initialisations – described in the next section ‘After Memory Init’ – before passing control to the next stage of FSP.
- NotifyPhase(): This is where boot firmware would pass control back to FSP and set params which would tell FSP what sort of actions it needs to take before winding down. The types of things that FSP can do here are platform dependent but they include things like post PCI enumeration.
After Memory Init
Once DRAM is ready, it breathes a new life into boot process. First that the firmware does is copy itself into DRAM. This is done with help of “memory aliasing”, which means that reads and writes to addresses below 1MB are routed to and from DRAM. Then, firmware sets up the stack and transfer control to DRAM.
Next, some platform specific inits are done, such as GPIO configuration and re-enabling the watchdog timer in ICH which was disabled before memory init, paving the way for interrupts enabling. Local Advanced Programmable Interrupt Controller (LAPIC) sites inside each processor, i.e. it is local to each CPU in a multiprocessor system. LAPIC determines how each interrupt is delivered to that particular CPU. I/O APIC (IOxAPIC) lives inside ICH and there is one IOxAPIC for all processors. There can also be a Programmable Interrupt Controller (PIC) which is for use in Real Mode as is Interrupt Vector Table which contains 256 interrupt vectors – pointers to handlers for corresponding interrupts. Interrupt Descriptor Table on the other hand, is used to hold interrupt vectors when in Protected Mode.
Firmware then sets up various timers depending upon platform and the firmware. Programmable Interrupt Timer (PIT) is the system timer and sits on IRQ0. It lives inside ICH. High Precision Event Time (HPET) also sits inside ICH but boot firmware may not initialise it, letting the OS to set it up if needed. There is also a clock, the Real Time Clock (RTC) which too resides in ICH. There are other timers too, particularly LAPIC timer which is inside each CPU. Next, the firmware sets up memory caching. This basically means setting up different cache characteristics – write-back, uncached etc – for different ranges of memory.
Other Processors, I/O Devices and PCI
Finally, it is time to bring up other processors as all the work so far was being handled by the bootstrap processor. To find out about other application processors (AP) on the same package, BSP runs CPUID instruction. Then using its LAPIC, BSP sends an interrupt called SIPI, to each AP. Each SIPI points to the physical address at which the receiving AP should start executing. It is worth noting that each AP comes up in Real Mode, therefore the SIPI address must be less than 1MB, the maximum addressable in Real Mode. Usually soon after initialisation, each AP executes HLT instruction and gets into halt state, waiting for further instructions from BSP. However, just before OS gains control, APs are supposed to be in “waiting-for-SIPI” state. BSP achieves this by sending a couple of inter-processor interrupts to each AP.
Next come I/O devices like Embedded Controller (EC) and Super I/O, and after that PCI init. PCI init basically boils down to:
- enumerating all PCI devices
- allocating resources to each PCI device
This discussion here applies to PCIe also. PCI is a hierarchical bus system where for each bus, leaf is either a PCI device or a PCI bridge leading to another PCI bus. CPU communicates with PCI by reading and writing PCI registers. The resources needed by PCI devices are range inside memory address space, range inside I/O address space and IRQ assignment. CPU finds out about address ranges and their types (memory-mapped or I/O) by writing to and reading from Base Address Registers (BARs) of PCI devices. IRQs are usually set up based how the board is designed.
During PCI enumeration, firmware also reads Option ROM register. If that register is not empty then it contains address of Option ROM. This is ROM chip that is physically situated on the PCI device. For example the network card may contain Option ROM which holds iPXE firmware. When an Option ROM is encountered then it is read into DRAM and executed.
Handing Control to OS loader
Before handing over control to next stage loader which is usually an OS loader like GRUB2 or LILO, the firmware sets up some information inside memory which is later to be consumed by the OS. This information is things like Advanced Configuration and Power Interface (ACPI) tables and memory map itself. Memory map tells the OS what address ranges have been set up for what purposes. The regions can be gerenal memory for OS use, ACPI related address ranges, reserved (i.e. not to be used by OS), IOAPIC (to be used by IOAPIC), LAPIC (to be used by LAPICs). Boot firmware also sets up handlers for System Management Mode (SMM) interrupts. SMM is an operating mode of Intel CPUs, just like Real, Protected and Long (64-bit) modes. A CPU enters SMM mode upon receipt of an SMM interrupt which can be triggered by a number of things like chip’s temperature reaching a certain level. Before handing control to OS loader, the firmware also locks down some registers and CPU capability, so that it can’t be changed afterwards by the OS.
Actual transfer of control to the OS loader usually takes form of a JMP to that part of memory. An OS loader like GRUB2 will perform actions based on its config and ultimately pass controle to an operating system like Linux. For Linux, this will usually be a bzImage (big zImage, not bz compression). It is worth noting that the OS, like Linux would enumerate PCI devices again and may have other overlap with some of the final initialisations done by boot firmware. Linux usually picks up the system in 32-bit mode with paging turned off and performs its own initialisations which include setting up page tables, enabling paging and switching to long mode, i.e. 64-bit.
 userbinator on Hacker News pointed out that IP hasn’t always held the value 0xfff0 on a reset. On 8086/8088 it was 0x0. Here’s what he found from Intel’s documentation:
8086/88: CS:IP = FFFF:0000 first instruction at FFFF0 80186/188: CS:IP = FFFF:0000 first instruction at FFFF0 80286: CS:IP = F000:FFF0 first instruction at FFFF0 80386: CS:IP = 0000:0000FFF0 or F000:0000FFF0, first instruction at FFFFFFF0 80486+: CS:IP = F000:0000FFF0(?) first instruction at FFFFFFF0