October 2017 – Binary Debt

Typical classification of sockets

Typically, sockets are classified along two orthogonal dimensions: domain and type. This is reflected in the system call used to create a socket

int socket(int domain, int type, int protocol)

In typical IPC, protocol is usually zero.

Domain:

Domain means two things:

range of communication (e.g. on same host or between two remote hosts)
address format used to identify a peer (e.g. a path name or (IPv4 address, port) pair)

At least following three domains are supported by OSs:

UNIX domain (identified by C macro AF_INET)
IPv4 domain (AF_INET)
IPv6 domain (AF_INET6)

Note that in above macro names, prefix PF_* can also be used instead of AF_*. Both mean same thing.

Type:

Again typically, two types of sockets are used:

Stream sockets (identified by C macro SOCK_STREAM)
Datagram sockets (SOCK_DGRAM)

Stream sockets are connection-oriented. One socket is connected to only one peer. They are byte-stream based and don’t preserve message boundaries. This means that basic unit of data transfer between two SOCK_STREAM sockets is byte. If a sender sends two messages in quick succession, and then receiver does a receive then bytes from second message will follow bytes of first message as a continous stream of bytes, rather than two separate messages. In contrast, a SOCK_DGRAM socket will receive one message in each call to recvfrom().

Apart from above, stream sockets provide reliable (in-order and non-duplicate) two-way communication.

Datagram sockets are message oriented. Unit of transfer is a single message. If the message size is too big, i.e. ‘length’ parameter of recvfrom is less than actual message length, then the message is silently truncated to ‘length’. Datagram sockets are also unreliable (messages may be lost, duplicated or received out of order) and connectionless, i.e. unlike SOCK_STREAM where one socket is connected to only one peer. Therefore sender has to specify recipient address everytime when sending data – sendto() syscall does that. Similarly, recvfrom() identifies sender to receiver. Having said that, connectionlessness comes with one qualificatoin mentioned below.

Connected datagram socket:

Stream sockets use connect() system call to connect to their peer, thus forming the one-to-one pairing mentioned above. It turns out, connect() can also be called on datagram socket. The effect is that kernel creates an association between caller and remote address specified in connect(). Then that socket can use write() or send() syscall, without specifying recipient address every time. At the same time, that socket will only
receive datagrams from the socket that it is connected to. Note that connectedness of datagram sockets is asymmetrical – the remote socket doesn’t have to be connected to local one which called connect().

Connection can be changed by calling connect again on the same datagram socket but with a different remote socket. To abolish the connection, specify address family of peer address argument of connect as AF_UNSPEC. However, abolishing of connection is Linux-specific only and thus not portable.

Paging in Linux on x86 – part 2

Our last post left the story of paging in Linux on x86 incomplete. Today we will cover that. This is a vast topic whose tentacles go far and wide into almost all aspects of the kernel. So we will cover only some interesting highlights. Our focus is on x86 without Physical Address Extension (PAE) enabled.

Data structures:

Although x86 uses two-level page tables, there are architectures which use three or four levels too. For example, x86 with Physical Address Extension (PAE) uses three page tables and the 64-bit x86_64 uses four page tables. Linux aims to cover all architecures and therefore uses four page tables. They are:

Page Global Directory
Page Upper Directory
Page Middle Directory
Page Table

This means, Linux divides virtual address into five parts – one for indexing into each of the tables above and one offset into the physical page frame. An entry in each of the tables above is a 32-bit unsigned int on x86 (without PAE). These data types are used to represent each of them respectively: pgd_t, pud_t, pmd_t and pte_t. There is also a set of helper macros and functions used to manipulate them.

On x86:

On x86, there are only two levels of page tables. Linux reconciles that with its four levels by nullifying the effect of PUD and PMD. It does this by keeping just one record in each of them. So practically, it is only using PGD and Page Table.

Kernel typically divides virtual address space into 3GB from 0x00000000 to 0xbfffffff for user space and 0xc0000000 to 0xffffffff for kernel space. In kernel code, the macro PAGE_OFFSET contains virtual address at which kernel starts, i.e. 0xc0000000 on a typical x86 set up.

Early on during boot, kernel learns size of RAM by querying BIOS. Then it scans physcial addresses to find those addresses which are unavailable. They can be:

addresses which are mapped with hardware devices’ I/O (memory-mapped I/O)
addresses pointing to page frames containing BIOS data

Typically the kernel lodges itself at physical address 0x00100000, i.e. 2nd MB onwards. Reasons for skipping first MB are architecture specfic – not just x86, but other machines also do some special things in that first MB of physical memory. Those page frames in which kernel sits never get swapped out to disk. Kernel also never swaps out unavailable addresses mentioned above.

Kernel address mapping:

From the 1 GB of virtual address space that kernel occupies, 896MB is directly mapped to physical addresses, i.e. there is a one-to-one mapping. Since kernel pages are never swapped out, this mapping always holds true. Macro __pa() converts kernel virtual address into physical address. It basically does simple maths like

Physical address = virtual address – PAGE_OFFSET

Another macro __va does the same thing in reverse.

Kernel sets aside the highest 128MB of its 1GB address space for non-contiguous allocations (high mem) and what is called fix-mapped linear addresses. Non-contiguous allocations is a separate topic on its own but we will quickly describe fix-mapped linear addresses before concluding this article. They are basically constant mappings from virtual to physical addresses but unlike first 896MB, they don’t follow simple offset formula above. Their mapping is arbitrary but fixed nonetheless. Kernel uses them as pointers as they are more efficient in terms of memory accesses required.

Paging in Linux on x86

In our last post we covered how x86 logical address is translated into linear address. In this one we will look at translation from linear to physical. We will use the terms ‘virtual address’ and ‘linear address’ interchangeably.

A piece of hardware called paging unit is responsible for converting virtual addresses to physical. However, the operating system needs to set it up with correct data structures – page tables. On x86, paging is enabled by setting a flag inside a special register. When that flag is zero, paging is not enabled and linear addresses are treated as physical addresses. Linux first sets up page tables and then enables paging.

Pages and page tables

For ease of management of memory, e.g. access rights, physical memory is divided into `page frames`. These are contiguous cells of RAM, usually 4KB in size. Corresponding to each physical page frame there is a `page` of virtual addresses. For instance virtual addresses 0x20300000 – 0x20301000 represent a page which corresponds to 4096 physical addresses each of which points to a cell (one byte) in RAM. A page as well as a page frame represent contiguous addresses, so inside a page, the virtual-to-physical mapping is one-to-one. Page is basic unit of memory management in Linux. A key function of paging unit is to check type of access to a virtual address (read or write) against access rights of the page to which that virtual address belongs. When access right is violated, paging unit generates a Page Fault.

Page table is an array in RAM which maps virtual address to physical address. Each user process has its own page table and when a context switch happens, the page tables are changed as part of it. Each entry inside page table points to a page frame inside RAM. So a 32-bit virtual address has two parts: page table index (20 most significant bits) and page offset (12 bits because page size is 4096). Using page tabele index, we will get page frame. Inside page frame we use page offset to get the exact memory cell, the byte that the virtual address points to.

A naive way of organising page table would be to have one page table whose indices are 20 most significant bits of virtual address and whose values contain (among other things) physical address of page frame. That would be wasteful. If each entry is 4 bytes, a page table would require (2^20 * 4) bytes = 4MB of RAM. That is for each process. x86 instead breaks single page table into two: Page Directory and Page Table. Virtual address is also divided into three parts: index inside Page Directory, to get Page Table entry, index inside Page Table entry to get page frame address, and then the same 12-bit page offset to find the cell inside page frame. This way, each process will have to have a Page Directory but there is no need to allocate all Page Tables upfront. Instead Page Tables can be set up when they are needed.

Management of Pages

Physical address of Page Directory is stored in a special register and that registered is updated when there is a context switch. Entries in Page Directory and Page Table have same format. Along with address of corresponding page frames (or Page Table in case of Page Directory’s entry), it stores privilege level needed to access that page. The privilege level is a single byte so has two possible values. It depends upon CPU Privilege Level (CPL) – a two byte value on x86 which represents four levels. In page table entry, it only checks whether a page requires supervisor mode (CPL = 0) or not (CPL = 1, 2 or 3).

Page table entry also contains access type allowed: read and write. In contrast, access rights for segments are three: read, write and execute. So a page which is read only cannot be written to.

What about Linux?

As you might have noticed, this post hasn’t really lived up to its title and only talks about paging in x86. Time and other conditions permitting, we will discuss paging in Linux in a follow-up article.

80×86 segmentation & what Linux does with it

Background:

Address space segmentation basically means dividing all possible virtual addresses into groups – segments – and applying some properties on those segments, e.g. privilege level required to access them. Segmentation applies to virtual addresses so it comes into play before virtual-to-physical address translation takes place. In x86, segmentation is a relic from past. 286 didn’t have virtual addressing so it divided address space into segments so that processes could keep themselves to addresses in their own segments. Then 386 added virtual addresses but still kept segments.

Different types of addresses

In x86, there are three different types of addresses.

Logical
Linear
Physical

This requires two steps to translate from logical to physical address. Translation from logical to linear is described in this article. Translation from linear to physical is done using page tables and we might cover it in a follow-up article.

Logical address consists of two parts: segment and offset. Segment is basically an index into an array of 8-byte records (discriptors) stored in RAM. This array is called Global Discriptor Table (GDT). There is also a per-process Local Descriptor Table (LDT) but we will ignore it as it doesn’t play a significant role in this discussion.

Each entry inside GDT contains info about a segment that it represents: base address, range (max address), CPU privelege level needed to access it and some other info.

Linear address = base address from segment entry in GDT + offset part of logical address

So to convert a logical address into linear, take base address from segment entry in GDT and add offset to it.

What Linux does with it

Linux prefers to group addresses into sections and manage them during the linear-to-physical transition phase, instead of logical-to-linear transition phase. Therefore, it pretty much nullifies effects of segment part of logical address so that offset just represents linear address. It does create four different segments: two (code and data) for each user space and kernel space. But each segment’s base is zero and max range is 2^32 – 1, thereby nullifying segmentation. It does however use CPU privilege level so that CPU has to be in right privilege level for accessing segments in kernel space – kernel code and kernel data segments.