Skip to content

Latest commit

 

History

History
838 lines (593 loc) · 50.8 KB

understanding-linux-kernel.md

File metadata and controls

838 lines (593 loc) · 50.8 KB

2 Memory Addressing

2.1 Memory Addresses

2.2 Segmentation in Hardware

2.3 Segmentation in Linux

2.4 Paging in Hardware Paging in Linux

3 Processes

3.1 Processes, Lightweight Processes, and Threads

3.2 Process Descriptor

3.3 Process Switch

3.3.1 Hardware Context

The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context. The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.

We can thus define a process switch as the activity consisting of saving the hardware context of prev and replacing it with the hardware context of next.

The contents of all registers used by a process in User Mode have already been saved on the Kernel Mode stack before performing process switching (see Chapter 4). This includes the contents of the ss and esp pair that specifies the User Mode stack pointer address.

3.3.2 Task State Segment

struct __attribute__ ((__packed__)) vmcb_seg {
    u16 selector;
    u16 attrib;
    u32 limit;
    u64 base;
};

struct __attribute__ ((__packed__)) vmcb_save_area {
    struct vmcb_seg es;
    struct vmcb_seg cs;
    struct vmcb_seg ss;
    struct vmcb_seg ds;
    struct vmcb_seg fs;
    struct vmcb_seg gs;
    struct vmcb_seg gdtr;
    struct vmcb_seg ldtr;
    struct vmcb_seg idtr;
    struct vmcb_seg tr;
    u8 reserved_1[43];
    u8 cpl;
    u8 reserved_2[4];
    u64 efer;
    u8 reserved_3[112];
    u64 cr4;
    u64 cr3;
    u64 cr0;
    u64 dr7;
    u64 dr6;
    u64 rflags;
    u64 rip;
    u8 reserved_4[88];
    u64 rsp;
    u8 reserved_5[24];
    u64 rax;
    u64 star;
    u64 lstar;
    u64 cstar;
    u64 sfmask;
    u64 kernel_gs_base;
    u64 sysenter_cs;
    u64 sysenter_esp;
    u64 sysenter_eip;
    u64 cr2;
    u8 reserved_6[32];
    u64 g_pat;
    u64 dbgctl;
    u64 br_from;
    u64 br_to;
    u64 last_excp_from;
    u64 last_excp_to;
};

struct __attribute__ ((__packed__)) vmcb {
    struct vmcb_control_area control;
    struct vmcb_save_area save;
};

3.3.2.1 The thread field

Each process descriptor includes a field called thread of type thread_struct, in which the kernel saves the hardware context whenever the process is being switched out.

This data structure includes fields for most of the CPU registers, except the general-purpose registers such as eax, ebx, etc., which are stored in the Kernel Mode stack.

3.3.3 Performing the Process Switch

Essentially, every process switch consists of two steps:

  1. Switching the Page Global Directory to install a new address space (Chapter 9).
  2. Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers.

3.3.3.1 The switch_to macro

Before the process switching, the macro saves in the eax CPU register the content of the variable identified by the first input parameter prev—that is, the prev local variable allocated on the Kernel Mode stack of A. After the process switching, when A has resumed its execution, the macro writes the content of the eax CPU register in the memory location of A identified by the third output parameter last. Because the CPU register doesn’t change across the process switch, this memory location receives the address of C’s descriptor. In the current implementation of schedule(), the last parameter identifies the prev local variable of A, so prev is overwritten with the address of C.

3.3.3.2 The __swith_to function

3.3.4 Saving and Loading the FPU, MMX, and XMM Registers

3.4 Creating Processes

3.5 Destroying Processes

4 Interrupts and Exceptions

Interrupts are often divided into synchronous and asynchronous interrupts:

  • Synchronous(exceptions) interrupts are produced by the CPU control unit while executing instructions and are called synchronous because the control unit issues them only after terminating the execution of an instruction.
  • Asynchronous(interrupts) interrupts are generated by other hardware devices at arbitrary times with respect to the CPU clock signals.

4.1 The Role of Interrupt Signals

There is a key difference between interrupt handling and process switching: the code executed by an interrupt or by an exception handler is not a process. Rather, it is a kernel control path that runs at the expense of the same process that was running when the interrupt occurred. As a kernel control path, the interrupt handler is lighter than a process (it has less context and requires less time to set up or tear down).

Interrupt must satisfy the following constraints:

  • Interrupts can come any time, whenthekernelmaywanttofinishsomethingelse it was trying to do.
  • Because interrupts can come anytime, the kernel might be handling one of them while another one (of a different type) occurs. When the last kernel control path terminates, the kernel must be able to resume execution of the interrupted process or switch to another process if the interrupt signal has caused a rescheduling activity.
  • Although the kernel may accept a new interrupt signal while handling a previous one, some critical regions exist inside the kernel code where interrupts must be disabled.

4.2 Interrupts and Exceptions

The Intel documentation classifies interrupts and exceptions as follows:

  • Interrups
    • Maskable interrupts
    • Non-maskable interrupts
  • Exception
    • Processor-detected exceptions
      • Faults
      • Traps
      • Aborts
    • Programmed exceptions

Programmable Interrupt Controller performs the following actions:

  1. Monitors the IRQ lines, checking for raised signals. If two or more IRQ lines are raised, selects the one having the lower pin number.
  2. If a raised signal occurs on an IRQ line:
    • Converts the raised signal received into a corresponding vector.
    • Stores the vector in an Interrupt Controller I/O port, thus allowing the CPU to read it via the data bus.
    • Sends a raised signal to the processor INTR pin—that is, issues an interrupt.
    • Waits until the CPU acknowledges the interrupt signal by writing into one of the Programmable Interrupt Controllers (PIC) I/O ports; when this occurs, clears the INTR line.
  3. Goes back to step 1.

Intel’s default vector associated with IRQn is n+32. As mentioned before, the mapping between IRQs and vectors can be modified by issuing suitable I/O instructions to the Interrupt Controller ports.

Disabled interrupts are not lost; the PIC sends them to the CPU as soon as they are enabled again.

4.2.1 APIC

Each local APIC has 32-bit registers, an internal clock; a local timer device; and two additional IRQ lines, LINT0 and LINT1, reserved for local APIC interrupts. All local APICs are connected to an external I/O APIC, giving rise to a multi-APIC system.

The I/O APIC consists of a set of 24 IRQ lines, a 24-entry Interrupt Redirection Table, programmable registers, and a message unit for sending and receiving APIC messages over the APIC bus.

Each entry in the Redirection Table can be individually programmed to indicate the interrupt vector and priority, the destination processor, and how the processor is selected. The information in the Redirection Table is used to translate each external IRQ signal into a message to one or more local APIC units via the APIC bus.

Interrupt requests coming from external hardware devices can be distributed among the available CPUs in two ways:

  • Static distribution
    • The IRQ signal is delivered to the local APICs listed in the corresponding Redirection Table entry.
  • Dynamic distribution
    • The IRQ signal is delivered to the local APIC of the processor that is executing the process with the lowest priority. task priority register (TPR) is used to calculate task priority.

Interprocessor interrupts When a CPU wishes to send an interrupt to another CPU, it stores the interrupt vector and the identifier of the target’s local APIC in the Interrupt Command Register (ICR) of its own local APIC. A message is then sent via the APIC bus to the target’s local APIC, which therefore issues a corresponding interrupt to its own CPU.

4.2.1 Exceptions

4.2.3 Interrupt Descriptor Table

The descriptors are:

  • Task gate: Includes the TSS selector of the process that must replace the current one when an interrupt signal occurs.
  • Interrupt gate: Includes the Segment Selector and the offset inside the segment of an interrupt or exception handler. While transferring control to the proper segment, the processor clears the IF flag, thus disabling further maskable interrupts.
  • Trap gate: Similar to an interrupt gate, except that while transferring control to the proper segment, the processor does not modify the IF flag.

4.2.4 Hardware Handling of Interrupts and Exceptions

After executing an instruction, the cs and eip pair of registers contain the logical address of the next instruction to be executed. Before dealing with that instruction, the control unit checks whether an interrupt or an exception occurred while the control unit executed the previous instruction. If one occurred, the control unit does the following:

  1. Determines the vector i (0 ≤ i ≤ 255) associated with the interrupt or the exception.
  2. Reads the ith entry of the IDT referred by the idtr register (we assume in the following description that the entry contains an interrupt or a trap gate).
  3. Gets the base address of the GDT from the gdtr register and looks in the GDT to read the Segment Descriptor identified by the selector in the IDT entry. This descriptor specifies the base address of the segment that includes the interrupt or exception handler.
  4. Makes sure the interrupt was issued by an authorized source.
    • First, it compares the Current Privilege Level (CPL), which is stored in the two least significant bits of the cs register, with the Descriptor Privilege Level (DPL) of the Segment Descriptor included in the GDT. Raises a "General protection" exception if the CPL is lower than the DPL, because the interrupt handler cannot have a lower privilege than the program that caused the interrupt.
    • For programmed exceptions, makes a further security check: compares the CPL with the DPL of the gate descriptor included in the IDT and raises a "General protection" exception if the DPL is lower than the CPL. This last check makes it possible to prevent access by user applications to specific trap or interrupt gates.
  5. Checks whether a change of privilege level is taking place—that is, if CPL is different from the selected Segment Descriptor’s DPL. If so, the control unit must start using the stack that is associated with the new privilege level. It does this by performing the following steps:
    • Reads the tr register to access the TSS segment of the running process.
    • Loads the ss and esp registers with the proper values for the stack segment and stack pointer associated with the new privilege level. These values are found in the TSS (see the section "Task State Segment" in Chapter 3).
    • In the new stack, it saves the previous values of ss and esp, which define the logical address of the stack associated with the old privilege level.
  6. If a fault has occurred, it loads cs and eip with the logical address of the instruction that caused the exception so that it can be executed again.
  7. Saves the contents of eflags, cs, and eip in the stack.
  8. If the exception carries a hardware error code, it saves it on the stack.
  9. Loads cs and eip, respectively, with the Segment Selector and the Offset fields of the Gate Descriptor stored in the ith entry of the IDT. These values define the logical address of the first instruction of the interrupt or exception handler.

After the interrupt or exception is processed, the corresponding handler must relinquish control to the interrupted process by issuing the iret instruction, which forces the control unit to:

  1. Load the cs, eip, and eflags registers with the values saved on the stack. If a hardware error code has been pushed in the stack on top of the eip contents, it must be popped before executing iret.
  2. Check whether the CPL of the handler is equal to the value contained in the two least significant bits of cs (this means the interrupted process was running at the same privilege level as the handler). If so, iret concludes execution; otherwise, go to the next step.
  3. Load the ss and esp registers from the stack and return to the stack associated with the old privilege level.
  4. Examine the contents of the ds, es, fs, and gs segment registers; if any of them contains a selector that refers to a Segment Descriptor whose DPL value is lower than CPL, clear the corresponding segment register. The control unit does this to forbid User Mode programs that run with a CPL equal to 3 from using segment registers previously used by kernel routines (with a DPL equal to 0). If these registers were not cleared, malicious User Mode programs could exploit them in order to access the kernel address space.

4.3 Nested Execution of Exception and Interrupt Handlers

The last instructions of a kernel control path that is taking care of an interrupt do not always put the current process back into User Mode: if the level of nesting is greater than 1, these instructions will put into execution the kernel control path that was interrupted last, and the CPU will continue to run in Kernel Mode.

The price to pay for allowing nested kernel control paths is that an interrupt handler must never block, that is, no process switch can take place until an interrupt handler is running. In fact, all the data needed to resume a nested kernel control path is stored in the Kernel Mode stack, which is tightly bound to the current process.

In contrast to exceptions, interrupts issued by I/O devices do not refer to data structures specific to the current process, although the kernel control paths that handle them run on behalf of that process.

An interrupt handler may preempt both other interrupt handlers and exception handlers. Conversely, an exception handler never preempts an interrupt handler.

Linux interleaves kernel control paths for two major reasons:

  • To improve the throughput of programmable interrupt controllers and device controllers.
  • To implement an interrupt model without priority levels.

4.4 Initializing the Interrupt Descriptor Table

Before the kernel enables the interrupts, it must load the initial address of the IDT table into the idtr register and initialize all the entries of that table.

4.4.1 Interrupt, Trap, and System Gates

  • Interrupt gate: All Linux interrupt handlers are activated by means of interrupt gates, and all are restricted to Kernel Mode.
  • System gate The three Linux exception handlers associated with the vectors 4, 5, and 128 are activated by means of system gates, so the three assembly language instructions into, bound, and int $0x80 can be issued in User Mode.
  • System interrupt gate The exception handler associated with the vector 3 is activated by means of a system interrupt gate, so the assembly language instruction int3 can be issued in User Mode.
  • Trap gate Most Linux exception handlers are activated by means of trap gates.
  • Task gate The Linux handler for the "Double fault" exception is activated by means of a task gate.

4.4.2 Preliminary Initialization of the IDT

The IDT is initialized and used by the BIOS routines while the computer still operates in Real Mode. Once Linux takes over, however, the IDT is moved to another area of RAM and initialized a second time, because Linux does not use any BIOS routine.

4.5 Exception Handling

Exception handlers have a standard structure consisting of three steps:

  1. Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).
  2. Handle the exception by means of a high-level C function.
  3. Exit from the handler by means of the ret_from_exception( ) function.

4.5.1 Saving the Registers for the Exception Handler

ENTRY(async_page_fault)
  ASM_CLAC
  pushl $do_async_page_fault
  jmp   common_exception
END(async_page_fault)

4.5.2 Entering and Leaving the Exception Handler

4.6 Interrupt Handling

4.6.1 I/O Interrupt Handling

Interrupt handler flexibility is achieved in two distinct ways, as discussed in the following list.

  • IRQ sharing: each ISR is executed to verify whether its device needs attention; if so, the ISR performs all the operations that need to be executed when the device raises an interrupt
  • IRQ dynamic allocation

Long noncritical operations should be deferred, because while an interrupt handler is running, the signals on the corresponding IRQ line are temporarily ignored. Therefore, interrupt handlers cannot perform any blocking procedure such as an I/O disk operation.

Linux divides the actions to be performed following an interrupt into three classes:

  • Critical : Critical actions are executed within the interrupt handler immediately, with maskable interrupts disabled.
  • Noncritical : These actions can also finish quickly, so they are executed by the interrupt handler immediately, with the interrupts enabled.
  • Noncritical Deferable : Noncritical deferrable actions are performed by means of separate functions that are discussed in the later section "Softirqs and Tasklets."

All I/O interrupt handlers perform the same four basic actions:

  1. Save the IRQ value and the register’s contents on the Kernel Mode stack.
  2. Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue further interrupts.
  3. Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
  4. Terminate by jumping to the ret_from_intr( ) address.

4.6.1.1 Interrupt vectors

Vector Range Use
0–19 (0x0–0x13) Nonmaskable interrupts and exceptions
20–31 (0x14–0x1f) Intel-reserved
32–127 (0x20–0x7f) External interrupts (IRQs)
128 (0x80) Programmed exception for system calls
129–238 (0x81–0xee) External interrupts (IRQs)
239 (0xef) Local APIC timer interrupt
240 (0xf0) Local APIC thermal interrupt
241–250 (0xf1–0xfa) Reserved by Linux for future use
251–253 (0xfb–0xfd) Interprocessor interrupts
254 (0xfe) Local APIC error interrupt
255 (0xff) Local APIC spurious interrupt

4.6.1.2 IRQ data structures

4.6.1.3 IRQ distribution in multiprocessor systems

During system bootstrap, the booting CPU executes the setup_IO_APIC_irqs() function to initialize the I/O APIC chip. The 24 entries of the Interrupt Redirection Table of the chip are filled, so that all IRQ signals from the I/O hardware devices can be routed to each CPU in the system according to the "lowest priority" scheme (see the earlier section "IRQs and Interrupts").

During system bootstrap, moreover, all CPUs execute the setup_local_APIC() function, which takes care of initializing the local APICs.

When a hardware device raises an IRQ signal, the multi-APIC system selects one of the CPUs and delivers the signal to the corresponding local APIC, which in turn interrupts its CPU. No other CPUs are notified of the event.

4.6.1.4 Multiple Kernel Mode stacks

4.6.1.5 Saving the registers for the interrupt handler

common_interrupt:
    SAVE_ALL
    movl %esp,%eax
    call do_IRQ
    jmp ret_from_intr

SAVE_ALL:
    cld
    push %es
    push %ds
    pushl %eax
    pushl %ebp
    pushl %edi
    pushl %esi
    pushl %edx
    pushl %ecx
    pushl %ebx
    movl $__USER_DS,%edx movl %edx,%ds
    movl %edx,%es

SAVE_ALL saves all the CPU registers that may be used by the interrupt handler on the stack, except for eflags, cs, eip, ss, and esp, which are already saved automatically by the control unit (see the earlier section "Hardware Handling of Interrupts and Exceptions").

4.6.1.6 The do_IRQ( ) function

4.6.1.7 Reviving a lost interrupt

4.6.1.8 Dynamic allocation of IRQ lines

As noted in section "Interrupt vectors," a few vectors are reserved for specific devices, while the remaining ones are dynamically handled.

4.6.2 Interprocessor Interrupt Handling

an interprocessor interrupt (IPI) is delivered not through an IRQ line, but directly as a message on the bus that connects the local APIC of all CPUs (either a dedicated bus in older motherboards, or the system bus in the Pentium 4–based motherboards).

Linux makes use of three kinds of interprocessor interrupts (see also Table 4-2):

  • CALL_FUNCTION_VECTOR (vector 0xfb). Sent to all CPUs but the sender, forcing those CPUs to run a function passed by the sender. call_function_interrupt( ).
  • RESCHEDULE_VECTOR (vector 0xfc). When a CPU receives this type of interrupt, reschedule_interrupt().
  • INVALIDATE_TLB_VECTOR (vector 0xfd). Sent to all CPUs but the sender, forcing them to invalidate their Translation Lookaside Buffers. invalidate_interrupt( )

4.7 Softirqs and Tasklets

Softirqs are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module).

Softirqs can run concurrently on several CPUs, even if they are of the same type. Thus, softirqs are reentrant functions and must explicitly protect their data structures with spin locks. Tasklets do not have to worry about this, because their execution is controlled more strictly by the kernel.

Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, because the tasklet function needs not be reentrant.

Checks for active (pending) softirqs should be perfomed periodically:

  • When the kernel invokes the local_bh_enable() function* to enable softirqs on the local CPU
  • When the do_IRQ() function finishes handling an I/O interrupt and invokes the irq_exit() macro
  • If the system uses an I/O APIC, when the smp_apic_timer_interrupt() function finishes handling a local timer interrupt (see the section "Timekeeping Architecture in Multiprocessor Systems" in Chapter 6)
  • In multi-processor systems, when a CPU finishes handling a functiont riggered by a CALL_FUNCTION_VECTOR interprocessor interrupt
  • When one of the special ksoftirqd/n kernel threads is awakened (see later)

4.8 Work Queues

The difference of defferable functions and work queues:

  • The main difference is that deferrable functions run in interrupt context while functions in work queues run in process context. Running in process context is the only way to execute functions that can block.
  • No process switch can take place in interrupt context.
  • Neither deferrable functions nor functions in a work queue can access the User Mode address space of a process.

4.9 Returning from Interrupts and Exceptions

To resume execution of some program—several issues must be considered before doing it:

  • Number of kernel control paths being concurrently executed. If there is just one, the CPU must switch back to User Mode.
  • Pending process switch requests. If there is any request, the kernel must perform process scheduling; otherwise, control is returned to the current process.
  • Pending signals. If a signal is sent to the current process, it must be handled.
  • Single-step mode. If a debugger is tracing the execution of the current process, single-step mode must be restored before switching back to User Mode.
  • Virtual-8086 mode. If the CPU is in virtual-8086 mode, the current process is executing a legacy Real Mode program, thus it must be handled in a special way.

A few flags are used to keep track of pending process switch requests, of pending signals, and of single step execution; they are stored in the flags field of the thread_ info descriptor:

Flag name Description
TIF_SYSCALL_TRACE System calls are being traced
TIF_NOTIFY_RESUME Not used in the 80 x 86 platform
TIF_SIGPENDING The process has pending signals
TIF_NEED_RESCHED Scheduling must be performed
TIF_SINGLESTEP Restore single step execution on return to User Mode
TIF_IRET Force return from system call via iret rather than sysexit
TIF_SYSCALL_AUDIT System calls are being audited
TIF_POLLING_NRFLAG The idle process is polling the TIF_NEED_RESCHED flag
TIF_MEMDIE The process is being destroyed to reclaim memory (see the section "The Out of Memory Killer" in Chapter 17)

6 Timing Measurements

Two main kinds of timing measurement that must be performed by the Linux kernel:

  1. Keeping the current time and date so they can be returned to user programs through the time(), ftime(), and gettimeofday() APIs and used by the kernel itself as timestamps for files and network packets
  2. Maintaining timers.

Clock and Timer Circuits

The clock circuits are used both to keep track of the current time of day and to make precise time measurements.

The timer circuits are programmed by the kernel, so that they issue interrupts at a fixed, predefined frequency.

Real Time Clock (RTC)

Linux uses the RTC only to derive the time and date;

It allows processes to program the RTC by acting on the /dev/rtc device file;

The kernel accesses the RTC through the 0x70 and 0x71 I/O ports.

Time Stamp Counter (TSC)

Linux may take advantage of this register to get much more accurate time measurements than those delivered by the Programmable Interval Timer.

Programmable Interval Timer (PIT)

The role of a PIT is similar to the alarm clock of a microwave oven: it makes the user aware that the cooking time interval has elapsed.

It issues a special interrupt called timer interrupt, which notifies the kernel that one more time interval has elapsed and goes on issuing interrupts forever at some fixed frequency established by the kernel.

The PIT is initialized by setup_pit_timer() as follows:

spin_lock_irqsave(&i8253_lock, flags);
outb_p(0x34,0x43);
udelay(10);
outb_p(LATCH & 0xff, 0x40);
udelay(10);
outb(LATCH >> 8, 0x40);
spin_unlock_irqrestore(&i8253_lock, flags);

CPU Local Timer (APIT)

The CPU local timer is a device similar to the Programmable Interval Timer just described that can issue one-shot or periodic interrupts. There are, however, a few differences:

  1. The APIC’s timer counteris 32bits long,while the PIT’s timer counteris 16 bits long; therefore, the local timer can be programmed to issue interrupts at very low frequencies
  2. The local APIC timer sends an interrupt only to its processor, while the PIT raises a global interrupt, which may be handled by any CPU in the system.
  3. The APIC’s timer is based on the bus clock signal (or the APIC bus signal, in older machines). It can be programmed in such a way to decrease the timer counter every 1, 2, 4, 8, 16, 32, 64, or 128 bus clock signals. Conversely, the PIT, which makes use of its own clock signals, can be programmed in a more flexible way.

High Precision Event Timer (HPET)

The HPET provides a number of hardware timers that can be exploited by the kernel.

Each HPET has one fixed-rate counter (at 10+ MHz, hence "High Precision") and up to 32 comparators. Normally three or more comparators are provided, each of which can generate oneshot interrupts and at least one of which has additional hardware to support periodic interrupts. The comparators are also called "timers", which can be misleading since usually timers are independent of each other … these share a counter, complicating resets. From Linux Kernle Doc

Any counter is associated with at most 32 timers, each of which is composed by a comparator and a match register.

The comparator is a circuit that checks the value in the counter against the value in the match register, and raises a hardware interrupt if a match is found.

Some of the timers can be enabled to generate a periodic interrupt.

ACPI Power Management Timer (ACPI PMT)

The ACPI Power Management Timer is preferable to the TSC if the operating system or the BIOS may dynamically lower the frequency or voltage of the CPU to save battery power. When this happens, the frequency of the TSC changes—thus causing time warps and others unpleasant effects—while the frequency of the ACPI PMT does not.

The Linux Timekeeping Architecture

The kernel periodically:

  1. Updates the time elapsed since system startup.
  2. Updates the time and date.
  3. Determines, for every CPU, how long the current process has been running, and preempts it if it has exceeded the time allocated to it.
  4. Updates resource usage statistics.
  5. Checks whether the interval of time associated with each software timer has elapsed.

Linux’s timekeeping architecture depends also on the availability of the Time Stamp Counter (TSC), of the ACPI Power Management Timer (PMT), and of the High Precision Event Timer (HPET).

The kernel uses two basic timekeeping functions:

  1. keep the current time up-to-date
  2. count the number of nanoseconds that have elapsed within the current second.

Data Structures of the Timekeeping Architecture

The jiffies variable

The jiffies variable is a counter that stores the number of elapsed ticks since the system was started.

The kernel handles cleanly the overflow of jiffies thanks to the time_after, time_after_eq, time_before, and time_before_eq macros: they yield the correct value even if a wraparound occurred.

u64 get_jiffies_64(void)
{
  unsigned long seq;
  u64 ret;

  do {
    seq = read_seqbegin(&jiffies_lock);
    ret = jiffies_64;
  } while (read_seqretry(&jiffies_lock, seq));
  return ret;
}

The xtime variable

struct timespec {
  __kernel_time_t   tv_sec;      /* seconds */
  long              tv_nsec;    /* nanoseconds */
};

Timekeeping Architecture in Uniprocessor Systems

Initialization phase

time_init

  1. Initializes the xtime variable by get_cmos_time()
  2. Initializes the wall_to_monotonic variable.
  3. If the kernel supports HPET, it invokes the hpet_enable() function to determine whether the ACPI firmware has probed the chip and mapped its registers in the memory address space.
  4. Invokes clocksource_select() to select the best timer source available in the system, and sets the cur_timer variable to the address of the corresponding timer object.
  5. Invokes setup_irq(0,&irq0) to set up the interrupt gate corresponding to IRQ0

The timer interrupt handler

Timekeeping Architecture in Multiprocessor Systems

Multiprocessor systems can rely on two different sources of timer interrupts:

  1. those raised by the PIT or the HPET
  2. those raised by the CPU local timers.

Initialization phase

The global timer interrupt handler

Updating Local CPU Statistics

  1. Checks how long the current process has been running. 1.1

Software Timers and Delay Functions

Linux considers two types of timers called dynamic timers and interval timers. The first type is used by the kernel, while interval timers may be created by processes in User Mode.

Dynamic Timers

Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers.

Dynamic timers and race conditions

a rule of thumb is to stop the timer before releasing the resource: ...

del_timer(&t); X_Release_Resources( );

In multiprocessor systems, however, this code is not safe because the timer function might already be running on another CPU when del_timer() is invoked.

Dynamic timer handling

Despite the clever data structures, handling software timers is a time-consuming activity that should not be performed by the timer interrupt handler. In Linux 2.6 this activity is carried on by a deferrable function, namely the TIMER_SOFTIRQ softirq.

An Application of Dynamic Timers: the nanosleep( ) System Call

Delay Functions

static void udelay(int loops)
{
    while (loops--)
        io_delay();    /* Approximately 1 us */
}

static inline void io_delay(void)
{
    const u16 DELAY_PORT = 0x80;
    asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));
}

7 Process Scheduling

7.1 Scheduling Policy

Objectives: fast process response time, good throughput for background jobs, avoidance of process starvation, reconciliation of the needs of lowand highpriority processes, and so on.

Linux scheduling is based on the time sharing technique: several processes run in "time multiplexing" because the CPU time is divided into slices, one for each runnable process.

The scheduling policy is also based on ranking processes according to their priority.

When speaking about scheduling, processes are traditionally classified as I/O-bound or CPU-bound.

An alternative classification distinguishes three classes of processes: Interactive processes, Batch processes, Real-time processes.

System call Description
nice( ) Change the static priority of a conventional process
getpriority( ) Get the maximum static priority of a group of conventional processes
setpriority( ) Set the static priority of a group of conventional processes
sched_getscheduler( ) Get the scheduling policy of a process
sched_setscheduler( ) Set the scheduling policy and the real-time priority of a process
sched_getparam( ) Get the real-time priority of a process
sched_setparam( ) Set the real-time priority of a process
sched_yield( ) Relinquish the processor voluntarily without blocking
sched_get_priority_min( ) Get the minimum real-time priority value for a policy
sched_get_priority_max( ) Set the maximum real-time priority value for a policy
sched_rr_get_interval( ) Get the time quantum value for the Round Robin policy
sched_setaffinity( ) Set the CPU affinity mask of a process
sched_getaffinity( ) Get the CPU affinity mask of a process

7.1.1 Process Preemption

When a process enters the TASK_RUNNING state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run.

Be aware that a preempted process is not suspended, because it remains in the TASK_ RUNNING state; it simply no longer uses the CPU.

7.1.2 How Long Must a Quantum Last?

It is often believed that a long quantum duration degrades the response time of interactive applications. This is usually false. As described in the section "Process Preemption" earlier in this chapter, interactive processes have a relatively high priority, so they quickly preempt the batch processes, no matter how long the quantum duration is.

7.2 The Scheduling Algorithm

scheduling classes: SCHED_FIFO, SCHED_RR, SCHED_NORMAL

7.2.1 Scheduling of Conventional Processes

  • Base time quantum

    base time quantum (in milliseconds) =

    1. (140 – static priority) × 20 if static priority < 120
    2. (140 – static priority) × 5 if static priority =120
    Description Static priority Nice value Base time quantum Interactivedelta Sleep time threshold
    Highest static priority 100 -20 800 ms –3 299 ms
    High static priority 110 -10 600 ms -1 499 ms
    Default static priority 120 0 100 ms +2 799 ms
    Low static priority 130 +10 50 ms +4 999 ms
    Lowest static priority 139 +19 5 ms +6 1199 ms
  • Dynamic priority and average sleep time

    dynamic priority = max(100, min(static priority − bonus + 5, 139))

    The bonus is a value ranging from 0 to 10; a value less than 5 represents a penalty that lowers the dynamic priority, while a value greater than 5 is a premium that raises the dynamic priority. Bonus is related to the average sleep time of the process

    Average sleep time Bonus Granularity
    >=0 < 100 ms 0 5120
    >=100 ms < 200 ms 1 2560
    >=200 ms < 300 ms 2 1280
    >=300 ms < 400 ms 3 640
    >=400 ms < 500 ms 4 320
    >=500 ms < 600 ms 5 160
    >=600 ms < 700 ms 6 80
    >=700 ms < 800 ms 7 40
    >=800 ms < 900 ms 8 20
    >=900 ms < 1000 ms 9 10
    1 second 10 10

    interactive process: [dynamic priority ≤ 3 × static priority / 4 + 28] or [bonus - 5 ≥ static priority / 4 − 28]

    The expression static priority / 4 − 28 is called the interactive delta.

  • Active and expired processes

  • Scheduling of Real-Time Processes

7.3 Data Structures Used by the Scheduler

7.3.4 schedule()

  • Direct invocation

    1. Inserts current in the proper wait queue.
    2. Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_ UNINTERRUPTIBLE.
    3. Invokes schedule( ).
    4. Checks whether the resource is available; if not, goes to step 2.
    5. Once the resource is available, removes current from the wait queue.
  • Lazy invocation

    The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current to 1.

    • When current has used up its quantum of CPU time; this is done by the scheduler_tick() function.
    • When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up() function.
    • When a sched_setscheduler() system call is issued (see the section "System Calls Related to Scheduling" later in this chapter).

7.4 Functions Used by the Scheduler

7.5 Runqueue Balancing in Multiprocessor Systems

Types of multiprocessor machines:

  • Classic multiprocessor architecture
  • Hyper-threading
    • Allows the processor to exploit the machine cycles to execute another thread while the current thread is stalled for a memory access. A hyper-threaded physical CPU is seen by Linux as several different logical CPUs.
  • NUMA

These basic kinds of multiprocessor systems are often combined. For instance, a motherboard that includes two different hyper-threaded CPUs is seen by the kernel as four logical CPUs.

Linux sports a sophisticated runqueue balancing algorithm based on the notion of "scheduling domains."

7.5.1 Scheduling Domains

A scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel.

Workload balancing is always done between groups of a scheduling domain. In other words, a process is moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain.

TODO

7.5.2 rebalance_tick

7.6 System Calls Related to Scheduling

8 Memory Management

8.1 Page Frame Management

  1. Linux adopts the smaller 4 KB page frame size as the standard memory allocation unit. This makes things simpler for two reasons:
    • The Page Fault exceptions issued by the paging circuitry are easily interpreted.
    • Although both 4 KB and 4 MB are multiples of all disk block sizes, transfers of data between main memory and disks are in most cases more efficient when the smaller size is used.

8.1.1 Page Descriptors

8.1.2 Non-Uniform Memory Access (NUMA)

The Linux kernel makes use of NUMA even for some peculiar uniprocessor systems that have huge "holes" in the physical address space.

The kernel handles these architectures by assigning the contiguous subranges of valid physical addresses to different memory nodes.

8.1.3 Memory Zones

Linux kernel must deal with two hardware constraints of the 80 × 86 architecture:

  • The Direct Memory Access (DMA) processors for old ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.
  • In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.
ZONE DREC
ZONE_DMA Contains page frames of memory below 16 MB
ZONE_NORMAL Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM Contains page frames of memory at and above 896 MB

//?? ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space

8.1.4 The Pool of Reserved Page Frames

8.1.5 The Zoned Page Frame Allocator

8.1.6 Kernel Mappings of High-Memory Page Frames

8.1.6.1 Permanent kernel mappings

8.1.6.2 Temporary kernel mappings

8.1.7 The Buddy System Algorithm

All free page frames are grouped into 11 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively. The largest request of 1024 page frames corresponds to a chunk of 4 MB of contiguous RAM. The physical address of the first page frame of a block is a multiple of the group size.

The reverse operation, releasing blocks of page frames, gives rise to the name of this algorithm. The kernel attempts to merge pairs of free buddy blocks of size b together into a single block of size 2b. Two blocks are considered buddies if: • Both blocks have the same size, say b. • They are located in contiguous physical addresses. • The physical address of the first page frame of the first block is a multiple of 2 * b * 2^12.

8.1.9 The Per-CPU Page Frame Cache

8.1.10 The Zone Allocator

The __alloc_pages() function essentially executes the following steps:

  1. Performs a first scanning of the memory zones (see the block of code shown ear- lier). In this first scan, the min threshold value is set to z->pages_low, where z points to the zone descriptor being analyzed (the can_try_harder and gfp_high parameters are set to zero).
  2. If the function did not terminate in the previous step, there is not much free memory left: the function awakens the kswapd kernel threads to start reclaiming page frames asynchronously (see Chapter 17).
  3. Performs a second scanning of the memory zones, passing as base threshold the value z->pages_min. As explained previously, the actual threshold is determined also by the can_try_harder and gfp_high flags. This step is nearly identical to step 1, except that the function is using a lower threshold.
  4. If the function did not terminate in the previous step, the system is definitely low on memory. If the kernel control path that issued the memory allocation request is not an interrupt handler or a deferrable function and it is trying to reclaim page frames (either the PF_MEMALLOC flag or the PF_MEMDIE flag of current is set), the function then performs a third scanning of the memory zones, trying to allo- cate the page frames ignoring the low-on-memory thresholds—that is, without invoking zone_watermark_ok(). This is the only case where the kernel control path is allowed to deplete the low-on-memory reserve of pages specified by the lowmem_reserve field of the zone descriptor. In fact, in this case the kernel con- trol path that issued the memory request is ultimately trying to free page frames, thus it should get what it has requested, if at all possible. If no memory zone includes enough page frames, the function returns NULL to notify the caller of the failure.
  5. Here, the invoking kernel control path is not trying to reclaim memory. If the _ _ GFP_WAIT flag of gfp_mask is not set, the function returns NULL to notify the kernel control path of the memory allocation failure: in this case, there is no way to satisfy the request without blocking the current process.
  6. Here the current process can be blocked: invokes cond_resched() to check whether some other process needs the CPU.
  7. Sets the PF_MEMALLOC flag of current, to denote the fact that the process is ready to perform memory reclaiming.
  8. Stores in current->reclaim_state a pointer to a reclaim_state structure. This structure includes just one field, reclaimed_slab, initialized to zero (we’ll see how this field is used in the section “Interfacing the Slab Allocator with the Zoned Page Frame Allocator” later in this chapter).
  9. Invokes try_to_free_pages() to look for some page frames to be reclaimed (see the section “Low On Memory Reclaiming” in Chapter 17). The latter function may block the current process. Once that function returns, __alloc_pages() resets the PF_MEMALLOC flag of current and invokes once more cond_resched().
  10. If the previous step has freed some page frames, the function performs yet another scanning of the memory zones equal to the one performed in step 3. If the memory allocation request cannot be satisfied, the function determines whether it should continue scanning the memory zone: if the _ _GFP_NORETRY flag is clear and either the memory allocation request spans up to eight page frames, or one of the _ _GFP_REPEAT and _ _GFP_NOFAIL flags is set, the function invokes blk_congestion_wait() to put the process asleep for awhile (see Chapter 14), and it jumps back to step 6. Otherwise, the function returns NULL to notify the caller that the memory allocation failed.
  11. If no page frame has been freed in step 9, the kernel is in deep trouble, because free memory is dangerously low and it was not possible to reclaim any page frame. Perhaps the time has come to take a crucial decision. If the kernel control path is allowed to perform the filesystem-dependent operations needed to kill a process (the _ _GFP_FS flag in gfp_mask is set) and the _ _GFP_NORETRY flag is clear, performs the following substeps: a. Scans once again the memory zones with a threshold value equal to z-> pages_high. b. Invokes out_of_memory() to start freeing some memory by killing a victim process (see “The Out of Memory Killer” in Chapter 17). c. Jumps back to step 1.

8.2 Memory Area Management

Slabs can be destroyed in two cases:

  • There are too many free objects in the slab cache (see the later section “Releas- ing a Slab from a Cache”).
  • A timer function invoked periodically determines that there are fully unused slabs that can be released (see Chapter 17).

8.3 Noncontiguous Memory Area Management

14 Block Device Drivers

14.1 Block Devices Handling

14.1.1 Sectors

  • The controllers of the hardware block devices transfer data in chunks of fixed length called "sectors."

14.1.2 Blocks

The Virtual Filesystem, the mapping layer, and the filesystems group the disk data in logical units called "blocks."

While the sector is the basic unit of data transfer for the hardware devices, the block is the basic unit of data transfer for the VFS and, consequently, for the filesystems.

In Linux, the block size must be a power of 2 and cannot be larger than a page frame. Moreover, it must be a multiple of the sector size, because each block must include an integral number of sectors.

Each block requires its own block buffer, which is a RAM memory area used by the kernel to store the block’s content.

14.1.3 Segments

As we will see shortly, block device drivers should be able to cope with "segments" of data: each segment is a memory page—or a portion of a memory page—including chunks of data that are physically adjacent on disk.

For each scatter-gather DMA transfer, the block device driver must send to the disk controller:

  • The initial disk sector number and the total number of sectors to be transferred
  • A list of descriptors of memory areas, each of which consists of an address and a length.

The disk controller takes care of the whole data transfer; for instance, in a read operation the controller fetches the data from the adjacent disk sectors and scatters it into the various memory areas.

To make use of scatter-gather DMA operations, block device drivers must handle the data in units called segments. A segment is simply a memory page—or a portion of a memory page—that includes the data of some adjacent disk sectors. Thus, a scattergather DMA operation may involve several segments at once.

14.2 The Generic Block Layer

The generic block layer is a kernel component that handles the requests for all block devices in the system.

14.3 The I/O Scheduler

To keep the block device driver from being suspended, each I/O operation is processed asynchronously. In particular, block device drivers are interrupt-driven (see the section "Monitoring I/O Operations" in the previous chapter): the generic block layer invokes the I/O scheduler to create a new block device request or to enlarge an already existing one and then terminates. The block device driver, which is activated at a later time, invokes the strategy routine to select a pending request and satisfy it by issuing suitable commands to the disk controller. When the I/O operation terminates, the disk controller raises an interrupt and the corresponding handler invokes the strategy routine again, if necessary, to process another pending request.

Each block device driver maintains its own request queue, which contains the list of pending requests for the device. If the disk controller is handling several disks, there is usually one request queue for each physical block device.

14.3.1 The "Noop" elevator

There is no ordered queue: new requests are always added either at the front or at the tail of the dispatch queue, and the next request to be processed is always the first request in the queue.

14.3.2 The "CFQ" elevator

Requests coming from the same process are always inserted in the same queue.

14.3.3 The "Deadline" elevator

Besides the dispatch queue, the "Deadline" elevator makes use of four queues.

  • Two of them—the sorted queues—include the read and write requests, respectively, ordered according to their initial sector numbers.
  • The other two—the deadline queues— include the same read and write requests sorted according to their "deadlines."

A request deadline is essentially an expire timer that starts ticking when the request is passed to the elevator. By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 seconds—read requests are privileged over write requests.

14.4 Block Device Drivers

14.5 Opening a Block Device File

17 Page Frame Reclaiming

20 Program Execution

20.1 Executable Files

20.2 Executable Formats

20.3 Execution Domains

20.4 The exec Functions

  • How dynamic linker works:
    1. sets up a basic execution context for itself, starting from the information stored by the kernel in the User Mode stack between the array of pointers to environment strings and arg_start.
    2. examines the program to be executed to identify which shared libraries must be loaded and which functions in each shared library are effectively requested.
    3. issues several mmap() system calls to create memory regions mapping the pages that will hold the library functions (text and data) actually used by the program.
    4. update all references to the symbols of the shared library, according to the linear addresses of the library’s memory regions.
    5. terminates its execution by jumping to the main entry point of the program to be executed.