The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context. The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.
We can thus define a process switch as the activity consisting of saving the hardware context of prev and replacing it with the hardware context of next.
The contents of all registers used by a process in User Mode have already been saved on the Kernel Mode stack before performing process switching (see Chapter 4). This includes the contents of the ss and esp pair that specifies the User Mode stack pointer address.
struct __attribute__ ((__packed__)) vmcb_seg {
u16 selector;
u16 attrib;
u32 limit;
u64 base;
};
struct __attribute__ ((__packed__)) vmcb_save_area {
struct vmcb_seg es;
struct vmcb_seg cs;
struct vmcb_seg ss;
struct vmcb_seg ds;
struct vmcb_seg fs;
struct vmcb_seg gs;
struct vmcb_seg gdtr;
struct vmcb_seg ldtr;
struct vmcb_seg idtr;
struct vmcb_seg tr;
u8 reserved_1[43];
u8 cpl;
u8 reserved_2[4];
u64 efer;
u8 reserved_3[112];
u64 cr4;
u64 cr3;
u64 cr0;
u64 dr7;
u64 dr6;
u64 rflags;
u64 rip;
u8 reserved_4[88];
u64 rsp;
u8 reserved_5[24];
u64 rax;
u64 star;
u64 lstar;
u64 cstar;
u64 sfmask;
u64 kernel_gs_base;
u64 sysenter_cs;
u64 sysenter_esp;
u64 sysenter_eip;
u64 cr2;
u8 reserved_6[32];
u64 g_pat;
u64 dbgctl;
u64 br_from;
u64 br_to;
u64 last_excp_from;
u64 last_excp_to;
};
struct __attribute__ ((__packed__)) vmcb {
struct vmcb_control_area control;
struct vmcb_save_area save;
};
Each process descriptor includes a field called thread of type thread_struct, in which the kernel saves the hardware context whenever the process is being switched out.
This data structure includes fields for most of the CPU registers, except the general-purpose registers such as eax, ebx, etc., which are stored in the Kernel Mode stack.
Essentially, every process switch consists of two steps:
- Switching the Page Global Directory to install a new address space (Chapter 9).
- Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers.
Before the process switching, the macro saves in the eax CPU register the content of the variable identified by the first input parameter prev—that is, the prev local variable allocated on the Kernel Mode stack of A. After the process switching, when A has resumed its execution, the macro writes the content of the eax CPU register in the memory location of A identified by the third output parameter last. Because the CPU register doesn’t change across the process switch, this memory location receives the address of C’s descriptor. In the current implementation of schedule(), the last parameter identifies the prev local variable of A, so prev is overwritten with the address of C.
Interrupts are often divided into synchronous and asynchronous interrupts:
- Synchronous(exceptions) interrupts are produced by the CPU control unit while executing instructions and are called synchronous because the control unit issues them only after terminating the execution of an instruction.
- Asynchronous(interrupts) interrupts are generated by other hardware devices at arbitrary times with respect to the CPU clock signals.
There is a key difference between interrupt handling and process switching: the code executed by an interrupt or by an exception handler is not a process. Rather, it is a kernel control path that runs at the expense of the same process that was running when the interrupt occurred. As a kernel control path, the interrupt handler is lighter than a process (it has less context and requires less time to set up or tear down).
Interrupt must satisfy the following constraints:
- Interrupts can come any time, whenthekernelmaywanttofinishsomethingelse it was trying to do.
- Because interrupts can come anytime, the kernel might be handling one of them while another one (of a different type) occurs. When the last kernel control path terminates, the kernel must be able to resume execution of the interrupted process or switch to another process if the interrupt signal has caused a rescheduling activity.
- Although the kernel may accept a new interrupt signal while handling a previous one, some critical regions exist inside the kernel code where interrupts must be disabled.
The Intel documentation classifies interrupts and exceptions as follows:
- Interrups
- Maskable interrupts
- Non-maskable interrupts
- Exception
- Processor-detected exceptions
- Faults
- Traps
- Aborts
- Programmed exceptions
- Processor-detected exceptions
Programmable Interrupt Controller performs the following actions:
- Monitors the IRQ lines, checking for raised signals. If two or more IRQ lines are raised, selects the one having the lower pin number.
- If a raised signal occurs on an IRQ line:
- Converts the raised signal received into a corresponding vector.
- Stores the vector in an Interrupt Controller I/O port, thus allowing the CPU to read it via the data bus.
- Sends a raised signal to the processor INTR pin—that is, issues an interrupt.
- Waits until the CPU acknowledges the interrupt signal by writing into one of the Programmable Interrupt Controllers (PIC) I/O ports; when this occurs, clears the INTR line.
- Goes back to step 1.
Intel’s default vector associated with IRQn is n+32. As mentioned before, the mapping between IRQs and vectors can be modified by issuing suitable I/O instructions to the Interrupt Controller ports.
Disabled interrupts are not lost; the PIC sends them to the CPU as soon as they are enabled again.
Each local APIC has 32-bit registers, an internal clock; a local timer device; and two additional IRQ lines, LINT0 and LINT1, reserved for local APIC interrupts. All local APICs are connected to an external I/O APIC, giving rise to a multi-APIC system.
The I/O APIC consists of a set of 24 IRQ lines, a 24-entry Interrupt Redirection Table, programmable registers, and a message unit for sending and receiving APIC messages over the APIC bus.
Each entry in the Redirection Table can be individually programmed to indicate the interrupt vector and priority, the destination processor, and how the processor is selected. The information in the Redirection Table is used to translate each external IRQ signal into a message to one or more local APIC units via the APIC bus.
Interrupt requests coming from external hardware devices can be distributed among the available CPUs in two ways:
- Static distribution
- The IRQ signal is delivered to the local APICs listed in the corresponding Redirection Table entry.
- Dynamic distribution
- The IRQ signal is delivered to the local APIC of the processor that is executing the process with the lowest priority. task priority register (TPR) is used to calculate task priority.
Interprocessor interrupts When a CPU wishes to send an interrupt to another CPU, it stores the interrupt vector and the identifier of the target’s local APIC in the Interrupt Command Register (ICR) of its own local APIC. A message is then sent via the APIC bus to the target’s local APIC, which therefore issues a corresponding interrupt to its own CPU.
- Task gate: Includes the TSS selector of the process that must replace the current one when an interrupt signal occurs.
- Interrupt gate: Includes the Segment Selector and the offset inside the segment of an interrupt or exception handler. While transferring control to the proper segment, the processor clears the IF flag, thus disabling further maskable interrupts.
- Trap gate: Similar to an interrupt gate, except that while transferring control to the proper segment, the processor does not modify the IF flag.
After executing an instruction, the cs and eip pair of registers contain the logical address of the next instruction to be executed. Before dealing with that instruction, the control unit checks whether an interrupt or an exception occurred while the control unit executed the previous instruction. If one occurred, the control unit does the following:
- Determines the
vector
i (0 ≤ i ≤ 255) associated with the interrupt or the exception. - Reads the ith entry of the IDT referred by the
idtr
register (we assume in the following description that the entry contains an interrupt or a trap gate). - Gets the base address of the GDT from the
gdtr
register and looks in the GDT to read the Segment Descriptor identified by the selector in the IDT entry. This descriptor specifies the base address of the segment that includes the interrupt or exception handler. - Makes sure the interrupt was issued by an authorized source.
- First, it compares the Current Privilege Level (CPL), which is stored in the two least significant bits of the cs register, with the Descriptor Privilege Level (DPL) of the Segment Descriptor included in the GDT. Raises a "General protection" exception if the CPL is lower than the DPL, because the interrupt handler cannot have a lower privilege than the program that caused the interrupt.
- For programmed exceptions, makes a further security check: compares the CPL with the DPL of the gate descriptor included in the IDT and raises a "General protection" exception if the DPL is lower than the CPL. This last check makes it possible to prevent access by user applications to specific trap or interrupt gates.
- Checks whether a change of privilege level is taking place—that is, if CPL is different from the selected Segment Descriptor’s DPL. If so, the control unit must start using the
stack
that is associated with the new privilege level. It does this by performing the following steps:- Reads the tr register to access the TSS segment of the running process.
- Loads the ss and esp registers with the proper values for the stack segment and stack pointer associated with the new privilege level. These values are found in the TSS (see the section "Task State Segment" in Chapter 3).
- In the new stack, it saves the previous values of ss and esp, which define the logical address of the stack associated with the old privilege level.
- If a fault has occurred, it loads cs and eip with the logical address of the instruction that caused the exception so that it can be executed again.
- Saves the contents of eflags, cs, and eip in the stack.
- If the exception carries a hardware error code, it saves it on the stack.
- Loads cs and eip, respectively, with the Segment Selector and the Offset fields of the Gate Descriptor stored in the ith entry of the IDT. These values define the logical address of the first instruction of the interrupt or exception handler.
After the interrupt or exception is processed, the corresponding handler must relinquish control to the interrupted process by issuing the iret
instruction, which forces the control unit to:
- Load the cs, eip, and eflags registers with the values saved on the stack. If a hardware error code has been pushed in the stack on top of the eip contents, it must be popped before executing iret.
- Check whether the CPL of the handler is equal to the value contained in the two least significant bits of cs (this means the interrupted process was running at the same privilege level as the handler). If so, iret concludes execution; otherwise, go to the next step.
- Load the ss and esp registers from the stack and return to the stack associated with the old privilege level.
- Examine the contents of the ds, es, fs, and gs segment registers; if any of them contains a selector that refers to a Segment Descriptor whose DPL value is lower than CPL, clear the corresponding segment register. The control unit does this to forbid User Mode programs that run with a CPL equal to 3 from using segment registers previously used by kernel routines (with a DPL equal to 0). If these registers were not cleared, malicious User Mode programs could exploit them in order to access the kernel address space.
The last instructions of a kernel control path that is taking care of an interrupt do not always put the current process back into User Mode: if the level of nesting is greater than 1, these instructions will put into execution the kernel control path that was interrupted last, and the CPU will continue to run in Kernel Mode.
The price to pay for allowing nested kernel control paths is that an interrupt handler must never block, that is, no process switch can take place until an interrupt handler is running. In fact, all the data needed to resume a nested kernel control path is stored in the Kernel Mode stack, which is tightly bound to the current process.
In contrast to exceptions, interrupts issued by I/O devices do not refer to data structures specific to the current process, although the kernel control paths that handle them run on behalf of that process.
An interrupt handler may preempt both other interrupt handlers and exception handlers. Conversely, an exception handler never preempts an interrupt handler.
Linux interleaves kernel control paths for two major reasons:
- To improve the throughput of programmable interrupt controllers and device controllers.
- To implement an interrupt model without priority levels.
Before the kernel enables the interrupts, it must load the initial address of the IDT table into the idtr register and initialize all the entries of that table.
- Interrupt gate: All Linux interrupt handlers are activated by means of interrupt gates, and all are restricted to Kernel Mode.
- System gate The three Linux exception handlers associated with the vectors 4, 5, and 128 are activated by means of system gates, so the three assembly language instructions into, bound, and int $0x80 can be issued in User Mode.
- System interrupt gate The exception handler associated with the vector 3 is activated by means of a system interrupt gate, so the assembly language instruction int3 can be issued in User Mode.
- Trap gate Most Linux exception handlers are activated by means of trap gates.
- Task gate The Linux handler for the "Double fault" exception is activated by means of a task gate.
The IDT is initialized and used by the BIOS routines while the computer still operates in Real Mode. Once Linux takes over, however, the IDT is moved to another area of RAM and initialized a second time, because Linux does not use any BIOS routine.
Exception handlers have a standard structure consisting of three steps:
- Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).
- Handle the exception by means of a high-level C function.
- Exit from the handler by means of the ret_from_exception( ) function.
ENTRY(async_page_fault)
ASM_CLAC
pushl $do_async_page_fault
jmp common_exception
END(async_page_fault)
Interrupt handler flexibility is achieved in two distinct ways, as discussed in the following list.
- IRQ sharing: each ISR is executed to verify whether its device needs attention; if so, the ISR performs all the operations that need to be executed when the device raises an interrupt
- IRQ dynamic allocation
Long noncritical operations should be deferred, because while an interrupt handler is running, the signals on the corresponding IRQ line are temporarily ignored. Therefore, interrupt handlers cannot perform any blocking procedure such as an I/O disk operation.
Linux divides the actions to be performed following an interrupt into three classes:
- Critical : Critical actions are executed within the interrupt handler immediately, with maskable interrupts disabled.
- Noncritical : These actions can also finish quickly, so they are executed by the interrupt handler immediately, with the interrupts enabled.
- Noncritical Deferable : Noncritical deferrable actions are performed by means of separate functions that are discussed in the later section "Softirqs and Tasklets."
All I/O interrupt handlers perform the same four basic actions:
- Save the IRQ value and the register’s contents on the Kernel Mode stack.
- Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it to issue further interrupts.
- Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
- Terminate by jumping to the ret_from_intr( ) address.
Vector Range | Use |
---|---|
0–19 (0x0–0x13) | Nonmaskable interrupts and exceptions |
20–31 (0x14–0x1f) | Intel-reserved |
32–127 (0x20–0x7f) | External interrupts (IRQs) |
128 (0x80) | Programmed exception for system calls |
129–238 (0x81–0xee) | External interrupts (IRQs) |
239 (0xef) | Local APIC timer interrupt |
240 (0xf0) | Local APIC thermal interrupt |
241–250 (0xf1–0xfa) | Reserved by Linux for future use |
251–253 (0xfb–0xfd) | Interprocessor interrupts |
254 (0xfe) | Local APIC error interrupt |
255 (0xff) | Local APIC spurious interrupt |
During system bootstrap, the booting CPU executes the setup_IO_APIC_irqs() function to initialize the I/O APIC chip. The 24 entries of the Interrupt Redirection Table of the chip are filled, so that all IRQ signals from the I/O hardware devices can be routed to each CPU in the system according to the "lowest priority" scheme (see the earlier section "IRQs and Interrupts").
During system bootstrap, moreover, all CPUs execute the setup_local_APIC() function, which takes care of initializing the local APICs.
When a hardware device raises an IRQ signal, the multi-APIC system selects one of the CPUs and delivers the signal to the corresponding local APIC, which in turn interrupts its CPU. No other CPUs are notified of the event.
common_interrupt:
SAVE_ALL
movl %esp,%eax
call do_IRQ
jmp ret_from_intr
SAVE_ALL:
cld
push %es
push %ds
pushl %eax
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
movl $__USER_DS,%edx movl %edx,%ds
movl %edx,%es
SAVE_ALL saves all the CPU registers that may be used by the interrupt handler on the stack, except for eflags, cs, eip, ss, and esp, which are already saved automatically by the control unit (see the earlier section "Hardware Handling of Interrupts and Exceptions").
As noted in section "Interrupt vectors," a few vectors are reserved for specific devices, while the remaining ones are dynamically handled.
an interprocessor interrupt (IPI) is delivered not through an IRQ line, but directly as a message on the bus that connects the local APIC of all CPUs (either a dedicated bus in older motherboards, or the system bus in the Pentium 4–based motherboards).
Linux makes use of three kinds of interprocessor interrupts (see also Table 4-2):
- CALL_FUNCTION_VECTOR (vector 0xfb). Sent to all CPUs but the sender, forcing those CPUs to run a function passed by the sender. call_function_interrupt( ).
- RESCHEDULE_VECTOR (vector 0xfc). When a CPU receives this type of interrupt, reschedule_interrupt().
- INVALIDATE_TLB_VECTOR (vector 0xfd). Sent to all CPUs but the sender, forcing them to invalidate their Translation Lookaside Buffers. invalidate_interrupt( )
Softirqs are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module).
Softirqs can run concurrently on several CPUs, even if they are of the same type. Thus, softirqs are reentrant functions and must explicitly protect their data structures with spin locks. Tasklets do not have to worry about this, because their execution is controlled more strictly by the kernel.
Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, because the tasklet function needs not be reentrant.
Checks for active (pending) softirqs should be perfomed periodically:
- When the kernel invokes the local_bh_enable() function* to enable softirqs on the local CPU
- When the do_IRQ() function finishes handling an I/O interrupt and invokes the irq_exit() macro
- If the system uses an I/O APIC, when the smp_apic_timer_interrupt() function finishes handling a local timer interrupt (see the section "Timekeeping Architecture in Multiprocessor Systems" in Chapter 6)
- In multi-processor systems, when a CPU finishes handling a functiont riggered by a CALL_FUNCTION_VECTOR interprocessor interrupt
- When one of the special ksoftirqd/n kernel threads is awakened (see later)
The difference of defferable functions and work queues:
- The main difference is that deferrable functions run in interrupt context while functions in work queues run in process context. Running in process context is the only way to execute functions that can block.
- No process switch can take place in interrupt context.
- Neither deferrable functions nor functions in a work queue can access the User Mode address space of a process.
To resume execution of some program—several issues must be considered before doing it:
- Number of kernel control paths being concurrently executed. If there is just one, the CPU must switch back to User Mode.
- Pending process switch requests. If there is any request, the kernel must perform process scheduling; otherwise, control is returned to the current process.
- Pending signals. If a signal is sent to the current process, it must be handled.
- Single-step mode. If a debugger is tracing the execution of the current process, single-step mode must be restored before switching back to User Mode.
- Virtual-8086 mode. If the CPU is in virtual-8086 mode, the current process is executing a legacy Real Mode program, thus it must be handled in a special way.
A few flags are used to keep track of pending process switch requests, of pending signals, and of single step execution; they are stored in the flags field of the thread_ info descriptor:
Flag name | Description |
---|---|
TIF_SYSCALL_TRACE | System calls are being traced |
TIF_NOTIFY_RESUME | Not used in the 80 x 86 platform |
TIF_SIGPENDING | The process has pending signals |
TIF_NEED_RESCHED | Scheduling must be performed |
TIF_SINGLESTEP | Restore single step execution on return to User Mode |
TIF_IRET | Force return from system call via iret rather than sysexit |
TIF_SYSCALL_AUDIT | System calls are being audited |
TIF_POLLING_NRFLAG | The idle process is polling the TIF_NEED_RESCHED flag |
TIF_MEMDIE | The process is being destroyed to reclaim memory (see the section "The Out of Memory Killer" in Chapter 17) |
Two main kinds of timing measurement that must be performed by the Linux kernel:
- Keeping the current time and date so they can be returned to user programs through the time(), ftime(), and gettimeofday() APIs and used by the kernel itself as timestamps for files and network packets
- Maintaining timers.
The clock circuits are used both to keep track of the current time of day and to make precise time measurements.
The timer circuits are programmed by the kernel, so that they issue interrupts at a fixed, predefined frequency.
Linux uses the RTC only to derive the time and date;
It allows processes to program the RTC by acting on the /dev/rtc device file;
The kernel accesses the RTC through the 0x70 and 0x71 I/O ports.
Linux may take advantage of this register to get much more accurate time measurements than those delivered by the Programmable Interval Timer.
The role of a PIT is similar to the alarm clock of a microwave oven: it makes the user aware that the cooking time interval has elapsed.
It issues a special interrupt called timer interrupt, which notifies the kernel that one more time interval has elapsed and goes on issuing interrupts forever at some fixed frequency established by the kernel.
The PIT is initialized by setup_pit_timer()
as follows:
spin_lock_irqsave(&i8253_lock, flags);
outb_p(0x34,0x43);
udelay(10);
outb_p(LATCH & 0xff, 0x40);
udelay(10);
outb(LATCH >> 8, 0x40);
spin_unlock_irqrestore(&i8253_lock, flags);
The CPU local timer is a device similar to the Programmable Interval Timer just described that can issue one-shot or periodic interrupts. There are, however, a few differences:
- The APIC’s timer counteris 32bits long,while the PIT’s timer counteris 16 bits long; therefore, the local timer can be programmed to issue interrupts at very low frequencies
- The local APIC timer sends an interrupt only to its processor, while the PIT raises a global interrupt, which may be handled by any CPU in the system.
- The APIC’s timer is based on the bus clock signal (or the APIC bus signal, in older machines). It can be programmed in such a way to decrease the timer counter every 1, 2, 4, 8, 16, 32, 64, or 128 bus clock signals. Conversely, the PIT, which makes use of its own clock signals, can be programmed in a more flexible way.
The HPET provides a number of hardware timers that can be exploited by the kernel.
Each HPET has one fixed-rate counter (at 10+ MHz, hence "High Precision") and up to 32 comparators. Normally three or more comparators are provided, each of which can generate oneshot interrupts and at least one of which has additional hardware to support periodic interrupts. The comparators are also called "timers", which can be misleading since usually timers are independent of each other … these share a counter, complicating resets. From Linux Kernle Doc
Any counter is associated with at most 32 timers, each of which is composed by a comparator and a match register.
The comparator is a circuit that checks the value in the counter against the value in the match register, and raises a hardware interrupt if a match is found.
Some of the timers can be enabled to generate a periodic interrupt.
The ACPI Power Management Timer is preferable to the TSC if the operating system or the BIOS may dynamically lower the frequency or voltage of the CPU to save battery power. When this happens, the frequency of the TSC changes—thus causing time warps and others unpleasant effects—while the frequency of the ACPI PMT does not.
The kernel periodically:
- Updates the time elapsed since system startup.
- Updates the time and date.
- Determines, for every CPU, how long the current process has been running, and preempts it if it has exceeded the time allocated to it.
- Updates resource usage statistics.
- Checks whether the interval of time associated with each software timer has elapsed.
Linux’s timekeeping architecture depends also on the availability of the Time Stamp Counter (TSC), of the ACPI Power Management Timer (PMT), and of the High Precision Event Timer (HPET).
The kernel uses two basic timekeeping functions:
- keep the current time up-to-date
- count the number of nanoseconds that have elapsed within the current second.
The jiffies variable is a counter that stores the number of elapsed ticks since the system was started.
The kernel handles cleanly the overflow of jiffies thanks to the time_after
, time_after_eq
, time_before
, and time_before_eq
macros: they yield the correct value even if a wraparound occurred.
u64 get_jiffies_64(void)
{
unsigned long seq;
u64 ret;
do {
seq = read_seqbegin(&jiffies_lock);
ret = jiffies_64;
} while (read_seqretry(&jiffies_lock, seq));
return ret;
}
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
time_init
- Initializes the xtime variable by
get_cmos_time()
- Initializes the
wall_to_monotonic
variable. - If the kernel supports HPET, it invokes the
hpet_enable()
function to determine whether the ACPI firmware has probed the chip and mapped its registers in the memory address space. - Invokes
clocksource_select()
to select the best timer source available in the system, and sets the cur_timer variable to the address of the corresponding timer object. - Invokes
setup_irq(0,&irq0)
to set up the interrupt gate corresponding to IRQ0
Multiprocessor systems can rely on two different sources of timer interrupts:
- those raised by the PIT or the HPET
- those raised by the CPU local timers.
- Checks how long the current process has been running. 1.1
Linux considers two types of timers called dynamic timers and interval timers. The first type is used by the kernel, while interval timers may be created by processes in User Mode.
Dynamic timers may be dynamically created and destroyed. No limit is placed on the number of currently active dynamic timers.
a rule of thumb is to stop the timer before releasing the resource: ...
del_timer(&t); X_Release_Resources( );
In multiprocessor systems, however, this code is not safe because the timer function might already be running on another CPU when del_timer() is invoked.
Despite the clever data structures, handling software timers is a time-consuming activity that should not be performed by the timer interrupt handler. In Linux 2.6 this activity is carried on by a deferrable function, namely the TIMER_SOFTIRQ softirq.
static void udelay(int loops)
{
while (loops--)
io_delay(); /* Approximately 1 us */
}
static inline void io_delay(void)
{
const u16 DELAY_PORT = 0x80;
asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));
}
Objectives: fast process response time, good throughput for background jobs, avoidance of process starvation, reconciliation of the needs of lowand highpriority processes, and so on.
Linux scheduling is based on the time sharing technique: several processes run in "time multiplexing" because the CPU time is divided into slices, one for each runnable process.
The scheduling policy is also based on ranking processes according to their priority.
When speaking about scheduling, processes are traditionally classified as I/O-bound or CPU-bound.
An alternative classification distinguishes three classes of processes: Interactive processes, Batch processes, Real-time processes.
System call | Description |
---|---|
nice( ) | Change the static priority of a conventional process |
getpriority( ) | Get the maximum static priority of a group of conventional processes |
setpriority( ) | Set the static priority of a group of conventional processes |
sched_getscheduler( ) | Get the scheduling policy of a process |
sched_setscheduler( ) | Set the scheduling policy and the real-time priority of a process |
sched_getparam( ) | Get the real-time priority of a process |
sched_setparam( ) | Set the real-time priority of a process |
sched_yield( ) | Relinquish the processor voluntarily without blocking |
sched_get_priority_min( ) | Get the minimum real-time priority value for a policy |
sched_get_priority_max( ) | Set the maximum real-time priority value for a policy |
sched_rr_get_interval( ) | Get the time quantum value for the Round Robin policy |
sched_setaffinity( ) | Set the CPU affinity mask of a process |
sched_getaffinity( ) | Get the CPU affinity mask of a process |
When a process enters the TASK_RUNNING state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run.
Be aware that a preempted process is not suspended, because it remains in the TASK_ RUNNING state; it simply no longer uses the CPU.
It is often believed that a long quantum duration degrades the response time of interactive applications. This is usually false. As described in the section "Process Preemption" earlier in this chapter, interactive processes have a relatively high priority, so they quickly preempt the batch processes, no matter how long the quantum duration is.
scheduling classes: SCHED_FIFO, SCHED_RR, SCHED_NORMAL
-
Base time quantum
base time quantum (in milliseconds) =
- (140 – static priority) × 20 if static priority < 120
- (140 – static priority) × 5 if static priority =120
Description Static priority Nice value Base time quantum Interactivedelta Sleep time threshold Highest static priority 100 -20 800 ms –3 299 ms High static priority 110 -10 600 ms -1 499 ms Default static priority 120 0 100 ms +2 799 ms Low static priority 130 +10 50 ms +4 999 ms Lowest static priority 139 +19 5 ms +6 1199 ms -
Dynamic priority and average sleep time
dynamic priority = max(100, min(static priority − bonus + 5, 139))
The bonus is a value ranging from 0 to 10; a value less than 5 represents a penalty that lowers the dynamic priority, while a value greater than 5 is a premium that raises the dynamic priority. Bonus is related to the
average sleep time
of the processAverage sleep time Bonus Granularity >=0 < 100 ms 0 5120 >=100 ms < 200 ms 1 2560 >=200 ms < 300 ms 2 1280 >=300 ms < 400 ms 3 640 >=400 ms < 500 ms 4 320 >=500 ms < 600 ms 5 160 >=600 ms < 700 ms 6 80 >=700 ms < 800 ms 7 40 >=800 ms < 900 ms 8 20 >=900 ms < 1000 ms 9 10 1 second 10 10 interactive process: [dynamic priority ≤ 3 × static priority / 4 + 28] or [bonus - 5 ≥ static priority / 4 − 28]
The expression
static priority / 4 − 28
is called the interactive delta. -
Active and expired processes
-
Scheduling of Real-Time Processes
-
Direct invocation
- Inserts current in the proper wait queue.
- Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_ UNINTERRUPTIBLE.
- Invokes schedule( ).
- Checks whether the resource is available; if not, goes to step 2.
- Once the resource is available, removes current from the wait queue.
-
Lazy invocation
The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current to 1.
- When current has used up its quantum of CPU time; this is done by the scheduler_tick() function.
- When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up() function.
- When a sched_setscheduler() system call is issued (see the section "System Calls Related to Scheduling" later in this chapter).
Types of multiprocessor machines:
- Classic multiprocessor architecture
- Hyper-threading
- Allows the processor to exploit the machine cycles to execute another thread while the current thread is stalled for a memory access. A hyper-threaded physical CPU is seen by Linux as several different logical CPUs.
- NUMA
These basic kinds of multiprocessor systems are often combined. For instance, a motherboard that includes two different hyper-threaded CPUs is seen by the kernel as four logical CPUs.
Linux sports a sophisticated runqueue balancing algorithm based on the notion of "scheduling domains."
A scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel.
Workload balancing is always done between groups of a scheduling domain. In other words, a process is moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain.
TODO
- Linux adopts the smaller 4 KB page frame size as the standard memory allocation unit. This makes things simpler for two reasons:
- The Page Fault exceptions issued by the paging circuitry are easily interpreted.
- Although both 4 KB and 4 MB are multiples of all disk block sizes, transfers of data between main memory and disks are in most cases more efficient when the smaller size is used.
The Linux kernel makes use of NUMA even for some peculiar uniprocessor systems that have huge "holes" in the physical address space.
The kernel handles these architectures by assigning the contiguous subranges of valid physical addresses to different memory nodes.
Linux kernel must deal with two hardware constraints of the 80 × 86 architecture:
- The Direct Memory Access (DMA) processors for old ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.
- In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.
ZONE | DREC |
---|---|
ZONE_DMA | Contains page frames of memory below 16 MB |
ZONE_NORMAL | Contains page frames of memory at and above 16 MB and below 896 MB |
ZONE_HIGHMEM | Contains page frames of memory at and above 896 MB |
//?? ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of linear address space
All free page frames are grouped into 11 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively. The largest request of 1024 page frames corresponds to a chunk of 4 MB of contiguous RAM. The physical address of the first page frame of a block is a multiple of the group size.
The reverse operation, releasing blocks of page frames, gives rise to the name of this algorithm. The kernel attempts to merge pairs of free buddy blocks of size b together into a single block of size 2b. Two blocks are considered buddies if: • Both blocks have the same size, say b. • They are located in contiguous physical addresses. • The physical address of the first page frame of the first block is a multiple of 2 * b * 2^12.
The __alloc_pages() function essentially executes the following steps:
- Performs a first scanning of the memory zones (see the block of code shown ear- lier). In this first scan, the min threshold value is set to z->pages_low, where z points to the zone descriptor being analyzed (the can_try_harder and gfp_high parameters are set to zero).
- If the function did not terminate in the previous step, there is not much free memory left: the function awakens the kswapd kernel threads to start reclaiming page frames asynchronously (see Chapter 17).
- Performs a second scanning of the memory zones, passing as base threshold the value z->pages_min. As explained previously, the actual threshold is determined also by the can_try_harder and gfp_high flags. This step is nearly identical to step 1, except that the function is using a lower threshold.
- If the function did not terminate in the previous step, the system is definitely low on memory. If the kernel control path that issued the memory allocation request is not an interrupt handler or a deferrable function and it is trying to reclaim page frames (either the PF_MEMALLOC flag or the PF_MEMDIE flag of current is set), the function then performs a third scanning of the memory zones, trying to allo- cate the page frames ignoring the low-on-memory thresholds—that is, without invoking zone_watermark_ok(). This is the only case where the kernel control path is allowed to deplete the low-on-memory reserve of pages specified by the lowmem_reserve field of the zone descriptor. In fact, in this case the kernel con- trol path that issued the memory request is ultimately trying to free page frames, thus it should get what it has requested, if at all possible. If no memory zone includes enough page frames, the function returns NULL to notify the caller of the failure.
- Here, the invoking kernel control path is not trying to reclaim memory. If the _ _ GFP_WAIT flag of gfp_mask is not set, the function returns NULL to notify the kernel control path of the memory allocation failure: in this case, there is no way to satisfy the request without blocking the current process.
- Here the current process can be blocked: invokes cond_resched() to check whether some other process needs the CPU.
- Sets the PF_MEMALLOC flag of current, to denote the fact that the process is ready to perform memory reclaiming.
- Stores in current->reclaim_state a pointer to a reclaim_state structure. This structure includes just one field, reclaimed_slab, initialized to zero (we’ll see how this field is used in the section “Interfacing the Slab Allocator with the Zoned Page Frame Allocator” later in this chapter).
- Invokes try_to_free_pages() to look for some page frames to be reclaimed (see the section “Low On Memory Reclaiming” in Chapter 17). The latter function may block the current process. Once that function returns, __alloc_pages() resets the PF_MEMALLOC flag of current and invokes once more cond_resched().
- If the previous step has freed some page frames, the function performs yet another scanning of the memory zones equal to the one performed in step 3. If the memory allocation request cannot be satisfied, the function determines whether it should continue scanning the memory zone: if the _ _GFP_NORETRY flag is clear and either the memory allocation request spans up to eight page frames, or one of the _ _GFP_REPEAT and _ _GFP_NOFAIL flags is set, the function invokes blk_congestion_wait() to put the process asleep for awhile (see Chapter 14), and it jumps back to step 6. Otherwise, the function returns NULL to notify the caller that the memory allocation failed.
- If no page frame has been freed in step 9, the kernel is in deep trouble, because free memory is dangerously low and it was not possible to reclaim any page frame. Perhaps the time has come to take a crucial decision. If the kernel control path is allowed to perform the filesystem-dependent operations needed to kill a process (the _ _GFP_FS flag in gfp_mask is set) and the _ _GFP_NORETRY flag is clear, performs the following substeps: a. Scans once again the memory zones with a threshold value equal to z-> pages_high. b. Invokes out_of_memory() to start freeing some memory by killing a victim process (see “The Out of Memory Killer” in Chapter 17). c. Jumps back to step 1.
Slabs can be destroyed in two cases:
- There are too many free objects in the slab cache (see the later section “Releas- ing a Slab from a Cache”).
- A timer function invoked periodically determines that there are fully unused slabs that can be released (see Chapter 17).
- The controllers of the hardware block devices transfer data in chunks of fixed length called "sectors."
The Virtual Filesystem, the mapping layer, and the filesystems group the disk data in logical units called "blocks."
While the sector is the basic unit of data transfer for the hardware devices, the block is the basic unit of data transfer for the VFS and, consequently, for the filesystems.
In Linux, the block size must be a power of 2 and cannot be larger than a page frame. Moreover, it must be a multiple of the sector size, because each block must include an integral number of sectors.
Each block requires its own block buffer, which is a RAM memory area used by the kernel to store the block’s content.
As we will see shortly, block device drivers should be able to cope with "segments" of data: each segment is a memory page—or a portion of a memory page—including chunks of data that are physically adjacent on disk.
For each scatter-gather DMA transfer, the block device driver must send to the disk controller:
- The initial disk sector number and the total number of sectors to be transferred
- A list of descriptors of memory areas, each of which consists of an address and a length.
The disk controller takes care of the whole data transfer; for instance, in a read operation the controller fetches the data from the adjacent disk sectors and scatters it into the various memory areas.
To make use of scatter-gather DMA operations, block device drivers must handle the data in units called segments. A segment is simply a memory page—or a portion of a memory page—that includes the data of some adjacent disk sectors. Thus, a scattergather DMA operation may involve several segments at once.
The generic block layer is a kernel component that handles the requests for all block devices in the system.
To keep the block device driver from being suspended, each I/O operation is processed asynchronously. In particular, block device drivers are interrupt-driven (see the section "Monitoring I/O Operations" in the previous chapter): the generic block layer invokes the I/O scheduler to create a new block device request or to enlarge an already existing one and then terminates. The block device driver, which is activated at a later time, invokes the strategy routine to select a pending request and satisfy it by issuing suitable commands to the disk controller. When the I/O operation terminates, the disk controller raises an interrupt and the corresponding handler invokes the strategy routine again, if necessary, to process another pending request.
Each block device driver maintains its own request queue, which contains the list of pending requests for the device. If the disk controller is handling several disks, there is usually one request queue for each physical block device.
There is no ordered queue: new requests are always added either at the front or at the tail of the dispatch queue, and the next request to be processed is always the first request in the queue.
Requests coming from the same process are always inserted in the same queue.
Besides the dispatch queue, the "Deadline" elevator makes use of four queues.
- Two of them—the sorted queues—include the read and write requests, respectively, ordered according to their initial sector numbers.
- The other two—the deadline queues— include the same read and write requests sorted according to their "deadlines."
A request deadline is essentially an expire timer that starts ticking when the request is passed to the elevator. By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 seconds—read requests are privileged over write requests.
- How dynamic linker works:
- sets up a basic execution context for itself, starting from the information stored by the kernel in the User Mode stack between the array of pointers to environment strings and arg_start.
- examines the program to be executed to identify which shared libraries must be loaded and which functions in each shared library are effectively requested.
- issues several mmap() system calls to create memory regions mapping the pages that will hold the library functions (text and data) actually used by the program.
- update all references to the symbols of the shared library, according to the linear addresses of the library’s memory regions.
- terminates its execution by jumping to the main entry point of the program to be executed.