Skip to content

Debugging and troubleshooting

Roman edited this page Jan 6, 2025 · 30 revisions

If you're developing or debugging Tempesta's code (thanks for that by the way!), then you need to make several steps to simplify the process or even save your VM in a cloud. Tempesta FW is a Linux kernel extension, so the below are typical steps helping in Linux kernel development and debugging.

Automatic reboot to safe kernel

Instructions at the below are for Ubuntu 24.

If a kernel crash occurs, then the system may hang, so it has a sense to setup automatic reboot. Normally your Linux should reboot automatically on kernel panic. You can check it by

$ cat /proc/sys/kernel/panic
1

1 means 1 second before reboot on system panic. You can access and set the setting by sysctl kernel.panic. Next, you can emulate the panic by

$ echo c > /proc/sysrq-trigger

and see that the system reboots in 1 second. One more important setting is sysctl kernel.panic_on_oops (you can find it also in /proc/sys/kernel/panic_on_oops). Usually it's set to 1, i.e. reboot on any kernel fault ("Oops") occurred. Setting this isn't necessary, but you may prefer to use it.

Next, you need to reboot to testing kernel but make the system reboot automatically to safe kernel if the first one occasionally crashes. You can do this using

GRUB_DEFAULT=saved
GRUB_CMDLINE_LINUX_DEFAULT="panic=1"

in /etc/default/grub. The settings allow you to set a kernel as safe, i.e. booted by default, and add the kernel parameter to reboot automatically in 1 second after panic. The kernel parameter automatically sets kernel.panic sysctl. To apply the changes run

## update-grub

Now let's determine the menu and submenu IDs for grub:

$ grep 'submenu' /boot/grub/grub.cfg
submenu 'Advanced options for Ubuntu' $menuentry_id_option 'gnulinux-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {

$ grep 'menuentry\>' /boot/grub/grub.cfg
menuentry 'Ubuntu' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_op tion 'gnulinux-simple-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.15.0-79-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-79-generic-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.15.0-79-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-79-generic-recovery-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.15.0-78-generic' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-78-generic-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.15.0-78-generic (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.15.0-78-generic-recovery-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.10.35-tfw-secperf+' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.35-tfw-secperf+-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.10.35-tfw-secperf+ (recovery mode)' --class ubuntu--class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.35-tfw-secperf+-recovery-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.10.35-tfw-secperf+.old' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.35-tfw-secperf+.old-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {
	menuentry 'Ubuntu, with Linux 5.10.35-tfw-secperf+.old (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-5.10.35-tfw-secperf+.old-recovery-d466b9e4-cf98-4174-a7b8-ee1a9d100741' {

To boot the Ubuntu, with Linux 5.10.35-tfw-secperf+ kernel you should specify

GRUB_DEFAULT='gnulinux-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741>gnulinux-5.10.35-tfw-secperf+-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741'

in /etc/default/grub.

Also you can choose the default Ubuntu kernel as safe and the Tempesta kernel as testing:

$ grub-set-default 'gnulinux-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741>gnulinux-5.15.0-71-generic-advanced-660909d3-a156-4a05-9647-5d3f40a6543c'
$ grub-reboot 'gnulinux-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741>gnulinux-5.10.35-tfw-secperf+-advanced-d466b9e4-cf98-4174-a7b8-ee1a9d100741'

After the system reboot it will boot into the 5.10.35-tfw-secperf+ (Tempesta) kernel, but if it fails the system will automatically reboot to the standard Ubuntu kernel.

Crash dumps

It might be useful to setup automatic storing kernel crash dumps. There are plenty of good documentation about kdump & crash utility, please explore the links from references.

Just some settings for quick start and to fix known issues:

  • kernel.panic=60 in /etc/sysctl.conf - we've found that sometimes kdump isn't in time to create a memory image and this setting fixes the problem.

  • it's better to use as small tempesta_dbmem kernel parameter as possible, e.g. tempesta_dbmem=4. Smaller TempestaDB memory leave more space for kdump.

  • adjust /etc/default/grub.d/kdump-tools.default as GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=2048M-:512M"

References

Serial console

It's extremely important for the kernel debugging to get output of the serial console (dmesg), which contains call stacks and necessary system output just before a problem occurred.

Check that CONFIG_SERIAL_8250_CONSOLE is enabled in your kernel config (this is default for Ubuntu).

Add printk.synchronous=1 to the kernel cmdline to get all printk() messages. If you see messages like

** 5029 printk messages dropped **

in dmesg, then you typically need to set the option. This is extremely important for debug builds producing massive printing.

You also might need to update your /etc/sysctl.conf with

kernel.printk = 8 4 1 8

to make the kernel to send even debug messages to the console.

Virtual machines

If you use a KVM virtual machine, then it's very straightforward to enable serial console output:

  1. add -serial file:serial.txt option to qemu-system-x86_64 call so KVM will write the output of the serial console to the serial.txt file.
  2. add console=tty0 console=ttyS0,115200n8 to GRUB_CMDLINE_LINUX line in /etc/default/grub to let the kernel forward the console output to the tty device.
  3. update the grub as in the instructions above and reboot your VM.

If you use a VMware virtual machine:

  1. add serial port in virtual machine settings: Virtual Machine -> Settings (Ctrl+D) -> Hardware tab -> Add... -> select Serial port -> Finish -> select Use output file. Or use official documentation.
  2. add console=tty0 console=ttyS0,115200n8 to GRUB_CMDLINE_LINUX line in /etc/default/grub to let the kernel forward the console output to the tty device. You may need to use `console=tty1 console=ttyS1,115200n8'.
  3. update the grub as in the instructions above and reboot your VM.

There might be issues with serial console redirection into a file with libvirt of version 8 (check with libvirtd --version) and the file isn't written at all. In this case try to use log file eliment as described in the official guide, e.g.

   <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <log file="/var/log/vm-logs/tempesta-1.1.log" append="on"/>
      <alias name='serial0'/>
    </serial>

Restart the libvirt services after the changes:

systemctl restart libvirtd; systemctl restart virtlogd

Hardware servers

If you run Tempesta FW on a hardware server and do not have access to IPMI, then a netconsole can be used to get the output of serial console. You can use the Ubuntu guide how to set it up.

Decoding crashdump

Call traces are dumped on kernel crash and on kernel warnings. Typical trace consists of entries which looks like

do_syscall_64+0x33/0x80

Here are

  • a function name (do_syscall_64 in the example),
  • an offset from function beginning (0x33 in the example),
  • a function length (/0x80 in the example).

The addr2line utility helps to convert this information into a source file name and line number. The call to utility is:

addr2line -e /PATH/TO/MODULE -i HEX_ADDRESS

Here:

  • /PATH/TO/MODULE is a path to a module where the function is located. In case of kernel built in functions, this should be a path to the kernel. Note that the kernel should not be compressed and debug info should not be stripped from it.
  • HEX_ADDRESS - an address of the instruction, '0x' prefix is required.

To convert the function name and the offset into the plain hex address, a mix from nm, grep, perl & awk could be used, so the overal command will be in the form

addr2line -e /PATH/TO/MODULE -i $(perl -e 'printf("%x\n", 0x'$(nm /PATH/TO/MODULE | grep '\<FUNCTION_NAME$' | awk '{print $1}')'+HEX_OFFSET)')

; for example above it will be

addr2line -e vmlinux -i $(perl -e 'printf("%x\n", 0x'$(nm vmlinux | grep '\<do_syscall_64$'| awk '{print $1}')'+0x33)')

There are scripts in linux kernel tree capable of automating this task. Those can be used like this:

linux-kernel-source/scripts/decode_stacktrace.sh linux-kernel-source/vmlinux linux-kernel-source tempesta-source/fw/tempesta_fw.ko < dmesg.log

You can also use a more convenient tool gdb, to locate the source code.

sym is a simple wrapper function of gdb (just append it into your ~/.bashrc):

# Usage: sym <symbol description> <binary>
# $1: the symbol description, e.g. "tfw_hpack_decode+0x1940/0x3c50"
# $2: the binary, e.g. "fw/tempesta_fw.ko", "/any/path/vmlinux"
sym ()
{
    bin=${2:-~/kernel/linux-$(uname -r |grep -Eoh '[0-9]+\.[0-9]+\.[0-9]+')-tfw/vmlinux};
    path=$(realpath $bin);
    ( cd $(dirname $path) && gdb $path -q -ex "list *(${1%/*})" -ex 'quit' )
}

It will use the binary to locate the source code, assuming the source code is in the same directory layout as the binary. With gdb, it will output color-highlighted source code with context lines.

Example:

$ sym __inet_lookup_established+0x5f/0x140 ~/kernel/kasan/linux-5.10.35-tfw/vmlinux
Reading symbols from /home/kingluo/kernel/kasan/linux-5.10.35-tfw/vmlinux...
0xffffffff820ce35f is in __inet_lookup_established (net/ipv4/inet_hashtables.c:395).
390     struct sock *__inet_lookup_established(struct net *net,
391                                       struct inet_hashinfo *hashinfo,
392                                       const __be32 saddr, const __be16 sport,
393                                       const __be32 daddr, const u16 hnum,
394                                       const int dif, const int sdif)
395     {
396             INET_ADDR_COOKIE(acookie, saddr, daddr);
397             const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
398             struct sock *sk;
399             const struct hlist_nulls_node *node;

Analyze the kernel Crash Report

The crash report has many important hints:

  1. the first line of the crash will give a brief reason, e.g. BUG: kernel NULL pointer dereference, address: 0000000000000001.
  2. the stack-top statement, e.g. RIP: 0010:__memcpy_fast+0x13d/0x160 [tempesta_lib].
  3. the assembly code, note that the <xx> denotes the first byte of the crash instruction, e.g. <0f> b7 31 48 83 c0 02 48 83 c1 02 66 89 70 fe 83 e2 01 74 05 0f b6.
  4. the value of the registers, which is mostly useful to check the C function input arguments or the context data.
  5. the call trace, which denotes the crash path, i.e. the function call chain.

The KASAN dump has more context information, e.g. the memory violation type and where the object comes from. Note that due to design limitations, KASAN only works with memory managed by kmalloc/kmem_cache API, it's unaware of if we do some dirty operations on memory managed by our pool (e.g. tfw_pool_alloc. We even caches pages in per-CPU arrays), such as hpack header tables.

The most important thing is to find the RIP in the source code. To make sure the source code matches, we should confirm which git commit the crash comes from. If you can't find the right git commit, try matching the assembly code. Hopefully, the C function you are looking for hasn't changed a lot.

For example, we have a crash report:

[ 9241.758089] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 9241.759403] #PF: supervisor read access in kernel mode
[ 9241.760288] #PF: error_code(0x0000) - not-present page
[ 9241.761169] PGD 0 P4D 0 
[ 9241.761700] Oops: 0000 [#1] SMP KASAN PTI
[ 9241.762487] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G    B      OE     5.10.35+ #5
[ 9241.763861] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[ 9241.765441] RIP: 0010:tfw_hpack_decode+0x1940/0x3c50 [tempesta_fw]
[ 9241.784017] Code: 10 00 00 44 39 65 b0 0f 83 10 10 00 00 48 89 df e8 b5 28 9e d9 4c 8b 2b 49 8d 7d 08 e8 a9 28 9e d9 ba e0 ff ff ff 48 8d 7b 1c <41> 2b 55 08 41 01 d4 e8 34 25 9e d9 f6 43 1c 01 74 b1 48 8b bd 68
[ 9241.787148] RSP: 0018:ffff888420188e50 EFLAGS: 00010282
[ 9241.788061] RAX: 0000000000000000 RBX: ffff8881cd93c010 RCX: ffffffffc1717377
[ 9241.789274] RDX: 00000000ffffffe0 RSI: 0000000000000008 RDI: ffff8881cd93c02c
[ 9241.790589] RBP: ffff888420188f48 R08: ffffffffc1716d49 R09: ffff8881e3d860c7
[ 9241.797484] R10: 00000000000000ac R11: 0000000000000001 R12: 0000000000000fcd
[ 9241.797486] R13: 0000000000000000 R14: 0000000000000041 R15: 0000000000000000
[ 9241.797491] FS:  0000000000000000(0000) GS:ffff888420180000(0000) knlGS:0000000000000000
[ 9241.834645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9241.834648] CR2: 0000000000000008 CR3: 00000001d941e005 CR4: 0000000000770ee0
[ 9241.834652] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9241.834657] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

First, let's check if we can find the C code location via gdb:

sym tfw_hpack_decode+0x1940/0x3c50 fw/tempesta_fw.ko

example output:

0x1544f is in tfw_hpack_decode (/home/tempesta/tempesta/fw/hpack.c:832).
827				       "maximum allowed decreased size: %u\n",  __func__,
828				       curr, early, count, window);
829	
830				cp = entries + early;
831				do {
832					size -= HPACK_ENTRY_OVERHEAD + cp->hdr->len; // <------ CRASH HERE
833					T_DBG3("%s: dropped index: %u\n", __func__,
834					       early);
835					if (cp->last)
836						tfw_pool_clean_single(tbl->h_pool,

It points to size -= HPACK_ENTRY_OVERHEAD + cp->hdr->len;. Is it what we want?

If the binary does not match the git commit, you will get the wrong location, and it will give you a meaningless result. But please note that there are some cases that should be considered normal, which will make the location slightly different from the correct location, for example, some functions are inlined, statements are reordered due to compiler optimizations. In these cases, you will still get the correct location, but it will point to the called function fragment, or the line number will have a certain offset.

So, if you're in doubt, check the assembly code too, i.e. <41> 2b 55 08 41 ....

objdump -rSl --disassemble=tfw_hpack_decode fw/tempesta_fw.ko | less

Okay, we get it. The assembly code matches the the C code, and according to the crash brief, it's the correct crash point.

/home/kingluo/tempesta/fw/hpack.c:833
                                size -= HPACK_ENTRY_OVERHEAD + cp->hdr->len;  <---- cp->hdr=%r13=0x0
   1b8bd:       41 2b 55 08             sub    0x8(%r13),%edx
   1b8c1:       41 01 d4                add    %edx,%r12d

Some suggestions for debugging deployment:

  1. The debugging git branch should be pushed, avoid force push, otherwise the history may be messed up.
  2. Compile and deploy binaries only on each push commit, not on a code base with uncommitted or temporary changes, because the temporarily built binary cannot be used to track the source code again.
  3. Record the commit hash in the GitHub issue comment so that each issue assignee can compile tempesta by themselves and make the correct mapping between the binary and the source code.

In addition, for official deployments, we should make a git tag, compile it into the tempesta binary, and print the tag or commit hash when tempesta starts, so that we can use it to track the source code.

References

Tracking illegal memory accesses and leaks

There exist several tools to help debug memory issues.

kmemleak

https://www.kernel.org/doc/html/v5.10/dev-tools/kmemleak.html

Kmemleak provides a way of detecting possible kernel memory leaks in a way similar to a tracing garbage collector.

In order to use kmemleak the kernel has to be built with the following options:

CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000 <- can be arbitrary

Basic flow:

  1. Check if debugfs is already mounted. Mount if necessary: # mount -t debugfs nodev /sys/kernel/debug/

  2. echo clear > /sys/kernel/debug/kmemleak

  3. Run the test

  4. echo scan > /sys/kernel/debug/kmemleak

    If there are any hypothetical leaks, one might see the following meesage in dmesg: [ 1185.562872] kmemleak: 39 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

  5. Gather the report: cat /sys/kernel/debug/kmemleak

    Example report:

unreferenced object 0xffff88813997d040 (size 704):
  comm "softirq", pid 0, jiffies 4295173707 (age 614.949s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c0 98 99 26 81 88 ff ff  ...........&....
  backtrace:
    [<0000000035bfc8aa>] tfw_h2_add_stream+0xbe/0x410 [tempesta_fw]
    [<00000000ba7e15f7>] tfw_h2_frame_recv+0x3140/0x64f0 [tempesta_fw]
    [<000000004e5a3637>] ss_skb_process+0x3a3/0x5e0 [tempesta_fw]
    [<00000000aedeaca5>] tfw_h2_frame_process+0x16d/0x1200 [tempesta_fw]
    [<00000000e66f86b4>] tfw_connection_recv+0x100/0x1d0 [tempesta_fw]
    [<000000003ca45a93>] tfw_tls_connection_recv+0x538/0xbb0 [tempesta_fw]
    [<00000000b7f25f34>] ss_tcp_process_data+0x5d9/0xd90 [tempesta_fw]
    [<000000004dcf244d>] ss_tcp_data_ready+0xc4/0x1c0 [tempesta_fw]
    [<00000000832ba38a>] tcp_data_queue+0x1669/0x4cb0
    [<000000001c743974>] tcp_rcv_established+0x6ae/0x1d10
    [<00000000a6c79ebb>] tcp_v4_do_rcv+0x4fa/0x750
    [<0000000035a2b47c>] tcp_v4_rcv+0x2588/0x3530
    [<00000000158df7d6>] ip_protocol_deliver_rcu+0x6a/0x550
    [<0000000085c0e07f>] ip_local_deliver_finish+0x1a4/0x250
    [<000000006d5a9809>] ip_local_deliver+0x246/0x2a0
    [<00000000d925252d>] ip_rcv+0x15a/0x180
unreferenced object 0xffff888137f47a80 (size 704):

page_owner

https://www.kernel.org/doc/html/v5.10/vm/page_owner.html

kmemleak can't track raw pages allocations from the buddy allocator, so page_owner comes to rescue! This tool is useful to track who allocated each page and can be used to find memory leaks.

The kernel has to be built with CONFIG_PAGE_OWNER=y option and page_owner=on should be added to boot cmdline.

  1. Build user space helper: cd <linux kernel>/tools/vm && make page_owner_sort
  2. Collect allocation statistics before the test:
    cat /sys/kernel/debug/page_owner > page_owner_full_pre.txt
    ./page_owner_sort page_owner_full_pre.txt sorted_page_owner_pre.txt
    
  3. Run the workload.
  4. Collect new statistics and compare it to the original data from above:
    cat /sys/kernel/debug/page_owner > page_owner_full_post.txt
    ./page_owner_sort page_owner_full_post.txt sorted_page_owner_post.txt
    

Example output:

PRE >

57889 times:
Page allocated via order 0, mask 0x100a20(GFP_ATOMIC|__GFP_HARDWALL)
 prep_new_page+0x1ad/0x250
 get_page_from_freelist+0x22b1/0x2e80
 __alloc_pages_nodemask+0x2b4/0x630
 tfw_cache_do_action+0x2598/0x2e10 [tempesta_fw]
 tfw_cache_process+0x1aa/0x7b0 [tempesta_fw]
 tfw_http_req_process+0x94d/0x13c0 [tempesta_fw]
 tfw_http_msg_process_generic+0x42e/0x11d0 [tempesta_fw]
 tfw_h2_frame_process+0x7b0/0x11d0 [tempesta_fw]
 tfw_connection_recv+0x102/0x1e0 [tempesta_fw]
 tfw_tls_connection_recv+0x4d9/0xa30 [tempesta_fw]
 ss_tcp_process_data+0x5ea/0xd50 [tempesta_fw]
 ss_tcp_data_ready+0xb9/0x1e0 [tempesta_fw]
 tcp_data_queue+0x1669/0x4cb0
 tcp_rcv_established+0x6e8/0x1d70
 tcp_v4_do_rcv+0x4fa/0x750
 tcp_v4_rcv+0x2588/0x3530

POST >

120203 times:
Page allocated via order 0, mask 0x100a20(GFP_ATOMIC|__GFP_HARDWALL)
 prep_new_page+0x1ad/0x250
 get_page_from_freelist+0x22b1/0x2e80
 __alloc_pages_nodemask+0x2b4/0x630
 tfw_cache_do_action+0x2598/0x2e10 [tempesta_fw]
 tfw_cache_process+0x1aa/0x7b0 [tempesta_fw]
 tfw_http_req_process+0x94d/0x13c0 [tempesta_fw]
 tfw_http_msg_process_generic+0x42e/0x11d0 [tempesta_fw]
 tfw_h2_frame_process+0x7b0/0x11d0 [tempesta_fw]
 tfw_connection_recv+0x102/0x1e0 [tempesta_fw]
 tfw_tls_connection_recv+0x4d9/0xa30 [tempesta_fw]
 ss_tcp_process_data+0x5ea/0xd50 [tempesta_fw]
 ss_tcp_data_ready+0xb9/0x1e0 [tempesta_fw]
 tcp_data_queue+0x1669/0x4cb0
 tcp_rcv_established+0x6e8/0x1d70
 tcp_v4_do_rcv+0x4fa/0x750
 tcp_v4_rcv+0x2588/0x3530

KASAN

https://www.kernel.org/doc/html/v5.10/dev-tools/kasan.html

KernelAddressSANitizer (KASAN) is a dynamic memory error detector designed to find out-of-bound and use-after-free bugs.

The following build options are needed to use it:

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
# CONFIG_KASAN_OUTLINE is not set
CONFIG_KASAN_INLINE=y   <- works faster than CONFIG_KASAN_OUTLINE
CONFIG_KASAN_STACK=1
CONFIG_KASAN_VMALLOC=y  <- enable this option, so KASAN would also cover vmalloc space

Usage:

  1. Run the desired test
  2. Observe reports in dmesg

Example report:

BUG: KASAN: slab-out-of-bounds in sg_next+0x72/0x80
Read of size 8 at addr ffff8881292b50e8 by task ksoftirqd/3/29

CPU: 3 PID: 29 Comm: ksoftirqd/3 Tainted: G           OE     5.10.35+ #8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x91/0xbc
 print_address_description.constprop.0+0x1c/0x210
 ? _raw_spin_lock_irqsave+0x7d/0xc0
 ? _raw_write_unlock_bh+0x60/0x60
 ? sg_next+0x72/0x80
 kasan_report.cold+0x37/0x7c
 ? kasan_poison_shadow+0x10/0x30
 ? sg_next+0x72/0x80
 sg_next+0x72/0x80
 ttls_decrypt+0x12cb/0x20b0 [tempesta_tls]
 ? receive_buf+0xb68/0x57f0 [virtio_net]
 ? virtnet_poll+0x4b0/0xc81 [virtio_net]
 ? ttls_send_alert+0x1d0/0x1d0 [tempesta_tls]
 ? smpboot_thread_fn+0x2eb/0x6a0
 ? kthread+0x31c/0x3e0
 ? ret_from_fork+0x1f/0x30
 ? console_trylock+0xb0/0xb0
 ttls_recv+0x23e/0x1e10 [tempesta_tls]
 ? memset+0x10/0x40
 ss_skb_process+0x3a3/0x5e0 [tempesta_fw]
 ? ttls_handle_alert+0x1a0/0x1a0 [tempesta_tls]
 ? __kasan_slab_free+0x124/0x150
 ? tfw_sock_srv_exit+0x30/0x30 [tempesta_fw]
 ? _raw_spin_lock+0x6a/0xb0
 ? _raw_write_lock_irqsave+0xc0/0xc0
 tfw_tls_connection_recv+0x17d/0xc00 [tempesta_fw]
 ? ss_skb_unroll+0x9b0/0xf40 [tempesta_fw]
 ? tfw_tls_connection_lost+0x30/0x30 [tempesta_fw]
 ? dynamic_emit_prefix+0x350/0x350
 ss_tcp_process_data+0x5d9/0xe30 [tempesta_fw]
 ? ss_linkerror+0x120/0x120 [tempesta_fw]
 ss_tcp_data_ready+0x13b/0x270 [tempesta_fw]
 tcp_rcv_established+0x1155/0x1d10
 ? tcp_data_queue+0x4cb0/0x4cb0
 tcp_v4_do_rcv+0x4fa/0x750
 tcp_v4_rcv+0x2588/0x3530
 ? nf_ct_deliver_cached_events+0xe4/0x410 [nf_conntrack]
 ? __inet_lookup_established+0x27e/0x4d0
 ? tcp_v4_early_demux+0x790/0x790
 ip_protocol_deliver_rcu+0x6a/0x550
 ? nf_confirm+0x340/0x340 [nf_conntrack]
 ip_local_deliver_finish+0x1a4/0x250
 ip_local_deliver+0x246/0x2a0
 ? ip_local_deliver_finish+0x250/0x250
 ? ip_rcv_finish_core.constprop.0+0x2e9/0x18a0
 ? ip_protocol_deliver_rcu+0x550/0x550
 ip_rcv+0x15a/0x180
 ? ip_local_deliver+0x2a0/0x2a0
 ? ip_rcv_finish_core.constprop.0+0x18a0/0x18a0
 ? kasan_unpoison_shadow+0x34/0x40
 __netif_receive_skb_core.constprop.0+0xdd9/0x3050
 ? get_page_from_freelist+0x1b39/0x4da0
 ? generic_xdp_tx+0x410/0x410
 ? ip_finish_output2+0x60c/0x1a60
 ? __zone_watermark_ok+0x420/0x420
 __netif_receive_skb_list_core+0x2cf/0x8b0
 ? __netif_receive_skb_core.constprop.0+0x3050/0x3050
 ? ret_from_fork+0x1f/0x30
 ? ktime_get_with_offset+0xa9/0x150
 netif_receive_skb_list_internal+0x5d7/0xc10
 ? arch_stack_walk+0x6c/0xb0
 ? __netif_receive_skb_list_core+0x8b0/0x8b0
 ? dev_gro_receive+0x229/0x2330
 ? memcpy+0x39/0x60
 gro_normal_one+0x137/0x220
 napi_gro_receive+0x23c/0x2a0
 receive_buf+0xb68/0x57f0 [virtio_net]
 ? page_to_skb+0x9f0/0x9f0 [virtio_net]
 ? tcp_schedule_loss_probe.part.0+0x383/0x510
 ? virtqueue_get_buf_ctx_split+0x273/0x630
 ? _raw_spin_unlock_bh+0x20/0x20
 virtnet_poll+0x4b0/0xc81 [virtio_net]
 ? receive_buf+0x57f0/0x57f0 [virtio_net]
 net_rx_action+0x37c/0xbc0
 ? napi_complete_done+0x5d0/0x5d0
 ? sched_clock_cpu+0x18/0x160
 __do_softirq+0x1a6/0x5ab
 ? tasklet_hi_action+0x20/0x20
 run_ksoftirqd+0x26/0x40
 smpboot_thread_fn+0x2eb/0x6a0
 ? smpboot_register_percpu_thread+0x320/0x320
 ? __kthread_parkme+0x85/0x100
 ? smpboot_register_percpu_thread+0x320/0x320
 kthread+0x31c/0x3e0
 ? __kthread_bind_mask+0x90/0x90
 ret_from_fork+0x1f/0x30

Allocated by task 29:
 kasan_save_stack+0x1b/0x40
 __kasan_kmalloc.constprop.0+0xc2/0xd0
 ttls_decrypt+0xf6f/0x20b0 [tempesta_tls]
 ttls_recv+0x23e/0x1e10 [tempesta_tls]
 ss_skb_process+0x3a3/0x5e0 [tempesta_fw]
 tfw_tls_connection_recv+0x17d/0xc00 [tempesta_fw]
 ss_tcp_process_data+0x5d9/0xe30 [tempesta_fw]
 ss_tcp_data_ready+0x13b/0x270 [tempesta_fw]
 tcp_rcv_established+0x1155/0x1d10
 tcp_v4_do_rcv+0x4fa/0x750
 tcp_v4_rcv+0x2588/0x3530
 ip_protocol_deliver_rcu+0x6a/0x550
 ip_local_deliver_finish+0x1a4/0x250
 ip_local_deliver+0x246/0x2a0
 ip_rcv+0x15a/0x180
 __netif_receive_skb_core.constprop.0+0xdd9/0x3050
 __netif_receive_skb_list_core+0x2cf/0x8b0
 netif_receive_skb_list_internal+0x5d7/0xc10
 gro_normal_one+0x137/0x220
 napi_gro_receive+0x23c/0x2a0
 receive_buf+0xb68/0x57f0 [virtio_net]
 virtnet_poll+0x4b0/0xc81 [virtio_net]
 net_rx_action+0x37c/0xbc0
 __do_softirq+0x1a6/0x5ab

The buggy address belongs to the object at ffff8881292b5000
 which belongs to the cache kmalloc-256 of size 256
The buggy address is located 232 bytes inside of
 256-byte region [ffff8881292b5000, ffff8881292b5100)
The buggy address belongs to the page:
page:00000000decafbbf refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1292b4
head:00000000decafbbf order:1 compound_mapcount:0
flags: 0x17ffffc0010200(slab|head)
raw: 0017ffffc0010200 0000000000000000 0000000500000001 ffff888100043400
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8881292b4f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8881292b5000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8881292b5080: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
                                                          ^
 ffff8881292b5100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8881292b5180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint

SystemTap

Source code: https://sourceware.org/git/?p=systemtap.git;a=summary

SystemTap is a tracing and probing tool that allows users to study and monitor the activities of the operating system (particularly, the kernel) in fine detail

This tool is quite useful for non-intrusive debugging/tracing. One can trace nearly any part of the running kernel and/or module without any explicit modifications to the source code. Systemtap can do a lot of things ranging from simple printk() insertion at arbitrary place in code to on-the-fly data structres modification, thus allowing to perform some sort of fault injection to test the code.

Scripting language reference: https://sourceware.org/systemtap/langref/

Beginners guide: https://sourceware.org/systemtap/SystemTap_Beginners_Guide/

Tutorial: https://sourceware.org/systemtap/tutorial/

It is usually available as a package in your distro of choice.

  • Debian/Ubuntu: apt-get install -y systemtap
  • Fedora: dnf install systemtap

In order to perform instrumentation, systemtap needs some information about the running kernel:

  1. For packaged kernel, one might need to install kernel-debuginfo kernel-debuginfo-common kernel-devel additional packages.
  2. If the kernel was build from source, we would need an access to the folder containing the built kernel:
    f35tfw :: ~ » l /lib/modules/5.10.35perf+ 
    total 2.6M
    drwxr-xr-x   4 root root 4.0K Feb  1 04:42 .
    drwxr-xr-x.  6 root root 4.0K Feb  1 04:20 ..
    lrwxrwxrwx   1 root root   27 Feb  1 04:20 build -> /root/host/linux-stable-tfw
    drwxr-xr-x   2 root root 4.0K Feb  1 04:42 extra
    drwxr-xr-x  11 root root 4.0K Feb  1 04:21 kernel
    -rw-r--r--   1 root root 681K Feb  1 04:26 modules.alias
    -rw-r--r--   1 root root 658K Feb  1 04:26 modules.alias.bin
    -rw-r--r--   1 root root  11K Feb  1 04:20 modules.builtin
    -rw-r--r--   1 root root  29K Feb  1 04:26 modules.builtin.alias.bin
    -rw-r--r--   1 root root  13K Feb  1 04:26 modules.builtin.bin
    -rw-r--r--   1 root root  82K Feb  1 04:20 modules.builtin.modinfo
    -rw-r--r--   1 root root 161K Feb  1 04:26 modules.dep
    -rw-r--r--   1 root root 254K Feb  1 04:26 modules.dep.bin
    -rw-r--r--   1 root root  280 Feb  1 04:26 modules.devname
    -rw-r--r--   1 root root  82K Feb  1 04:20 modules.order
    -rw-r--r--   1 root root  572 Feb  1 04:26 modules.softdep
    -rw-r--r--   1 root root 248K Feb  1 04:26 modules.symbols
    -rw-r--r--   1 root root 300K Feb  1 04:26 modules.symbols.bin
    lrwxrwxrwx   1 root root   27 Feb  1 04:20 source -> /root/host/linux-stable-tfw
    
    build/source symlinks point to the sources
    
  3. In order to be able to probe the code in out-of-tree modules like tempesta_fw, one should create symlinks to the these modules inside extra folder in /lib/modules/<kernel>/. E.g:
    f35tfw :: ~ » l /lib/modules/5.10.35perf+/extra/         
    total 8.0K
    drwxr-xr-x 2 root root 4.0K Feb  1 04:42 .
    drwxr-xr-x 4 root root 4.0K Feb  1 04:42 ..
    lrwxrwxrwx 1 root root   42 Feb  1 04:42 tempesta_db.ko -> /root/host/tempesta/db/core/tempesta_db.ko
    lrwxrwxrwx 1 root root   37 Feb  1 04:42 tempesta_fw.ko -> /root/host/tempesta/fw/tempesta_fw.ko
    lrwxrwxrwx 1 root root   39 Feb  1 04:42 tempesta_lib.ko -> /root/host/tempesta/lib/tempesta_lib.ko
    lrwxrwxrwx 1 root root   39 Feb  1 04:42 tempesta_tls.ko -> /root/host/tempesta/tls/tempesta_tls.ko
    

Run the script: stap -g -v ~/my_script.stp

Sample output:

f35tfw :: ~ » 
Pass 1: parsed user script and 500 library scripts using 227748virt/202296res/14656shr/187828data kb, in 250usr/100sys/340real ms.
Pass 2: analyzed script: 10 probes, 17 functions, 1 embed, 1 global using 692748virt/636244res/15892shr/652828data kb, in 2720usr/180sys/3164real ms.
Pass 3: translated to C into "/tmp/stapYCj74o/stap_066522e114f6510efc1fcfdfd1c92da5_8418_src.c" using 692748virt/636432res/16080shr/652828data kb, in 550usr/20sys/572real ms.
Pass 4: compiled C into "stap_066522e114f6510efc1fcfdfd1c92da5_8418.ko" in 1280usr/370sys/1958real ms.
Pass 5: starting run. <--- success!

Here are some simple examples of what can be done inside the script.

  1. Print softIRQ NET_TX_ACTION / NET_RX_ACTION invocations count every 1s on VM with 4 vCPUs.
f35tfw :: ~ » cat tfw_softirq_rx2_tx3.stp 
global t_si_arr_rx[4]
global t_si_arr_tx[4]

probe timer.ms(1000) {
        printf("CPU:    [0]             [1]             [2]             [3]             \n")
        printf("==========================================================================\n")
        printf("RX =>   %d              |%d             |%d             |%d\n",
                t_si_arr_rx[0], t_si_arr_rx[1], t_si_arr_rx[2], t_si_arr_rx[3])
        printf("--------------------------------------------------------------------------\n")
        printf("TX =>   %d              |%d             |%d             |%d\n",
                t_si_arr_tx[0], t_si_arr_tx[1], t_si_arr_tx[2], t_si_arr_tx[3])
        printf("\n")
}

probe kernel.statement("__do_softirq@kernel/softirq.c:303") {
        if ($vec_nr == 3) {
                t_si_arr_tx[cpu()] ++
        }

        if ($vec_nr == 2) {
                t_si_arr_rx[cpu()] ++
        }
}
  1. A little bit more complicated example, which uses embedded C code to dump sk_buff. skb_dump() function has been 'enhanced' and several probes are attached in the datapath.
%{
#include <include/net/sock.h>

#include <linux/net.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h>
#include <linux/smp.h>
%}

function tfw_tls_skb_type:long (sk_buff:long)
{
	dev = @cast(sk_buff, "sk_buff")->dev
	
	if (!(dev & 1))
		return 0
	return dev >> 1
}

function should_check_skb:long (sk:long, sk_buff:long)
{
	return (@cast(sk, "sock")->sk_write_xmit != NULL) && tfw_tls_skb_type(sk_buff)
}

function skb_dump_msg(sk:long, sk_buff:long, msg:string, full:long)
%{
	pr_warn("[%d] sk %px skb %px: %s\n", smp_processor_id(),
		(struct sock *)STAP_ARG_sk, (struct sk_buff *)STAP_ARG_sk_buff,
		(const char *)STAP_ARG_msg);
	skb_dump(KERN_WARNING, (struct sk_buff *)STAP_ARG_sk_buff,
		 (long)STAP_ARG_full ? true : false);
%}

function __tcp_dump(sk:long, sk_buff:long, lim:long, msg:string)
%{
	pr_warn("[%d] %s sk %px skb %px limit %lu\n", smp_processor_id(), (const char*)STAP_ARG_msg,
		(struct sock *)STAP_ARG_sk, (struct sk_buff *)STAP_ARG_sk_buff,
		(long)STAP_ARG_lim);
%}

function __tcp_tskb_info(sk:long, sk_buff:long, thdr_size:long, topt_size:long)
%{
	pr_warn("[%d] sk %px skb %px thdr_size %lu topt_size %lu\n", smp_processor_id(),
		(struct sock *)STAP_ARG_sk, (struct sk_buff *)STAP_ARG_sk_buff,
		(long)STAP_ARG_thdr_size, (long)STAP_ARG_topt_size);
%}

function __skb_dump_ext(sk_buff:long, msg:string, full:long)
%{
	bool full = (bool)STAP_ARG_full;
	struct sk_buff *skb = (struct sk_buff *)STAP_ARG_sk_buff;

	struct skb_shared_info *sh = skb_shinfo(skb);
	struct net_device *dev = skb->dev;
	struct sock *sk = skb->sk;
	struct sk_buff *list_skb;
	bool has_mac, has_trans;
	int headroom, tailroom;
	int i, len, seg_len;

	if (full)
		len = skb->len;
	else
		len = min_t(int, skb->len, MAX_HEADER + 128);

	headroom = skb_headroom(skb);
	tailroom = skb_tailroom(skb);

	has_mac = skb_mac_header_was_set(skb);
	has_trans = skb_transport_header_was_set(skb);

	pr_warn("[%d] %s skb %px\n", smp_processor_id(),
		(const char *)STAP_ARG_msg, (struct sk_buff *)STAP_ARG_sk_buff);

	printk("%sskb len=%u headroom=%u headlen=%u tailroom=%u\n"
	       "mac=(%d,%d) net=(%d,%d) trans=%d\n"
	       "shinfo(txflags=%u nr_frags=%u gso(size=%hu type=%u segs=%hu))\n"
	       "csum(0x%x ip_summed=%u complete_sw=%u valid=%u level=%u)\n"
	       "hash(0x%x sw=%u l4=%u) proto=0x%04x pkttype=%u iif=%d\n",
	       KERN_WARNING, skb->len, headroom, skb_headlen(skb), tailroom,
	       has_mac ? skb->mac_header : -1,
	       has_mac ? skb_mac_header_len(skb) : -1,
	       skb->network_header,
	       has_trans ? skb_network_header_len(skb) : -1,
	       has_trans ? skb->transport_header : -1,
	       sh->tx_flags, sh->nr_frags,
	       sh->gso_size, sh->gso_type, sh->gso_segs,
	       skb->csum, skb->ip_summed, skb->csum_complete_sw,
	       skb->csum_valid, skb->csum_level,
	       skb->hash, skb->sw_hash, skb->l4_hash,
	       ntohs(skb->protocol), skb->pkt_type, skb->skb_iif);

	if (dev)
		printk("%sdev name=%s feat=0x%pNF\n",
		       KERN_WARNING, dev->name, &dev->features);
	if (sk)
		printk("%ssk family=%hu type=%u proto=%u\n",
		       KERN_WARNING, sk->sk_family, sk->sk_type, sk->sk_protocol);

	if (full && headroom)
		print_hex_dump(KERN_WARNING, "skb headroom: ", DUMP_PREFIX_OFFSET,
			       16, 1, skb->head, headroom, true);

	seg_len = min_t(int, skb_headlen(skb), len);
	if (seg_len)
		print_hex_dump(KERN_WARNING, "skb linear:   ", DUMP_PREFIX_OFFSET,
			       16, 1, skb->data, seg_len, true);
	len -= seg_len;

	if (full && tailroom)
		print_hex_dump(KERN_WARNING, "skb tailroom: ", DUMP_PREFIX_OFFSET,
			       16, 1, skb_tail_pointer(skb), tailroom, true);

	for (i = 0; len && i < skb_shinfo(skb)->nr_frags; i++) {
		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
		u32 p_off, p_len, copied;
		struct page *p;
		u8 *vaddr;

		skb_frag_foreach_page(frag, skb_frag_off(frag),
				      skb_frag_size(frag), p, p_off, p_len,
				      copied) {
			seg_len = min_t(int, p_len, len);
			vaddr = kmap_atomic(p);
			print_hex_dump(KERN_WARNING, "skb frag:     ",
				       DUMP_PREFIX_OFFSET,
				       16, 1, vaddr + p_off, seg_len, true);
			kunmap_atomic(vaddr);
			len -= seg_len;
			if (!len)
				break;
		}
	}

	if (full && skb_has_frag_list(skb)) {
		printk("skb fraglist:\n");
		skb_walk_frags(skb, list_skb)
			skb_dump(KERN_WARNING, list_skb, true);
	}
%}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2663") {
	if (should_check_skb($sk, $skb)) {
		__skb_dump_ext($skb, "RIGHT AFTER WHILE", 1)
	}
}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2696") {
	if (should_check_skb($sk, $skb))
		__tcp_dump($sk, $skb, $limit, "FIRST LIMIT")
}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2703") {
	if (should_check_skb($sk, $skb))
		__tcp_dump($sk, $skb, $limit, "SECOND LIMIT")
}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2716") {
	if (should_check_skb($sk, $skb)) {
		__tcp_dump($sk, $skb, $limit, "AFTER ADJUST")
		__skb_dump_ext($skb, "BEFORE FRAGMENT", 1)
	}
}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2744") {
	__skb_dump_ext($skb, "BEFORE ENCRYPT", 1)
}

probe kernel.statement("tcp_write_xmit@net/ipv4/tcp_output.c:2745") {
	skb_dump_msg($sk, $skb, "AFTER ENCRYPT", 1)
	/* __skb_dump_ext($skb, "AFTER ENCRYPT", 1) */
}

probe kernel.statement("__tcp_transmit_skb@net/ipv4/tcp_output.c:1259") {
	skb_dump_msg($sk, $skb, "OSKB PRE CLONE", 1)
	/* __skb_dump_ext($skb, "OSKB PRE CLONE", 1) */
}

probe kernel.statement("__tcp_transmit_skb@net/ipv4/tcp_output.c:1312") {
	__tcp_tskb_info($sk, $skb, $tcp_header_size, $tcp_options_size)
	skb_dump_msg($sk, $skb, "SKB AFTER CLONE", 1)
	/* __skb_dump_ext($skb, "SKB AFTER CLONE", 1) */
}

probe kernel.function("skb_panic") {
	__skb_dump_ext($skb, "PANIC SKB", 0)
}


function __skb_alloc_trace(sk_buff:long)
%{
	pr_warn("[%d] got new skb %px\n", smp_processor_id(), (struct sk_buff *)STAP_ARG_sk_buff);
	dump_stack();
%}


probe module("tempesta_fw").statement("tfw_h2_resp_adjust_fwd@fw/http.c:5367") {
	skb_dump_msg(NULL, $sk_head, "BEFORE MSG TRANS", 1)
}

Error injection

Tempesta FW uses fault injection infrastructure provided by linux kernel. We use error injection to make tests for some difficult places in source code. How to use:

  1. Linux kernel should be build with CONFIG_BPF, CONFIG_BPF_KPROBE_OVERRIDE, CONFIG_FUNCTION_ERROR_INJECTION, CONFIG_FAULT_INJECTION, CONFIG_FAULT_INJECTION_DEBUG_FS and CONFIG_FAIL_FUNCTION.
  2. Add macro ALLOW_ERROR_INJECTION(function_name, error_code) after function definition that should return an error.
  3. Implement test based on one of the tests from t_fault_injection/test_fault_injection_base.py
Clone this wiki locally