In the third part of this guide we will explain what system calls actually are and how they work under the Linux kernel. Furthermore, we will also try to write a simple syscall that will greet the user. Do note that this article is part of a series of guides on kernel hacking for Linux, you can find the rest of the guide here:

  1. Part one: Introduction to the Linux kernel architecture;
  2. Part two: Building a device driver from scratch;
  3. Part three(this one): Introduction to syscalls, how to create a new syscall;

Be sure to read the previous two before starting with this one.

What is a system call?

In the first part of this guide, we gave a naive definition of system calls(a sort of API to communicate with the kernel from userspace). Let us now expand it a bit more.

System calls are programming interface used by userspace programs to delegate the kernel to achieve a certain task. A system call is called, for instance, whenever an userspace process has to deal with devices(read, write), network sockets(socket) or processes(fork). Many of these system calls are “wrapped” by the C library functions(such as glibc or uclibc) i.e., before calling the requested syscall, they may perform additional tasks(an example of this phenomenon is the fork system call). Others C library functions, such as strtok, achieve their scope without interacting with the kernel.

One thing to remember about system calls is that they are not like others C functions; in order to call them, the userspace program has to ask the CPU to enter kernel mode using a particular instruction(syscall, sysenter, int 0x80) and then to go to the memory location where the system call resides. The Linux kernel architecture makes this process very easy for system programmers; all we have to do is provide the kernel a unique identifier that will be used to access the corresponding syscall code.

System call in C programs

Now that we have an idea of what system calls are and how they work, let us now try to find the system calls used in the following C program:

#include <stdio.h>

int main(void) {
	printf("Hello World\n");
	return 0;
}

In order to do that, we can use the strace(1) utility:

marco@kernelvm:~$ strace ./a.out
execve("./a.out", ["./a.out"], 0xffffdcc5c320 /* 23 vars */) = 0
brk(NULL)                               = 0xaaaac9e94000
faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=20031, ...}) = 0
mmap(NULL, 20031, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffbd25e000
close(3)                                = 0
openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0`C\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1458480, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffffbd25c000
mmap(NULL, 1531032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffbd0bf000
mprotect(0xffffbd21c000, 65536, PROT_NONE) = 0
mmap(0xffffbd22c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15d000) = 0xffffbd22c000
mmap(0xffffbd232000, 11416, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffffbd232000
close(3)                                = 0
mprotect(0xffffbd22c000, 12288, PROT_READ) = 0
mprotect(0xaaaab6de0000, 4096, PROT_READ) = 0
mprotect(0xffffbd266000, 4096, PROT_READ) = 0
munmap(0xffffbd25e000, 20031)           = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
brk(NULL)                               = 0xaaaac9e94000
brk(0xaaaac9eb5000)                     = 0xaaaac9eb5000
write(1, "Hello World\n", 12Hello World
)           = 12
exit_group(0)                           = ?
+++ exited with 0 +++

Even for a small program like this, a lot of system calls are being used. However, the only one we care about is the following one:

write(1, "Hello World\n", 12Hello World) = 12

The write C function(not the syscall) takes three arguments:

  1. The file descriptor(identifier for I/O operations. 0 = stdin, 1 = stdout, 2 = stderr);
  2. The character buffer;
  3. The size of the buffer.

In this way, we could rewrite the previous program without using the printf function:

#include <unistd.h>

int main(void) {
	//printf("Hello World\n");
	write(1, "Hello World\n", 12);

	return 0;
}

The write function here serves as a wrapper for the write system call. If we want to avoid using the C library at all we could rewrite this as:

#include <unistd.h>
#include <sys/syscall.h>

int main(void) {
        //printf("Hello World\n");
        //write(1, "Hello World\n", 12);
        syscall(SYS_write, 1, "Hello World\n", 12);

        return 0;
}

Now you may wonder if we can go a level deeper…of course we can! but this time we have to work with assembly. The source code of the syscall function is defined as follow(in glibc):

.text
ENTRY (syscall)
        movq %rdi, %rax                /* Syscall number -> rax.  */
        movq %rsi, %rdi                /* shift arg1 - arg5.  */
        movq %rdx, %rsi
        movq %rcx, %rdx
        movq %r8, %r10
        movq %r9, %r8
        movq 8(%rsp),%r9               /* arg6 is on the stack.  */
        syscall                        /* Do the system call.  */
        cmpq $-4095, %rax              /* Check %rax for error.  */
        jae SYSCALL_ERROR_LABEL        /* Jump to error handler if error.  */
        ret                            /* Return to caller.  */
PSEUDO_END (syscall)

At this point, there is no need to use C. Let us rewrite the program one more time in assembly. In order to do that, we need to retrieve the system call number associated with write:

#include <stdio.h>
#include <sys/syscall.h>

int main(void) {
    printf("%d\n", SYS_write);
    return 0;
}

In Linux x86_64 it prints 1, so the system call identifier for write is 1. The second and third parameters are, respectively, the char buffer and the buffer length. Syscalls parameters are stored into registers(such as rdi, rsi, rdx, etc.). In our program we will use the following registers:

  • RAX: system call number(i.e., 1);
  • RDI: File descriptor(i.e., stdout = 1);
  • RSI: char buffer(i.e. Hello, World\n);
  • RDX: char size(i.e., 12).

Then we need to call the exit system call to quit our program. This syscall requires just on parameter(the return code) which in our case will be equal to zero.

; Intel syntax
; Compile with nasm -felf64 test.asm && ld test.o -o test
;

global _start        ; entrypoint for NASM
section .text
_start:
    mov rax, 0x1     ; Syscall number for write
    mov rdi, 0x1     ; Stdout file descriptor
    mov rsi, buf     ; Store address of char buffer
    mov rdx, bln     ; Length of buffer
    syscall          ; Execute the system call

    mov rax, 0x3C    ; Syscall for exit(60 = 0x3C)
    mov rdi, 0x0     ; Return code
    syscall

section .data
buf: db "Hello World", 0xA, 0xD ; Add line feed and carriage return
bln: equ 12                     ; Length of buffer

Compile it with the following commands:

marco@kernelvm:~$ nasm -felf64 test.asm
marco@kernelvm:~$ ld test.o -o test

And then execute it:

marco@kernelvm:~$ ./test
Hello World

And if we ask Bash to print out the return code, we will get the expected value 0:

marco@kernelvm:~$ echo $?
0

You can easily find a complete system calls table here. Let us see another example of a C program rewritten in assembly using Linux system calls:

#include <stdio.h>

int main(void) {
    char name[256];
    printf("Insert your name: ");
    scanf("%s", name);

    printf("Hello, %s. Nice to meet you.\n", name);

    return 0;
}

This program asks the user for their name and then proceed to print out a formatted message. Below, an implementation using only Linux system calls:

; Intel syntax
; Compile with nasm -felf64 name.asm && ld name.o -o name
;

; Print buffer to stdout
%macro print 2
    mov rax, 0x1        ; sys_write syscall
    mov rdi, 0x1        ; stdout file descriptor
    mov rsi, %1
    mov rdx, %2
    syscall
%endmacro

; Exit with custom return code
%macro exit 1
    mov rax, 0x3C       ; sys_exit syscall
    mov rdi, %1         ; return code
    syscall
%endmacro

section .rodata
    welcome db "Insert your name: "
    lenwelcome equ $-welcome
    msgA db "Hello, "
    lenmsgA equ $-msgA
    msgB db ". Nice to meet you.", 0xA, 0xD
    lenmsgB equ $-msgB
    lenbuf equ 256

section .bss
    buf resb lenbuf


global _start
section .text
_start:
    ; Print welcome message
    print welcome, lenwelcome

    ; Read from stdin
    xor rax, rax        ; sys_read syscall
    mov rdi, rax        ; stdin file descriptor
    mov rsi, buf
    mov rdx, lenbuf
    syscall
    mov byte [rsi + rax - 1], 0 ; Remove '\n' character

    ; Print formatted message
    print msgA, lenmsgA
    print buf, lenbuf
    print msgB, lenmsgB

    exit 0

Write a new syscall

Now that we have a brief understanding on how to work with Linux system calls, let us try to add a new one to the kernel. In order to do that, we will need to recompile the whole kernel(since there is no way to dynamically load system calls at runtime), thus the first thing to do is to get the kernel source code from kernel.org:

[marco@kernelvm ~]$ curl -O https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.16.2.tar.xz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  121M  100  121M    0     0  7730k      0  0:00:16  0:00:16 --:--:-- 9530k

You can download any modern kernel version, in this guide I’m using the 5.16.3(latest stable release at the time of writing). Once the download is complete, extract the archive and move to the source code directory:

[marco@kernelvm ~]$ tar xpf linux-5.16.3.tar.xz
[marco@kernelvm ~]$ cd linux-5.16.3/

If you have previously built a kernel from scratch you know how many hours are needed to properly configure all your USB devices, GPU card, audio card, Wi-Fi adapters and so on. Although, since this is not the scope of this guide, we will just copy the existing kernel configuration from /proc/config.gz:

[marco@kerneldev linux-5.16.3]$ zcat /proc/config.gz > .config

the .config file contains a list of modules that will be included into the kernel(as loadable modules or into the kernel image), the only thing we need to change is the kernel name. You can choose any name of your choice, in my case I will name it -syscall

[marco@kerneldev linux-5.16.3]$ sed -i 's/CONFIG_LOCALVERSION=""/CONFIG_LOCALVERSION="-syscall"/g' .config

System calls can be placed nearly everywhere in the kernel source tree. For instance, the source code of the read and write syscalls is stored in fs/read_write.c while many others are stored in kernel/sys.c. I will put the new system call in kernel/sys.c but keep in mind that any other file is also fine.

Now edit kernel/sys.c and add the following:

/**
 * hicall - Greet the user
 */
SYSCALL_DEFINE1(hicall, char*, data) {
    char name[256];
    long err = strncpy_from_user(name, data, sizeof(name));
    
    if(err < 0) {
            printk(KERN_WARNING "hicall: cannot read data from userspace\n");
            return -EFAULT;
    }

    printk(KERN_INFO "hicall: Hi %s, nice to meet you.\n", name);
    return 0;
}

Let us understand what is going on here. To define a system call with n arguments, the Linux kernel provides us a set of macros called SYSCALL_DEFINEX, where X is the number of arguments we want to pass to our system call. Our system call will get one argument of type char*, so we use the SYSCALL_DEFINE1 macro. Furthermore, inside the parameter list of the macro, we specify the type and the name of each parameter. The first parameter of the macro(in this case hicall) is the name of the system call.

The code is pretty straightforward: we first copy the name of the user from the char pointer using the strncpy_from_user function(do not try to copy directly from the userspace buffer), we check for errors, and finally we print out a message. That’s all.

Last but not least, we need to register our system call. To do that, edit the arch/x86/entry/syscalls/syscall_64.tbl file and add the following line at the end of the common group:

450     common  hicall                  sys_hicall

Be sure to change the number 450 to avoid collision with other system calls. As version 5.16.3 there are 449 common system calls, but if you are using a newer version of the kernel, you may need to trim that number. Apart from the first column(which serve merely as a unique identifier), the second column is used to indicate that the system call is available for both 32bit and 64bit CPUs, the third one is the name of the system call and the fourth one is the name of the function implementing it.

Compiling the Kernel

We are now ready to compile the kernel. To do that, I will follow the traditional compilation process described in the Arch Wiki, I will only describe the essential steps in this blog, if you want to know more about this process, please refer to the arch’s article.

To start compiling the kernel, go to the root of the kernel source folder and type the following:

[marco@kerneldev linux-5.16.3]$ make -j4

The build process could ask you to enable or disable certain new modules not declared in your .config file, you can ignore them by simply pressing the enter key on your keyboard. You also may need to install the following utilities:

[marco@kerneldev linux-5.16.3]$ sudo pacman -S xmlto kmod inetutils bc libelf git cpio perl tar xz base-devel

Once the kernel is compiled, run the following command to build the modules:

[marco@kerneldev linux-5.16.3]$ make modules -j4

Then, install the modules by typing:

[marco@kerneldev linux-5.16.3]$ sudo make modules_install

this will install the modules into /lib/modules/5.16.3-syscall. Now copy the kernel image into the boot directory:

[marco@kerneldev linux-5.16.3]$ sudo cp -v arch/x86/boot/bzImage /boot/vmlinuz-linux-syscall
'arch/x86/boot/bzImage' -> '/boot/vmlinuz-linux-syscall'

be sure to change the kernel name according to your needs. Finally, we need to generate the initial ram disk. To do that in Arch Linux, we need to create a preset called mkinitcpio, but we can avoid manually writing it by using the already existing one:

[marco@kerneldev linux-5.16.3]$ sudo cp /etc/mkinitcpio.d/linux-lts.preset /etc/mkinitcpio.d/linux-syscall.preset 

Modify the preset file like this:

# mkinitcpio preset file for the 'linux-lts' package

ALL_config="/etc/mkinitcpio.conf"
ALL_kver="/boot/vmlinuz-linux-syscall"

PRESETS=('default' 'fallback')

#default_config="/etc/mkinitcpio.conf"
default_image="/boot/initramfs-linux-syscall.img"
#default_options=""

#fallback_config="/etc/mkinitcpio.conf"
fallback_image="/boot/initramfs-linux-syscall-fallback.img"
fallback_options="-S autodetect"

Generate the initcpio image using

[marco@kerneldev linux-5.16.3]$ sudo mkinitcpio -p linux-syscall
==> Building image from preset: /etc/mkinitcpio.d/linux-syscall.preset: 'default'
  -> -k /boot/vmlinuz-linux-syscall -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-syscall.img
==> Starting build: 5.16.3-syscall
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [autodetect]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [filesystems]
  -> Running build hook: [keyboard]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-syscall.img
==> Image generation successful
==> Building image from preset: /etc/mkinitcpio.d/linux-syscall.preset: 'fallback'
  -> -k /boot/vmlinuz-linux-syscall -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-syscall-fallback.img -S autodetect
==> Starting build: 5.16.3-syscall
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: qed
==> WARNING: Possibly missing firmware for module: bfa
==> WARNING: Possibly missing firmware for module: wd719x
==> WARNING: Possibly missing firmware for module: qla2xxx
==> WARNING: Possibly missing firmware for module: aic94xx
==> WARNING: Possibly missing firmware for module: qla1280
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [filesystems]
  -> Running build hook: [keyboard]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-syscall-fallback.img
==> Image generation successful

Finally, update grub:

[marco@kerneldev linux-5.16.3]$ sudo grub-mkconfig -o /boot/grub/grub.cfg 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-linux-syscall
Found initrd image: /boot/initramfs-linux-syscall.img
Found fallback initrd image(s) in /boot:  initramfs-linux-syscall-fallback.img
Found linux image: /boot/vmlinuz-linux-lts
Found initrd image: /boot/initramfs-linux-lts.img
Found fallback initrd image(s) in /boot:  initramfs-linux-lts-fallback.img
done

Now reboot the system and choose the new kernel at the GRUB prompt:

[marco@kerneldev ~]$ uname -a
Linux kerneldev 5.16.3-syscall #3 SMP PREEMPT Fri Jan 28 22:22:45 UTC 2022 x86_64 GNU/Linux

Testing the system call

Let us now write a simple user space client that make use of our new system call. Create a new file called test_hicall.c and write the following inside:

#include <unistd.h>
#include <stdio.h>
#include <sys/syscall.h>

#define SYS_hicall 450

int main(void) {
    char name[256];
    printf("Insert your name: ");
    scanf("%s", name);

    syscall(SYS_hicall, name);

    puts("Done! Please check dmesg");

    return 0;
}

compile it with gcc test_hicall.c -o test_hicall and execute it:

[marco@kerneldev syscall]$ ./a.out 
Insert your name: Marco
Done! Please check dmesg

Once the program exits, check dmesg for the result:

[marco@kerneldev syscall]$ sudo dmesg -WH
[Jan29 15:07] hicall: Hi Marco, nice to meet you.

Our system call works! Here’s an equivalent version of the program above written in plain x86_64 assembly:

; Intel syntax
; Compile with nasm -felf64 test_hicall.asm && ld test_hicall.o -o test_hicall
;

; Print buffer to stdout
%macro print 2
    mov rax, 0x1
    mov rdi, 0x1
    mov rsi, %1
    mov rdx, %2
    syscall
%endmacro

%macro exit 1
    mov rax, 0x3C
    mov rdi, %1
    syscall
%endmacro

section .rodata
    welcome db "Insert your name: "
    lenwelcome equ $-welcome
    done db "Done! Please check dmesg", 0xA, 0xD
    lendone equ $-done
    lenbuf equ 256

section .bss
    buf resb lenbuf

global _start
section .text
_start:
    print welcome, lenwelcome
    
    ; Read user name
    xor rax, rax
    mov rdi, rax
    mov rsi, buf
    mov rdx, lenbuf
    syscall

    ; Remove trailing new line character
    mov byte [rsi + rax - 1], 0

    ; Call 'hicall' system call
    mov rax, 0x1C2 ; 450 = 1C2h
    mov rdi, buf   ; User name
    syscall

    print done, lendone

    exit 0