KERNEL HACKING - SYSTEM CALLS(3/3)
2022-01-28
In the third part of this guide we will explain what system calls actually are and how they work under the Linux kernel.
Furthermore, we will also try to write a simple syscall that will greet the user.
Do note that this article is part of a series of guides on kernel hacking for Linux,
you can find the rest of the guide here:
- Part one: Introduction to the Linux kernel architecture;
- Part two: Building a device driver from scratch;
- Part three(this one): Introduction to syscalls, how to create a new syscall;
What is a system call?
In the first part of this guide we gave a naive definition of system calls(a sort of API to communicate with the kernel from userspace). Let us now expand it a bit more.System calls are programming interface used by userspace programs to delegate the kernel to achieve a certain task. A system call is called, for instance, whenever an userspace process has to deal with devices(
read
, write
),
network sockets(socket
) or
processes(fork
).
Many of these system calls are "wrapped" by the C library functions(such as glibc or uclibc) i.e.,
before calling the requested syscall, they may perform additional tasks(an example of this phenomenon is the
fork
system call[1]).
Others C library functions, such as
strtok
, achieve their scope
without interacting with the kernel.
One thing to remember about system calls is that they are not like others C functions; in order to call them, the userspace program has to ask the CPU to enter kernel mode using a particular instruction (
syscall
, sysenter
, int 0x80
)
and then to go to the memory location where the system call resides. The Linux kernel architecture makes this process very
easy for system programmers; all we have to do is provide the kernel a unique identifier that will
be used to access the corresponding syscall code.
System call in C programs
Now that we have an idea of what system calls are and how they work, let us now try to find the system calls used in the following C program:
#include <stdio.h>
int main(void) {
printf("Hello World\n");
return 0;
}
In order to do that, we can use the strace(1)
utility:
marco@kernelvm:~$ strace ./a.out
execve("./a.out", ["./a.out"], 0xffffdcc5c320 /* 23 vars */) = 0
brk(NULL) = 0xaaaac9e94000
faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=20031, ...}) = 0
mmap(NULL, 20031, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffffbd25e000
close(3) = 0
openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0`C\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1458480, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffffbd25c000
mmap(NULL, 1531032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffffbd0bf000
mprotect(0xffffbd21c000, 65536, PROT_NONE) = 0
mmap(0xffffbd22c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15d000) = 0xffffbd22c000
mmap(0xffffbd232000, 11416, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffffbd232000
close(3) = 0
mprotect(0xffffbd22c000, 12288, PROT_READ) = 0
mprotect(0xaaaab6de0000, 4096, PROT_READ) = 0
mprotect(0xffffbd266000, 4096, PROT_READ) = 0
munmap(0xffffbd25e000, 20031) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
brk(NULL) = 0xaaaac9e94000
brk(0xaaaac9eb5000) = 0xaaaac9eb5000
write(1, "Hello World\n", 12Hello World
) = 12
exit_group(0) = ?
+++ exited with 0 +++
Even for a small program like this, a lot of system calls are being used. However,
the only one we care about is the following one:
write(1, "Hello World\n", 12Hello World) = 12
The write
C function(not the syscall) takes three arguments:
- The file descriptor(identifier for I/O operations. 0 = stdin, 1 = stdout, 2 = stderr);
- The character buffer;
- The size of the buffer.
printf
function.
#include <unistd.h>
int main(void) {
//printf("Hello World\n");
write(1, "Hello World\n", 12);
return 0;
}
The write
function here serves as a wrapper for the
write
system call.
If we want to avoid using the C library at all we could rewrite this as:
#include <unistd.h>
#include <sys/syscall.h>
int main(void) {
//printf("Hello World\n");
//write(1, "Hello World\n", 12);
syscall(SYS_write, 1, "Hello World\n", 12);
return 0;
}
Now you may wonder if we can go a level deeper...of course we can! but this time we have to
work with assembly.
The source code[2] of the
syscall
function is define as follow(in glibc):
.text
ENTRY (syscall)
movq %rdi, %rax /* Syscall number -> rax. */
movq %rsi, %rdi /* shift arg1 - arg5. */
movq %rdx, %rsi
movq %rcx, %rdx
movq %r8, %r10
movq %r9, %r8
movq 8(%rsp),%r9 /* arg6 is on the stack. */
syscall /* Do the system call. */
cmpq $-4095, %rax /* Check %rax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
At this point, there is no need to use C. Let us rewrite the program one more time in assembly.
In order to do that, we need to retrieve the system call number associated with write
#include <stdio.h>
#include <sys/syscall.h>
int main(void) {
printf("%d\n", SYS_write);
return 0;
}
In Linux x86_64 it prints 1
, so the system call identifier for
write
is 1
.
The second and third parameters are, respectively, the char buffer and the buffer length.
Syscalls parameters are stored into registers(such as rdi, rsi, rdx, etc.).
In our program we will use the following registers:
- RAX: system call number(i.e., 1);
- RDI: File descriptor(i.e., stdout = 1);
-
RSI: char buffer(i.e.
Hello, World\n
); - RDX: char size(i.e., 12).
exit
system call to quit our program.
This syscall requires just one parameter(the return code) which
in our case will be equal to zero.
; Intel syntax
; Compile with nasm -felf64 test.asm && ld test.o -o test
;
global _start ; entrypoint for NASM
section .text
_start:
mov rax, 0x1 ; Syscall number for write
mov rdi, 0x1 ; Stdout file descriptor
mov rsi, buf ; Store address of char buffer
mov rdx, bln ; Length of buffer
syscall ; Execute the system call
mov rax, 0x3C ; Syscall for exit(60 = 0x3C)
mov rdi, 0x0 ; Return code
syscall
section .data
buf: db "Hello World", 0xA, 0xD ; Add line feed and carriage return
bln: equ 12
Compile it with the following commands:
marco@kernelvm:~$ nasm -felf64 test.asm
marco@kernelvm:~$ ld test.o -o test
And then execute it:
marco@kernelvm:~$ ./test
Hello World
And if we ask Bash to print out the return code, we will get the expected value(0
):
marco@kernelvm:~$ echo $?
0
You can easily find a complete system calls table here.
Let us see another example of a C program rewritten in assembly using Linux system calls:
#include <stdio.h>
int main(void) {
char name[256];
printf("Insert your name: ");
scanf("%s", name);
printf("Hello, %s. Nice to meet you.\n", name);
return 0;
}
This program asks the user for their name and then proceed to print out a
formatted message. Below, an implementation using only Linux system calls:
; Intel syntax
; Compile with nasm -felf64 name.asm && ld name.o -o name
;
; Print buffer to stdout
%macro print 2
mov rax, 0x1 ; sys_write syscall
mov rdi, 0x1 ; stdout file descriptor
mov rsi, %1
mov rdx, %2
syscall
%endmacro
; Exit with custom return code
%macro exit 1
mov rax, 0x3C ; sys_exit syscall
mov rdi, %1 ; return code
syscall
%endmacro
section .rodata
welcome db "Insert your name: "
lenwelcome equ $-welcome
msgA db "Hello, "
lenmsgA equ $-msgA
msgB db ". Nice to meet you.", 0xA, 0xD
lenmsgB equ $-msgB
lenbuf equ 256
section .bss
buf resb lenbuf
global _start
section .text
_start:
; Print welcome message
print welcome, lenwelcome
; Read from stdin
xor rax, rax ; sys_read syscall
mov rdi, rax ; stdin file descriptor
mov rsi, buf
mov rdx, lenbuf
syscall
mov byte [rsi + rax - 1], 0 ; Remove '\n' character
; Print formatted message
print msgA, lenmsgA
print buf, lenbuf
print msgB, lenmsgB
exit 0
Write a new syscall
Now that we have a brief understanding on how to work with Linux system calls, let us try to add a new one to the kernel. In order to do that, we will need to recompile the whole kernel(since there is no way to dynamically load system calls at runtime), thus the first thing to do is to get the kernel source code from kernel.org.
[marco@kernelvm ~]$ curl -O https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.16.2.tar.xz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 121M 100 121M 0 0 7730k 0 0:00:16 0:00:16 --:--:-- 9530k
You can download any modern kernel version,
in this guide I'm using the
5.16.3
(latest stable release at the time of writing).
Once the download is complete, extract the archive and move to the source code directory:
[marco@kernelvm ~]$ tar xpf linux-5.16.3.tar.xz
[marco@kernelvm ~]$ cd linux-5.16.3/
If you have previously built a kernel from scratch you know how many hours are needed to properly
configure all your USB devices, GPU card, audio card, Wi-Fi adapters and so on.
Since this is not the scope of this guide,
we will just copy the existing kernel configuration from /proc/config.gz
:
[marco@kerneldev linux-5.16.3]$ zcat /proc/config.gz > .config
The .config
file contains a list of modules that will be included into the kernel(as loadable modules or into the kernel image),
the only thing we need to change is the kernel name. You can choose any name of your choice, in my case I will name it
-syscall
[marco@kerneldev linux-5.16.3]$ sed -i 's/CONFIG_LOCALVERSION=""/CONFIG_LOCALVERSION="-syscall"/g' .config
System calls can be placed nearly everywhere in the kernel source tree. For instance, the source
code of the read
and
write
syscalls is stored in
fs/read_write.c
while many others are stored
in kernel/sys.c
. I will put the new system call in
kernel/sys.c
but keep
in mind that any other file is also fine.
Now edit
kernel/sys.c
and add the following:
/**
* hicall - Greet the user
*/
SYSCALL_DEFINE1(hicall, char*, data) {
char name[256];
long err = strncpy_from_user(name, data, sizeof(name));
if(err < 0) {
printk(KERN_WARNING "hicall: cannot read data from userspace\n");
return -EFAULT;
}
printk(KERN_INFO "hicall: Hi %s, nice to meet you.\n", name);
return 0;
}
Let us understand what is going on here. To define a system call with
n
arguments, the Linux kernel provides
us a set of macros called SYSCALL_DEFINEX
, where
X
is the number of arguments we want to pass to our system call.
Our system call will get one argument of type
char*
, so we use the
SYSCALL_DEFINE1
macro. Furthermore,
inside the parameter list of the macro, we specify the type and the name of each parameter.
The first parameter of the macro(in this case
hicall
) is the name of the system call.
The code is pretty straightforward: we first copy the name of the user from the char pointer using the
strncpy_from_user
function(do not try to copy directly from the userspace buffer),
we check for errors, and finally we print out a message. That's all.
Last but not least, we need to register our system call. To do that, edit the
arch/x86/entry/syscalls/syscall_64.tbl
file and add the following line at the end of the common group:
450 common hicall sys_hicall
Be sure to change the number 450
to avoid collision with other system calls.
As version 5.16.3
there are 449 common system calls, but if you are using a newer version of the kernel,
you may need to trim that number. Apart from the first column(which serve merely as a unique identifier),
the second column is used to indicate that the system call is available for both 32bit and 64bit CPUs,
the third one is the name of the system call and the fourth one is the name of the function implementing it.
Compiling the Kerne
We are now ready to compile the kernel. To do that, I will follow the traditional compilation process described in the Arch Wiki, I will only describe the essential steps in this blog, if you want to know more about this process, please refer to the arch's article.To start compiling the kernel, go to the root of the kernel source folder and type the following:
[marco@kerneldev linux-5.16.3]$ make -j4
The build process could ask you to enable or disable certain new modules not declared in your
.config
file,
you can ignore them by simply pressing the enter key on your keyboard.
You also may need to install the following utilities:
[marco@kerneldev linux-5.16.3]$ sudo pacman -S xmlto kmod inetutils bc libelf git cpio perl tar xz base-devel
Once the kernel is compiled, run the following command to build the modules:
[marco@kerneldev linux-5.16.3]$ make modules -j4
Then, install the modules by typing:
[marco@kerneldev linux-5.16.3]$ sudo make modules_install
this will install the modules into
/lib/modules/5.16.3-syscall
.
Now copy the kernel image into the boot directory:
[marco@kerneldev linux-5.16.3]$ sudo cp -v arch/x86/boot/bzImage /boot/vmlinuz-linux-syscall
'arch/x86/boot/bzImage' -> '/boot/vmlinuz-linux-syscall'
be sure to change the kernel name according to your needs. Finally, we need to generate the
initial ram disk. To do that in Arch Linux, we need to create a preset called mkinitcpio,
but we can avoid manually writing it by using the already existing one:
[marco@kerneldev linux-5.16.3]$ sudo cp /etc/mkinitcpio.d/linux-lts.preset /etc/mkinitcpio.d/linux-syscall.preset
Modify the preset file like this:
# mkinitcpio preset file for the 'linux-lts' package
ALL_config="/etc/mkinitcpio.conf"
ALL_kver="/boot/vmlinuz-linux-syscall"
PRESETS=('default' 'fallback')
#default_config="/etc/mkinitcpio.conf"
default_image="/boot/initramfs-linux-syscall.img"
#default_options=""
#fallback_config="/etc/mkinitcpio.conf"
fallback_image="/boot/initramfs-linux-syscall-fallback.img"
fallback_options="-S autodetect"
Generate the initcpio image using
marco@kerneldev linux-5.16.3]$ sudo mkinitcpio -p linux-syscall
==> Building image from preset: /etc/mkinitcpio.d/linux-syscall.preset: 'default'
-> -k /boot/vmlinuz-linux-syscall -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-syscall.img
==> Starting build: 5.16.3-syscall
-> Running build hook: [base]
-> Running build hook: [udev]
-> Running build hook: [autodetect]
-> Running build hook: [modconf]
-> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: xhci_pci
-> Running build hook: [filesystems]
-> Running build hook: [keyboard]
-> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-syscall.img
==> Image generation successful
==> Building image from preset: /etc/mkinitcpio.d/linux-syscall.preset: 'fallback'
-> -k /boot/vmlinuz-linux-syscall -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-syscall-fallback.img -S autodetect
==> Starting build: 5.16.3-syscall
-> Running build hook: [base]
-> Running build hook: [udev]
-> Running build hook: [modconf]
-> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: qed
==> WARNING: Possibly missing firmware for module: bfa
==> WARNING: Possibly missing firmware for module: wd719x
==> WARNING: Possibly missing firmware for module: qla2xxx
==> WARNING: Possibly missing firmware for module: aic94xx
==> WARNING: Possibly missing firmware for module: qla1280
==> WARNING: Possibly missing firmware for module: xhci_pci
-> Running build hook: [filesystems]
-> Running build hook: [keyboard]
-> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-syscall-fallback.img
==> Image generation successful
Finally, update grub:
[marco@kerneldev linux-5.16.3]$ sudo grub-mkconfig -o /boot/grub/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-linux-syscall
Found initrd image: /boot/initramfs-linux-syscall.img
Found fallback initrd image(s) in /boot: initramfs-linux-syscall-fallback.img
Found linux image: /boot/vmlinuz-linux-lts
Found initrd image: /boot/initramfs-linux-lts.img
Found fallback initrd image(s) in /boot: initramfs-linux-lts-fallback.img
done
Now reboot the system and choose the new kernel at the GRUB prompt:
[marco@kerneldev ~]$ uname -a
Linux kerneldev 5.16.3-syscall #3 SMP PREEMPT Fri Jan 28 22:22:45 UTC 2022 x86_64 GNU/Linux
Testing the system call
Let us now write a simple user space client that make use of our new system call. Create a new file calledtest_hicall.c
and write the following inside:
#include <unistd.h>
#include <stdio.h>
#include <sys/syscall.h>
#define SYS_hicall 450
int main(void) {
char name[256];
printf("Insert your name: ");
scanf("%s", name);
syscall(SYS_hicall, name);
puts("Done! Please check dmesg");
return 0;
}
compile it with gcc test_hicall.c -o test_hicall
and execute it:
[marco@kerneldev syscall]$ ./a.out
Insert your name: Marco
Done! Please check dmesg
Once the program exits, check dmesg for the result:
[marco@kerneldev syscall]$ sudo dmesg -WH
[Jan29 15:07] hicall: Hi Marco, nice to meet you.
Our system call works! Here's an equivalent version of the program above written in plain x86_64 assembly:
; Intel syntax
; Compile with nasm -felf64 test_hicall.asm && ld test_hicall.o -o test_hicall
;
; Print buffer to stdout
%macro print 2
mov rax, 0x1
mov rdi, 0x1
mov rsi, %1
mov rdx, %2
syscall
%endmacro
%macro exit 1
mov rax, 0x3C
mov rdi, %1
syscall
%endmacro
section .rodata
welcome db "Insert your name: "
lenwelcome equ $-welcome
done db "Done! Please check dmesg", 0xA, 0xD
lendone equ $-done
lenbuf equ 256
section .bss
buf resb lenbuf
global _start
section .text
_start:
print welcome, lenwelcome
; Read user name
xor rax, rax
mov rdi, rax
mov rsi, buf
mov rdx, lenbuf
syscall
; Remove trailing new line character
mov byte [rsi + rax - 1], 0
; Call 'hicall' system call
mov rax, 0x1C2 ; 450 = 1C2h
mov rdi, buf ; User name
syscall
print done, lendone
exit 0
References
[1]: fork's source code ↩
[2]: syscall's function source code ↩