Syscalls

Learning Objective

Understand the story of a system call

What is a syscall?

  1. A userspace request for kernel action

  2. The mechanism to escalate privileges

  3. The terms "syscall" and "system call" are used interchangably

demo

this arm64 program will KILL you

.text
.globl _start
_start:
msr VBAR_EL1, x0

Overview

  1. Execution contexts

  2. Define kernelspace and userspace

  3. Kernel representation of a process or thread

  4. What do we want out of a system call?

  5. The five steps of a system call

Execution context

An execution context is a CPU register state

Execution context example 1

The {set,long}jmp(3) library functions

  1. setjmp: save current state

  2. longjmp: restore saved register state

demo

example use of {set,long}jmp

Execution context example 2

Threads in a program

  1. Threads share: address space (heap, code, static data)

  2. Threads have their own: stack (register ptr) registers, IP

Kernelspace and userspace

Kernelspace and userspace are distinct execution contexts

  • Primary difference: privilege level

Kernelspace and userspace

Normal programs run in userspace

Kernelspace is the privileged execution context

  1. Registers, stack, memory are familiar

  2. Key difference: CPU capabilities are unrestricted

Kernelspace simplification

Kernelspace execution context can be further subdivided

  1. Kernel code may be running on behalf of a particular userspace process

  2. Other kernel code runs on its own behalf

Definition of a context switch

  1. A capture of the CPU register state

  2. A load of a saved CPU register state

Context switching and the kernel

Switching to kernel context is like other context switches

  1. Main difference: privileges escalated

  2. Switching back to userspace is similar but privilegs are dropped

Note on terminology

The term "context switch" is sometimes used to refer to task-switching

  1. Each task-switch involves several context switches

Definition of re-entrancy

"A computer program or subroutine is called reentrant if multiple invocations can safely run concurrently on multiple processors" (source)

  1. Important concept for kernel code

Introducing the struct task_struct

This is Linux's Process Control Block (PCB)

  1. each kernel pid has unique struct task_struct

demo

A quick look at include/linux/sched.h

Introducing the current macro

Refers to struct task_struct of process in current execution context

demo

A quick look at two files:

  1. arch/arm64/include/asm/current.h

  2. arch/x86/include/asm/current.h

The meaning of pid

kernelspace name userspace name
pid tid
tgid pid

demo

see get{p,t}id(2) in kernel/sys.c

Simplified getpid(2) call stack

getpid(2) calls functions

... namespaces are taken into account

... locking is done

task_pid_nr() { ... tsk->pid ... }

The pid/tgid distinction

Why do we have different names?

History of tgid: the dilema

Before Linux 2.6, there were only pids

  1. The clone(2) call could share address space between processes

  2. This allowed thread-like behavior

  3. These processes were too independent

    1. Example: no shared signals

History of tgid: the proposal

NPTL implements threads as specified by POSIX

  1. This required both userspace and kernelspace changes

History of tgid: userspace changes

  1. The C library was hardened for concurrency

  2. The C library introduced the tid concept

  3. The tid subdivides a pid

History of tgid: kernelspace changes

  1. The kernel introduced the tgid concept

  2. The tgid groups kernel pids together

History of tgid: present day

Each pid corresponds to unique struct task_struct

  1. The tgid and pid values are stored here

Syscalls: are they necessary?

Can a program do anything useful without making any syscalls?

Syscalls: highly necessary

All useful programs depend on system calls

demo

Let's trace a program's syscall usage with strace

demo

Syscall-free prime-number detector program

Desirable properties of syscalls

  1. speed

  2. security

  3. stability

  4. re-entrancy

Security concerns

  1. confused deputy problem

  2. One example: validate address range of any pointer arguments

Stability

Linux provides a stable syscall API

  1. This sets Linux apart from other OSes

Syscall implementation

A syscall can be broken down into 5 distinct steps

  1. Userspace invocation

  2. Hardware-assisted privilege escalation

  3. Kernel code handler

  4. Hardware-assisted privilege drop

  5. Userspace program continues

The transfer of software or hardware responsibility divides each step

Userspace invocation

All programs make system calls

  1. excluding trivial example programs

Example: a shell as an abstraction over many syscalls

demo

see /proc/PID/syscall

  1. Multi-arch syscall number table

Userspace invocation: library wrappers

The C library provides wrapper functions for many syscalls

  1. Main benefit: speed

  2. We want to minimize the high-overhead syscalls

  3. Checks like input validation avoid syscalls if possible

  4. Avoid architecture specific details

Common notation: manual page section numbers

Example: write(2) vs write(3)

  1. Number in parenthesis refers to manual page section number

  2. Section 2 has system calls and section 3 has library calls

See man man for more information

demo

ltrace: like strace for library calls

Userspace invocation: architecture-specific

Common accross architectures:

  1. specify the syscall and arguments

  2. give up control to the hardware

Userspace invocation: arm64

  1. Specify syscall number in x8

  2. Specify arguments 1-6 in x0, x1, x2, x3, x4, x5

  3. Return value will land in x0

  4. svc #0 gives up control to hardware

Userspace invocation: x86_64

  1. Specify syscall number in rax

  2. Args 1-6 in: rdi, rsi, rdx, r10, r8, r9

  3. syscall gives up control to hardware

  4. Return value will land in rax

Userspace invocation: x86_64 fine print

Difference from normal function calling convention

  1. The syscall instruction clobbers rcx

  2. Use r10 instead of rcx

Userspace invocation: wrap up

With arguments chosen and syscall selected

  1. Give up control to the hardware

Hardware-assisted privilege escalation

This step is handled by hardware

  1. How does hardware know what to do?

Hardware-assisted privilege escalation

Rewind to boot

  1. Linux installs it's syscalls into our CPU

Hardware-assisted privilege escalation: arm64

See __primary_switched()

  1. Set VBAR_EL1 to address of vector table

  2. Vector table defined in entry.S

Hardware-assisted privilege escalation: x86_64

See syscall_init()

  1. Set MSR_LSTAR to entry_SYSCALL_64 address

  2. LSTAR: Long System Target Address Register

Hardware-assisted privilege escalation: at invocation

Back to the present

  1. The CPU is preconfigured to correctly transfer control

  2. This makes privilege escalation safe

Hardware-assisted privilege escalation

  1. On arm64: elevate execution level

  2. On x86_64: change to ring 0

  3. Both of these are stored in a particular register

Kernel handles request

  1. Part architecture-specific

  2. Part architecture-generic

Kernel handles request: starting point

Execution resumes from a hardware specified register rate

  1. At bottom, mostly assembly and C macros

  2. Higher on call stack is more generic code

Kernel handles request: arm64

Start in VBAR_EL1

  1. Hardware jumps to particular offset in vectors

Kernel handles request: arm64 reaches C

A function defined by macro in entry.S calls into C code

  1. The first C function is el0t_64_sync_handler()

Kernel handles request: arm64 goes geenric

Execution reaches el0_svc_common()

  1. The invoke_syscall() indexes into jump table of handlers

  2. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro

Kernel handles request: x86_64 entry

Start at entry_SYSCALL_64

  1. Assembly calls into the do_syscall_64() C function

Kernel handles request: x86_64 goes generic

Using a few helper functions, index into jump table of system call handlers

  1. This architecture-generic handler is defined by a SYSCALL_DEFINE* macro

Further reading on x86_64 syscall implementation details

  1. We have another article about this available

Kernel handles request: architecture-generic

A closer look at the SYSCALL_DEFINE*() handlers

Kernel handles request: SYSCALL_DEFINE_*

Defined in include/linux/syscalls.h

  1. Resolve to __SYSCALL_DEFINEx(x,...

  2. Five functions generated

  3. See __do_sys##name(...

A note on syscall arguments

No SYSCALL_DEFINE7 and above

demo

Take a look at the SYSCALL_DEFINE definition in include/linux/syscalls.h

Kernel handles request: return imminent

Indicate error using the errno macros

Return to assembly for another context switch

Kernel handles request: arm64 returns

el0t_64_sync() calls ret_to_user()

  1. ret_to_user() calls kernel_exit 0

Restore registers, including the stack pointer

Kernel handles request: arm64 returns

eret gives up control to the hardware once again

Kernel handles request: x86_64 returns

entry_SYSCALL_64() prepares to return

  1. Place the userspace return address in rcx

Kernel handles request: x86_64 returns

Prefer sysret over the slower iret

  1. Some conditions preclude usage of sysret

  2. Via either instruction we give up control to the hardware once again

Hardware-assisted privilege drop

Less dangerous operation than escalation

  1. Restore old register and stack

  2. Drop privileges

  3. Set rip to userspace return address

Hardware-assisted privilege drop: arm64

The svc #0 instruction saves a return address in hardware

The eret instruction sets the program counter to this value

Hardware-assisted privilege drop: x86_64

iret loads the return address form the stack

sysret returns to rcx

Hardware-assisted privilege drop: completed

Software takes control of execution

Userspace program continues

Always check for an error

Userspace program continues: errno

Kernel functions return -errno

C library wrappers check for error

  1. Store original error in errno

  2. Convert return code to -1

  3. Example: musl syscall return

Demo:

The errno utility from moreutils package

errno further reading

See man 3 errno

Userspace program continues

The system call is complete

The story of a system call: A summary

Linux provides a stable system call API

The story of a system call: A summary

  1. Most programs run in user execution context ("userspace")

  2. Kernel code runs in several execution contexts (all "kernelspace")

The story of a system call: A summary

Hardware plays two key roles in system calls

  1. Raising privileges and entering kernel execution context

  2. Dropping privileges and entering user execution context

The story of a system call: A summary

Many syscall implementation details are architecture-specific

The story of a system call: A summary

The kernel defines the main syscall handler using a SYSCALL_DEFINE* macro

  1. These macros are used to define system call implementations

The story of a system call: A summary

The C library defines wrapper functions for many syscalls

  1. These hide architecture-specific details

  2. Provide POSIX-compatible behavior by hiding Linux eccentricities

The story of a system call: A summary

Always check for an error after making a syscall

End