The System V semaphores provided by Unix and later by Linux, are quite heavy from a performance point of view.
With the futex coming from the researches of H. Franke, R. Russel et M. Kirkwood [1], a more efficient alternative is available.
According to the manuals, Linux provides this mechanism since version 2.5.7 with some evolutions in the subsequent releases.
This article aims at presenting the implementation details of the futex in the Linux kernel and how to use them as well
as how the open sources libraries benefit from them.
A french version of this article has been published in
Gnu Linux Magazine France
numbers 173, 175, 176 and 178. The current translation is an updated version.
The article contains source code examples that can be downloaded from this
WEB site. Then, proceed
as follow to uncompress the archive file and build the examples (you need to install
cmake if not done yet):
$ mkdir examples $ cp the_futex.tgz examples $ cd examples $ tar xvfz the_futex.tgz [...] $ cmake . $ make
The executables can be run from the current directory.
The system V semaphores are a bottleneck on heavy loaded servers, multi-threaded processes,
real-time oriented applications... The futex have come to solve this.
Originally, the futex (a.k.a. "Fast User Space mutex") were introduced to implement efficient
mutex. But they are now used in several other synchronization mechanisms (sémaphores,
conditional variables...). The principle consists in reducing the number of system calls
involved in the synchronization operations. Promoted by the GLIBC designers, they are now
the cornerstone of the synchronization functions of the POSIX threads and more generally
Linux processes.
At first sight, using the futex is not straightforward. Although online manuals are available
(i.e. man 7 futex
and man 2 futex),
the usage is still tricky as the documentation is quite terse and sometimes out of date.
Moreover, the manual says "bare futexes are not intended as an easy-to-use abstraction
for end users. Implementors are expected to be assembly literate and to have read the
sources of the futex user-space library" (sic).
This article will demistify the futex.
The semget() system call returns a semaphore identifier. The operations
on the semaphore are realized thanks to the semop() system call.
The destruction is done with semctl() system call.
The following program sets up two threads. One thread (the writer) updates the global variable
data with the counter of seconds while the other (the reader)
displays the value of the counter upon user requests. Both execution flows run concurrently to access the
data variable to either read or write into it. The mutual exclusion
is solved thanks to the semid semaphore which serializes the accesses to
the global variable in the critical sections (highlighted in red color in the following source code).
Hence, the reader thread always get a coherent value of the counter of seconds.
ipc.c |
#include <unistd.h> #include <pthread.h> #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #define NB_SECONDS 20 static char *str_second[NB_SECONDS] = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fiveteen", "sixteen", "seventeen", "eighteen", "nineteen" }; char data[100]; static int semid; struct sembuf sembuf_for_P = {0, -1, SEM_UNDO}; struct sembuf sembuf_for_V = {0, +1, SEM_UNDO}; #define P(id) semop(id, &sembuf_for_P, 1) #define V(id) semop(id, &sembuf_for_V, 1) static void *thd_main(void *p) { unsigned int i = 0; (void)p; while(1) { // Update the counter in ASCII P(semid); strcpy(data, str_second[i]); V(semid); // Wait one second sleep(1); // Incrementation of the counter in the range [0, NB_SECONDS[ i = (i + 1) % NB_SECONDS; } return NULL; } // thd_main int main(void) { pthread_t tid; unsigned short val = 1; // Creation of the semaphore semid = semget(IPC_PRIVATE, 1, IPC_CREAT|0777); // Initialization of the semaphore with the value 1 semctl(semid, 0, SETALL, &val); // Creation of the thread (void)pthread_create(&tid, NULL, thd_main, NULL); // Interaction with the operator while(fgetc(stdin) != EOF) { // Display the ASCII value of the counter P(semid); printf("Current counter value: %s\n", data); V(semid); } return 0; } // main |
Once built, the example is run as follow. Each time the user types <RETURN>, the program displays the string value of the counter:
$ ./ipc Current counter value: two Current counter value: three Current counter value: four <CTRL>+<D> $
The strace utility (with the -f option to follow the child processes) points out the calls to the Linux system during the execution. Each seconds, upon return from clock_nanosleep(), the semtimedop() service is called twice to decrement (P() operation) and increment (V() operation) the semaphore by the thread#4685 to update the counter.
[...] [pid 4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0 [pid 4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0 [pid 4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0 [pid 4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0 [pid 4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0 [...]
Each time the user types <RETURN>, the thread#4684 does the same calls to read and display the current value of the counter.
[...] [pid 4684] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0 [pid 4684] write(1, "Current counter value: five\n", 28Current counter value: five) = 28 [pid 4684] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0 [...]
The preceding example and especially the strace output shows that both threads invoke the semop() system call eventhough the semaphore is not contented. For example, when the writer thread updates the counter, most of the time, the reader thread is waiting for the user input and is not accessing the counter. Those continuous calls to the kernel to ensure the mutual exclusion consume lots of CPU. In real-time systems or bigger applications like database managers, the costs is huge. This behaviour may become prohibitive!
To reduce the cost of system calls, some solutions like vDSO have been implemented to reduce the calling overhead. But this is limited to a handful of services. Anyway, the best way to save CPU time is to avoid calling the system when it is not necessary. [1] proposes a solution with the futex introduced in Linux 2.5.7 and enhanced in the following versions of the kernel.
For a shared resource (data variable in our example), we differenciate the situations where there is no contention (i.e. no task is accessing it) from the situations where there is contention (i.e. a task is accessing it). In the first case, the operations are done in user space. The need for kernel services is required only for the second case. This is the main difference between system V semaphores and futex from performance point of view. In non-contended situations, the futex mechanism saves the cost of a context switch between user space and kernel space to take the control on the mutual exclusion lock whereas the system V IPC systematically invokes a system call in both situations.
Actually, a futex is an integer variable shared by processes and threads. As a consequence, this can be defined as a global variable if it is shared by the threads of a single process or can be located in a shared memory segment if it is used by threads from multiple processes. The shared variable contains a state modified in user space with atomic operations [3]. When the state specifies that there is no contention, no system call is required. On the other hand, if the state specifies a contention, a futex system call is done to make the calling task sleep. This basic mechanism is used to implement several elaborated services like mutex and rwlocks [13]. For instance, the POSIX thread synchronization services provided by the GLIBC/NPTL (i.e. Native Posix Thread Library [18]) use the futex: pthread_mutex_lock(), pthread_cond_signal(), pthread_rwlock_rdlock()...
As they are not always available in the high level languages like C or C++, the atomic operations need some assembly code. Hence, this could trigger portability problems from one architecture to another. That is one of the reasons why the manuals specify that the futex are preferably dedicated to advanced users involved in the design of service libraries like the GLIBC. They hide the architecture specific details with standardized high level services. But some compilers like GCC, provide built-ins [2] for widely used atomic operations. So, as long as we keep the same compiler, the operations are portable on all the platforms supported by the compiler.
Figure 1 depicts the behaviour of the preceding ipc program based on a futex instead of a system V semaphore. The atomic operations are highlighted in red color. The sequence describes what is happening when the user types <RETURN>. The fgetc() function returns in the reader thread which calls P(). The latter is replaced by an atomic operation to set the futex variable to 1 if it is equal to 0. Meanwhile, the writer thread returns from sleep() function and calls the same P() operation. But as the futex variable is set to 1, it calls futex() system calls to wait into the kernel. The reader thread reads and displays the data and calls V() which consists into an atomic operation to set the futex variable to 0 and call futex() service to wake up the waiting threads. This triggers the wakeup of the writer thread which atomically checks if the futex variable is 0 and sets it to 1. It updates the data and calls the same V() operation to set atomically the futex variable to 0 and call the futex() service to wake up any waiting threads. At that moment, there are no waiting threads. Then, it calls sleep() function one more time. And so on...
In this sequence, the P() operation does not invoke a system call (i.e. futex()) when there is no contention. This is the big difference with the system V semaphore for which the P() operation always triggers a semop() system call. The V() operation always triggers a system call with both the futex and the system V semaphore. Below, we will see that it is possible to be more efficient in the V() operation by calling futex() only when there is a waiting thread.
Before going ahead, let's compare the performances when using a semaphore in system V mode and in futex mode (through the
POSIX services of the GLIBC). To make it, we use an open source called
systress which can be configured to set up concurrent
running threads incrementing by 1 a global shared counter in a loop. The incrementation is done in a critical section
surrounded by P() and V() operations. The
semaphore is used with a binary value to behave as a mutex. In other words, only one thread at a time enters the critical
section. The threads stop as soon as the counter reaches a configured ceiling value fixed to 100000000 here.
The results on a desktop PC (Distribution: Ubuntu 20.04, Linux: 5.4 PREEMPT_VOLUNTARY, gcc: 9.3.0, glibc: 2.31, Intel Core i7-3770K CPU @ 3.50GHz, 8 cores) are:
Number of threads (ceiling value = 100000000) |
POSIX mutex (Duration) |
System V mutex (Duration) |
---|---|---|
1 | 1''85 | 1'19''15 |
2 | 13''58 | 3'25''52 |
3 | 14''06 | 5'18''89 |
4 | 16''84 | 6'44''46 |
5 | 17''33 | 7'16''64 |
The results on a Raspberry Pi version 3 (model B+) (Distribution: Raspbian 10, Linux: 4.19 PREEMPT_VOLUNTARY, gcc: 4.9.3, glibc: 2.28, Cortex-A53 (ARMv8) 1.4GHz, 4 cores) are:
Number of threads (ceiling value = 100000000) |
POSIX semaphore (Duration) |
System V semaphore (Duration) |
---|---|---|
1 | 8''43 | 2'17''92 |
2 | 21''34 | 16'19''59 |
3 | 33''30 | 16'31''80 |
4 | 26''92 | 15'16''14 |
5 | 27''18 | 17'54''66 |
The preceding tables of results show big differences between binary semaphores in system V and futex modes from a performance point of view. Contended (more than one running thread) or not (one running thread), the futex based semaphore is tremendously more performant. That is why it is worth using futex based synchronization mechanisms!
The section 7 of the online manual provides a general description while the section 2 describes the system call itself.
#include <linux/futex.h> #include <sys/time.h> int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: uint32_t val2 */ int *uaddr2, int val3);
The op parameter determines which operation to do.
futex.c |
#include <stdio.h> #include <linux/futex.h> int main(void) { int rc; rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0); printf("rc=%d\n", rc); return 0; } // main |
The compilation ends with an error as the futex() service does not exist in the GLIBC:
$ gcc futex.c futex.c: In function 'main': futex.c:8:8: warning: implicit declaration of function 'futex' [-Wimplicit-function-declaration] 8 | rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0); | ^~~~~ /usr/bin/ld: /tmp/ccIJA66y.o: in function `main': futex.c:(.text+0x32): undefined reference to `futex' collect2: error: ld returned 1 exit status
But the system call exists as it is defined in the <sys/syscall.h> header file:
#ifdef __NR_futex # define SYS_futex __NR_futex #endif
So, the call to futex() is done through the syscall() service:
futex.c |
#include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #define futex(a, b, c, d, e, f) syscall(SYS_futex, a, b, c, d, e, f) int main(void) { int rc; rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0); printf("rc=%d\n", rc); return 0; } // main |
The uaddr parameter is the address of the futex variable. This is an integer. The op parameter is the code of the operation to be run by the system call. They are defined in <linux/futex.h>:
#define FUTEX_WAIT 0 #define FUTEX_WAKE 1 #define FUTEX_FD 2 #define FUTEX_REQUEUE 3 #define FUTEX_CMP_REQUEUE 4 #define FUTEX_WAKE_OP 5 #define FUTEX_LOCK_PI 6 #define FUTEX_UNLOCK_PI 7 #define FUTEX_TRYLOCK_PI 8 #define FUTEX_WAIT_BITSET 9 #define FUTEX_WAKE_BITSET 10 #define FUTEX_WAIT_REQUEUE_PI 11 #define FUTEX_CMP_REQUEUE_PI 12
For examples:
If the futex is to be shared between threads belonging to the same process, the futex variable
can be a simple global integer variable or an integer stored in a dynamically allocated memory
location.
If the futex is used by threads belonging to several processes, the futex variable must be
located in some shared memory location allocated through system V IPC (i.e.
shmget(), shmat())
or a POSIX service (i.e. shm_open(),
mmap()).
Destroying a futex requires first of all that no thread is waiting on it. This deallocates the corresponding data structures in the kernel. On user space side, the destruction is nothing else then getting rid of the futex variable. If the latter has been dynamically allocated, the user must call the corresponding deallocation services (e.g. free(), shmdt(), munmap()...). If it is a global static variable, it will be kept unused up to the end of the program.
A mutex is a lock with two states: locked or unlocked. It is typically used to permit the exclusive access to
a resource shared by multiple concurrent executing threads. Those code segments are called
"critical sections". Two operations are generally defined:
LOCK() and UNLOCK() to
respectively lock and unlock the critical section. The ipc program
previously presented contains two critical sections. Those are the code portions surrounded by
P() and V() operations
(preferred terminology when we speak about semaphores [19]).
Actually, mutex are a particular case of the semaphores. The latter decrement
(P() operation) or increment
(V() operation) a counter initialized to the number of threads
allowed to enter a critical section. For a mutex, the counter, is initialized to 1. This is a binary
semaphore.
The following mutex program is the modified version of the
ipc program above. It replaces the system V semaphore by a futex.
mutex.c |
#include <unistd.h> #include <pthread.h> #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/syscall.h> #include <linux/futex.h> #define NB_SECONDS 20 static char *str_second[NB_SECONDS] = { "zero", "un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix", "onze", "douze", "treize", "quatorze", "quinze", "seize", "dix sept", "dix huit", "dix neuf" }; char data[100]; int futex_var; #define futex_wait(addr, val) \ syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL) #define futex_wakeup(addr, nb) \ syscall(SYS_futex, addr, FUTEX_WAKE, nb) static void LOCK(int *f) { int old; while (1) { old = __sync_val_compare_and_swap(f, 0, 1); if (0 == old) { // The futex variable was free as it was equal to 0 // Now, it is set to 1 return; } else { // The futex variable is equal to 1 // ==> It is not modified futex_wait(f, 1); } } } // LOCK static void UNLOCK(int *f) { (void)__sync_fetch_and_sub(f, 1); futex_wakeup(f, 1); } // UNLOCK static void *thd_main(void *p) { unsigned int i = 0; (void)p; while(1) { // Update of the counter LOCK(&futex_var); strcpy(data, str_second[i]); UNLOCK(&futex_var); // Sleep 1 second sleep(1); // Incrementation of the counter in the range [0, NB_SECONDS[ i = (i + 1) % NB_SECONDS; } // End while return NULL; } // thd_main int main(void) { pthread_t tid; // Creation of the thread (void)pthread_create(&tid, NULL, thd_main, NULL); // Interaction with the operator while(fgetc(stdin) != EOF) { // Display the current value of the counter LOCK(&futex_var); printf("Current counter value: %s\n", data); UNLOCK(&futex_var); } // End while return 0; } // main |
The futex relies on the futex_var global variable. The
FUTEX_WAIT and
FUTEX_WAKE operations are provided as macros (respectively named
futex_wait() and
futex_wakeup()).
The LOCK() service is interesting. It first atomically checks
futex_var to set it to 1 (second parameter of
__sync_val_compare_and_swap()) only if it is currently equal to 0
(first parameter of
__sync_val_compare_and_swap()). This is a "compare and swap
operation" (i.e. CAS
[4]) provided as a built-in by GCC compiler. If it is not currently equal to 0,
the variable is left unchanged and
the calling thread is put to sleep with the FUTEX_WAIT operation.
The latter also atomically checks the
value of the variable to verify that it is still equal to 1 (fourth parameter of
futex()) before
putting the calling thread to sleep. The purpose of the comparison with the expected value is to prevent
lost wake-ups. If another thread changed the value of the futex variable after the calling thread decided
to block based on the prior value, and if the other thread executed a
FUTEX_WAKE operation after the value change and before this
FUTEX_WAIT operation, then the calling thread will observe the
value change and will not go to sleep. This does not prevent the famous ABA problem
[21] where a thread could change the value of the futex and put it back to its
original value before the thread makes the check but at least, the thread will not sleep for ever!
The UNLOCK() service, much more simple, atomically decrements
futex_var thanks the
__sync_fetch_and_sub()
built-in of GCC and calls the FUTEX_WAKE operation along with the
value 1 to wakeup at most one waiting thread. No guarantee is provided about which waiters are awoken
(e.g. a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a
waiter with a lower priority).
As we can see, the futex mechanism needs user space atomic operations helpers. Those helpers are not
available in high level languages like C or C++. The assembly language is required with the inherent
portability problems from one platform to another. Hopefully, compilers like GCC provide built-ins to
simplify the programmer's life.
On Intel platform, the CAS atomic operation is done with a cmpxchg
assembly instruction [5] prefixed by
lock for SMP system to guarantee exclusive access to the memory
location. Here is the disassembly code of the LOCK() service:
12d0: 55 push %rbp
12d1: 31 ed xor %ebp,%ebp
12d3: 53 push %rbx
12d4: 48 83 ec 08 sub $0x8,%rsp
12d8: 48 8d 1d 61 2d 00 00 lea 0x2d61(%rip),%rbx ; RBX=@futex_var
12df: eb 20 jmp 1301 ; Goto CAS operation
12e1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
12e8: 45 31 c0 xor %r8d,%r8d ; R8D=0 ("timeout" parameter of futex())
12eb: b9 01 00 00 00 mov $0x1,%ecx ; ECX=1 ("val" parameter of futex())
12f0: 31 d2 xor %edx,%edx ; EDX=0=FUTEX_WAIT ("op" parameter of futex())
12f2: 48 89 de mov %rbx,%rsi ; RSI=@futex_var ("addr" parameter of futex())
12f5: bf ca 00 00 00 mov $0xca,%edi ; EDI=0xCA=__NR_futex (1st parameter of syscall())
12fa: 31 c0 xor %eax,%eax ; EAX=0
12fc: e8 df fd ff ff callq 10e0 ; futex(@futex_var, FUTEX_WAIT, 1, NULL) system call
1301: ba 01 00 00 00 mov $0x1,%edx ; EDX=1
1306: 89 e8 mov %ebp,%eax ; EAX=0
1308: f0 0f b1 13 lock cmpxchg %edx,(%rbx) ; CAS operation with EDX=1 and (%RBX)=futex_var
130c: 85 c0 test %eax,%eax ; EAX=old value of futex_var
130e: 75 d8 jne 12e8 ; If old != 0, call futex()
1310: 48 83 c4 08 add $0x8,%rsp
1314: 5b pop %rbx
1315: 5d pop %rbp
1316: c3 retq
The cmpxchg instruction above atomically compares the content of EAX equal to 0 with
the content of (%rbx) (i.e. futex_var). If both are equal, (%rbx) (i.e. futex_var) is set to the content of %EDX (i.e. 1).
If they are not equal, EAX is set with (%rbx) (i.e. futex_var). So, after this instruction EAX represents the "old" value
of futex_var as it is either unchanged (i.e. equal to the current content of futex_var which is 0) or it is set to the
value of futex_var.
The same code for the ARM processor of the Raspberry Pi card is based on the
"load-link and store-conditional" (LL/SC) [6] with the
ldrex (Load exclusive) and strex (Store exclusive) pair of
instructions to atomically update the futex variable into two separate steps [20].
108ec: e59f3064 ldr r3, [pc, #100] 108f0: e59f2064 ldr r2, [pc, #100] 108f4: e08f3003 add r3, pc, r3 108f8: e92d4070 push {r4, r5, r6, lr} 108fc: e3a05001 mov r5, #1 10900: e24dd008 sub sp, sp, #8 10904: e3a06000 mov r6, #0 10908: e7934002 ldr r4, [r3, r2] ; r4=@futex_var 1090c: ea000001 b 10918 ; Goto the LL/SC operation 10910: e58d6000 str r6, [sp] ; Param#6 of syscall = Param#5 of futex() = timeout (i.e. r6 = 0) 10914: ebffff70 bl 106dc10918: ee070fba mcr 15, 0, r0, cr7, cr10, {5} ; Data memory barrier to wait for completion of all memory transactions 1091c: e1943f9f ldrex r3, [r4] ; LL: r3=Content of futex_var 10920: e3530000 cmp r3, #0 ; Compare futex_var to 0 10924: 1a000002 bne 10934 ; If futex_var is not 0, call futex(@futex_var, FUTEX_WAIT, 1, 0) 10928: e1842f95 strex r2, r5, [r4] ; SC: As futex_var was 0, set it to r5=1 1092c: e3520000 cmp r2, #0 ; R2 is 0 (if store succeeded) or 1 (if store failed) 10930: 1afffff9 bne 1091c ; If store failed, try again 10934: e3530000 cmp r3, #0 ; Store succeeded, Return (after memory barrier) 10938: ee070fba mcr 15, 0, r0, cr7, cr10, {5} ; Data memory barrier to wait for completion of all memory transactions 1093c: e1a01004 mov r1, r4 ; r1=@futex_var 10940: e3a02000 mov r2, #0 ; r2=FUTEX_WAIT ("op" parameter of futex()) 10944: e3a03001 mov r3, #1 ; r3=1 ("val" parameter of futex()) 10948: e3a000f0 mov r0, #240 ; r0=__NR_futex (1st parameter of syscall()) 1094c: 1affffef bne 10910 ; Call futex(@futex_var, FUTEX_WAIT, 1, 0) 10950: e28dd008 add sp, sp, #8 10954: e8bd8070 pop {r4, r5, r6, pc} 10958: 0001033c andeq r0, r1, ip, lsr r3 1095c: 00000030 andeq r0, r0, r0, lsr r0
The preceding disassembly source codes show how much the atomic operations are touchy and require experience and hardware architecture knowledges to be implemented. Thanks to the GCC built-ins, all those tricky details are hidden.
The execution under the control of strace points out
a big difference between mutex and ipc programs:
There are much less system calls now.
The main thread is highlighted in blue and the secondary thread is in red.
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},[pid 22928] <... read resumed>"\n", 1024) = 1 [pid 22928] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0 [pid 22928] write(1, "Current counter value: cinq\n", 28Current counter value: cinq ) = 28 [pid 22928] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22928] read(0, [pid 22929] <... clock_nanosleep resumed>0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0 [pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, [pid 22928] <... read resumed>"\n", 1024) = 1 [pid 22928] write(1, "Current counter value: neuf\n", 28Current counter value: neuf ) = 28 [pid 22928] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0 [pid 22928] read(0,
Upon each clock_nanosleep(), the secondary thread invokes only one system call:
futex() with the FUTEX_WAKE operation.
When the operator types <RETURN>, the main thread behaves the same. To trigger
futex() with the FUTEX_WAIT operation, a
contention is necessary.
In the ipc program, in contended and non contented situations, a system call
was systematically invoked to decrement the semaphore at the beginning of the critical section and decrement it at
the end. Here, we save a system call at the beginning of the critical section when there is no contention.
It is possible to go farther by saving a system call at the end of the critical section as well.
It is possible to go farther in the reduction of system calls. Up to now, we only enhanced the locking procedure by saving a system call in non contended situations. But we can do the same in the unlocking procedure. In [8], Ulrich Drepper proposes a additionnal state for the futex. Instead of a binary state saying locked or not, it is possible to have a thrid value:
mutex2.c |
#include <unistd.h> #include <pthread.h> #include <stdio.h> #include <string.h> #include <sys/types.h> #include <sys/syscall.h> #include <linux/futex.h> #define NB_SECONDS 20 static char *str_second[NB_SECONDS] = { "zero", "un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix", "onze", "douze", "treize", "quatorze", "quinze", "seize", "dix sept", "dix huit", "dix neuf" }; char data[100]; int futex_var; #define futex_wait(addr, val) \ syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL) #define futex_wakeup(addr, nb) \ syscall(SYS_futex, addr, FUTEX_WAKE, nb) static void LOCK(int *f) { int old; old = __sync_val_compare_and_swap(f, 0, 1); switch(old) { case 0: { // The futex was free as it contained 0 // The atomic operation set it to 1 meaning // that it is busy with no waiting threads return; } break; default: { // The futex contained either 1 or 2: it is busy // Atomically set the futex to the value 2 // to specify that there is at least one waiting thread old = __sync_lock_test_and_set(f, 2); // Loop as long as the futex is busy while (old != 0) { // Wait while the futex is equal to 2 futex_wait(f, 2); // The futex is either not busy (value 0) or 1 or 2 old = __sync_lock_test_and_set(f, 2); } } break; } } // LOCK static void UNLOCK(int *f) { int old; old = __sync_fetch_and_sub(f, 1); switch(old) { case 1: { // There were no waiting threads to wake up // ==> The futex is now equal to 0 (not busy) } break; case 2: { // There were waiting threads // Atomically set the futex to 0 old = __sync_lock_test_and_set(f, 0); // Wake up one of the waiting threads futex_wakeup(f, 1); } break; default: { // Impossible } } } // UNLOCK static void *thd_main(void *p) { unsigned int i = 0; (void)p; while(1) { // Update the ASCII counter LOCK(&futex_var); strcpy(data, str_second[i]); UNLOCK(&futex_var); // Sleep one second sleep(1); // Incrementation of the counter in the range [0, NB_SECONDS[ i = (i + 1) % NB_SECONDS; } return NULL; } int main(void) { pthread_t tid; // Creation of the thread (void)pthread_create(&tid, NULL, thd_main, NULL); // Interaction with the operator while(fgetc(stdin) != EOF) { // Display the current value of the counter LOCK(&futex_var); printf("Compteur courant: %s\n", data); UNLOCK(&futex_var); } } // main |
In the LOCK() function, in case of contention, the __sync_val_compare_and_swap(f, 0, 1) call returns 1 or 2 for the current value of the mutex. In this situation, the futex variable is not set to 1 as it is different than 0. So, the
procedure makes the caller wait by calling futex() with the FUTEX_WAIT operation. But prior to that, it atomically sets the futex variable to 2 to specify
that there are waiting threads.
In the UNLOCK() function, the futex
variable is decremented as it is at least equal to 1 (i.e. the calling thread
holds the mutex). Then, its value before decrementation is checked, if it was
1, there are no waiting threads, so the futex is not busy and there is no thread to wake up. If it was 2, there are waiting threads, we must first decrement
the futex variable and then wake up a waiting thread by calling futex() with the FUTEX_WAKE operation. It is mandatory the wake up the threads after setting
the futex variable to 0 otherwise the FUTEX_WAKE operation, could wake up a thread. The latter could check the current
value of the futex and see that it is 1 or 2 and go back to sleep before the
current thread set the futex to 0.
To make it, we introduced one more user space atomic operation helper. It is still a GCC built-ins: __sync_lock_test_and_set() sets the futex variable to the specified value and returns its previous value.
The execution of this new program version under the control of strace
points out that even the unlocking part does not use system calls. The main thread is highlighted in blue and
the secondary thread is in red.
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},[pid 21648] <... read resumed>"\n", 1024) = 1 [pid 21648] write(1, "Compteur courant: dix\n", 22Compteur courant: dix ) = 22 [pid 21648] read(0, [pid 21649] <... clock_nanosleep resumed>0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0 [pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
The preceding traces show no futex() call at all as long as there is
no contention.
To go farther in those enhancements, [15] proposes a much more efficient algorithm
by adding a fourth state to the futex variable:
This use case concerns a synchronization between a kernel service creating a thread and the user space. The following clone program creates a thread with pthread_create() service. The thread displays "Hello from the secondary thread" before termination. The main thread waits for the end of the secondary thread through pthread_join() and displays "End of program" before termination.
clone.c |
#include <errno.h> #include <pthread.h> #include <unistd.h> #include "util.h" #define STR_HELLO "Hello from the secondary thread\n" #define STR_END "End of program\n" void *thd_main(void *p) { int rc; (void)p; rc = write(1, STR_HELLO, sizeof(STR_HELLO) - 1); if (rc < 0) { ERR("write(): '%m' (%d)\n", errno); return NULL; } return NULL; } // thd_main int main(void) { pthread_t tid; int rc; // Creation of the thread (void)pthread_create(&tid, NULL, thd_main, NULL); // Wait the end of the thread (void)pthread_join(tid, NULL); rc = write(1, STR_END, sizeof(STR_END) - 1); if (rc < 0) { ERR("write(): '%m' (%d)\n", errno); return 1; } return 0; } // main |
The main thread (i.e. the program) does not finish before the secondary thread is finished:
$ ./clone Hello from the secondary thread End of program
Running the same under the control of strace, we point out how the synchronization is done:
$ strace -f ./clone [...] prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5ebb80d000 mprotect(0x7f5ebb80e000, 8388608, PROT_READ|PROT_WRITE) = 0 brk(NULL) = 0x560fdffbd000 brk(0x560fdffde000) = 0x560fdffde000 clone(child_stack=0x7f5ebc00cfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[7024], tls=0x7f5ebc00d700, child_tidptr=0x7f5ebc00d9d0) = 7024 strace: Process 7024 attached [pid 7024] set_robust_list(0x7f5ebc00d9e0, 24[pid 7023] futex(0x7f5ebc00d9d0, FUTEX_WAIT, 7024, NULL [pid 7024] <... set_robust_list resumed>) = 0 [pid 7024] write(1, "Hello from the secondary thread\n", 32Hello from the secondary thread ) = 32 [pid 7024] madvise(0x7f5ebb80d000, 8368128, MADV_DONTNEED) = 0 [pid 7024] exit(0) = ? [pid 7023] <... futex resumed>) = 0 [pid 7024] +++ exited with 0 +++ write(1, "End of program\n", 15End of program ) = 15 exit_group(0) = ? +++ exited with 0 +++
The first lines are the system calls triggered by pthread_create():
Figure 2 depicts the context of the created thread:
In the main thread, the call to pthread_join() is actually a call to futex() with the FUTEX_WAIT
operation on the futex variable located at address 0x7f5ebc00d9d0. The father thread waits as long as the futex variable contains the child's identifier
(7024). The thread identifier is stored by clone() according to the
CLONE_PARENT_SETTID parameter. The CLONE_CHILD_CLEARTID
flag passed to clone() clears the child thread identifier at the location pointed to by
child_tid in child memory when the child exits (this is the address of the futex
variable), and triggers a call to futex() with the
FUTEX_WAKE operation. This makes pthread_join()
return in the father thread.
The source code of pthread_join() from the GLIBC
is in the file named nptl/pthread_join_common.c:
nptl/pthread_join_common.c |
int
__pthread_clockjoin_ex (pthread_t threadid, void **thread_return,
clockid_t clockid,
const struct timespec *abstime, bool block)
{
struct pthread *pd = (struct pthread *) threadid;
[...]
if (block)
{
/* During the wait we change to asynchronous cancellation. If we
are cancelled the thread we are waiting for must be marked as
un-wait-ed for again. */
pthread_cleanup_push (cleanup, &pd->joinid);
if (abstime != NULL)
result = clockwait_tid (&pd->tid, clockid, abstime);
else
{
pid_t tid;
/* We need acquire MO here so that we synchronize with the
kernel's store to 0 when the clone terminates. (see above) */
while ((tid = atomic_load_acquire (&pd->tid)) != 0)
lll_futex_wait_cancel (&pd->tid, tid, LLL_SHARED);
}
pthread_cleanup_pop (0);
}
void *pd_result = pd->result;
if (__glibc_likely (result == 0))
{
/* We mark the thread as terminated and as joined. */
pd->tid = -1;
/* Store the return value if the caller is interested. */
if (thread_return != NULL)
*thread_return = pd_result;
/* Free the TCB. */
__free_tcb (pd);
}
else
pd->joinid = NULL;
LIBC_PROBE (pthread_join_ret, 3, threadid, result, pd_result);
return result;
}
|
The call to futex() with the FUTEX_WAIT operation is highlighted in red. The function loops until the content of the futex variable is set to 0. Upon the end of the thread, the kernel sets the futex variable to 0 and calls futex() with the FUTEX_WAKE operation to wakeup the thread waiting on the futex variable.
In POSIX terminology, a barrier [9] is a "rendez-vous" where
several threads decide to suspend their execution until all the threads reach it. This is
for example used to ensure that several threads execute some actions
(e.g. initializations) before the whole program hits its stride.
The program below implements a barrier. Several threads are created. Each of them
(even the main thread) store their task identifier in a global table and wait on the
barrier until all the threads have done the same. The main thread also waits on the
barrier. When all the threads are suspended on the barrier, they are woken up. The
secondary threads terminates while the main thread displays the content of the global
table, calls pthread_join() for each secondary
thread and terminates.
barrier.c |
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>
static int futex_var;
static int nb_barrier;
static pid_t *tab;
#define futex_wait(addr, val) \
syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL)
#define futex_wakeup(addr, nb) \
syscall(SYS_futex, addr, FUTEX_WAKE, nb)
#define gettid() syscall(__NR_gettid)
void barrier_wait(void)
{
int old;
old = __sync_fetch_and_sub(&futex_var, 1);
if (1 == old) {
futex_wakeup(&futex_var, nb_barrier - 1);
return;
}
// Wait until the futex variable is equal to 0
old = old - 1;
do {
futex_wait(&futex_var, old);
old = __sync_fetch_and_sub(&futex_var, 0);
if (0 == old) {
break;
}
} while(1);
} // barrier_wait
void *thd_main(void *p)
{
int *idx = (int *)p;
tab[*idx] = gettid();
barrier_wait();
return NULL;
} // thd_main
int main(int ac, char *av[])
{
pthread_t *tid;
int i;
int *idx;
// If the number of threads is passed as parameter
if (2 == ac) {
nb_barrier = atoi(av[1]) + 1;
} else {
// Default value
nb_barrier = 3 + 1;
}
futex_var = nb_barrier;
tab = (pid_t *)malloc(nb_barrier * sizeof(pid_t));
idx = (int *)malloc(nb_barrier * sizeof(int));
tid = (pthread_t *)malloc(nb_barrier * sizeof(pthread_t));
tab[0] = gettid();
for (i = 1; i < nb_barrier; i ++) {
idx[i] = i;
pthread_create(&(tid[i]), NULL, thd_main, (void *)&(idx[i]));
}
barrier_wait();
free(idx);
for (i = 0; i < nb_barrier; i ++) {
printf("tab[%d] = %d\n", i, tab[i]);
}
free(tab);
for (i = 1; i < nb_barrier; i ++) {
pthread_join(tid[i], NULL);
}
free(tid);
return 0;
} // main
|
The execution of the program displays the identifiers of all the threads:
$ ./barrier tab[0] = 6560 tab[1] = 6561 tab[2] = 6562 tab[3] = 6563
Without the barrier, the main thread couldn't know when the table is fully populated without looping
to scan the table until all the entries are different than 0. The barrier really simplifies the life.
The barrier is implemented with a futex variable named futex_var. It
is initialized with the number of threads that will be suspended on the barrier (the secondary
threads and the main thread). In the barrier_wait() function,
the futex variable is atomically decremented thanks to the
__sync_fetch_and_sub() built-in:
$ strace -f ./barrier 4000 [...] [pid 16413] clone(child_stack=0x7fc1d926ffb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID[pid 16414] <... gettid resumed>) = 16414 [pid 16415] gettid( [pid 16414] futex(0x55ebf3432024, FUTEX_WAIT, 4000, NULL strace: Process 16416 attached [pid 16413] <... clone resumed>, parent_tid=[16416], tls=0x7fc1d9270700, child_tidptr=0x7fc1d92709d0) = 16416 [pid 16415] <... gettid resumed>) = 16415 [pid 16413] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 [pid 16416] set_robust_list(0x7fc1d92709e0, 24 [pid 16413] <... mmap resumed>) = 0x7fc1d826f000 [pid 16416] <... set_robust_list resumed>) = 0 [pid 16415] futex(0x55ebf3432024, FUTEX_WAIT, 3999, NULL [pid 16413] mprotect(0x7fc1d8270000, 8388608, PROT_READ|PROT_WRITE [pid 16416] gettid( [pid 16413] <... mprotect resumed>) = 0 [pid 16416] <... gettid resumed>) = 16416 [pid 16413] clone(child_stack=0x7fc1d8a6efb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID [pid 16416] futex(0x55ebf3432024, FUTEX_WAIT, 3998, NULL strace: Process 16417 attached [pid 16413] <... clone resumed>, parent_tid=[16417], tls=0x7fc1d8a6f700, child_tidptr=0x7fc1d8a6f9d0) = 16417 [pid 16417] set_robust_list(0x7fc1d8a6f9e0, 24 [pid 16413] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 [pid 16417] <... set_robust_list resumed>) = 0 [pid 16413] <... mmap resumed>) = 0x7fc1d7a6e000 [pid 16417] gettid( [pid 16413] mprotect(0x7fc1d7a6f000, 8388608, PROT_READ|PROT_WRITE [pid 16417] <... gettid resumed>) = 16417 [pid 16413] <... mprotect resumed>) = 0 [pid 16417] futex(0x55ebf3432024, FUTEX_WAIT, 3997, NULL [pid 16413] clone(child_stack=0x7fc1d826dfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[16418], tls=0x7fc1d826e700, child_tidptr=0x7fc1d826e9d0) = 16418 strace: Process 16418 attached [...] [pid 20413] futex(0x55ebf3432024, FUTEX_WAKE, 4000 [...] [pid 16669] <... futex resumed>) = -1 EAGAIN (Resource temporarily unavailable) [...] [pid 20389] <... futex resumed>) = 0 [pid 16413] <... futex resumed>) = 0 [pid 20413] <... futex resumed>) = 4000 [pid 20412] <... futex resumed>) = 0 [pid 20388] <... futex resumed>) = 0 [...]
In the preceding we can also see the call to futex()
with the FUTEX_WAKE operation returning the number
of threads which have been woken up (4000).
In GLIBC 2.31, the barrier are managed with services like
pthread_barrier_init() or
pthread_barrier_wait(). Those services, compliant with POSIX specification, are a
quite more elaborated than our preceding example. Moreover, the pthread_barrier_wait()
returns 0 in all the calling threads except the one which reached the barrier at last and for which
PTHREAD_BARRIER_SERIAL_THREAD is returned. The preceding example can be modified
to behave the same. In the following source code, the modification are highlighted in red. The special value
BARRIER_SERIAL_THREAD is returned in the thread which wakes up the other ones.
barrier2.c |
#include <stdlib.h> #include <stdio.h> #include <pthread.h> #include <unistd.h> #include <sys/syscall.h> #include <linux/futex.h> static int futex_var; static int nb_barrier; static pid_t *tab; #define futex_wait(addr, val) \ syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL) #define futex_wakeup(addr, nb) \ syscall(SYS_futex, addr, FUTEX_WAKE, nb) #define gettid() syscall(__NR_gettid) #define BARRIER_SERIAL_THREAD 1 int barrier_wait(void) { int old; old = __sync_fetch_and_sub(&futex_var, 1); if (1 == old) { futex_wakeup(&futex_var, nb_barrier - 1); return BARRIER_SERIAL_THREAD; } // Wait until the futex variable is equal to 0 old = old - 1; do { futex_wait(&futex_var, old); old = __sync_fetch_and_sub(&futex_var, 0); if (0 == old) { break; } } while(1); return 0; } // barrier_wait void *thd_main(void *p) { int *idx = (int *)p; int rc; tab[*idx] = gettid(); rc = barrier_wait(); if (BARRIER_SERIAL_THREAD == rc) { printf("Thread %d was the last on the barrier\n", *idx); } return NULL; } // thd_main int main(int ac, char *av[]) { pthread_t *tid; int i; int *idx; int rc; // If the number of threads is passed as parameter if (2 == ac) { nb_barrier = atoi(av[1]) + 1; } else { // Default value nb_barrier = 3 + 1; } futex_var = nb_barrier; tab = (pid_t *)malloc(nb_barrier * sizeof(pid_t)); idx = (int *)malloc(nb_barrier * sizeof(int)); tid = (pthread_t *)malloc(nb_barrier * sizeof(pthread_t)); tab[0] = gettid(); for (i = 1; i < nb_barrier; i ++) { idx[i] = i; pthread_create(&(tid[i]), NULL, thd_main, (void *)&(idx[i])); } rc = barrier_wait(); if (BARRIER_SERIAL_THREAD == rc) { printf("Main thread was the last on the barrier\n"); } free(idx); for (i = 0; i < nb_barrier; i ++) { printf("tab[%d] = %d\n", i, tab[i]); } free(tab); for (i = 1; i < nb_barrier; i ++) { pthread_join(tid[i], NULL); } free(tid); return 0; } // main |
$ ./barrier2 200
tab[0] = 6659
tab[1] = 6660
tab[2] = 6661
tab[3] = 6662
tab[4] = 6663
tab[5] = 6664
tab[6] = 6665
[...]
tab[50] = 6709
tab[51] = 6710
Thread 190 was the last on the barrier
tab[52] = 6711
tab[53] = 6712
[...]
tab[199] = 6858
tab[200] = 6859
This chapter shows the hidden part of the futex as it presents their implementation in the Linux kernel 5.4. With this knowledge, one can use the service more efficiently.
The Fast User Space Mutexes (called "futex" by Rusty Russell from IBM company [1]) are proposed by the Linux kernel with various flags in init/Kconfig file:
init/Kconfig |
config FUTEX bool "Enable futex support" if EXPERT default y imply RT_MUTEXES help Disabling this option will cause the kernel to be built without support for "fast userspace mutexes". The resulting kernel may not run glibc-based applications correctly. config FUTEX_PI bool depends on FUTEX && RT_MUTEXES default y config HAVE_FUTEX_CMPXCHG bool depends on FUTEX help Architectures should select this if futex_atomic_cmpxchg_inatomic() is implemented and always working. This removes a couple of runtime checks. |
The heart of the implementation is located in kernel/futex.c into which the introductory comment sums up the principle:
kernel/futex.c |
* The waiter reads the futex value in user space and calls * futex_wait(). This function computes the hash bucket and acquires * the hash bucket lock. After that it reads the futex user space value * again and verifies that the data has not changed. If it has not changed * it enqueues itself into the hash bucket, releases the hash bucket lock * and schedules. * * The waker side modifies the user space value of the futex and calls * futex_wake(). This function computes the hash bucket and acquires the * hash bucket lock. Then it looks for waiters on that futex in the hash * bucket and wakes them. |
The futex are stored in a hash table named __futex_data which entries are defined as:
/*
* The base of the bucket array and its size are always used together
* (after initialization only in hash_futex()), so ensure that they
* reside in the same cacheline.
*/
static struct {
struct futex_hash_bucket *queues;
unsigned long hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
The entries are cache aligned for optimization purposes (we don't want several entries on the same cache line which would cause false sharing).
Each entry point on a queue of struct futex_hash_bucket records:
/*
* Hash buckets are shared by all the futex_keys that hash to the same
* location. Each key may have multiple futex_q structures, one for each task
* waiting on a futex.
*/
struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
} ____cacheline_aligned_in_smp;
The hash function is:
The hash table is allocated with the alloc_large_system_hash() defined in mm/page_alloc.c. This function displays a message on the console such as:
$ dmesg [...] [ 0.433328] futex hash table entries: 1024 (order: 4, 65536 bytes, linear) [...]
The author is an engineer in computer sciences located in France. He can be contacted here.