Detailed Approach of the futex

Author: R. Koucha
Last update: 01-Jun-2020

Detailed approach of the futex

The System V semaphores provided by Unix and later by Linux, are quite heavy from a performance point of view. With the futex coming from the researches of H. Franke, R. Russel et M. Kirkwood [1], a more efficient alternative is available. According to the manuals, Linux provides this mechanism since version 2.5.7 with some evolutions in the subsequent releases.
This article aims at presenting the implementation details of the futex in the Linux kernel and how to use them as well as how the open sources libraries benefit from them.

Foreword
1. Introduction
2. From the System V semaphores to the futex
    2.1. Example using System V semaphores
    2.2. Performance problems
    2.3. Principle of the futex
    2.4. Modification of the program with a futex
    2.5. Performance tests
3. futex() system call
    3.1. Synopsis
    3.2. Usage
    3.3. The parameters
    3.4. Creation and initialization
    3.5. Destruction
4. Application examples
    4.1. Implementation of a mutex
        4.1.1. Execution of the example program
        4.1.2. Enhancement of the performances
    4.2. Synchronization on the end of a thread
    4.3. Implementation of a barrier
5. Detailed study
    5.1. Management of the futex in the kernel
    5.2. Number of threads to wake-up
    5.3. Intra-process optimization
    5.4. Early thread termination
        5.4.1. The user data structure
        5.4.2. The head of the list
        5.4.3. Getting the futex
        5.4.4. Waiting on the futex
        5.4.5. The wake-up of the waiting thread
        5.4.6. Limitations
    5.5. Priority inheritance
        5.5.1. Priority inversion
        5.5.2. Solution
    5.6. Multiplexing
    5.7. Temporization parameter
        5.7.1. Relative temporisation
        5.7.2. Absolute temporization
        5.7.3. Temporization and priority inheritance
    5.8. Conditional wake-up
    5.9. Implicit futex change
    5.10. Miscellaneous operations
6. Conclusion
7. References
About the author

Foreword

A french version of this article has been published in Gnu Linux Magazine France numbers 173, 175, 176 and 178. The current translation is an updated version.

The article contains source code examples that can be downloaded from this WEB site. Then, proceed as follow to uncompress the archive file and build the examples (you need to install cmake if not done yet):

$ mkdir examples
$ cp the_futex.tgz examples
$ cd examples
$ tar xvfz the_futex.tgz
[...]
$ cmake .
$ make

The executables can be run from the current directory.

1. Introduction

The system V semaphores are a bottleneck on heavy loaded servers, multi-threaded processes, real-time oriented applications... The futex have come to solve this.
Originally, the futex (a.k.a. "Fast User Space mutex") were introduced to implement efficient mutex. But they are now used in several other synchronization mechanisms (sémaphores, conditional variables...). The principle consists in reducing the number of system calls involved in the synchronization operations. Promoted by the GLIBC designers, they are now the cornerstone of the synchronization functions of the POSIX threads and more generally Linux processes.
At first sight, using the futex is not straightforward. Although online manuals are available (i.e. man 7 futex and man 2 futex), the usage is still tricky as the documentation is quite terse and sometimes out of date. Moreover, the manual says "bare futexes are not intended as an easy-to-use abstraction for end users. Implementors are expected to be assembly literate and to have read the sources of the futex user-space library" (sic).

This article will demistify the futex.

2. From the System V semaphores to the futex

2.1. Example using System V semaphores

The semget() system call returns a semaphore identifier. The operations on the semaphore are realized thanks to the semop() system call. The destruction is done with semctl() system call.
The following program sets up two threads. One thread (the writer) updates the global variable data with the counter of seconds while the other (the reader) displays the value of the counter upon user requests. Both execution flows run concurrently to access the data variable to either read or write into it. The mutual exclusion is solved thanks to the semid semaphore which serializes the accesses to the global variable in the critical sections (highlighted in red color in the following source code). Hence, the reader thread always get a coherent value of the counter of seconds.

ipc.c

#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>


#define NB_SECONDS 20

static char *str_second[NB_SECONDS] =
{
  "zero", "one", "two", "three", "four", "five",
  "six", "seven", "eight", "nine", "ten",
  "eleven", "twelve", "thirteen", "fourteen", "fiveteen",
  "sixteen", "seventeen", "eighteen", "nineteen"
};

char data[100];

static int semid;

struct sembuf sembuf_for_P =  {0, -1, SEM_UNDO};
struct sembuf sembuf_for_V =  {0, +1, SEM_UNDO};

#define P(id)  semop(id, &sembuf_for_P, 1)
#define V(id)  semop(id, &sembuf_for_V, 1)

static void *thd_main(void *p)
{
unsigned int i = 0;

 (void)p;

  while(1) {

    // Update the counter in ASCII
    P(semid);
    strcpy(data, str_second[i]);
    V(semid);

    // Wait one second
    sleep(1);

    // Incrementation of the counter in the range [0, NB_SECONDS[
    i = (i + 1) % NB_SECONDS;
  }

  return NULL;
} // thd_main

int main(void)
{
pthread_t      tid;
unsigned short val = 1;

  // Creation of the semaphore
  semid = semget(IPC_PRIVATE, 1, IPC_CREAT|0777);

  // Initialization of the semaphore with the value 1
  semctl(semid, 0, SETALL, &val);

  // Creation of the thread
  (void)pthread_create(&tid, NULL, thd_main, NULL);

  // Interaction with the operator
  while(fgetc(stdin) != EOF) {

    // Display the ASCII value of the counter
    P(semid);
    printf("Current counter value: %s\n", data);
    V(semid);
  }

  return 0;

} // main

Once built, the example is run as follow. Each time the user types <RETURN>, the program displays the string value of the counter:

$ ./ipc

Current counter value: two

Current counter value: three

Current counter value: four
<CTRL>+<D>
$

The strace utility (with the -f option to follow the child processes) points out the calls to the Linux system during the execution. Each seconds, upon return from clock_nanosleep(), the semtimedop() service is called twice to decrement (P() operation) and increment (V() operation) the semaphore by the thread#4685 to update the counter.

[...]
[pid  4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0
[pid  4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0
[pid  4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fb0bfdcbe80) = 0
[pid  4685] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0
[pid  4685] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0
[...]

Each time the user types <RETURN>, the thread#4684 does the same calls to read and display the current value of the counter.

[...]
[pid  4684] semtimedop(4, [{0, -1, SEM_UNDO}], 1, NULL) = 0
[pid  4684] write(1, "Current counter value: five\n", 28Current counter value: five) = 28
[pid  4684] semtimedop(4, [{0, 1, SEM_UNDO}], 1, NULL) = 0
[...]

2.2. Performance problems

The preceding example and especially the strace output shows that both threads invoke the semop() system call eventhough the semaphore is not contented. For example, when the writer thread updates the counter, most of the time, the reader thread is waiting for the user input and is not accessing the counter. Those continuous calls to the kernel to ensure the mutual exclusion consume lots of CPU. In real-time systems or bigger applications like database managers, the costs is huge. This behaviour may become prohibitive!

To reduce the cost of system calls, some solutions like vDSO have been implemented to reduce the calling overhead. But this is limited to a handful of services. Anyway, the best way to save CPU time is to avoid calling the system when it is not necessary. [1] proposes a solution with the futex introduced in Linux 2.5.7 and enhanced in the following versions of the kernel.

2.3. Principle of the futex

For a shared resource (data variable in our example), we differenciate the situations where there is no contention (i.e. no task is accessing it) from the situations where there is contention (i.e. a task is accessing it). In the first case, the operations are done in user space. The need for kernel services is required only for the second case. This is the main difference between system V semaphores and futex from performance point of view. In non-contended situations, the futex mechanism saves the cost of a context switch between user space and kernel space to take the control on the mutual exclusion lock whereas the system V IPC systematically invokes a system call in both situations.
Actually, a futex is an integer variable shared by processes and threads. As a consequence, this can be defined as a global variable if it is shared by the threads of a single process or can be located in a shared memory segment if it is used by threads from multiple processes. The shared variable contains a state modified in user space with atomic operations [3]. When the state specifies that there is no contention, no system call is required. On the other hand, if the state specifies a contention, a futex system call is done to make the calling task sleep. This basic mechanism is used to implement several elaborated services like mutex and rwlocks [13]. For instance, the POSIX thread synchronization services provided by the GLIBC/NPTL (i.e. Native Posix Thread Library [18]) use the futex: pthread_mutex_lock(), pthread_cond_signal(), pthread_rwlock_rdlock()...
As they are not always available in the high level languages like C or C++, the atomic operations need some assembly code. Hence, this could trigger portability problems from one architecture to another. That is one of the reasons why the manuals specify that the futex are preferably dedicated to advanced users involved in the design of service libraries like the GLIBC. They hide the architecture specific details with standardized high level services. But some compilers like GCC, provide built-ins [2] for widely used atomic operations. So, as long as we keep the same compiler, the operations are portable on all the platforms supported by the compiler.

2.4. Modification of the program with a futex

Figure 1 depicts the behaviour of the preceding ipc program based on a futex instead of a system V semaphore. The atomic operations are highlighted in red color. The sequence describes what is happening when the user types <RETURN>. The fgetc() function returns in the reader thread which calls P(). The latter is replaced by an atomic operation to set the futex variable to 1 if it is equal to 0. Meanwhile, the writer thread returns from sleep() function and calls the same P() operation. But as the futex variable is set to 1, it calls futex() system calls to wait into the kernel. The reader thread reads and displays the data and calls V() which consists into an atomic operation to set the futex variable to 0 and call futex() service to wake up the waiting threads. This triggers the wakeup of the writer thread which atomically checks if the futex variable is 0 and sets it to 1. It updates the data and calls the same V() operation to set atomically the futex variable to 0 and call the futex() service to wake up any waiting threads. At that moment, there are no waiting threads. Then, it calls sleep() function one more time. And so on...

Figure 1: ipc program based on a futex

In this sequence, the P() operation does not invoke a system call (i.e. futex()) when there is no contention. This is the big difference with the system V semaphore for which the P() operation always triggers a semop() system call. The V() operation always triggers a system call with both the futex and the system V semaphore. Below, we will see that it is possible to be more efficient in the V() operation by calling futex() only when there is a waiting thread.

2.5. Performance tests

Before going ahead, let's compare the performances when using a semaphore in system V mode and in futex mode (through the POSIX services of the GLIBC). To make it, we use an open source called systress which can be configured to set up concurrent running threads incrementing by 1 a global shared counter in a loop. The incrementation is done in a critical section surrounded by P() and V() operations. The semaphore is used with a binary value to behave as a mutex. In other words, only one thread at a time enters the critical section. The threads stop as soon as the counter reaches a configured ceiling value fixed to 100000000 here.

The results on a desktop PC (Distribution: Ubuntu 20.04, Linux: 5.4 PREEMPT_VOLUNTARY, gcc: 9.3.0, glibc: 2.31, Intel Core i7-3770K CPU @ 3.50GHz, 8 cores) are:

  
      Number
of
threads
(ceiling value = 100000000)
      
      POSIX mutex
(Duration)
      
      System V mutex
(Duration)
      
    

      1
      1''85
      1'19''15
    

      2
      13''58
      3'25''52
    

      3
      14''06
      5'18''89
    

      4
      16''84
      6'44''46
    

      5
      17''33
      7'16''64
    

Number of threads (ceiling value = 100000000)	POSIX mutex (Duration)	System V mutex (Duration)
1	1''85	1'19''15
2	13''58	3'25''52
3	14''06	5'18''89
4	16''84	6'44''46
5	17''33	7'16''64

The results on a Raspberry Pi version 3 (model B+) (Distribution: Raspbian 10, Linux: 4.19 PREEMPT_VOLUNTARY, gcc: 4.9.3, glibc: 2.28, Cortex-A53 (ARMv8) 1.4GHz, 4 cores) are:

  
      Number
of
threads
(ceiling value = 100000000)
      
      POSIX semaphore
(Duration)
      
      System V semaphore
(Duration)
      
    

      1
      8''43
      2'17''92
    

      2
      21''34
      16'19''59
    

      3
      33''30
      16'31''80
    

      4
      26''92
      15'16''14
    

      5
      27''18
      17'54''66
    

Number of threads (ceiling value = 100000000)	POSIX semaphore (Duration)	System V semaphore (Duration)
1	8''43	2'17''92
2	21''34	16'19''59
3	33''30	16'31''80
4	26''92	15'16''14
5	27''18	17'54''66

The preceding tables of results show big differences between binary semaphores in system V and futex modes from a performance point of view. Contended (more than one running thread) or not (one running thread), the futex based semaphore is tremendously more performant. That is why it is worth using futex based synchronization mechanisms!

3. futex() system call

The section 7 of the online manual provides a general description while the section 2 describes the system call itself.

3.1. Synopsis

#include <linux/futex.h>
#include <sys/time.h>

int futex(int *uaddr, int futex_op, int val,
          const struct timespec *timeout,   /* or: uint32_t val2 */
          int *uaddr2, int val3);

The op parameter determines which operation to do.

3.2. Usage

The source code below uses the system call according to the description in the manual:

futex.c

#include <stdio.h>
#include <linux/futex.h>

int main(void)
{
int rc;

  rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0);
  printf("rc=%d\n", rc);

  return 0;
} // main

The compilation ends with an error as the futex() service does not exist in the GLIBC:

$ gcc futex.c 
futex.c: In function 'main':
futex.c:8:8: warning: implicit declaration of function 'futex' [-Wimplicit-function-declaration]
    8 |   rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0);
      |        ^~~~~
/usr/bin/ld: /tmp/ccIJA66y.o: in function `main':
futex.c:(.text+0x32): undefined reference to `futex'
collect2: error: ld returned 1 exit status

But the system call exists as it is defined in the <sys/syscall.h> header file:

#ifdef __NR_futex
# define SYS_futex __NR_futex
#endif

So, the call to futex() is done through the syscall() service:

futex.c

#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

#define futex(a, b, c, d, e, f) syscall(SYS_futex, a, b, c, d, e, f)

int main(void)
{
int rc;

  rc = futex((int *)0, 0, 0, (const struct timespec *)0, (int *)0, 0);
  printf("rc=%d\n", rc);

  return 0;
} // main

3.3. The parameters

The uaddr parameter is the address of the futex variable. This is an integer. The op parameter is the code of the operation to be run by the system call. They are defined in <linux/futex.h>:

#define FUTEX_WAIT		0
#define FUTEX_WAKE		1
#define FUTEX_FD		2
#define FUTEX_REQUEUE		3
#define FUTEX_CMP_REQUEUE	4
#define FUTEX_WAKE_OP		5
#define FUTEX_LOCK_PI		6
#define FUTEX_UNLOCK_PI		7
#define FUTEX_TRYLOCK_PI	8
#define FUTEX_WAIT_BITSET	9
#define FUTEX_WAKE_BITSET	10
#define FUTEX_WAIT_REQUEUE_PI	11
#define FUTEX_CMP_REQUEUE_PI	12

For examples:

When op is equal to FUTEX_WAIT, the system call atomically checks if the memory location referenced by uaddr contains the val value. If not, futex() returns immediately with -1 as returned code and EAGAIN as errno value (synonymous of EWOULDBLOCK). Otherwise, the calling thread is put to sleep until another thread calls futex() with the FUTEX_WAKE operation (the returned code will be 0) or upon the timer expiration if the timeout parameter was not NULL (the returned code is -1 and errno is set to ETIMEDOUT).
When op is equal to FUTEX_WAKE, the val parameter specifies the maximum number of threads to wake up. The returned code is the number of threads actually woken up.

3.4. Creation and initialization

If the futex is to be shared between threads belonging to the same process, the futex variable can be a simple global integer variable or an integer stored in a dynamically allocated memory location.
If the futex is used by threads belonging to several processes, the futex variable must be located in some shared memory location allocated through system V IPC (i.e. shmget(), shmat()) or a POSIX service (i.e. shm_open(), mmap()).

3.5. Destruction

Destroying a futex requires first of all that no thread is waiting on it. This deallocates the corresponding data structures in the kernel. On user space side, the destruction is nothing else then getting rid of the futex variable. If the latter has been dynamically allocated, the user must call the corresponding deallocation services (e.g. free(), shmdt(), munmap()...). If it is a global static variable, it will be kept unused up to the end of the program.

4. Application examples

4.1. Implementation of a mutex

A mutex is a lock with two states: locked or unlocked. It is typically used to permit the exclusive access to a resource shared by multiple concurrent executing threads. Those code segments are called "critical sections". Two operations are generally defined: LOCK() and UNLOCK() to respectively lock and unlock the critical section. The ipc program previously presented contains two critical sections. Those are the code portions surrounded by P() and V() operations (preferred terminology when we speak about semaphores [19]).
Actually, mutex are a particular case of the semaphores. The latter decrement (P() operation) or increment (V() operation) a counter initialized to the number of threads allowed to enter a critical section. For a mutex, the counter, is initialized to 1. This is a binary semaphore.
The following mutex program is the modified version of the ipc program above. It replaces the system V semaphore by a futex.

mutex.c

#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <linux/futex.h>

#define NB_SECONDS 20

static char *str_second[NB_SECONDS] =
{
  "zero", "un", "deux", "trois", "quatre", "cinq",
  "six", "sept", "huit", "neuf", "dix",
  "onze", "douze", "treize", "quatorze", "quinze",
  "seize", "dix sept", "dix huit", "dix neuf"
};

char data[100];

int futex_var;

#define futex_wait(addr, val) \
           syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL)
#define futex_wakeup(addr, nb) \
           syscall(SYS_futex, addr, FUTEX_WAKE, nb)

static void LOCK(int *f)
{
int old;

  while (1)
  {
    old = __sync_val_compare_and_swap(f, 0, 1);

    if (0 == old)
    {
      // The futex variable was free as it was equal to 0
      // Now, it is set to 1
      return;
    }
    else
    {
      // The futex variable is equal to 1
      // ==> It is not modified
      futex_wait(f, 1);
    }
  }
} // LOCK

static void UNLOCK(int *f)
{
  (void)__sync_fetch_and_sub(f, 1);

  futex_wakeup(f, 1);
} // UNLOCK

static void *thd_main(void *p)
{
unsigned int i = 0;

 (void)p;

  while(1)
  {
    // Update of the counter
    LOCK(&futex_var);
    strcpy(data, str_second[i]);
    UNLOCK(&futex_var);

    // Sleep 1 second
    sleep(1);

    // Incrementation of the counter in the range [0, NB_SECONDS[
    i = (i + 1) % NB_SECONDS;
  } // End while

  return NULL;
} // thd_main

int main(void)
{
pthread_t tid;

  // Creation of the thread
  (void)pthread_create(&tid, NULL, thd_main, NULL);

  // Interaction with the operator
  while(fgetc(stdin) != EOF)
  {
    // Display the current value of the counter
    LOCK(&futex_var);
    printf("Current counter value: %s\n", data);
    UNLOCK(&futex_var);
  } // End while

  return 0;
} // main

The futex relies on the futex_var global variable. The FUTEX_WAIT and FUTEX_WAKE operations are provided as macros (respectively named futex_wait() and futex_wakeup()).
The LOCK() service is interesting. It first atomically checks futex_var to set it to 1 (second parameter of __sync_val_compare_and_swap()) only if it is currently equal to 0 (first parameter of __sync_val_compare_and_swap()). This is a "compare and swap operation" (i.e. CAS [4]) provided as a built-in by GCC compiler. If it is not currently equal to 0, the variable is left unchanged and the calling thread is put to sleep with the FUTEX_WAIT operation. The latter also atomically checks the value of the variable to verify that it is still equal to 1 (fourth parameter of futex()) before putting the calling thread to sleep. The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex variable after the calling thread decided to block based on the prior value, and if the other thread executed a FUTEX_WAKE operation after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not go to sleep. This does not prevent the famous ABA problem [21] where a thread could change the value of the futex and put it back to its original value before the thread makes the check but at least, the thread will not sleep for ever!
The UNLOCK() service, much more simple, atomically decrements futex_var thanks the __sync_fetch_and_sub() built-in of GCC and calls the FUTEX_WAKE operation along with the value 1 to wakeup at most one waiting thread. No guarantee is provided about which waiters are awoken (e.g. a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority).

As we can see, the futex mechanism needs user space atomic operations helpers. Those helpers are not available in high level languages like C or C++. The assembly language is required with the inherent portability problems from one platform to another. Hopefully, compilers like GCC provide built-ins to simplify the programmer's life.
On Intel platform, the CAS atomic operation is done with a cmpxchg assembly instruction [5] prefixed by lock for SMP system to guarantee exclusive access to the memory location. Here is the disassembly code of the LOCK() service:

    12d0:	55                   	push   %rbp
    12d1:	31 ed                	xor    %ebp,%ebp
    12d3:	53                   	push   %rbx
    12d4:	48 83 ec 08          	sub    $0x8,%rsp
    12d8:	48 8d 1d 61 2d 00 00 	lea    0x2d61(%rip),%rbx  ; RBX=@futex_var
    12df:	eb 20                	jmp    1301               ; Goto CAS operation
    12e1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    12e8:	45 31 c0             	xor    %r8d,%r8d          ; R8D=0 ("timeout" parameter of futex())
    12eb:	b9 01 00 00 00       	mov    $0x1,%ecx          ; ECX=1 ("val" parameter of futex())
    12f0:	31 d2                	xor    %edx,%edx          ; EDX=0=FUTEX_WAIT ("op" parameter of futex())
    12f2:	48 89 de             	mov    %rbx,%rsi          ; RSI=@futex_var ("addr" parameter of futex())
    12f5:	bf ca 00 00 00       	mov    $0xca,%edi         ; EDI=0xCA=__NR_futex (1st parameter of syscall())
    12fa:	31 c0                	xor    %eax,%eax          ; EAX=0
    12fc:	e8 df fd ff ff       	callq  10e0               ; futex(@futex_var, FUTEX_WAIT, 1, NULL) system call
    1301:	ba 01 00 00 00       	mov    $0x1,%edx          ; EDX=1
    1306:	89 e8                	mov    %ebp,%eax          ; EAX=0
    1308:	f0 0f b1 13          	lock cmpxchg %edx,(%rbx)  ; CAS operation with EDX=1 and (%RBX)=futex_var
    130c:	85 c0                	test   %eax,%eax          ; EAX=old value of futex_var
    130e:	75 d8                	jne    12e8               ; If old != 0, call futex()
    1310:	48 83 c4 08          	add    $0x8,%rsp
    1314:	5b                   	pop    %rbx
    1315:	5d                   	pop    %rbp
    1316:	c3                   	retq

The cmpxchg instruction above atomically compares the content of EAX equal to 0 with the content of (%rbx) (i.e. futex_var). If both are equal, (%rbx) (i.e. futex_var) is set to the content of %EDX (i.e. 1). If they are not equal, EAX is set with (%rbx) (i.e. futex_var). So, after this instruction EAX represents the "old" value of futex_var as it is either unchanged (i.e. equal to the current content of futex_var which is 0) or it is set to the value of futex_var.

The same code for the ARM processor of the Raspberry Pi card is based on the "load-link and store-conditional" (LL/SC) [6] with the ldrex (Load exclusive) and strex (Store exclusive) pair of instructions to atomically update the futex variable into two separate steps [20].

   108ec:	e59f3064 	ldr	r3, [pc, #100]
   108f0:	e59f2064 	ldr	r2, [pc, #100]
   108f4:	e08f3003 	add	r3, pc, r3
   108f8:	e92d4070 	push	{r4, r5, r6, lr}
   108fc:	e3a05001 	mov	r5, #1
   10900:	e24dd008 	sub	sp, sp, #8
   10904:	e3a06000 	mov	r6, #0
   10908:	e7934002 	ldr	r4, [r3, r2]               ; r4=@futex_var
   1090c:	ea000001 	b	10918                      ; Goto the LL/SC operation
   10910:	e58d6000 	str	r6, [sp]                   ; Param#6 of syscall = Param#5 of futex() = timeout (i.e. r6 = 0)
   10914:	ebffff70 	bl	106dc 
   10918:	ee070fba 	mcr	15, 0, r0, cr7, cr10, {5}  ; Data memory barrier to wait for completion of all memory transactions
   1091c:	e1943f9f 	ldrex	r3, [r4]                   ; LL: r3=Content of futex_var
   10920:	e3530000 	cmp	r3, #0                     ; Compare futex_var to 0
   10924:	1a000002 	bne	10934                      ; If futex_var is not 0, call futex(@futex_var, FUTEX_WAIT, 1, 0)
   10928:	e1842f95 	strex	r2, r5, [r4]               ; SC: As futex_var was 0, set it to r5=1
   1092c:	e3520000 	cmp	r2, #0                     ; R2 is 0 (if store succeeded) or 1 (if store failed)
   10930:	1afffff9 	bne	1091c                      ; If store failed, try again
   10934:	e3530000 	cmp	r3, #0                     ; Store succeeded, Return (after memory barrier)
   10938:	ee070fba 	mcr	15, 0, r0, cr7, cr10, {5}  ; Data memory barrier to wait for completion of all memory transactions
   1093c:	e1a01004 	mov	r1, r4                     ; r1=@futex_var 
   10940:	e3a02000 	mov	r2, #0                     ; r2=FUTEX_WAIT ("op" parameter of futex())
   10944:	e3a03001 	mov	r3, #1                     ; r3=1 ("val" parameter of futex())
   10948:	e3a000f0 	mov	r0, #240                   ; r0=__NR_futex (1st parameter of syscall())
   1094c:	1affffef 	bne	10910                      ; Call futex(@futex_var, FUTEX_WAIT, 1, 0)
   10950:	e28dd008 	add	sp, sp, #8
   10954:	e8bd8070 	pop	{r4, r5, r6, pc}
   10958:	0001033c 	andeq	r0, r1, ip, lsr r3
   1095c:	00000030 	andeq	r0, r0, r0, lsr r0

The preceding disassembly source codes show how much the atomic operations are touchy and require experience and hardware architecture knowledges to be implemented. Thanks to the GCC built-ins, all those tricky details are hidden.

4.1.1. Execution of the example program

The execution under the control of strace points out a big difference between mutex and ipc programs: There are much less system calls now.

The main thread is highlighted in blue and the secondary thread is in red.

[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 
 
[pid 22928] <... read resumed>"\n", 1024) = 1
[pid 22928] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0
[pid 22928] write(1, "Current counter value: cinq\n", 28Current counter value: cinq
) = 28
[pid 22928] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22928] read(0,  
[pid 22929] <... clock_nanosleep resumed>0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4685b29e90) = 0
[pid 22929] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22929] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 
 
[pid 22928] <... read resumed>"\n", 1024) = 1
[pid 22928] write(1, "Current counter value: neuf\n", 28Current counter value: neuf
) = 28
[pid 22928] futex(0x5568f8663040, FUTEX_WAKE, 1) = 0
[pid 22928] read(0,

Upon each clock_nanosleep(), the secondary thread invokes only one system call: futex() with the FUTEX_WAKE operation. When the operator types <RETURN>, the main thread behaves the same. To trigger futex() with the FUTEX_WAIT operation, a contention is necessary.
In the ipc program, in contended and non contented situations, a system call was systematically invoked to decrement the semaphore at the beginning of the critical section and decrement it at the end. Here, we save a system call at the beginning of the critical section when there is no contention.

It is possible to go farther by saving a system call at the end of the critical section as well.

4.1.2. Enhancement of the performances

It is possible to go farther in the reduction of system calls. Up to now, we only enhanced the locking procedure by saving a system call in non contended situations. But we can do the same in the unlocking procedure. In [8], Ulrich Drepper proposes a additionnal state for the futex. Instead of a binary state saying locked or not, it is possible to have a thrid value:

Unlocked (0);
Locked without waiting threads (1);
Locked with waiting threads (2).

Hence, in the unlocking procedure, the call to futex() with the FUTEX_WAKE operation wll be done only when the futex variable is equal to 2 (i.e. when threads are waiting).

The following mutex2 program is the modified version of the mutex program with the introduction of the additional state for the futex variable.

mutex2.c

#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <linux/futex.h>

#define NB_SECONDS 20

static char *str_second[NB_SECONDS] =
{
  "zero", "un", "deux", "trois", "quatre", "cinq",
  "six", "sept", "huit", "neuf", "dix",
  "onze", "douze", "treize", "quatorze", "quinze",
  "seize", "dix sept", "dix huit", "dix neuf"
};

char data[100];

int futex_var;

#define futex_wait(addr, val) \
           syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL)
#define futex_wakeup(addr, nb) \
           syscall(SYS_futex, addr, FUTEX_WAKE, nb)

static void LOCK(int *f)
{
int old;

  old = __sync_val_compare_and_swap(f, 0, 1);

  switch(old) {
    case 0: {
      // The futex was free as it contained 0
      // The atomic operation set it to 1 meaning
      // that it is busy with no waiting threads
      return;
    }
    break;

    default: {
      // The futex contained either 1 or 2: it is busy

      // Atomically set the futex to the value 2
      // to specify that there is at least one waiting thread
      old = __sync_lock_test_and_set(f, 2);

      // Loop as long as the futex is busy
      while (old != 0)
      {
        // Wait while the futex is equal to 2
        futex_wait(f, 2);

        // The futex is either not busy (value 0) or 1 or 2
        old = __sync_lock_test_and_set(f, 2);
      }
    }
    break;
  }
} // LOCK

static void UNLOCK(int *f)
{
int old;

  old = __sync_fetch_and_sub(f, 1);

  switch(old)
  {
    case 1:
    {
      // There were no waiting threads to wake up
      // ==> The futex is now equal to 0 (not busy)
    }
    break;

    case 2:
    {
      // There were waiting threads
      // Atomically set the futex to 0
      old = __sync_lock_test_and_set(f, 0);

      // Wake up one of the waiting threads
      futex_wakeup(f, 1);
    }
    break;

    default:
    {
      // Impossible
    }
  }
} // UNLOCK

static void *thd_main(void *p)
{
unsigned int i = 0;

  (void)p;

  while(1)
  {
    // Update the ASCII counter
    LOCK(&futex_var);
    strcpy(data, str_second[i]);
    UNLOCK(&futex_var);

    // Sleep one second
    sleep(1);

    // Incrementation of the counter in the range [0, NB_SECONDS[
    i = (i + 1) % NB_SECONDS;
  }

  return NULL;
}

int main(void)
{
pthread_t tid;

  // Creation of the thread
 (void)pthread_create(&tid, NULL, thd_main, NULL);

  // Interaction with the operator
  while(fgetc(stdin) != EOF)
  {
    // Display the current value of the counter
    LOCK(&futex_var);
    printf("Compteur courant: %s\n", data);
    UNLOCK(&futex_var);
  }
} // main

In the LOCK() function, in case of contention, the __sync_val_compare_and_swap(f, 0, 1) call returns 1 or 2 for the current value of the mutex. In this situation, the futex variable is not set to 1 as it is different than 0. So, the procedure makes the caller wait by calling futex() with the FUTEX_WAIT operation. But prior to that, it atomically sets the futex variable to 2 to specify that there are waiting threads.
In the UNLOCK() function, the futex variable is decremented as it is at least equal to 1 (i.e. the calling thread holds the mutex). Then, its value before decrementation is checked, if it was 1, there are no waiting threads, so the futex is not busy and there is no thread to wake up. If it was 2, there are waiting threads, we must first decrement the futex variable and then wake up a waiting thread by calling futex() with the FUTEX_WAKE operation. It is mandatory the wake up the threads after setting the futex variable to 0 otherwise the FUTEX_WAKE operation, could wake up a thread. The latter could check the current value of the futex and see that it is 1 or 2 and go back to sleep before the current thread set the futex to 0.

To make it, we introduced one more user space atomic operation helper. It is still a GCC built-ins: __sync_lock_test_and_set() sets the futex variable to the specified value and returns its previous value.

The execution of this new program version under the control of strace points out that even the unlocking part does not use system calls. The main thread is highlighted in blue and the secondary thread is in red.

[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 
 
[pid 21648] <... read resumed>"\n", 1024) = 1
[pid 21648] write(1, "Compteur courant: dix\n", 22Compteur courant: dix
) = 22
[pid 21648] read(0,  
[pid 21649] <... clock_nanosleep resumed>0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0
[pid 21649] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7f4798ecae90) = 0

The preceding traces show no futex() call at all as long as there is no contention.

To go farther in those enhancements, [15] proposes a much more efficient algorithm by adding a fourth state to the futex variable:

Unlocked (0);
Locked without waiting threads (1);
Locked with waiting threads (2);
Unlocked with waiting threads (3).

4.2. Synchronization on the end of a thread

This use case concerns a synchronization between a kernel service creating a thread and the user space. The following clone program creates a thread with pthread_create() service. The thread displays "Hello from the secondary thread" before termination. The main thread waits for the end of the secondary thread through pthread_join() and displays "End of program" before termination.

clone.c

#include <errno.h>
#include <pthread.h>
#include <unistd.h>

#include "util.h"


#define STR_HELLO "Hello from the secondary thread\n"
#define STR_END "End of program\n"

void *thd_main(void *p)
{
int rc;

  (void)p;

  rc = write(1, STR_HELLO, sizeof(STR_HELLO) - 1);
  if (rc < 0) {
    ERR("write(): '%m' (%d)\n", errno);
    return NULL;
  }

  return NULL;
} // thd_main

int main(void)
{
pthread_t tid;
int       rc;

  // Creation of the thread
 (void)pthread_create(&tid, NULL, thd_main, NULL);

  // Wait the end of the thread
 (void)pthread_join(tid, NULL);

  rc = write(1, STR_END, sizeof(STR_END) - 1);
  if (rc < 0) {
    ERR("write(): '%m' (%d)\n", errno);
    return 1;
  }

  return 0;
} // main

The main thread (i.e. the program) does not finish before the secondary thread is finished:

$ ./clone 
Hello from the secondary thread
End of program

Running the same under the control of strace, we point out how the synchronization is done:

$ strace -f ./clone
[...]
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f5ebb80d000
mprotect(0x7f5ebb80e000, 8388608, PROT_READ|PROT_WRITE) = 0
brk(NULL)                               = 0x560fdffbd000
brk(0x560fdffde000)                     = 0x560fdffde000
clone(child_stack=0x7f5ebc00cfb0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tid=[7024], tls=0x7f5ebc00d700, child_tidptr=0x7f5ebc00d9d0) = 7024
strace: Process 7024 attached
[pid  7024] set_robust_list(0x7f5ebc00d9e0, 24 
[pid  7023] futex(0x7f5ebc00d9d0, FUTEX_WAIT, 7024, NULL 
[pid  7024] <... set_robust_list resumed>) = 0
[pid  7024] write(1, "Hello from the secondary thread\n", 32Hello from the secondary thread
) = 32
[pid  7024] madvise(0x7f5ebb80d000, 8368128, MADV_DONTNEED) = 0
[pid  7024] exit(0)                     = ?
[pid  7023] <... futex resumed>)        = 0
[pid  7024] +++ exited with 0 +++
write(1, "End of program\n", 15End of program
)        = 15
exit_group(0)                           = ?
+++ exited with 0 +++

The first lines are the system calls triggered by pthread_create():

The TCB (Task Control Block) of the task is allocated through mmap() with no read and write accesses (PROT_NONE);
The default stack size (8 MB) is determined with prlimit();
The red zone (i.e. the read-only 4 KB long space which permits to detect the stack overflows) is set up thanks to mprotect() by activating the read and write access on the last 8388608 (i.e. 8392704 - 4096). Since the stack grows downwards (from the high to the low addresses), the red zone is actually located at the top of the stack zone;
The thread is created through a call to clone() with the CLONE_THREAD flag.

Figure 2 depicts the context of the created thread:

Figure 2: Thread's context

In the main thread, the call to pthread_join() is actually a call to futex() with the FUTEX_WAIT operation on the futex variable located at address 0x7f5ebc00d9d0. The father thread waits as long as the futex variable contains the child's identifier (7024). The thread identifier is stored by clone() according to the CLONE_PARENT_SETTID parameter. The CLONE_CHILD_CLEARTID flag passed to clone() clears the child thread identifier at the location pointed to by child_tid in child memory when the child exits (this is the address of the futex variable), and triggers a call to futex() with the FUTEX_WAKE operation. This makes pthread_join() return in the father thread.

The source code of pthread_join() from the GLIBC is in the file named nptl/pthread_join_common.c:

nptl/pthread_join_common.c

int
__pthread_clockjoin_ex (pthread_t threadid, void **thread_return,
			clockid_t clockid,
			const struct timespec *abstime, bool block)
{
  struct pthread *pd = (struct pthread *) threadid;
[...]
  if (block)
    {
      /* During the wait we change to asynchronous cancellation.  If we
	 are cancelled the thread we are waiting for must be marked as
	 un-wait-ed for again.  */
      pthread_cleanup_push (cleanup, &pd->joinid);

      if (abstime != NULL)
	result = clockwait_tid (&pd->tid, clockid, abstime);
      else
	{
	  pid_t tid;
	  /* We need acquire MO here so that we synchronize with the
	     kernel's store to 0 when the clone terminates. (see above)  */
	  while ((tid = atomic_load_acquire (&pd->tid)) != 0)
	    lll_futex_wait_cancel (&pd->tid, tid, LLL_SHARED);
	}

      pthread_cleanup_pop (0);
    }

  void *pd_result = pd->result;
  if (__glibc_likely (result == 0))
    {
      /* We mark the thread as terminated and as joined.  */
      pd->tid = -1;

      /* Store the return value if the caller is interested.  */
      if (thread_return != NULL)
	*thread_return = pd_result;

      /* Free the TCB.  */
      __free_tcb (pd);
    }
  else
    pd->joinid = NULL;

  LIBC_PROBE (pthread_join_ret, 3, threadid, result, pd_result);

  return result;
}

The call to futex() with the FUTEX_WAIT operation is highlighted in red. The function loops until the content of the futex variable is set to 0. Upon the end of the thread, the kernel sets the futex variable to 0 and calls futex() with the FUTEX_WAKE operation to wakeup the thread waiting on the futex variable.

4.3. Implementation of a barrier

In POSIX terminology, a barrier [9] is a "rendez-vous" where several threads decide to suspend their execution until all the threads reach it. This is for example used to ensure that several threads execute some actions (e.g. initializations) before the whole program hits its stride.
The program below implements a barrier. Several threads are created. Each of them (even the main thread) store their task identifier in a global table and wait on the barrier until all the threads have done the same. The main thread also waits on the barrier. When all the threads are suspended on the barrier, they are woken up. The secondary threads terminates while the main thread displays the content of the global table, calls pthread_join() for each secondary thread and terminates.

barrier.c

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>

static int futex_var;

static int nb_barrier;

static pid_t *tab;


#define futex_wait(addr, val) \
           syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL)

#define futex_wakeup(addr, nb) \
           syscall(SYS_futex, addr, FUTEX_WAKE, nb)

#define gettid() syscall(__NR_gettid)


void barrier_wait(void)
{
int old;

  old = __sync_fetch_and_sub(&futex_var, 1);

  if (1 == old) {
    futex_wakeup(&futex_var, nb_barrier - 1);
    return;
  }

  // Wait until the futex variable is equal to 0
  old = old - 1;
  do {
    futex_wait(&futex_var, old);

    old = __sync_fetch_and_sub(&futex_var, 0);
    if (0 == old) {
      break;
    }
  } while(1);

} // barrier_wait


void *thd_main(void *p)
{
int *idx = (int *)p;

  tab[*idx] = gettid();

  barrier_wait();

  return NULL;

} // thd_main


int main(int ac, char *av[])
{
pthread_t *tid;
int        i;
int       *idx;

  // If the number of threads is passed as parameter
  if (2 == ac) {

    nb_barrier = atoi(av[1]) + 1;

  } else {

    // Default value
    nb_barrier = 3 + 1;

  }

  futex_var = nb_barrier;

  tab = (pid_t *)malloc(nb_barrier * sizeof(pid_t));
  idx = (int *)malloc(nb_barrier * sizeof(int));
  tid = (pthread_t *)malloc(nb_barrier * sizeof(pthread_t)); 

  tab[0] = gettid();

  for (i = 1; i < nb_barrier; i ++) {
    idx[i] = i;
    pthread_create(&(tid[i]), NULL, thd_main, (void *)&(idx[i]));
  }

  barrier_wait();

  free(idx);

  for (i = 0; i < nb_barrier; i ++) {
    printf("tab[%d] = %d\n", i, tab[i]);
  }

  free(tab);

  for (i = 1; i < nb_barrier; i ++) {
    pthread_join(tid[i], NULL);
  }

  free(tid);

  return 0;

} // main

The execution of the program displays the identifiers of all the threads:

$ ./barrier 
tab[0] = 6560
tab[1] = 6561
tab[2] = 6562
tab[3] = 6563

Without the barrier, the main thread couldn't know when the table is fully populated without looping to scan the table until all the entries are different than 0. The barrier really simplifies the life.

The barrier is implemented with a futex variable named futex_var. It is initialized with the number of threads that will be suspended on the barrier (the secondary threads and the main thread). In the barrier_wait() function, the futex variable is atomically decremented thanks to the __sync_fetch_and_sub() built-in:

If the previous value of the futex is not 1, this means that some threads still not have called barrier_wait(). So, the calling thread is put to sleep with a call to futex() and the FUTEX_WAIT operation until the futex variable is set to 0. As usual, it is important to pass the current value of the variable when calling futex() as the latter atomically check this value prior to suspend the calling thread. If concurrent threads succeed to decrement the futex variable, the current thread would not be suspended and miss the FUTEX_WAKE event. In this case, the futex() system call returns the EAGAIN error;
If the previous value of the futex variable was 1, this means that the caller is the last thread calling barrier_wait() and consequently the futex variable is now equal to 0. A call to futex() with the FUTEX_WAKE operation is done to wakeup all the threads suspended on the barrier. It is important to pass "nb_barrier - 1" (i.e. the total number of threads except the current one which is not suspended) to the futex() system call otherwise, the kernel would wake-up only some threads and the remaining would wait indefinitely on the barrier.
When the futex_wait() macro returns, the current thread checks the current value of the futex variable (0 is passed to __sync_fetch_and_sub() built-in). This manages the case where concurrent threads decremented the variable just before the current thread calls futex_wait() and so, the value of the futex is no longer the value got from the beginning of the barrier_wait() function. In this case, the futex variable is at least one unit less. If it is 0, we go out of the loop. As already said above, in this case, the futex() system call returns the EAGAIN error.

When the program is run under the control of strace, we can see the case where EAGAIN is returned when the number of concurrent threads is huge (it depends of course on the hardware performances of the machine). In our configuration, we begin to see some EAGAIN errors when the number of threads is bigger than 2000:

$ strace -f ./barrier 4000
[...]
[pid 16413] clone(child_stack=0x7fc1d926ffb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID 
[pid 16414] <... gettid resumed>)       = 16414
[pid 16415] gettid( 
[pid 16414] futex(0x55ebf3432024, FUTEX_WAIT, 4000, NULL 
strace: Process 16416 attached
[pid 16413] <... clone resumed>, parent_tid=[16416], tls=0x7fc1d9270700, child_tidptr=0x7fc1d92709d0) = 16416
[pid 16415] <... gettid resumed>)       = 16415
[pid 16413] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 
[pid 16416] set_robust_list(0x7fc1d92709e0, 24 
[pid 16413] <... mmap resumed>)         = 0x7fc1d826f000
[pid 16416] <... set_robust_list resumed>) = 0
[pid 16415] futex(0x55ebf3432024, FUTEX_WAIT, 3999, NULL 
[pid 16413] mprotect(0x7fc1d8270000, 8388608, PROT_READ|PROT_WRITE 
[pid 16416] gettid( 
[pid 16413] <... mprotect resumed>)     = 0
[pid 16416] <... gettid resumed>)       = 16416
[pid 16413] clone(child_stack=0x7fc1d8a6efb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID 
[pid 16416] futex(0x55ebf3432024, FUTEX_WAIT, 3998, NULL 
strace: Process 16417 attached
[pid 16413] <... clone resumed>, parent_tid=[16417], tls=0x7fc1d8a6f700, child_tidptr=0x7fc1d8a6f9d0) = 16417
[pid 16417] set_robust_list(0x7fc1d8a6f9e0, 24 
[pid 16413] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 
[pid 16417] <... set_robust_list resumed>) = 0
[pid 16413] <... mmap resumed>)         = 0x7fc1d7a6e000
[pid 16417] gettid( 
[pid 16413] mprotect(0x7fc1d7a6f000, 8388608, PROT_READ|PROT_WRITE 
[pid 16417] <... gettid resumed>)       = 16417
[pid 16413] <... mprotect resumed>)     = 0
[pid 16417] futex(0x55ebf3432024, FUTEX_WAIT, 3997, NULL 
[pid 16413] clone(child_stack=0x7fc1d826dfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[16418], tls=0x7fc1d826e700, child_tidptr=0x7fc1d826e9d0) = 16418
strace: Process 16418 attached
[...]
[pid 20413] futex(0x55ebf3432024, FUTEX_WAKE, 4000 
[...]
[pid 16669] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[...]
[pid 20389] <... futex resumed>)        = 0
[pid 16413] <... futex resumed>)        = 0
[pid 20413] <... futex resumed>)        = 4000
[pid 20412] <... futex resumed>)        = 0
[pid 20388] <... futex resumed>)        = 0
[...]

In the preceding we can also see the call to futex() with the FUTEX_WAKE operation returning the number of threads which have been woken up (4000).

In GLIBC 2.31, the barrier are managed with services like pthread_barrier_init() or pthread_barrier_wait(). Those services, compliant with POSIX specification, are a quite more elaborated than our preceding example. Moreover, the pthread_barrier_wait() returns 0 in all the calling threads except the one which reached the barrier at last and for which PTHREAD_BARRIER_SERIAL_THREAD is returned. The preceding example can be modified to behave the same. In the following source code, the modification are highlighted in red. The special value BARRIER_SERIAL_THREAD is returned in the thread which wakes up the other ones.

barrier2.c

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>

static int futex_var;

static int nb_barrier;

static pid_t *tab;


#define futex_wait(addr, val) \
           syscall(SYS_futex, addr, FUTEX_WAIT, val, NULL)

#define futex_wakeup(addr, nb) \
           syscall(SYS_futex, addr, FUTEX_WAKE, nb)

#define gettid() syscall(__NR_gettid)


#define BARRIER_SERIAL_THREAD 1



int barrier_wait(void)
{
int old;

  old = __sync_fetch_and_sub(&futex_var, 1);

  if (1 == old) {
    futex_wakeup(&futex_var, nb_barrier - 1);
    return BARRIER_SERIAL_THREAD;
  }

  // Wait until the futex variable is equal to 0
  old = old - 1;
  do {
    futex_wait(&futex_var, old);

    old = __sync_fetch_and_sub(&futex_var, 0);
    if (0 == old) {
      break;
    }
  } while(1);

  return 0;

} // barrier_wait


void *thd_main(void *p)
{
int *idx = (int *)p;
int  rc;

  tab[*idx] = gettid();

  rc = barrier_wait();
  if (BARRIER_SERIAL_THREAD == rc) {
    printf("Thread %d was the last on the barrier\n", *idx);
  }

  return NULL;

} // thd_main


int main(int ac, char *av[])
{
pthread_t *tid;
int        i;
int       *idx;
int        rc;

  // If the number of threads is passed as parameter
  if (2 == ac) {

    nb_barrier = atoi(av[1]) + 1;

  } else {

    // Default value
    nb_barrier = 3 + 1;

  }

  futex_var = nb_barrier;

  tab = (pid_t *)malloc(nb_barrier * sizeof(pid_t));
  idx = (int *)malloc(nb_barrier * sizeof(int));
  tid = (pthread_t *)malloc(nb_barrier * sizeof(pthread_t)); 

  tab[0] = gettid();

  for (i = 1; i < nb_barrier; i ++) {
    idx[i] = i;
    pthread_create(&(tid[i]), NULL, thd_main, (void *)&(idx[i]));
  }

  rc = barrier_wait();
  if (BARRIER_SERIAL_THREAD == rc) {
    printf("Main thread was the last on the barrier\n");
  }

  free(idx);

  for (i = 0; i < nb_barrier; i ++) {
    printf("tab[%d] = %d\n", i, tab[i]);
  }

  free(tab);

  for (i = 1; i < nb_barrier; i ++) {
    pthread_join(tid[i], NULL);
  }

  free(tid);

  return 0;

} // main

$ ./barrier2 200
tab[0] = 6659
tab[1] = 6660
tab[2] = 6661
tab[3] = 6662
tab[4] = 6663
tab[5] = 6664
tab[6] = 6665
[...]
tab[50] = 6709
tab[51] = 6710
Thread 190 was the last on the barrier
tab[52] = 6711
tab[53] = 6712
[...]
tab[199] = 6858
tab[200] = 6859

5. Detailed study

This chapter shows the hidden part of the futex as it presents their implementation in the Linux kernel 5.4. With this knowledge, one can use the service more efficiently.

5.1. Management of the futex in the kernel

The Fast User Space Mutexes (called "futex" by Rusty Russell from IBM company [1]) are proposed by the Linux kernel with various flags in init/Kconfig file:

init/Kconfig

config FUTEX
	bool "Enable futex support" if EXPERT
	default y
	imply RT_MUTEXES
	help
	  Disabling this option will cause the kernel to be built without
	  support for "fast userspace mutexes".  The resulting kernel may not
	  run glibc-based applications correctly.

config FUTEX_PI
	bool
	depends on FUTEX && RT_MUTEXES
	default y

config HAVE_FUTEX_CMPXCHG
	bool
	depends on FUTEX
	help
	  Architectures should select this if futex_atomic_cmpxchg_inatomic()
	  is implemented and always working. This removes a couple of runtime
	  checks.

The heart of the implementation is located in kernel/futex.c into which the introductory comment sums up the principle:

kernel/futex.c

 * The waiter reads the futex value in user space and calls
 * futex_wait(). This function computes the hash bucket and acquires
 * the hash bucket lock. After that it reads the futex user space value
 * again and verifies that the data has not changed. If it has not changed
 * it enqueues itself into the hash bucket, releases the hash bucket lock
 * and schedules.
 *
 * The waker side modifies the user space value of the futex and calls
 * futex_wake(). This function computes the hash bucket and acquires the
 * hash bucket lock. Then it looks for waiters on that futex in the hash
 * bucket and wakes them.

The futex are stored in a hash table named __futex_data which entries are defined as:

/*
 * The base of the bucket array and its size are always used together
 * (after initialization only in hash_futex()), so ensure that they
 * reside in the same cacheline.
 */
static struct {
	struct futex_hash_bucket *queues;
	unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));

The entries are cache aligned for optimization purposes (we don't want several entries on the same cache line which would cause false sharing).
Each entry point on a queue of struct futex_hash_bucket records:

/*
 * Hash buckets are shared by all the futex_keys that hash to the same
 * location.  Each key may have multiple futex_q structures, one for each task
 * waiting on a futex.
 */
struct futex_hash_bucket {
	atomic_t waiters;
	spinlock_t lock;
	struct plist_head chain;
} ____cacheline_aligned_in_smp;

The hash function is:

The hash table is allocated with the alloc_large_system_hash() defined in mm/page_alloc.c. This function displays a message on the console such as:

$ dmesg
[...]
[    0.433328] futex hash table entries: 1024 (order: 4, 65536 bytes, linear)
[...]

To be continued

(French to english translation in progress)...

7. References

"Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux" par Hubertus Franke et Rusty Russel
http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf
"Built-in functions for atomic memory access" extrait du manuel du compilateur GCC
http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html
The atomic operations
https://en.wikipedia.org/wiki/Linearizability
Compare And Swap operation
http://en.wikipedia.org/wiki/Compare-and-swap
Intel 64 and IA-32 Architectures - Software Developer’s Manual - Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z
Load-link/store-conditional
http://en.wikipedia.org/wiki/Load-link/store-conditional
Principle of the hash table
https://en.wikipedia.org/wiki/Hash_table
"Futex are tricky" par Ulrich Drepper
http://people.redhat.com/drepper/futex.pdf
Synchronization barriers
https://en.wikipedia.org/wiki/Barrier_%28computer_science%29
Thundering herd problem
http://en.wikipedia.org/wiki/Thundering_herd_problem
Priority inversion
http://en.wikipedia.org/wiki/Priority_inversion
Priority inheritance
http://en.wikipedia.org/wiki/Priority_inheritance
Readers-writer lock
http://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock
"Futex Cheat Sheet"
http://locklessinc.com/articles/futex_cheat_sheet/
Mutexes and Condition Variables using Futexes
http://locklessinc.com/articles/mutex_cv_futex/
Deadlock
https://en.wikipedia.org/wiki/Deadlock
Starvation
https://en.wikipedia.org/wiki/Starvation_(computer_science)
Native Posix Thread Library
http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library
The semaphores
https://en.wikipedia.org/wiki/Semaphore_(programming)
ARM synchronization primitives
http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf
ABA problem
https://en.wikipedia.org/wiki/ABA_problem

About the author

The author is an engineer in computer sciences located in France. He can be contacted here.