Tuesday, June 8, 2010

Examining Futexes and how it tackles THUNDERING HERD

A friend of mine had posted a futex query in a google mailing list named:apartition which mainly comprises of my friends. It was basically an strace output which had a FUTEX_WAKE returning -EAGAIN in one of his pthreads which had him concerned.

futex(0x6cebd548, FUTEX_WAKE, 1) EAGAIN (Resource temporarily unavailable

I had a theoretical idea about what futexes were in user space w.r.t NPTL glibc pthread library and how it was getting mapped in kernel space. But that wasn't sufficient to understand and answer his futex query.

Hence I used the mail as an opportunity to hack on futex source code to understand them better and in the process, also get to know the root cause for his above FUTEX-WAKE EAGAIN concerns.

Futex as the name suggests are fast user space mutexes. They were introduced primarily to avoid sleeps or system call latencies in the no-contention lock path as opposed to using system V sems which incur syscall overheads to lock/unlock. (semop)

Futex brings the atomic operations to user space on the lock variable and tries to avoid hitting the kernel in case of zero lock contention. or zero waiters. So basically its a spinlock in that case through an atomic "lock; cmpxchg" op.

These are primarily used by posix nptl threads or PTHREAD library to implement fast pthread mutex locks in user space. I am actually bypassing the internals of the user space pthread_mutex_t,pthread_cond_t management structures as thats a different blog post altogether :-).

I am also not covering the priority inheritance futex syscalls and opcodes prefixed with "pi" as those are also a separate topic altogether and come into picture with real time mutexes with different thread priorities and when PTHREAD_PRIO_INHERIT is used with pthread_mutexattr_setprotocol.

The discussions below are valid for both shared mutexes having mutexes and cond variables in shared memory with pthread_{condattr,mutexattr}_settype both having PTHREAD_PROCESS_SHARED and for process private mutexes.

So when you fire a:


pthread library does an atomic op. to determine if the mutex is LOCKED. If the mutex is locked (value 0), then the thread pertaining to the process is put to sleep by having it enter kernel to wait on the FUTEX variable associated with the mutex. This is done through a futex syscall with FUTEX_WAIT as the opcode. and using the "mutex->futex" virtual address along with the current value of the futex.

The FUTEX_WAIT call is an indicator to the kernel to have the process sleep on a futex variable till its variable changes through say a: pthread_cond_signal or broadcast for example.

The kernel as part of FUTEX_WAIT hashes the address of the futex variable by mapping the virtual address of the futex variable to the process page through its vma_struct + offset within the page to get the futex queue representation for that futex variable from the futex hash bucket. or creating a new futex key if absent with the futex queue. Thus the kernel uses the futex variable itself to create a representation of a futex object for the thread dynamically without incurring the overhead of creating global identifiers like with sem_open, semget,shmget,msgget syscalls.

It then rechecks(to avoid lost wakeup races) if the value in the futex variable (virtual address: mutex->futex for a pthread_mutex_lock(&mutex)) has changed in which case the FUTEX has been signalled through a FUTEX_WAKE else it puts the futex queue object into the hash table by scheduling itself (task) out after queuing itself into the futex queue for that futex variable.

So basically a FUTEX_WAIT is a wait on a futex variable thats signalled by FUTEX_WAKE whenever the value of futex variable being waited for changes.

Now when a :




is fired, the glibc pthread library increases the value of the futex variable associated with the cond variable or mutex (condvar->mutex OR mutex->futex) for pthread_cond_signal and pthread_mutex_unlock respectively,

sees if there were futex waiters(maintains wakeup/total sequence counters in the pthread_cond_t variable), and then fires a call to the kernel syscall: futex with opcode: FUTEX_WAKE with a value of "1" to wake up the waiters on the mutex if it finds waiters on the cond->futex or mutex->futex. The wakeup hint to the kernel being: Wake up 1 process waiting on the futex where the FUTEX_WAKE val represents the number of processes to be woken up waiting on the futex.

This is done only if there are waiters. If there are no waiters (zero-contention), then the call doesn't have to hit the kernel for FUTEX_WAKE which is the fast-path being optimized with futexes.

So the kernel futex_wake, just rehashes the futex address to the futex queue object, and then WAKES one process or whatever value was passed as hint from FUTEX_WAKE (count of processes to wake up).

Now pthread_cond_broadcast changes the equation a bit. In essence, pthread_cond_broadcast is nothing but a FUTEX_WAKE with a value of INT_MAX or ~0 that tells the kernel to wake up ALL processes (INT_MAX count of processes) sleeping on the futex queue.

But waking up all processes on a pthread_cond_broadcast would result in the famous THUNDERING HERD problem where all processes would be woken up to recheck their futexes only to have all of them requeued back again into the futex wait queue.

This THUNDERING HERD problem is prominent with process private mutexes as shared mutexes won't nearly have as much contention in shared memory as process private mutexes which could have 100/1000's of threads contending for a private mutex specific to the process.

In order to address the THUNDERING HERD problem, the futex implementation was changed to include a FUTEX_REQUEUE/FUTEX_CMP_REQUEUE op.

The trick that the REQUEUE op in futex employs in the code to get rid of the thundering herd problem is interesting.

Basically it fires a FUTEX_CMP_REQUEUE op with the address of 2 futex variables and a value of 1 indicating only 1 process to wake up from futex variable at address 1 but all (or INT_MAX remaining) processes/threads to be requeued into futex variable at address 2.

So basically it tries to wake up only 1 PROCESS at futex addres 1 to avoid the thundering herd and requeue the balance into the futex variable at address 2.

Whats this futex variable at address 2 where the processes waiting on futex at address 1 are requeued ?

Take the below sequence to understand the FUTEX_CMP_REQUEUE 2 futex operation better:

Thread 1:


pthread_cond_wait(&cond, &mutex);

< do something as MUTEX is held after return >


Thread 2:

pthread_cond_wait(&cond, &mutex);


Thread 3:


Now as you know, when pthread_cond_wait goes to sleep it releases "mutex" and acquires it again after its woken up without you having to reacquire it again after pthread_cond_wait.

So thread 1 and thread 2 are waiting/sleeping on futex 1 in the kernel which is : "condvar->futex". releasing mutex->futex variable.
But when they are woken up, they _TRY_ to reacquire : "mutex->futex"

They are effectively waiting for the value of condvar->futex to be changed in kernel while on a futex queue.

When you fire:


the pthread library tries to wake up only 1 process through a FUTEX_CMP_REQUEUE op passing:

(&condvar->futex, 1, INT_MAX, &mutex->futex);

This implies that the kernel should first try to wake up 1 process sleeping on the condvar->futex that has been changed by the broadcast from user-space and then requeue all the others sleeping on condvar->futex to now wait on "mutex->futex" .Howz that for thundering herd ?

Basically it moves the other waiting processes from condvar->futex queue into mutex->futex futex wait queue as when any process is woken up after a condwait as mentioned above, it would try to grab the mutex->futex (lock mutex after condwait). So when the mutex is unlocked again, automatically 1 process is woken up from mutex->futex as expected.

The wake all processes waiting on futex step on a broadcast step:
(FUTEX_WAKE,&mutex->futex, INT_MAX); is eliminated as all the required waiter processes were already moved to mutex->futex on a FUTEX_CMP_REQUEUE with only a 1 process wakeup to avoid the thundering herd issue with such broadcast wakeups/schedules.

However if you see the CMP_REQUEUE trick that the kernel employs to avoid a thundering herd, it can have a RACE condition when the value of "condvar->futex" or futex variable at address 1 can be CHANGED behind the back while the process enters the kernel for FUTEX_CMP_REQUEUE. with old value of futex1 through a parallel pthread_cond_wait/signal/broadcast.

So this could result in a missing signal or lost wake up if kernel doesn't recheck the value of futex1 or condvar->mutex
as effectively during that CMP_REQUEUE wake 1 window, another thread 4 could have fired a :

"pthread_cond_wait(&cond, &mutex);" changing "condvar->futex" behind the back while a CMP_REQUEUE wake 1 and requeue others is in progress.

So kernel detects this race by a recheck of the condvar->futex user space address value to see if it has changed indicating a racing pthread_cond_wait or a pthread_cond_signal/broadcast on the same cond.

When it detects such a change of the value of futex1 or a race condition, it returns -EAGAIN back to the NPTL posix pthread library FUTEX_CMP_REQUEUE call.(indirectly invoked on a pthread_cond_broadcast)

The pthread library then catches this failure (or any failure) from CMP_REQUEUE futex and then reissues the call again to the kernel with the usual FUTEX_WAKE opcode and a INT_MAX value or a wake_all tasks value to the kernel. indicating the kernel to fall back to the thundering herd wake all process step with INT_MAX processes to wake up using it as the last resort incase of a race as mentioned above with FUTEX_CMP_REQUEUE.

The addition of the FUTEX_CMP_REQUEUE futex operation in the futex syscall in the kernel is a juicy trickery to avoid thundering herd issues experienced with Operating Systems programming and hence makes for an interesting hack.


  1. Hi Karthick,

    Thanks for Informative post.
    Currently I'm trying to investigate an issue of randomly more CPU(not high) consumption of random application thread, running on windriver 2.0 64 bit.
    Timing the thread for CPU, took strace which looks like(mid-snippet):
    126sendto(65, "\t\0\0\0\0\0\0\0\305\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377"..., 164, 0, {sa_family=AF_INET, sin_port=htons(7701), sin_addr=inet_addr("")}, 16) = 164
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAKE, 1) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = 0
    futex(0x2b22b8000020, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

    And with more option:
    Process 7622 attached - interrupt to quit
    Process 7622 detached
    % time seconds usecs/call calls errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.045077 67 676 82 futex
    0.00 0.000000 0 5 write
    0.00 0.000000 0 9 stat
    0.00 0.000000 0 183 writev
    0.00 0.000000 0 243 sendto
    0.00 0.000000 0 140 msgsnd
    0.00 0.000000 0 427 times
    0.00 0.000000 0 183 gettid
    ------ ----------- ----------- --------- --------- ----------------
    100.00 0.045077 1866 82 total

    Application does use pthread rw locks/malloc etc.
    Have to yet confirm the leap second bug.
    Is there a way to back trace futex to specific api?

    1. Strange. I never got an email notification for this comment of yours.
      Just saw it today. So apologize for the belated response.
      I don't think you can map the futex wait op shown in strace above to the pthread_cond_wait in code.
      The FUTEX_WAIT op indicates a pthread_mutex_lock or a pthread_cond_wait.
      The EWOULDBLOCK or the EAGAIN return code for the FUTEX_WAIT op only indicates that the futex value changed while trying to queue itself into the futex wait queue thereby indicating a pthread_mutex_unlock or a pthread_cond_{signal,broadcast} on the futex.

      If you are lucky and assuming that the futex that is experiencing an EGAIN on a wait or doing the futex wait is not a glibc specific internal futex for the pthread mutexes itself, then you can try printing the pthread_mutex_t and pthread_cond_t values in your code.
      Map the output values to the futex uaddr above : 0x2b22b8000020
      and see if there is a hit before mapping the mutex/cond var to the thread doing the lock/cond_wait.
      Just dump the values of the mutexes and cond in your code.
      As a quick example:
      static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
      static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
      printf("lock at [%p], cond at [%p]\n", (void*)&lock, (void*)&cond);

      You might be seeing an overlap with the futex wait uaddress dumped by strace before you can correlate it to the application thread doing the lock/wait.
      (mutex_t should map directly to the futex address. cond lock futex variable should be at offset +4 to the cond address.)