Showing posts with label kernel. Show all posts
Showing posts with label kernel. Show all posts

Sunday, January 20, 2013

Ticket Spinlocks


Spinlocks as the name suggests, is a kind of lock where you spin until you own the lock.
Traditional mutexes, semaphore implementations put you into a sleeping wait queue in the kernel until you are granted the lock.
If your critical section is small and you don't have any sleeping paths in the code, you can use spinlocks which are faster as they entail no context switches in the lock path.
Since you would be spinning trying to grab the lock, you should not use them as a coarse grained lock which could entail sleeping, system call, context switching overheads. You want the lock region to be short and updates to be fast and sweet.
Linux kernel implements ticketing spinlocks which was a rewrite from the traditional decrement and wait spinning sequence to grab the lock. It also ensured fairness ensuring that threads don't get starved for long durations.
Traditionally, spinlock implementations work by decrementing the lock value atomically.
If the value becomes less than 0, then you spin waiting for the lock to become greater than 0 in which case you try to grab the lock by decrementing it again.
So in locked state, its 0 and in unlocked state, its 1. The thread that unlocks it simply sets the value of the lock to 1.
The problem with that approach was that, it didn't ensure fairness when the lock is being contented by multiple CPUs. Since in unlocked state, you could have multiple guys trying to decrement and grab the lock. This can starve some CPU's from grabbing the lock for long durations.
Linux kernel solved this problem with a ticket spinlock. Its similar in nature to the ones you find in departments with a token system to maintain queue discipline for processing jobs.
You enter into the shop, press a button to get your ticket number in the line.  And then wait for your number to arrive by monitoring the current number being processed. 
Ticket spinlocks work the same way.
The concept is similar to a ticketing system. It has a head and a tail ticket number
The tail indicates the ticket number granted to the thread trying to own the lock to wait in the line. 
The head indicates the current ticket number being processed. 
A new thread desiring to own the lock atomically reads the current tail or its own number in the line while incrementing the ticket or token number in the line.
If the head matches the tail, then the lock can be taken by the thread as it indicates that its your turn to grab the lock. In case of a mismatch, the thread spins waiting for the head to match its position in the line (/tail).
It spins in a loop by monitoring the current value of the head and comparing it to its allotted tail ticket number.
To unlock, the thread just increments the head position in the lock to indicate the next turn that would be grabbed by the thread that matches the head with its allotted tail position.
This mechanism ensures lock fairness without resorting to randomness which would result if we have all the threads spinning for a global lock to get to 1 or unlocked state.
I have a user space implementation for spinlocks that uses ticketing principles to ensure fairness and hand out locks which is similar to the technique used by the linux kernel space implementation of spinlocks.
Its simple to use for user space code and can be used as an alternative to pthread_spin_lock api's.
Ticketing spinlocks ensure fairness but they are prone to cause cacheline misses for the contention case.
Since most of the spinlock structures are embedded within data structures that are part of a single cache line, consistent monitoring of the cache line to read the current head or ticket number adversely results in cache misses for the lock owner. Since lock owner grabs the cache line in exclusive state, the shared state of the cache line because of multiple threads reading the lock state for the current ticket number of head position results in cache line misses and ends up in slowdowns when lock contention rates are high.
The linux kernel recently had a patch from Rik Van Riel that tries to introduce a delay in the contention path while monitoring the current position of the lock based on the current position in the line.
For eg:, loop 50 times before monitoring the current ticket number for every CPU that we are behind. (tail - head -- indicates how far we are in the line)
Or delay by looping X times where X is an adaptive per-cpu delay before monitoring the current position in the line with our ticket number. 
This patch from Rik Van Riel to ticketing spinlocks is an interesting one and only makes the case stronger for lockless algorithms and functional programming which try to minimize the mutable states. But thats easier said than done. 
So till then, check this ticketing spinlock patch to get an idea about how locks can adversely impact the scalability of systems with shared cache lines. And yes, don't forget to check out my user space implementation of ticketing spinlocks and see if it can fit into your development lifestyle :)

Tuesday, June 8, 2010

Examining Futexes and how it tackles THUNDERING HERD

A friend of mine had posted a futex query in a google mailing list named:apartition which mainly comprises of my friends. It was basically an strace output which had a FUTEX_WAKE returning -EAGAIN in one of his pthreads which had him concerned.


futex(0x6cebd548, FUTEX_WAKE, 1) EAGAIN (Resource temporarily unavailable


I had a theoretical idea about what futexes were in user space w.r.t NPTL glibc pthread library and how it was getting mapped in kernel space. But that wasn't sufficient to understand and answer his futex query.

Hence I used the mail as an opportunity to hack on futex source code to understand them better and in the process, also get to know the root cause for his above FUTEX-WAKE EAGAIN concerns.

Futex as the name suggests are fast user space mutexes. They were introduced primarily to avoid sleeps or system call latencies in the no-contention lock path as opposed to using system V sems which incur syscall overheads to lock/unlock. (semop)

Futex brings the atomic operations to user space on the lock variable and tries to avoid hitting the kernel in case of zero lock contention. or zero waiters. So basically its a spinlock in that case through an atomic "lock; cmpxchg" op.

These are primarily used by posix nptl threads or PTHREAD library to implement fast pthread mutex locks in user space. I am actually bypassing the internals of the user space pthread_mutex_t,pthread_cond_t management structures as thats a different blog post altogether :-).

I am also not covering the priority inheritance futex syscalls and opcodes prefixed with "pi" as those are also a separate topic altogether and come into picture with real time mutexes with different thread priorities and when PTHREAD_PRIO_INHERIT is used with pthread_mutexattr_setprotocol.

The discussions below are valid for both shared mutexes having mutexes and cond variables in shared memory with pthread_{condattr,mutexattr}_settype both having PTHREAD_PROCESS_SHARED and for process private mutexes.

So when you fire a:

pthread_mutex_lock(&mutex);


pthread library does an atomic op. to determine if the mutex is LOCKED. If the mutex is locked (value 0), then the thread pertaining to the process is put to sleep by having it enter kernel to wait on the FUTEX variable associated with the mutex. This is done through a futex syscall with FUTEX_WAIT as the opcode. and using the "mutex->futex" virtual address along with the current value of the futex.

The FUTEX_WAIT call is an indicator to the kernel to have the process sleep on a futex variable till its variable changes through say a: pthread_cond_signal or broadcast for example.

The kernel as part of FUTEX_WAIT hashes the address of the futex variable by mapping the virtual address of the futex variable to the process page through its vma_struct + offset within the page to get the futex queue representation for that futex variable from the futex hash bucket. or creating a new futex key if absent with the futex queue. Thus the kernel uses the futex variable itself to create a representation of a futex object for the thread dynamically without incurring the overhead of creating global identifiers like with sem_open, semget,shmget,msgget syscalls.

It then rechecks(to avoid lost wakeup races) if the value in the futex variable (virtual address: mutex->futex for a pthread_mutex_lock(&mutex)) has changed in which case the FUTEX has been signalled through a FUTEX_WAKE else it puts the futex queue object into the hash table by scheduling itself (task) out after queuing itself into the futex queue for that futex variable.

So basically a FUTEX_WAIT is a wait on a futex variable thats signalled by FUTEX_WAKE whenever the value of futex variable being waited for changes.

Now when a :

pthread_cond_signal(&cond)

or

pthread_mutex_unlock(&mutex);


is fired, the glibc pthread library increases the value of the futex variable associated with the cond variable or mutex (condvar->mutex OR mutex->futex) for pthread_cond_signal and pthread_mutex_unlock respectively,

sees if there were futex waiters(maintains wakeup/total sequence counters in the pthread_cond_t variable), and then fires a call to the kernel syscall: futex with opcode: FUTEX_WAKE with a value of "1" to wake up the waiters on the mutex if it finds waiters on the cond->futex or mutex->futex. The wakeup hint to the kernel being: Wake up 1 process waiting on the futex where the FUTEX_WAKE val represents the number of processes to be woken up waiting on the futex.

This is done only if there are waiters. If there are no waiters (zero-contention), then the call doesn't have to hit the kernel for FUTEX_WAKE which is the fast-path being optimized with futexes.

So the kernel futex_wake, just rehashes the futex address to the futex queue object, and then WAKES one process or whatever value was passed as hint from FUTEX_WAKE (count of processes to wake up).

Now pthread_cond_broadcast changes the equation a bit. In essence, pthread_cond_broadcast is nothing but a FUTEX_WAKE with a value of INT_MAX or ~0 that tells the kernel to wake up ALL processes (INT_MAX count of processes) sleeping on the futex queue.

But waking up all processes on a pthread_cond_broadcast would result in the famous THUNDERING HERD problem where all processes would be woken up to recheck their futexes only to have all of them requeued back again into the futex wait queue.

This THUNDERING HERD problem is prominent with process private mutexes as shared mutexes won't nearly have as much contention in shared memory as process private mutexes which could have 100/1000's of threads contending for a private mutex specific to the process.

In order to address the THUNDERING HERD problem, the futex implementation was changed to include a FUTEX_REQUEUE/FUTEX_CMP_REQUEUE op.

The trick that the REQUEUE op in futex employs in the code to get rid of the thundering herd problem is interesting.

Basically it fires a FUTEX_CMP_REQUEUE op with the address of 2 futex variables and a value of 1 indicating only 1 process to wake up from futex variable at address 1 but all (or INT_MAX remaining) processes/threads to be requeued into futex variable at address 2.

So basically it tries to wake up only 1 PROCESS at futex addres 1 to avoid the thundering herd and requeue the balance into the futex variable at address 2.

Whats this futex variable at address 2 where the processes waiting on futex at address 1 are requeued ?

Take the below sequence to understand the FUTEX_CMP_REQUEUE 2 futex operation better:



Thread 1:

pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

< do something as MUTEX is held after return >

pthread_mutex_unlock(&mutex);

Thread 2:
pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

pthread_mutex_unlock(&mutex);

Thread 3:

pthread_cond_broadcast(&cond);



Now as you know, when pthread_cond_wait goes to sleep it releases "mutex" and acquires it again after its woken up without you having to reacquire it again after pthread_cond_wait.

So thread 1 and thread 2 are waiting/sleeping on futex 1 in the kernel which is : "condvar->futex". releasing mutex->futex variable.
But when they are woken up, they _TRY_ to reacquire : "mutex->futex"

They are effectively waiting for the value of condvar->futex to be changed in kernel while on a futex queue.

When you fire:

pthread_cond_broadcast(&cond);


the pthread library tries to wake up only 1 process through a FUTEX_CMP_REQUEUE op passing:

(&condvar->futex, 1, INT_MAX, &mutex->futex);


This implies that the kernel should first try to wake up 1 process sleeping on the condvar->futex that has been changed by the broadcast from user-space and then requeue all the others sleeping on condvar->futex to now wait on "mutex->futex" .Howz that for thundering herd ?

Basically it moves the other waiting processes from condvar->futex queue into mutex->futex futex wait queue as when any process is woken up after a condwait as mentioned above, it would try to grab the mutex->futex (lock mutex after condwait). So when the mutex is unlocked again, automatically 1 process is woken up from mutex->futex as expected.



The wake all processes waiting on futex step on a broadcast step:
(FUTEX_WAKE,&mutex->futex, INT_MAX); is eliminated as all the required waiter processes were already moved to mutex->futex on a FUTEX_CMP_REQUEUE with only a 1 process wakeup to avoid the thundering herd issue with such broadcast wakeups/schedules.



However if you see the CMP_REQUEUE trick that the kernel employs to avoid a thundering herd, it can have a RACE condition when the value of "condvar->futex" or futex variable at address 1 can be CHANGED behind the back while the process enters the kernel for FUTEX_CMP_REQUEUE. with old value of futex1 through a parallel pthread_cond_wait/signal/broadcast.

So this could result in a missing signal or lost wake up if kernel doesn't recheck the value of futex1 or condvar->mutex
as effectively during that CMP_REQUEUE wake 1 window, another thread 4 could have fired a :



"pthread_cond_wait(&cond, &mutex);" changing "condvar->futex" behind the back while a CMP_REQUEUE wake 1 and requeue others is in progress.



So kernel detects this race by a recheck of the condvar->futex user space address value to see if it has changed indicating a racing pthread_cond_wait or a pthread_cond_signal/broadcast on the same cond.

When it detects such a change of the value of futex1 or a race condition, it returns -EAGAIN back to the NPTL posix pthread library FUTEX_CMP_REQUEUE call.(indirectly invoked on a pthread_cond_broadcast)

The pthread library then catches this failure (or any failure) from CMP_REQUEUE futex and then reissues the call again to the kernel with the usual FUTEX_WAKE opcode and a INT_MAX value or a wake_all tasks value to the kernel. indicating the kernel to fall back to the thundering herd wake all process step with INT_MAX processes to wake up using it as the last resort incase of a race as mentioned above with FUTEX_CMP_REQUEUE.

The addition of the FUTEX_CMP_REQUEUE futex operation in the futex syscall in the kernel is a juicy trickery to avoid thundering herd issues experienced with Operating Systems programming and hence makes for an interesting hack.