Sunday, January 20, 2013

Ticket Spinlocks


Spinlocks as the name suggests, is a kind of lock where you spin until you own the lock.
Traditional mutexes, semaphore implementations put you into a sleeping wait queue in the kernel until you are granted the lock.
If your critical section is small and you don't have any sleeping paths in the code, you can use spinlocks which are faster as they entail no context switches in the lock path.
Since you would be spinning trying to grab the lock, you should not use them as a coarse grained lock which could entail sleeping, system call, context switching overheads. You want the lock region to be short and updates to be fast and sweet.
Linux kernel implements ticketing spinlocks which was a rewrite from the traditional decrement and wait spinning sequence to grab the lock. It also ensured fairness ensuring that threads don't get starved for long durations.
Traditionally, spinlock implementations work by decrementing the lock value atomically.
If the value becomes less than 0, then you spin waiting for the lock to become greater than 0 in which case you try to grab the lock by decrementing it again.
So in locked state, its 0 and in unlocked state, its 1. The thread that unlocks it simply sets the value of the lock to 1.
The problem with that approach was that, it didn't ensure fairness when the lock is being contented by multiple CPUs. Since in unlocked state, you could have multiple guys trying to decrement and grab the lock. This can starve some CPU's from grabbing the lock for long durations.
Linux kernel solved this problem with a ticket spinlock. Its similar in nature to the ones you find in departments with a token system to maintain queue discipline for processing jobs.
You enter into the shop, press a button to get your ticket number in the line.  And then wait for your number to arrive by monitoring the current number being processed. 
Ticket spinlocks work the same way.
The concept is similar to a ticketing system. It has a head and a tail ticket number
The tail indicates the ticket number granted to the thread trying to own the lock to wait in the line. 
The head indicates the current ticket number being processed. 
A new thread desiring to own the lock atomically reads the current tail or its own number in the line while incrementing the ticket or token number in the line.
If the head matches the tail, then the lock can be taken by the thread as it indicates that its your turn to grab the lock. In case of a mismatch, the thread spins waiting for the head to match its position in the line (/tail).
It spins in a loop by monitoring the current value of the head and comparing it to its allotted tail ticket number.
To unlock, the thread just increments the head position in the lock to indicate the next turn that would be grabbed by the thread that matches the head with its allotted tail position.
This mechanism ensures lock fairness without resorting to randomness which would result if we have all the threads spinning for a global lock to get to 1 or unlocked state.
I have a user space implementation for spinlocks that uses ticketing principles to ensure fairness and hand out locks which is similar to the technique used by the linux kernel space implementation of spinlocks.
Its simple to use for user space code and can be used as an alternative to pthread_spin_lock api's.
Ticketing spinlocks ensure fairness but they are prone to cause cacheline misses for the contention case.
Since most of the spinlock structures are embedded within data structures that are part of a single cache line, consistent monitoring of the cache line to read the current head or ticket number adversely results in cache misses for the lock owner. Since lock owner grabs the cache line in exclusive state, the shared state of the cache line because of multiple threads reading the lock state for the current ticket number of head position results in cache line misses and ends up in slowdowns when lock contention rates are high.
The linux kernel recently had a patch from Rik Van Riel that tries to introduce a delay in the contention path while monitoring the current position of the lock based on the current position in the line.
For eg:, loop 50 times before monitoring the current ticket number for every CPU that we are behind. (tail - head -- indicates how far we are in the line)
Or delay by looping X times where X is an adaptive per-cpu delay before monitoring the current position in the line with our ticket number. 
This patch from Rik Van Riel to ticketing spinlocks is an interesting one and only makes the case stronger for lockless algorithms and functional programming which try to minimize the mutable states. But thats easier said than done. 
So till then, check this ticketing spinlock patch to get an idea about how locks can adversely impact the scalability of systems with shared cache lines. And yes, don't forget to check out my user space implementation of ticketing spinlocks and see if it can fit into your development lifestyle :)