Tuesday, June 22, 2010

Using static branch prediction with gcc

Recently I extended my heap manager for our product to have some runtime fencing code to be enabled based on an environment variable to detect memory overruns and freed memory reads/writes to the heap that would immediately result in crashing(core-dumping) of the process when the faulting instruction is executed. It was just a fast and acceptable way to enable some runtime code during production/field where one does not have the liberty to run under valgrind to detect such common occurrences which could result in huge headaches based on the memory corruption patterns.

The most common way to write such stuffs is to hide them beneath the #ifdef macros so they are compiled out. But then you have to recompile your code to enable the special helper code. I shoved the special code through a branch on an environment variable. So if the env. variable was defined, the code would get executed.

So you would obviously think this:


void *heapAllocate(int bytes)
{
if(getenv("HEAP_CHECKER"))
return heapAllocateFence(bytes);
else
{
/* main heap allocator fastpath */
}
}


Well the problem with the above code is the conditional branching for the environment
variable in the fast path: heapAllocate that could be called a million times. Most of the times, when you read such code, your mind staggers a bit and assumes that the code is going to perform slow coz of the branching. Well its more often than NOT. All modern CPU's have very good branch predictors that can cache the branch prediction hints in the branch prediction buffers.

But depending on the architecture and the CPU's branch prediction algorithms, a mis-prediction could result in instruction fetch or pipeline stalls in the CPU's: instruction fetch -> decode -> execute -> writeback phase.

Branch predictors are a complex topic and you can refer to: Branch predictor to get some rough idea about how CPU does branch prediction using branch history and target buffers.

In order to overcome any potential performance bottlenecks across all archs, I used the gcc static prediction hint in the code using the gcc: __builtin_expect which is heavily used by the linux kernel and also glibc code.

GCC provides a builtin function: __builtin_expect(expression, value) that can be used to provide static branch prediction hints to GCC. Infact in optimization mode, this is implicitly enabled. Using __builtin_expect you can tell the compiler to generate code that could set branch prediction hints for the target architecture.

Before I explain it a bit further, lets try and rephrase the above heapAllocate routine to use the gcc static branch prediction hint.


void *heapAllocate(int bytes)
{
if(__builtin_expect(getenv("HEAP_CHECKER"), 0))
{
/* trigger unlikely but special path */
return heapAllocateFence(bytes);
}
/* main allocation fast path */
}


The above __builtin_expect tells the compiler to enable static branch prediction hints by static that the expression above would be almost always FALSE or "0". So using this you can devise the linux kernel heavily used: likely and unlikely macros that could be used by your code in your error paths for example. (unlikely execution paths).


#define likely(expr) __builtin_expect((expr), 1)
#define unlikely(expr) __builtin_expect((expr), 0)


And then you could use it for your error paths like:

map = malloc(bytes);
if(unlikely(map == NULL))
{
/*error path*/
}
else free(map);


Now what code does the compiler generate to enable branch prediction hints ?
This could be seen by a disassembly of the above malloc code branch with and without the unlikely or __builtin_expect branch prediction hint and off-course without optimization turned on which enables it implicitly.

The below is x86_64 assembly without __builtin_expect in the malloc error path. So its effectively: map = malloc(); if(map == NULL) {erorpath;} else free(map);


call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
jne .L2
movl $0, %edi
call exit
.L2:
movq -8(%rbp), %rax
movq %rax, %rdi
call free


So you can see above that the compiler doesn't generate anything special. It cmpq instruction after return from malloc checks for NULL and jumps to label L2 if its not equal to NULL else exits if NULL. So you dont see anything new being done here in this code generation.

Lets see whats the result of: if(unlikely(map == NULL)) { error; } else free(map);
Now we are using the static branch prediction hint to GCC for the malloc error path.


call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
sete %al
movzbl %al, %eax
testq %rax, %rax
je .L2
movl $0, %edi
call exit
.L2:
movq -8(%rbp), %rax
movq %rax, %rdi
call free


Oh all of a sudden your instruction cache looks loaded. The compiler is using the "sete" instruction which sets the bit to the target register (%al) if the ZERO flag is set else loads 0 to target register if ZERO flag is clear. So effectively the code branches out again if zero flag is clear to the label L2 again which implies that the branch prediction bits 1 and 0 were used for NULL/error branch or label L2 branch. The x86 for example seems to cache the branch prediction hint results in its branch history buffer for subsequent usage. So when the same code is executed again, it would predict the error branch as not taken and jump to the prefetched branch target buffer at L2 while the pipelines could be prefetched parallely. So this optimizes the most likely case or branch by effectively keeping the instruction cache with the most likely branch target buffers.

This could also effectively imply that the error branch instruction never really was prefetched into the CPU's instruction cache albeit it was loaded into main memory by the kernel. This comes with a slightly larger memory footprint as the instruction cache has more instructions with static branch prediction but might optimize the case for various archs. through the static hints.

But it seems that the latest INTEL CORE-2's dont really expect the static branch prediction hint instructions through sete/setne,etc. and you can effectively do away without using the __builtin_expect hints and rely on the CPU's speculative branch prediction buffers to do the work. But its always better to use the GCC specific hints so GCC can always take care about generating the most efficient code based on the architecture and the programmer needn't bother himself with the branches in his code as long as he relies on the static hint to work.

Tuesday, June 8, 2010

Examining Futexes and how it tackles THUNDERING HERD

A friend of mine had posted a futex query in a google mailing list named:apartition which mainly comprises of my friends. It was basically an strace output which had a FUTEX_WAKE returning -EAGAIN in one of his pthreads which had him concerned.


futex(0x6cebd548, FUTEX_WAKE, 1) EAGAIN (Resource temporarily unavailable


I had a theoretical idea about what futexes were in user space w.r.t NPTL glibc pthread library and how it was getting mapped in kernel space. But that wasn't sufficient to understand and answer his futex query.

Hence I used the mail as an opportunity to hack on futex source code to understand them better and in the process, also get to know the root cause for his above FUTEX-WAKE EAGAIN concerns.

Futex as the name suggests are fast user space mutexes. They were introduced primarily to avoid sleeps or system call latencies in the no-contention lock path as opposed to using system V sems which incur syscall overheads to lock/unlock. (semop)

Futex brings the atomic operations to user space on the lock variable and tries to avoid hitting the kernel in case of zero lock contention. or zero waiters. So basically its a spinlock in that case through an atomic "lock; cmpxchg" op.

These are primarily used by posix nptl threads or PTHREAD library to implement fast pthread mutex locks in user space. I am actually bypassing the internals of the user space pthread_mutex_t,pthread_cond_t management structures as thats a different blog post altogether :-).

I am also not covering the priority inheritance futex syscalls and opcodes prefixed with "pi" as those are also a separate topic altogether and come into picture with real time mutexes with different thread priorities and when PTHREAD_PRIO_INHERIT is used with pthread_mutexattr_setprotocol.

The discussions below are valid for both shared mutexes having mutexes and cond variables in shared memory with pthread_{condattr,mutexattr}_settype both having PTHREAD_PROCESS_SHARED and for process private mutexes.

So when you fire a:

pthread_mutex_lock(&mutex);


pthread library does an atomic op. to determine if the mutex is LOCKED. If the mutex is locked (value 0), then the thread pertaining to the process is put to sleep by having it enter kernel to wait on the FUTEX variable associated with the mutex. This is done through a futex syscall with FUTEX_WAIT as the opcode. and using the "mutex->futex" virtual address along with the current value of the futex.

The FUTEX_WAIT call is an indicator to the kernel to have the process sleep on a futex variable till its variable changes through say a: pthread_cond_signal or broadcast for example.

The kernel as part of FUTEX_WAIT hashes the address of the futex variable by mapping the virtual address of the futex variable to the process page through its vma_struct + offset within the page to get the futex queue representation for that futex variable from the futex hash bucket. or creating a new futex key if absent with the futex queue. Thus the kernel uses the futex variable itself to create a representation of a futex object for the thread dynamically without incurring the overhead of creating global identifiers like with sem_open, semget,shmget,msgget syscalls.

It then rechecks(to avoid lost wakeup races) if the value in the futex variable (virtual address: mutex->futex for a pthread_mutex_lock(&mutex)) has changed in which case the FUTEX has been signalled through a FUTEX_WAKE else it puts the futex queue object into the hash table by scheduling itself (task) out after queuing itself into the futex queue for that futex variable.

So basically a FUTEX_WAIT is a wait on a futex variable thats signalled by FUTEX_WAKE whenever the value of futex variable being waited for changes.

Now when a :

pthread_cond_signal(&cond)

or

pthread_mutex_unlock(&mutex);


is fired, the glibc pthread library increases the value of the futex variable associated with the cond variable or mutex (condvar->mutex OR mutex->futex) for pthread_cond_signal and pthread_mutex_unlock respectively,

sees if there were futex waiters(maintains wakeup/total sequence counters in the pthread_cond_t variable), and then fires a call to the kernel syscall: futex with opcode: FUTEX_WAKE with a value of "1" to wake up the waiters on the mutex if it finds waiters on the cond->futex or mutex->futex. The wakeup hint to the kernel being: Wake up 1 process waiting on the futex where the FUTEX_WAKE val represents the number of processes to be woken up waiting on the futex.

This is done only if there are waiters. If there are no waiters (zero-contention), then the call doesn't have to hit the kernel for FUTEX_WAKE which is the fast-path being optimized with futexes.

So the kernel futex_wake, just rehashes the futex address to the futex queue object, and then WAKES one process or whatever value was passed as hint from FUTEX_WAKE (count of processes to wake up).

Now pthread_cond_broadcast changes the equation a bit. In essence, pthread_cond_broadcast is nothing but a FUTEX_WAKE with a value of INT_MAX or ~0 that tells the kernel to wake up ALL processes (INT_MAX count of processes) sleeping on the futex queue.

But waking up all processes on a pthread_cond_broadcast would result in the famous THUNDERING HERD problem where all processes would be woken up to recheck their futexes only to have all of them requeued back again into the futex wait queue.

This THUNDERING HERD problem is prominent with process private mutexes as shared mutexes won't nearly have as much contention in shared memory as process private mutexes which could have 100/1000's of threads contending for a private mutex specific to the process.

In order to address the THUNDERING HERD problem, the futex implementation was changed to include a FUTEX_REQUEUE/FUTEX_CMP_REQUEUE op.

The trick that the REQUEUE op in futex employs in the code to get rid of the thundering herd problem is interesting.

Basically it fires a FUTEX_CMP_REQUEUE op with the address of 2 futex variables and a value of 1 indicating only 1 process to wake up from futex variable at address 1 but all (or INT_MAX remaining) processes/threads to be requeued into futex variable at address 2.

So basically it tries to wake up only 1 PROCESS at futex addres 1 to avoid the thundering herd and requeue the balance into the futex variable at address 2.

Whats this futex variable at address 2 where the processes waiting on futex at address 1 are requeued ?

Take the below sequence to understand the FUTEX_CMP_REQUEUE 2 futex operation better:



Thread 1:

pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

< do something as MUTEX is held after return >

pthread_mutex_unlock(&mutex);

Thread 2:
pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

pthread_mutex_unlock(&mutex);

Thread 3:

pthread_cond_broadcast(&cond);



Now as you know, when pthread_cond_wait goes to sleep it releases "mutex" and acquires it again after its woken up without you having to reacquire it again after pthread_cond_wait.

So thread 1 and thread 2 are waiting/sleeping on futex 1 in the kernel which is : "condvar->futex". releasing mutex->futex variable.
But when they are woken up, they _TRY_ to reacquire : "mutex->futex"

They are effectively waiting for the value of condvar->futex to be changed in kernel while on a futex queue.

When you fire:

pthread_cond_broadcast(&cond);


the pthread library tries to wake up only 1 process through a FUTEX_CMP_REQUEUE op passing:

(&condvar->futex, 1, INT_MAX, &mutex->futex);


This implies that the kernel should first try to wake up 1 process sleeping on the condvar->futex that has been changed by the broadcast from user-space and then requeue all the others sleeping on condvar->futex to now wait on "mutex->futex" .Howz that for thundering herd ?

Basically it moves the other waiting processes from condvar->futex queue into mutex->futex futex wait queue as when any process is woken up after a condwait as mentioned above, it would try to grab the mutex->futex (lock mutex after condwait). So when the mutex is unlocked again, automatically 1 process is woken up from mutex->futex as expected.



The wake all processes waiting on futex step on a broadcast step:
(FUTEX_WAKE,&mutex->futex, INT_MAX); is eliminated as all the required waiter processes were already moved to mutex->futex on a FUTEX_CMP_REQUEUE with only a 1 process wakeup to avoid the thundering herd issue with such broadcast wakeups/schedules.



However if you see the CMP_REQUEUE trick that the kernel employs to avoid a thundering herd, it can have a RACE condition when the value of "condvar->futex" or futex variable at address 1 can be CHANGED behind the back while the process enters the kernel for FUTEX_CMP_REQUEUE. with old value of futex1 through a parallel pthread_cond_wait/signal/broadcast.

So this could result in a missing signal or lost wake up if kernel doesn't recheck the value of futex1 or condvar->mutex
as effectively during that CMP_REQUEUE wake 1 window, another thread 4 could have fired a :



"pthread_cond_wait(&cond, &mutex);" changing "condvar->futex" behind the back while a CMP_REQUEUE wake 1 and requeue others is in progress.



So kernel detects this race by a recheck of the condvar->futex user space address value to see if it has changed indicating a racing pthread_cond_wait or a pthread_cond_signal/broadcast on the same cond.

When it detects such a change of the value of futex1 or a race condition, it returns -EAGAIN back to the NPTL posix pthread library FUTEX_CMP_REQUEUE call.(indirectly invoked on a pthread_cond_broadcast)

The pthread library then catches this failure (or any failure) from CMP_REQUEUE futex and then reissues the call again to the kernel with the usual FUTEX_WAKE opcode and a INT_MAX value or a wake_all tasks value to the kernel. indicating the kernel to fall back to the thundering herd wake all process step with INT_MAX processes to wake up using it as the last resort incase of a race as mentioned above with FUTEX_CMP_REQUEUE.

The addition of the FUTEX_CMP_REQUEUE futex operation in the futex syscall in the kernel is a juicy trickery to avoid thundering herd issues experienced with Operating Systems programming and hence makes for an interesting hack.