Tuesday, June 22, 2010

Using static branch prediction with gcc

Recently I extended my heap manager for our product to have some runtime fencing code to be enabled based on an environment variable to detect memory overruns and freed memory reads/writes to the heap that would immediately result in crashing(core-dumping) of the process when the faulting instruction is executed. It was just a fast and acceptable way to enable some runtime code during production/field where one does not have the liberty to run under valgrind to detect such common occurrences which could result in huge headaches based on the memory corruption patterns.

The most common way to write such stuffs is to hide them beneath the #ifdef macros so they are compiled out. But then you have to recompile your code to enable the special helper code. I shoved the special code through a branch on an environment variable. So if the env. variable was defined, the code would get executed.

So you would obviously think this:


void *heapAllocate(int bytes)
{
if(getenv("HEAP_CHECKER"))
return heapAllocateFence(bytes);
else
{
/* main heap allocator fastpath */
}
}


Well the problem with the above code is the conditional branching for the environment
variable in the fast path: heapAllocate that could be called a million times. Most of the times, when you read such code, your mind staggers a bit and assumes that the code is going to perform slow coz of the branching. Well its more often than NOT. All modern CPU's have very good branch predictors that can cache the branch prediction hints in the branch prediction buffers.

But depending on the architecture and the CPU's branch prediction algorithms, a mis-prediction could result in instruction fetch or pipeline stalls in the CPU's: instruction fetch -> decode -> execute -> writeback phase.

Branch predictors are a complex topic and you can refer to: Branch predictor to get some rough idea about how CPU does branch prediction using branch history and target buffers.

In order to overcome any potential performance bottlenecks across all archs, I used the gcc static prediction hint in the code using the gcc: __builtin_expect which is heavily used by the linux kernel and also glibc code.

GCC provides a builtin function: __builtin_expect(expression, value) that can be used to provide static branch prediction hints to GCC. Infact in optimization mode, this is implicitly enabled. Using __builtin_expect you can tell the compiler to generate code that could set branch prediction hints for the target architecture.

Before I explain it a bit further, lets try and rephrase the above heapAllocate routine to use the gcc static branch prediction hint.


void *heapAllocate(int bytes)
{
if(__builtin_expect(getenv("HEAP_CHECKER"), 0))
{
/* trigger unlikely but special path */
return heapAllocateFence(bytes);
}
/* main allocation fast path */
}


The above __builtin_expect tells the compiler to enable static branch prediction hints by static that the expression above would be almost always FALSE or "0". So using this you can devise the linux kernel heavily used: likely and unlikely macros that could be used by your code in your error paths for example. (unlikely execution paths).


#define likely(expr) __builtin_expect((expr), 1)
#define unlikely(expr) __builtin_expect((expr), 0)


And then you could use it for your error paths like:

map = malloc(bytes);
if(unlikely(map == NULL))
{
/*error path*/
}
else free(map);


Now what code does the compiler generate to enable branch prediction hints ?
This could be seen by a disassembly of the above malloc code branch with and without the unlikely or __builtin_expect branch prediction hint and off-course without optimization turned on which enables it implicitly.

The below is x86_64 assembly without __builtin_expect in the malloc error path. So its effectively: map = malloc(); if(map == NULL) {erorpath;} else free(map);


call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
jne .L2
movl $0, %edi
call exit
.L2:
movq -8(%rbp), %rax
movq %rax, %rdi
call free


So you can see above that the compiler doesn't generate anything special. It cmpq instruction after return from malloc checks for NULL and jumps to label L2 if its not equal to NULL else exits if NULL. So you dont see anything new being done here in this code generation.

Lets see whats the result of: if(unlikely(map == NULL)) { error; } else free(map);
Now we are using the static branch prediction hint to GCC for the malloc error path.


call malloc
movq %rax, -8(%rbp)
cmpq $0, -8(%rbp)
sete %al
movzbl %al, %eax
testq %rax, %rax
je .L2
movl $0, %edi
call exit
.L2:
movq -8(%rbp), %rax
movq %rax, %rdi
call free


Oh all of a sudden your instruction cache looks loaded. The compiler is using the "sete" instruction which sets the bit to the target register (%al) if the ZERO flag is set else loads 0 to target register if ZERO flag is clear. So effectively the code branches out again if zero flag is clear to the label L2 again which implies that the branch prediction bits 1 and 0 were used for NULL/error branch or label L2 branch. The x86 for example seems to cache the branch prediction hint results in its branch history buffer for subsequent usage. So when the same code is executed again, it would predict the error branch as not taken and jump to the prefetched branch target buffer at L2 while the pipelines could be prefetched parallely. So this optimizes the most likely case or branch by effectively keeping the instruction cache with the most likely branch target buffers.

This could also effectively imply that the error branch instruction never really was prefetched into the CPU's instruction cache albeit it was loaded into main memory by the kernel. This comes with a slightly larger memory footprint as the instruction cache has more instructions with static branch prediction but might optimize the case for various archs. through the static hints.

But it seems that the latest INTEL CORE-2's dont really expect the static branch prediction hint instructions through sete/setne,etc. and you can effectively do away without using the __builtin_expect hints and rely on the CPU's speculative branch prediction buffers to do the work. But its always better to use the GCC specific hints so GCC can always take care about generating the most efficient code based on the architecture and the programmer needn't bother himself with the branches in his code as long as he relies on the static hint to work.

No comments:

Post a Comment