Techie blog streams: 2010

Tuesday, June 22, 2010

Using static branch prediction with gcc

Recently I extended my heap manager for our product to have some runtime fencing code to be enabled based on an environment variable to detect memory overruns and freed memory reads/writes to the heap that would immediately result in crashing(core-dumping) of the process when the faulting instruction is executed. It was just a fast and acceptable way to enable some runtime code during production/field where one does not have the liberty to run under valgrind to detect such common occurrences which could result in huge headaches based on the memory corruption patterns.

The most common way to write such stuffs is to hide them beneath the #ifdef macros so they are compiled out. But then you have to recompile your code to enable the special helper code. I shoved the special code through a branch on an environment variable. So if the env. variable was defined, the code would get executed.

So you would obviously think this:


 void *heapAllocate(int bytes)
 {
     if(getenv("HEAP_CHECKER"))
         return heapAllocateFence(bytes);
     else 
     {
         /* main heap allocator fastpath */ 
     }
 }

Well the problem with the above code is the conditional branching for the environment
variable in the fast path: heapAllocate that could be called a million times. Most of the times, when you read such code, your mind staggers a bit and assumes that the code is going to perform slow coz of the branching. Well its more often than NOT. All modern CPU's have very good branch predictors that can cache the branch prediction hints in the branch prediction buffers.

But depending on the architecture and the CPU's branch prediction algorithms, a mis-prediction could result in instruction fetch or pipeline stalls in the CPU's: instruction fetch -> decode -> execute -> writeback phase.

Branch predictors are a complex topic and you can refer to: Branch predictor to get some rough idea about how CPU does branch prediction using branch history and target buffers.

In order to overcome any potential performance bottlenecks across all archs, I used the gcc static prediction hint in the code using the gcc: __builtin_expect which is heavily used by the linux kernel and also glibc code.

GCC provides a builtin function: __builtin_expect(expression, value) that can be used to provide static branch prediction hints to GCC. Infact in optimization mode, this is implicitly enabled. Using __builtin_expect you can tell the compiler to generate code that could set branch prediction hints for the target architecture.

Before I explain it a bit further, lets try and rephrase the above heapAllocate routine to use the gcc static branch prediction hint.


void *heapAllocate(int bytes)
{
    if(__builtin_expect(getenv("HEAP_CHECKER"), 0))
    {
        /* trigger unlikely but special path */
        return heapAllocateFence(bytes);
    }
    /* main allocation fast path */
 }

The above __builtin_expect tells the compiler to enable static branch prediction hints by static that the expression above would be almost always FALSE or "0". So using this you can devise the linux kernel heavily used: likely and unlikely macros that could be used by your code in your error paths for example. (unlikely execution paths).


#define likely(expr) __builtin_expect((expr), 1)
#define unlikely(expr) __builtin_expect((expr), 0)

And then you could use it for your error paths like:


 map = malloc(bytes);
 if(unlikely(map == NULL))
 {
    /*error path*/
 }
 else free(map);

Now what code does the compiler generate to enable branch prediction hints ?
This could be seen by a disassembly of the above malloc code branch with and without the unlikely or __builtin_expect branch prediction hint and off-course without optimization turned on which enables it implicitly.

The below is x86_64 assembly without __builtin_expect in the malloc error path. So its effectively: map = malloc(); if(map == NULL) {erorpath;} else free(map);


    call    malloc
    movq    %rax, -8(%rbp)
    cmpq    $0, -8(%rbp)
    jne .L2
    movl    $0, %edi
    call    exit
.L2:
    movq    -8(%rbp), %rax
    movq    %rax, %rdi
    call    free

So you can see above that the compiler doesn't generate anything special. It cmpq instruction after return from malloc checks for NULL and jumps to label L2 if its not equal to NULL else exits if NULL. So you dont see anything new being done here in this code generation.

Lets see whats the result of: if(unlikely(map == NULL)) { error; } else free(map);
Now we are using the static branch prediction hint to GCC for the malloc error path.


    call    malloc
    movq    %rax, -8(%rbp)
    cmpq    $0, -8(%rbp)
    sete    %al
    movzbl  %al, %eax
    testq   %rax, %rax
    je  .L2
    movl    $0, %edi
    call    exit
.L2:
    movq    -8(%rbp), %rax
    movq    %rax, %rdi
    call    free

Oh all of a sudden your instruction cache looks loaded. The compiler is using the "sete" instruction which sets the bit to the target register (%al) if the ZERO flag is set else loads 0 to target register if ZERO flag is clear. So effectively the code branches out again if zero flag is clear to the label L2 again which implies that the branch prediction bits 1 and 0 were used for NULL/error branch or label L2 branch. The x86 for example seems to cache the branch prediction hint results in its branch history buffer for subsequent usage. So when the same code is executed again, it would predict the error branch as not taken and jump to the prefetched branch target buffer at L2 while the pipelines could be prefetched parallely. So this optimizes the most likely case or branch by effectively keeping the instruction cache with the most likely branch target buffers.

This could also effectively imply that the error branch instruction never really was prefetched into the CPU's instruction cache albeit it was loaded into main memory by the kernel. This comes with a slightly larger memory footprint as the instruction cache has more instructions with static branch prediction but might optimize the case for various archs. through the static hints.

But it seems that the latest INTEL CORE-2's dont really expect the static branch prediction hint instructions through sete/setne,etc. and you can effectively do away without using the __builtin_expect hints and rely on the CPU's speculative branch prediction buffers to do the work. But its always better to use the GCC specific hints so GCC can always take care about generating the most efficient code based on the architecture and the programmer needn't bother himself with the branches in his code as long as he relies on the static hint to work.

Tuesday, June 8, 2010

Examining Futexes and how it tackles THUNDERING HERD

A friend of mine had posted a futex query in a google mailing list named:apartition which mainly comprises of my friends. It was basically an strace output which had a FUTEX_WAKE returning -EAGAIN in one of his pthreads which had him concerned.


 futex(0x6cebd548, FUTEX_WAKE, 1) EAGAIN (Resource temporarily unavailable

I had a theoretical idea about what futexes were in user space w.r.t NPTL glibc pthread library and how it was getting mapped in kernel space. But that wasn't sufficient to understand and answer his futex query.

Hence I used the mail as an opportunity to hack on futex source code to understand them better and in the process, also get to know the root cause for his above FUTEX-WAKE EAGAIN concerns.

Futex as the name suggests are fast user space mutexes. They were introduced primarily to avoid sleeps or system call latencies in the no-contention lock path as opposed to using system V sems which incur syscall overheads to lock/unlock. (semop)

Futex brings the atomic operations to user space on the lock variable and tries to avoid hitting the kernel in case of zero lock contention. or zero waiters. So basically its a spinlock in that case through an atomic "lock; cmpxchg" op.

These are primarily used by posix nptl threads or PTHREAD library to implement fast pthread mutex locks in user space. I am actually bypassing the internals of the user space pthread_mutex_t,pthread_cond_t management structures as thats a different blog post altogether :-).

I am also not covering the priority inheritance futex syscalls and opcodes prefixed with "pi" as those are also a separate topic altogether and come into picture with real time mutexes with different thread priorities and when PTHREAD_PRIO_INHERIT is used with pthread_mutexattr_setprotocol.

The discussions below are valid for both shared mutexes having mutexes and cond variables in shared memory with pthread_{condattr,mutexattr}_settype both having PTHREAD_PROCESS_SHARED and for process private mutexes.

So when you fire a:


 pthread_mutex_lock(&mutex);

pthread library does an atomic op. to determine if the mutex is LOCKED. If the mutex is locked (value 0), then the thread pertaining to the process is put to sleep by having it enter kernel to wait on the FUTEX variable associated with the mutex. This is done through a futex syscall with FUTEX_WAIT as the opcode. and using the "mutex->futex" virtual address along with the current value of the futex.

The FUTEX_WAIT call is an indicator to the kernel to have the process sleep on a futex variable till its variable changes through say a: pthread_cond_signal or broadcast for example.

The kernel as part of FUTEX_WAIT hashes the address of the futex variable by mapping the virtual address of the futex variable to the process page through its vma_struct + offset within the page to get the futex queue representation for that futex variable from the futex hash bucket. or creating a new futex key if absent with the futex queue. Thus the kernel uses the futex variable itself to create a representation of a futex object for the thread dynamically without incurring the overhead of creating global identifiers like with sem_open, semget,shmget,msgget syscalls.

It then rechecks(to avoid lost wakeup races) if the value in the futex variable (virtual address: mutex->futex for a pthread_mutex_lock(&mutex)) has changed in which case the FUTEX has been signalled through a FUTEX_WAKE else it puts the futex queue object into the hash table by scheduling itself (task) out after queuing itself into the futex queue for that futex variable.

So basically a FUTEX_WAIT is a wait on a futex variable thats signalled by FUTEX_WAKE whenever the value of futex variable being waited for changes.

Now when a :


pthread_cond_signal(&cond)


pthread_mutex_unlock(&mutex);

is fired, the glibc pthread library increases the value of the futex variable associated with the cond variable or mutex (condvar->mutex OR mutex->futex) for pthread_cond_signal and pthread_mutex_unlock respectively,

sees if there were futex waiters(maintains wakeup/total sequence counters in the pthread_cond_t variable), and then fires a call to the kernel syscall: futex with opcode: FUTEX_WAKE with a value of "1" to wake up the waiters on the mutex if it finds waiters on the cond->futex or mutex->futex. The wakeup hint to the kernel being: Wake up 1 process waiting on the futex where the FUTEX_WAKE val represents the number of processes to be woken up waiting on the futex.

This is done only if there are waiters. If there are no waiters (zero-contention), then the call doesn't have to hit the kernel for FUTEX_WAKE which is the fast-path being optimized with futexes.

So the kernel futex_wake, just rehashes the futex address to the futex queue object, and then WAKES one process or whatever value was passed as hint from FUTEX_WAKE (count of processes to wake up).

Now pthread_cond_broadcast changes the equation a bit. In essence, pthread_cond_broadcast is nothing but a FUTEX_WAKE with a value of INT_MAX or ~0 that tells the kernel to wake up ALL processes (INT_MAX count of processes) sleeping on the futex queue.

But waking up all processes on a pthread_cond_broadcast would result in the famous THUNDERING HERD problem where all processes would be woken up to recheck their futexes only to have all of them requeued back again into the futex wait queue.

This THUNDERING HERD problem is prominent with process private mutexes as shared mutexes won't nearly have as much contention in shared memory as process private mutexes which could have 100/1000's of threads contending for a private mutex specific to the process.

In order to address the THUNDERING HERD problem, the futex implementation was changed to include a FUTEX_REQUEUE/FUTEX_CMP_REQUEUE op.

The trick that the REQUEUE op in futex employs in the code to get rid of the thundering herd problem is interesting.

Basically it fires a FUTEX_CMP_REQUEUE op with the address of 2 futex variables and a value of 1 indicating only 1 process to wake up from futex variable at address 1 but all (or INT_MAX remaining) processes/threads to be requeued into futex variable at address 2.

So basically it tries to wake up only 1 PROCESS at futex addres 1 to avoid the thundering herd and requeue the balance into the futex variable at address 2.

Whats this futex variable at address 2 where the processes waiting on futex at address 1 are requeued ?

Take the below sequence to understand the FUTEX_CMP_REQUEUE 2 futex operation better:



Thread 1:

pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

< do something as MUTEX is held after return >

pthread_mutex_unlock(&mutex);

Thread 2:
pthread_mutex_lock(&mutex);

pthread_cond_wait(&cond, &mutex);

pthread_mutex_unlock(&mutex);

Thread 3:

pthread_cond_broadcast(&cond);

Now as you know, when pthread_cond_wait goes to sleep it releases "mutex" and acquires it again after its woken up without you having to reacquire it again after pthread_cond_wait.

So thread 1 and thread 2 are waiting/sleeping on futex 1 in the kernel which is : "condvar->futex". releasing mutex->futex variable.
But when they are woken up, they _TRY_ to reacquire : "mutex->futex"

They are effectively waiting for the value of condvar->futex to be changed in kernel while on a futex queue.

When you fire:


 pthread_cond_broadcast(&cond);

the pthread library tries to wake up only 1 process through a FUTEX_CMP_REQUEUE op passing:


(&condvar->futex, 1, INT_MAX, &mutex->futex);

This implies that the kernel should first try to wake up 1 process sleeping on the condvar->futex that has been changed by the broadcast from user-space and then requeue all the others sleeping on condvar->futex to now wait on "mutex->futex" .Howz that for thundering herd ?

Basically it moves the other waiting processes from condvar->futex queue into mutex->futex futex wait queue as when any process is woken up after a condwait as mentioned above, it would try to grab the mutex->futex (lock mutex after condwait). So when the mutex is unlocked again, automatically 1 process is woken up from mutex->futex as expected.



The wake all processes waiting on futex step on a broadcast step: 
(FUTEX_WAKE,&mutex->futex, INT_MAX); is eliminated as all the required waiter processes were already moved to mutex->futex on a FUTEX_CMP_REQUEUE with only a 1 process wakeup to avoid the thundering herd issue with such broadcast wakeups/schedules.

However if you see the CMP_REQUEUE trick that the kernel employs to avoid a thundering herd, it can have a RACE condition when the value of "condvar->futex" or futex variable at address 1 can be CHANGED behind the back while the process enters the kernel for FUTEX_CMP_REQUEUE. with old value of futex1 through a parallel pthread_cond_wait/signal/broadcast.

So this could result in a missing signal or lost wake up if kernel doesn't recheck the value of futex1 or condvar->mutex
as effectively during that CMP_REQUEUE wake 1 window, another thread 4 could have fired a :



 "pthread_cond_wait(&cond, &mutex);" changing "condvar->futex" behind the back while a CMP_REQUEUE wake 1 and requeue others is in progress.

So kernel detects this race by a recheck of the condvar->futex user space address value to see if it has changed indicating a racing pthread_cond_wait or a pthread_cond_signal/broadcast on the same cond.

When it detects such a change of the value of futex1 or a race condition, it returns -EAGAIN back to the NPTL posix pthread library FUTEX_CMP_REQUEUE call.(indirectly invoked on a pthread_cond_broadcast)

The pthread library then catches this failure (or any failure) from CMP_REQUEUE futex and then reissues the call again to the kernel with the usual FUTEX_WAKE opcode and a INT_MAX value or a wake_all tasks value to the kernel. indicating the kernel to fall back to the thundering herd wake all process step with INT_MAX processes to wake up using it as the last resort incase of a race as mentioned above with FUTEX_CMP_REQUEUE.

The addition of the FUTEX_CMP_REQUEUE futex operation in the futex syscall in the kernel is a juicy trickery to avoid thundering herd issues experienced with Operating Systems programming and hence makes for an interesting hack.

Friday, May 14, 2010

Allocate only memory-groups for scatter-gather storage

Recently we wrote a simple but generic memory group for scatter-gather storage which could be just freed collectively in one single shot. So there is no free of any individual objects or data-set but just a alloc and a destroy. Its basically a simple power-of-2 alloc-only memory pool but used only for a specific scatter-gather result storage and processing scenarios.

Its not such a big deal but the generic approach impressed me to blog about it. And especially when your product is going to run in space now - Openclovis in space , some development genericity makes the framework cooler even though it isn't really required :-). In this case, we could have easily had multiple scatter-gather code references in different places as the usage scenarios were not complex as some of the earlier ones for which we developed other generic solutions which made more sense.

The idea was from a colleague of mine whom I consider to be an expert in object storage,synthesis and object-oriented generic design in general. Being a "C" programmer all my life, I had learnt some useful stuffs about writing code in a fashion that could be _easily_ re-used by anyone.

Before discussing the implementation details of the memory group, ask yourself a question:
"How many times have you faced a situation where you are supposed to hunt for variable-sized data-sets and save them in scatter-gather lists for fast but short-lived inspection and deletion".

[
Off-course this question is not relevant for dynamic typed languages like python but only holds good for "C" or C++. So if you are a python,ruby,perl or maybe even a Java programmer, you don't really have to read ahead and I leave it to your discretion :-)
]

It could be as simple as creating a storage array to hold the results and freeing them.
Something like:



void **data = NULL;
next_data = 1;
int count = 0;          
while(next_data > 0)
{
    if(!(count&15)) /*allocate and expand the result-set 16 at a time*/
    {
        data = realloc(data, (count+16) * sizeof(*data)); 
        assert(data != NULL);
        memset(data+count, 0, 16*sizeof(*data));
    }
    data[count] = calloc(1, getdatasize(&your_data_pool, count, &next_data) );
    /*
     * Initialize data or fetch the result and copy it into data[count] and process data[count]
     * free it here itself or delay frees to later if processing the data-set in batches
     */
    ++count;
}

Now freeing them could be a simple one with a :free(data[count]); on each data-set after processing your results.The above result-data set to store results could even be a linked list, tree, or any data-structure that could hold these results for a short period of time.

Based on the data pool and the query of the data-set, your storage mechanism could slightly vary. But most of the times, it would represent a scatter-gather list accumulator.

Now if you want to follow this approach in many different places of the code which is supposed to accumulate and process results of different sizes,and then eliminate them immediately, you would have to effectively replicate the above code with maybe different variants.

To address this in a generic manner, we implemented a simple but generic allocate only memory group to address this scenario. We could have alternatively used our internal memory pool but the internal pool is used for fixed-size storage where you could create different pools to accomodate different storage sizes. Something like below to create a 1 MB pool that can be used to allocate objects upto size 32 bytes:

pool = pool_create(1<<20, 32, optional_flags);

and then allocate the data-set from the pool:

data = alloc_pool(pool, flags);

But most of the times, our data storage while quickly iterating through data-sets was not of fixed-size to pre-create memory cache pools and allocate objects from them. So we implemented an alternate but simple alloc-only memory group which used a simple power-of-2 allocator to allocate memory for the variable sized data-sets that could be used during scatter-gather. And this pool didn't have any individual object free but just had a collective free to destroy the entire scatter-gather memory group after usage.

The expectation is that these alloc-only memory groups would be short-lived and won't be residing in the process for longer periods holding memory. So the above scatter-gather result accumulator became something like this:



MemGroupT memGroup;
const int storageSizeBase = 8; /*just a initial base size. could be a sizeof(struct foo); */
memGroupInit(&memGroup, storageSizeBase);/*initialize a scatter-gather list with a base size*/

/*
 * The allocation of the data storage. is now from the above memGroup instead of a malloc.
 * So within the "while" loop, you would have.
 */
data[count] = memGroupAlloc(&memGroup, getdatasize(&your_data_pool, count, &next_data));

The above accumulates the resulting storage into an array for processing. You could just use them as it is after allocation. But the allocation base is from a group. And you could just base the &memGroup reference to other places in the code to allocate memory from its storage pool. And then after processing of the result-set, you could throw them all away instead of freeing each object/data individually using a collective or a 1 shot free like below:


memGroupFree(&memGroup);

which frees the memory group itself.

The above would also give you savings in iterating over your entire data-set storage in order to free the references later which could be sufficiently large. The implementation of memGroup uses a 32 slot memory pool where the memory capacity/storage in each slot is double the previous slot. So ideally you have a virtually large pool that could accomodate a max. allocation of 4 gig. (or 2 gig depending on whether you use the last slot or not. (1 << 32) or (1<<31))

I have also slightly spiced up the implementation by giving support to read-write memory groups. So you could write protect your objects or storage from the memory group whenever you want through something like:


object = memGroupAlloc(&memGroup, dataSize);  
memcpy(object, mydata, dataSize);

Then you can convert the storage into a read-only location through something like this:

  
object = memGroupProtect(&memGroup, object);

So _any_ writes to the above "object" location would trigger a SIGSEGV or a segfault resulting in process-crash/core-dump as a read-only object memory location after memGroupProtect is write protected internally.

So you could lease the object to another portion of your code that isn't aware about the source of the object storage (like memGroup, malloc, etc.) thats expected to read the contents of the object and _not_ modify the object to be safe. And then you could again unprotect the object back later to modify it if you want through a later unprotect call on the same object:


object = memGroupUnprotect(&memGroup, object);

It resembles the mprotect syscalls: mprotect(obj, objlen, PROT_READ); and a mprotect(obj,objlen, PROT_READ| PROT_WRITE); but only for a small object snapshot from a big storage space.

memGroupProtect and unprotect internally use read-write and read-only mmap mirrors of a shared resource thats given out to the user as a data storage reference. You could also comment out the blocks in the below implementation that use a temp file for shared read-write and read-only mmap to protect and unprotect storage within the memGroup, throw-away the implementation of the protect and unprotect incase you don't need them and just use anonyous mmap or malloc calls to allocate a big storage chunk to your memGroup by modifying the memAlloc routine in the implementation below.

You can download/view the complete implementation here: Allocate-only MemGroups
Feel free to use the implementation in your code if you deem it fit for your usage-scenarios mentioned above.

Tuesday, May 4, 2010

Scalable anon-vma in linux kernel 2.6.34

While getting in touch recently with an ex-colleague and friend of mine - Vikram Shenoy, he mentioned about the interesting Linux Kernel 2.6.34 anon-vma triggered kernel panic during heavy testing of suspend/resume sequence which Linus ultimately tapped in 2.6.34 rc4. The problem was triggered as a result of the new scalable anon-vma chain introduction by Rik Van Riel. He asked mejavascript:void(0) to take a look at the bug as it had generated sufficient curiosity and headaches during the rc4 release.

Yesterday I spent sometime in going through the problem. My first attempt at googling revealed Jonathan Corbets interesting tidbits on the problem at: http://lwn.net/Articles/383162
While the article was well-written, it really would be superficial just going through a writeup without looking into the code that caused the problem in the first place. So I went about going through Rik Van Riels new scalable anon-vma changeset at:
javascript:void(0)
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5beb49305251e5669852ed541e8e2f2f7696c53e

While I had a rough understanding of how the patch worked, I had to dig into the 2.6.34-rc2 sourcetree which also had his changes in order for the patch and the bug which was addressed ultimately by Linus, to make complete sense. If you just go through the explanation above, Riks commit log and his patch, it still won't make complete sense unless you yourself take a look at the relevant code for anonymous page management in the linux kernel.

Before I explain the changes introduced by Rik Van Riel for scalable anononymous virtual memory area management, let me quickly explain the meaning of anon-vma in the context of the kernel data structure.

An anonymous VMA is a memory region for a process with no backing storage. In other words, any private allocation pertaining to the process address space thats not file-backed can be termed as an anonymous page.
malloc, mmap with MAP_ANONYMOUS are example of anonymous pages for any process. In fact its pretty common for all the processes in the system to have anonymous pages unless off-course the process is entirely shared-mem or file-mapped. with no private allocations whatsoever.

Usually management of anonymous pages are not as complicated as opposed to file backed pages considering that they are private to the process virtual memory address space. But such anon-vma mappings get interesting when the process starts forking in which case, the virtual memory area and the anon-vma mappings are cloned and forked to the childs address space resulting in shared anon-vmas in preparation for a break COW to unprotect the write-protected page table entries in both the parent and the child.(copy on write trigger)

This poses problems during page reclamation phase of the kernel when such anonymous pages are eligible to be reclaimed by the kernel to free pages or memory when under memory pressure during LRU scanning.

The problem is to trace all the virtual memory areas of the children (or shared) mappings for an anonymous page to unmap the PTE entries from all the processes who have mapped that anonymous page.

In 2.4, the kernel had a brute force page reclaim approach where an LRU scan entailed scanning the MM areas of all the processes in the system through a init_mm list head incurring an O(N) scan which wasn't scalable.

In kernel 2.6.x, reverse mapping was introduced for pages to track the virtual memory areas mapped to a given page so as to enable the kernel to quickly and easily unmap the PTE entries(page table entries) pertaining to the virtual memory area for _each_ process.

Before Rik Van Riels recent rmap changes specific to anon-vma, the anon-vma of a parent process used to contain the virtual memory areas of all the child processes. A anon-vma pointer or hint was stored in the page->mapping field for the anonymous page(this was left untouched by Rik's recent patch). So a try_to_unmap_anon_page used to traverse the list of all anon-vma VMAs linked to the page and nuke the pte entries for each of the vma (which in-turn would correspond to each process sharing that anonymous page)

To give a simple example, lets assume that process 'A' to start with has done an mmap of a pagesize like below:

char *map = mmap(0, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1 ,0);

The above mmap would result in a new "vma" struct in the kernel for process : 'A'.
But the VMA wouldn't yet have an anon-vma representation since the anonymous page is only assigned to the VMA on a page fault as its deferred to page faults for the process or in other words - when the process actually tries to access/write to that page/vma.

So lets assume that process 'A' touched the pages through say a:
map[0] = 0x7f;

Now this would incur a page fault for process 'A' in the kernel (or CPU trap 14 in x86), which would then assign an anon-vma representation in the kernel and assign the anon-vma to the vma.
So vma->anon_vma = anon_vma would be triggered through the anon_vma_prepare call in the __do_fault function in mm/memory.c

So far so good.

Now process 'A' forks and creates process 'B'. The fork of process 'B' would duplicate the mmap layout of the parent in the child through dup_mmap and this would in turn create a copy of the above "vma" area in the child. Now prior to Riks changes in 2.6.34, the anon-vma of all the children and down (grand children,etc.) all points to the oldest parent.

And childs VMA is added to the parents anon_vma list-head ->&anon_vma->head through anon_vma_link. So every anonymous VMA of the parent is duplicated in the child and the childs virtual memory area is added as a reverse map to the parents anon_vma->head. And during a page fault, the anon-vma or VMA pertaining to the anonymous page is added to the rmap of the anonymous page *page_add_anon_rmap. The anon-vma pointer is stored to the page->mapping field specific to backing storage of a page and an LSB lock bit indicates the setting of this mapping to the anonymous page.

So ideally every anonymous VMA of the child would have an anon-vma entry in the parent for the anon page to easily reverse map that page whenever that page (say) is going to be unmapped by the kernel during page reclamation/LRU scans when under memory pressure. So you just walk the anon-vma list head of VMAs of all the processes (all children, grand-children,etc.) , and nuke the PTE entry for the VMA pertaining to the process.

This technique worked fairly fell for normal workloads which were not heavily forking loads.

But because each VMA of the children is attached to the top most parent in the heirarchy, it failed for heavily forking loads like the AIM7 benchmark tests.

For forking workloads with a parent with 1000 child processes or 1000 anonyous pages (like above example of process 'A') and with each in turn forking 1000, we would end up with a million entries: 10^6 VMA entries in the anon_vma->head of the top most parent since the anon-vma pointer is just shared by all the descendants. So even though its effectively mapped for 1 in every 1000 of such processes, the rmap scan for the anon-vma unmap for an anonymous page becomes a brutal O(N) and hence hits the system CPU hard with almost ~100% on the scan with the anon-vma spinlock held during such anon-vma unmap traversals in the rmap code. So obviously the addition of the VMA's of all descendants to the top-most parent was again causing scalability issues with extremely heavy forking workloads like the AIM7 benchmarks.

In order to address this issue, Rik introduced a scalable anon-vma chain patch that distributed the anon-vma's to every process (all child processes down the line). So basically every process has an anon-vma chain with a list head of anon-vma->same_anon_vma chains and a list head of vma->same_vma chains for traversal of the anon-vma. Its basically a list of virtual memory areas stored as vma chains pertaining to a anonymous VMA and a list of anonymous virtual memory areas stored as anon-vma chains pertaining to a vma. (anon-vma->same_anon_vma chains and vma->same_vma chains back linkage)

So a parent process 'A' would start with a anon-vma chain on the above mmap with page touch causing a anon_vma_prepare.(page fault trigger)

Then a fork in process 'B', would result in anon_vma_fork which would first traverse the vma->same_vma chain of parent process 'A' to clone all the anonymous VMA entries to the child for each VMA being cloned on a fork. The vma->same_vma chain is created for the child and this vma chain is added initially to the anon_vma->same_anon_vma chain of the parent for that anonymous page.(similar to earlier anon_vma->head adds of a child VMA but only upto 1 level down the child and not grand-children and below)

Then a new anon-vma is also allocated for the child itself (not present in the earlier anon-vma rmap management code), plus a new anon vma chain is also allocated for the new anon-vma for the child. This new vma chain is then added to both the anon-vma->same_anon_vma chain and the anon-vma added to the vma->same_vma chain for the child itself.(self linkage)

This creation of per child anon-vma chains and anon-vma, distributes the anon-vma for the parents descendants on subsequent forks in its children and below. A break cow is simply delinking the anon-vma chain of the child from the parent and adding it to its own anon-vma chain that was created on a fork instead of adding it all to the top most parent. (oldest in the heirarchy as before)

A page unmap of a anonymous page would now amount to fetching the anon-vma pointer for the page from page->mapping as before, and then traversing the anon_vma->same_anon_chain list_head to iterate through the vma chains for all the children mapping that anonymous page. ( would be now 1000 in AIM7 fork load over a million earlier)

So in case of AIM7 forking workloads that would create a million processes, each process would have 1000 anon-vma chains distributed for its anonymous pages pertaining to the VMA. instead of distributing them all in the oldest parent. So instead of a million VMA entries for an anonymous page, you would have 1000 anon-vma chains for each of the 1000 children for each of the 1000 processes.

This radically improved the AIM7 benchmark with effectively no cpu utilization in the kernel as the anonymous page unmap scans for VMAs were reduced from a million to a 1000 for an anonymous page.

But the patch was triggering random kernel panics in the swap page reclamation path or trying to add in a anon-page that was unreferenced earlier, back to a process. The anon-vma mapped to the page was presumably getting corrupted as it was pointing to a dangling anon-vma pointer stored for an anonymous page in page->mapping.

Riks patch used to add a page to the anon map (page_add_anon_rmap) by trying to use the most recent anon-vma map set for that page. So if process 'A' starts with anon-vma 'X', forks process 'B' which initially also shares anon-vma 'X' in its anon-vma chain prepared/created by parent.

So far so good.

But lets say down the road, the order was inverted after the page was unmapped and referenced again, there is a potential to have a parent left with a dangling/freed anon-vma pointer for the anon-page coz of the above setting of the first created anon-vma to the anonymous page whenever such an anon-page is added to the rmap store through page_add_anon_rmap.

Lets say the following sequence happened (as clearly mentioned by Linus in the bugfix of anon-vma in rc4):

Process 'A' creates an anon-vma, forks 'B'. Initially 'A' and 'B' anon-vma chains point to anon-vma 'X' created by parent process 'A'.

Then both are unmapped lets say by the page LRU code later in an effort to swapout the page under memory pressure. Lets assume that the anonymous page is either in the swap cache or swapped out.

Now lets say that process 'B' referenced the anonymous page FIRST _before_ parent process 'A'.

So a page fault would result in anon_vma_prepare for process 'B' and swapin or swap-cache pull of the anonymous page. The page_anon_add_rmap : if(first) check ensures that the anon-vma pointer created/prepared by child process 'B' gets the anon-vma mapping into the anonymous page: page->mapping FIRST.

Then lets say parent process 'A' tries to access the same page and a page fault on the anon page would again incur a page_add_anon_rmap in which case the page->mapping anon-vma pointer for the parent would be a no-op as it was already set by the fault on child process 'B' earlier.(coz of the : if(first) or already set check).

Now parent process 'A's or the anonymous page's anon-vma mapping now points to the entry created by child process 'B'. Lets say, child process 'B' goes ahead and exits maybe through execvp and the VMA cleanup of process 'B' results in also kmem_cache_free of the anon-vma pointer created by child process 'B'. Hence this would result in a condition of leaving parent process 'A' having a dangling or freed anon-vma pointer of child process 'B' in its vma->anon_vma. Result: Spurious kernel panics as mentioned above during page reclamation or traversals of mapping of that anonymous page.

The fix that Linus suggested does a list_entry.prev to choose the OLDEST anon-vma in the list head of vma->same_vma chain which effectively just picks the oldest anon-vma in the heirarchy thus ensuring that we would always be referring to a valid anon-vma created for a process in the event of such page unmap/access reversals as above. The earlier RMAP code before Riks recent scalable anon-vma changeset also indirectly ensured this since every child always pointed to the anon-vma entry of the top most (or oldest) parent in the heirarchy.

The analysis of the anon-vma chain patch along with anon-vma management in the linux kernel, its implications in the form of random kernel panics and subsequent amazing debugging of Linus with apparently not much help from the author of the new scalable anon-vma himself: Rik Van Riel (as inferred from lkml threads of 2.6.34 rc4 ) further asserts the famous quote from Brian Kernighan:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Incidentally, I had seen this quote used by Rik himself in his signature earlier and this time, he himself happened to be on the receiving end of that famous quote :-)

Techie blog streams

Tuesday, June 22, 2010

Using static branch prediction with gcc

Tuesday, June 8, 2010

Examining Futexes and how it tackles THUNDERING HERD

Friday, May 14, 2010

Allocate only memory-groups for scatter-gather storage

Tuesday, May 4, 2010

Scalable anon-vma in linux kernel 2.6.34

Pages

Followers

Blog Archive

About Me