Friday, May 14, 2010

Allocate only memory-groups for scatter-gather storage

Recently we wrote a simple but generic memory group for scatter-gather storage which could be just freed collectively in one single shot. So there is no free of any individual objects or data-set but just a alloc and a destroy. Its basically a simple power-of-2 alloc-only memory pool but used only for a specific scatter-gather result storage and processing scenarios.

Its not such a big deal but the generic approach impressed me to blog about it. And especially when your product is going to run in space now - Openclovis in space , some development genericity makes the framework cooler even though it isn't really required :-). In this case, we could have easily had multiple scatter-gather code references in different places as the usage scenarios were not complex as some of the earlier ones for which we developed other generic solutions which made more sense.

The idea was from a colleague of mine whom I consider to be an expert in object storage,synthesis and object-oriented generic design in general. Being a "C" programmer all my life, I had learnt some useful stuffs about writing code in a fashion that could be _easily_ re-used by anyone.

Before discussing the implementation details of the memory group, ask yourself a question:
"How many times have you faced a situation where you are supposed to hunt for variable-sized data-sets and save them in scatter-gather lists for fast but short-lived inspection and deletion".

[
Off-course this question is not relevant for dynamic typed languages like python but only holds good for "C" or C++. So if you are a python,ruby,perl or maybe even a Java programmer, you don't really have to read ahead and I leave it to your discretion :-)
]

It could be as simple as creating a storage array to hold the results and freeing them.
Something like:



void **data = NULL;
next_data = 1;
int count = 0;
while(next_data > 0)
{
if(!(count&15)) /*allocate and expand the result-set 16 at a time*/
{
data = realloc(data, (count+16) * sizeof(*data));
assert(data != NULL);
memset(data+count, 0, 16*sizeof(*data));
}
data[count] = calloc(1, getdatasize(&your_data_pool, count, &next_data) );
/*
* Initialize data or fetch the result and copy it into data[count] and process data[count]
* free it here itself or delay frees to later if processing the data-set in batches
*/
++count;
}


Now freeing them could be a simple one with a :free(data[count]); on each data-set after processing your results.The above result-data set to store results could even be a linked list, tree, or any data-structure that could hold these results for a short period of time.

Based on the data pool and the query of the data-set, your storage mechanism could slightly vary. But most of the times, it would represent a scatter-gather list accumulator.

Now if you want to follow this approach in many different places of the code which is supposed to accumulate and process results of different sizes,and then eliminate them immediately, you would have to effectively replicate the above code with maybe different variants.

To address this in a generic manner, we implemented a simple but generic allocate only memory group to address this scenario. We could have alternatively used our internal memory pool but the internal pool is used for fixed-size storage where you could create different pools to accomodate different storage sizes. Something like below to create a 1 MB pool that can be used to allocate objects upto size 32 bytes:

pool = pool_create(1<<20, 32, optional_flags);

and then allocate the data-set from the pool:

data = alloc_pool(pool, flags);

But most of the times, our data storage while quickly iterating through data-sets was not of fixed-size to pre-create memory cache pools and allocate objects from them. So we implemented an alternate but simple alloc-only memory group which used a simple power-of-2 allocator to allocate memory for the variable sized data-sets that could be used during scatter-gather. And this pool didn't have any individual object free but just had a collective free to destroy the entire scatter-gather memory group after usage.

The expectation is that these alloc-only memory groups would be short-lived and won't be residing in the process for longer periods holding memory. So the above scatter-gather result accumulator became something like this:



MemGroupT memGroup;
const int storageSizeBase = 8; /*just a initial base size. could be a sizeof(struct foo); */
memGroupInit(&memGroup, storageSizeBase);/*initialize a scatter-gather list with a base size*/

/*
* The allocation of the data storage. is now from the above memGroup instead of a malloc.
* So within the "while" loop, you would have.
*/
data[count] = memGroupAlloc(&memGroup, getdatasize(&your_data_pool, count, &next_data));



The above accumulates the resulting storage into an array for processing. You could just use them as it is after allocation. But the allocation base is from a group. And you could just base the &memGroup reference to other places in the code to allocate memory from its storage pool. And then after processing of the result-set, you could throw them all away instead of freeing each object/data individually using a collective or a 1 shot free like below:

memGroupFree(&memGroup);

which frees the memory group itself.

The above would also give you savings in iterating over your entire data-set storage in order to free the references later which could be sufficiently large. The implementation of memGroup uses a 32 slot memory pool where the memory capacity/storage in each slot is double the previous slot. So ideally you have a virtually large pool that could accomodate a max. allocation of 4 gig. (or 2 gig depending on whether you use the last slot or not. (1 << 32) or (1<<31))

I have also slightly spiced up the implementation by giving support to read-write memory groups. So you could write protect your objects or storage from the memory group whenever you want through something like:


object = memGroupAlloc(&memGroup, dataSize);
memcpy(object, mydata, dataSize);


Then you can convert the storage into a read-only location through something like this:

object = memGroupProtect(&memGroup, object);


So _any_ writes to the above "object" location would trigger a SIGSEGV or a segfault resulting in process-crash/core-dump as a read-only object memory location after memGroupProtect is write protected internally.

So you could lease the object to another portion of your code that isn't aware about the source of the object storage (like memGroup, malloc, etc.) thats expected to read the contents of the object and _not_ modify the object to be safe. And then you could again unprotect the object back later to modify it if you want through a later unprotect call on the same object:


object = memGroupUnprotect(&memGroup, object);


It resembles the mprotect syscalls: mprotect(obj, objlen, PROT_READ); and a mprotect(obj,objlen, PROT_READ| PROT_WRITE); but only for a small object snapshot from a big storage space.

memGroupProtect and unprotect internally use read-write and read-only mmap mirrors of a shared resource thats given out to the user as a data storage reference. You could also comment out the blocks in the below implementation that use a temp file for shared read-write and read-only mmap to protect and unprotect storage within the memGroup, throw-away the implementation of the protect and unprotect incase you don't need them and just use anonyous mmap or malloc calls to allocate a big storage chunk to your memGroup by modifying the memAlloc routine in the implementation below.

You can download/view the complete implementation here: Allocate-only MemGroups
Feel free to use the implementation in your code if you deem it fit for your usage-scenarios mentioned above.

Tuesday, May 4, 2010

Scalable anon-vma in linux kernel 2.6.34

While getting in touch recently with an ex-colleague and friend of mine - Vikram Shenoy, he mentioned about the interesting Linux Kernel 2.6.34 anon-vma triggered kernel panic during heavy testing of suspend/resume sequence which Linus ultimately tapped in 2.6.34 rc4. The problem was triggered as a result of the new scalable anon-vma chain introduction by Rik Van Riel. He asked mejavascript:void(0) to take a look at the bug as it had generated sufficient curiosity and headaches during the rc4 release.

Yesterday I spent sometime in going through the problem. My first attempt at googling revealed Jonathan Corbets interesting tidbits on the problem at: http://lwn.net/Articles/383162
While the article was well-written, it really would be superficial just going through a writeup without looking into the code that caused the problem in the first place. So I went about going through Rik Van Riels new scalable anon-vma changeset at:
javascript:void(0)
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5beb49305251e5669852ed541e8e2f2f7696c53e

While I had a rough understanding of how the patch worked, I had to dig into the 2.6.34-rc2 sourcetree which also had his changes in order for the patch and the bug which was addressed ultimately by Linus, to make complete sense. If you just go through the explanation above, Riks commit log and his patch, it still won't make complete sense unless you yourself take a look at the relevant code for anonymous page management in the linux kernel.

Before I explain the changes introduced by Rik Van Riel for scalable anononymous virtual memory area management, let me quickly explain the meaning of anon-vma in the context of the kernel data structure.

An anonymous VMA is a memory region for a process with no backing storage. In other words, any private allocation pertaining to the process address space thats not file-backed can be termed as an anonymous page.
malloc, mmap with MAP_ANONYMOUS are example of anonymous pages for any process. In fact its pretty common for all the processes in the system to have anonymous pages unless off-course the process is entirely shared-mem or file-mapped. with no private allocations whatsoever.

Usually management of anonymous pages are not as complicated as opposed to file backed pages considering that they are private to the process virtual memory address space. But such anon-vma mappings get interesting when the process starts forking in which case, the virtual memory area and the anon-vma mappings are cloned and forked to the childs address space resulting in shared anon-vmas in preparation for a break COW to unprotect the write-protected page table entries in both the parent and the child.(copy on write trigger)

This poses problems during page reclamation phase of the kernel when such anonymous pages are eligible to be reclaimed by the kernel to free pages or memory when under memory pressure during LRU scanning.

The problem is to trace all the virtual memory areas of the children (or shared) mappings for an anonymous page to unmap the PTE entries from all the processes who have mapped that anonymous page.

In 2.4, the kernel had a brute force page reclaim approach where an LRU scan entailed scanning the MM areas of all the processes in the system through a init_mm list head incurring an O(N) scan which wasn't scalable.

In kernel 2.6.x, reverse mapping was introduced for pages to track the virtual memory areas mapped to a given page so as to enable the kernel to quickly and easily unmap the PTE entries(page table entries) pertaining to the virtual memory area for _each_ process.

Before Rik Van Riels recent rmap changes specific to anon-vma, the anon-vma of a parent process used to contain the virtual memory areas of all the child processes. A anon-vma pointer or hint was stored in the page->mapping field for the anonymous page(this was left untouched by Rik's recent patch). So a try_to_unmap_anon_page used to traverse the list of all anon-vma VMAs linked to the page and nuke the pte entries for each of the vma (which in-turn would correspond to each process sharing that anonymous page)

To give a simple example, lets assume that process 'A' to start with has done an mmap of a pagesize like below:

char *map = mmap(0, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1 ,0);

The above mmap would result in a new "vma" struct in the kernel for process : 'A'.
But the VMA wouldn't yet have an anon-vma representation since the anonymous page is only assigned to the VMA on a page fault as its deferred to page faults for the process or in other words - when the process actually tries to access/write to that page/vma.

So lets assume that process 'A' touched the pages through say a:
map[0] = 0x7f;

Now this would incur a page fault for process 'A' in the kernel (or CPU trap 14 in x86), which would then assign an anon-vma representation in the kernel and assign the anon-vma to the vma.
So vma->anon_vma = anon_vma would be triggered through the anon_vma_prepare call in the __do_fault function in mm/memory.c

So far so good.

Now process 'A' forks and creates process 'B'. The fork of process 'B' would duplicate the mmap layout of the parent in the child through dup_mmap and this would in turn create a copy of the above "vma" area in the child. Now prior to Riks changes in 2.6.34, the anon-vma of all the children and down (grand children,etc.) all points to the oldest parent.

And childs VMA is added to the parents anon_vma list-head ->&anon_vma->head through anon_vma_link. So every anonymous VMA of the parent is duplicated in the child and the childs virtual memory area is added as a reverse map to the parents anon_vma->head. And during a page fault, the anon-vma or VMA pertaining to the anonymous page is added to the rmap of the anonymous page *page_add_anon_rmap. The anon-vma pointer is stored to the page->mapping field specific to backing storage of a page and an LSB lock bit indicates the setting of this mapping to the anonymous page.

So ideally every anonymous VMA of the child would have an anon-vma entry in the parent for the anon page to easily reverse map that page whenever that page (say) is going to be unmapped by the kernel during page reclamation/LRU scans when under memory pressure. So you just walk the anon-vma list head of VMAs of all the processes (all children, grand-children,etc.) , and nuke the PTE entry for the VMA pertaining to the process.

This technique worked fairly fell for normal workloads which were not heavily forking loads.

But because each VMA of the children is attached to the top most parent in the heirarchy, it failed for heavily forking loads like the AIM7 benchmark tests.

For forking workloads with a parent with 1000 child processes or 1000 anonyous pages (like above example of process 'A') and with each in turn forking 1000, we would end up with a million entries: 10^6 VMA entries in the anon_vma->head of the top most parent since the anon-vma pointer is just shared by all the descendants. So even though its effectively mapped for 1 in every 1000 of such processes, the rmap scan for the anon-vma unmap for an anonymous page becomes a brutal O(N) and hence hits the system CPU hard with almost ~100% on the scan with the anon-vma spinlock held during such anon-vma unmap traversals in the rmap code. So obviously the addition of the VMA's of all descendants to the top-most parent was again causing scalability issues with extremely heavy forking workloads like the AIM7 benchmarks.

In order to address this issue, Rik introduced a scalable anon-vma chain patch that distributed the anon-vma's to every process (all child processes down the line). So basically every process has an anon-vma chain with a list head of anon-vma->same_anon_vma chains and a list head of vma->same_vma chains for traversal of the anon-vma. Its basically a list of virtual memory areas stored as vma chains pertaining to a anonymous VMA and a list of anonymous virtual memory areas stored as anon-vma chains pertaining to a vma. (anon-vma->same_anon_vma chains and vma->same_vma chains back linkage)

So a parent process 'A' would start with a anon-vma chain on the above mmap with page touch causing a anon_vma_prepare.(page fault trigger)

Then a fork in process 'B', would result in anon_vma_fork which would first traverse the vma->same_vma chain of parent process 'A' to clone all the anonymous VMA entries to the child for each VMA being cloned on a fork. The vma->same_vma chain is created for the child and this vma chain is added initially to the anon_vma->same_anon_vma chain of the parent for that anonymous page.(similar to earlier anon_vma->head adds of a child VMA but only upto 1 level down the child and not grand-children and below)

Then a new anon-vma is also allocated for the child itself (not present in the earlier anon-vma rmap management code), plus a new anon vma chain is also allocated for the new anon-vma for the child. This new vma chain is then added to both the anon-vma->same_anon_vma chain and the anon-vma added to the vma->same_vma chain for the child itself.(self linkage)

This creation of per child anon-vma chains and anon-vma, distributes the anon-vma for the parents descendants on subsequent forks in its children and below. A break cow is simply delinking the anon-vma chain of the child from the parent and adding it to its own anon-vma chain that was created on a fork instead of adding it all to the top most parent. (oldest in the heirarchy as before)

A page unmap of a anonymous page would now amount to fetching the anon-vma pointer for the page from page->mapping as before, and then traversing the anon_vma->same_anon_chain list_head to iterate through the vma chains for all the children mapping that anonymous page. ( would be now 1000 in AIM7 fork load over a million earlier)

So in case of AIM7 forking workloads that would create a million processes, each process would have 1000 anon-vma chains distributed for its anonymous pages pertaining to the VMA. instead of distributing them all in the oldest parent. So instead of a million VMA entries for an anonymous page, you would have 1000 anon-vma chains for each of the 1000 children for each of the 1000 processes.

This radically improved the AIM7 benchmark with effectively no cpu utilization in the kernel as the anonymous page unmap scans for VMAs were reduced from a million to a 1000 for an anonymous page.

But the patch was triggering random kernel panics in the swap page reclamation path or trying to add in a anon-page that was unreferenced earlier, back to a process. The anon-vma mapped to the page was presumably getting corrupted as it was pointing to a dangling anon-vma pointer stored for an anonymous page in page->mapping.

Riks patch used to add a page to the anon map (page_add_anon_rmap) by trying to use the most recent anon-vma map set for that page. So if process 'A' starts with anon-vma 'X', forks process 'B' which initially also shares anon-vma 'X' in its anon-vma chain prepared/created by parent.

So far so good.

But lets say down the road, the order was inverted after the page was unmapped and referenced again, there is a potential to have a parent left with a dangling/freed anon-vma pointer for the anon-page coz of the above setting of the first created anon-vma to the anonymous page whenever such an anon-page is added to the rmap store through page_add_anon_rmap.

Lets say the following sequence happened (as clearly mentioned by Linus in the bugfix of anon-vma in rc4):

Process 'A' creates an anon-vma, forks 'B'. Initially 'A' and 'B' anon-vma chains point to anon-vma 'X' created by parent process 'A'.

Then both are unmapped lets say by the page LRU code later in an effort to swapout the page under memory pressure. Lets assume that the anonymous page is either in the swap cache or swapped out.

Now lets say that process 'B' referenced the anonymous page FIRST _before_ parent process 'A'.

So a page fault would result in anon_vma_prepare for process 'B' and swapin or swap-cache pull of the anonymous page. The page_anon_add_rmap : if(first) check ensures that the anon-vma pointer created/prepared by child process 'B' gets the anon-vma mapping into the anonymous page: page->mapping FIRST.

Then lets say parent process 'A' tries to access the same page and a page fault on the anon page would again incur a page_add_anon_rmap in which case the page->mapping anon-vma pointer for the parent would be a no-op as it was already set by the fault on child process 'B' earlier.(coz of the : if(first) or already set check).

Now parent process 'A's or the anonymous page's anon-vma mapping now points to the entry created by child process 'B'. Lets say, child process 'B' goes ahead and exits maybe through execvp and the VMA cleanup of process 'B' results in also kmem_cache_free of the anon-vma pointer created by child process 'B'. Hence this would result in a condition of leaving parent process 'A' having a dangling or freed anon-vma pointer of child process 'B' in its vma->anon_vma. Result: Spurious kernel panics as mentioned above during page reclamation or traversals of mapping of that anonymous page.

The fix that Linus suggested does a list_entry.prev to choose the OLDEST anon-vma in the list head of vma->same_vma chain which effectively just picks the oldest anon-vma in the heirarchy thus ensuring that we would always be referring to a valid anon-vma created for a process in the event of such page unmap/access reversals as above. The earlier RMAP code before Riks recent scalable anon-vma changeset also indirectly ensured this since every child always pointed to the anon-vma entry of the top most (or oldest) parent in the heirarchy.

The analysis of the anon-vma chain patch along with anon-vma management in the linux kernel, its implications in the form of random kernel panics and subsequent amazing debugging of Linus with apparently not much help from the author of the new scalable anon-vma himself: Rik Van Riel (as inferred from lkml threads of 2.6.34 rc4 ) further asserts the famous quote from Brian Kernighan:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Incidentally, I had seen this quote used by Rik himself in his signature earlier and this time, he himself happened to be on the receiving end of that famous quote :-)