Yesterday I spent sometime in going through the problem. My first attempt at googling revealed Jonathan Corbets interesting tidbits on the problem at: http://lwn.net/Articles/383162
While the article was well-written, it really would be superficial just going through a writeup without looking into the code that caused the problem in the first place. So I went about going through Rik Van Riels new scalable anon-vma changeset at:
While I had a rough understanding of how the patch worked, I had to dig into the 2.6.34-rc2 sourcetree which also had his changes in order for the patch and the bug which was addressed ultimately by Linus, to make complete sense. If you just go through the explanation above, Riks commit log and his patch, it still won't make complete sense unless you yourself take a look at the relevant code for anonymous page management in the linux kernel.
Before I explain the changes introduced by Rik Van Riel for scalable anononymous virtual memory area management, let me quickly explain the meaning of anon-vma in the context of the kernel data structure.
An anonymous VMA is a memory region for a process with no backing storage. In other words, any private allocation pertaining to the process address space thats not file-backed can be termed as an anonymous page.
malloc, mmap with MAP_ANONYMOUS are example of anonymous pages for any process. In fact its pretty common for all the processes in the system to have anonymous pages unless off-course the process is entirely shared-mem or file-mapped. with no private allocations whatsoever.
Usually management of anonymous pages are not as complicated as opposed to file backed pages considering that they are private to the process virtual memory address space. But such anon-vma mappings get interesting when the process starts forking in which case, the virtual memory area and the anon-vma mappings are cloned and forked to the childs address space resulting in shared anon-vmas in preparation for a break COW to unprotect the write-protected page table entries in both the parent and the child.(copy on write trigger)
This poses problems during page reclamation phase of the kernel when such anonymous pages are eligible to be reclaimed by the kernel to free pages or memory when under memory pressure during LRU scanning.
The problem is to trace all the virtual memory areas of the children (or shared) mappings for an anonymous page to unmap the PTE entries from all the processes who have mapped that anonymous page.
In 2.4, the kernel had a brute force page reclaim approach where an LRU scan entailed scanning the MM areas of all the processes in the system through a init_mm list head incurring an O(N) scan which wasn't scalable.
In kernel 2.6.x, reverse mapping was introduced for pages to track the virtual memory areas mapped to a given page so as to enable the kernel to quickly and easily unmap the PTE entries(page table entries) pertaining to the virtual memory area for _each_ process.
Before Rik Van Riels recent rmap changes specific to anon-vma, the anon-vma of a parent process used to contain the virtual memory areas of all the child processes. A anon-vma pointer or hint was stored in the page->mapping field for the anonymous page(this was left untouched by Rik's recent patch). So a try_to_unmap_anon_page used to traverse the list of all anon-vma VMAs linked to the page and nuke the pte entries for each of the vma (which in-turn would correspond to each process sharing that anonymous page)
To give a simple example, lets assume that process 'A' to start with has done an mmap of a pagesize like below:
char *map = mmap(0, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1 ,0);
The above mmap would result in a new "vma" struct in the kernel for process : 'A'.
But the VMA wouldn't yet have an anon-vma representation since the anonymous page is only assigned to the VMA on a page fault as its deferred to page faults for the process or in other words - when the process actually tries to access/write to that page/vma.
So lets assume that process 'A' touched the pages through say a:
map = 0x7f;
Now this would incur a page fault for process 'A' in the kernel (or CPU trap 14 in x86), which would then assign an anon-vma representation in the kernel and assign the anon-vma to the vma.
So vma->anon_vma = anon_vma would be triggered through the anon_vma_prepare call in the __do_fault function in mm/memory.c
So far so good.
Now process 'A' forks and creates process 'B'. The fork of process 'B' would duplicate the mmap layout of the parent in the child through dup_mmap and this would in turn create a copy of the above "vma" area in the child. Now prior to Riks changes in 2.6.34, the anon-vma of all the children and down (grand children,etc.) all points to the oldest parent.
And childs VMA is added to the parents anon_vma list-head ->&anon_vma->head through anon_vma_link. So every anonymous VMA of the parent is duplicated in the child and the childs virtual memory area is added as a reverse map to the parents anon_vma->head. And during a page fault, the anon-vma or VMA pertaining to the anonymous page is added to the rmap of the anonymous page *page_add_anon_rmap. The anon-vma pointer is stored to the page->mapping field specific to backing storage of a page and an LSB lock bit indicates the setting of this mapping to the anonymous page.
So ideally every anonymous VMA of the child would have an anon-vma entry in the parent for the anon page to easily reverse map that page whenever that page (say) is going to be unmapped by the kernel during page reclamation/LRU scans when under memory pressure. So you just walk the anon-vma list head of VMAs of all the processes (all children, grand-children,etc.) , and nuke the PTE entry for the VMA pertaining to the process.
This technique worked fairly fell for normal workloads which were not heavily forking loads.
But because each VMA of the children is attached to the top most parent in the heirarchy, it failed for heavily forking loads like the AIM7 benchmark tests.
For forking workloads with a parent with 1000 child processes or 1000 anonyous pages (like above example of process 'A') and with each in turn forking 1000, we would end up with a million entries: 10^6 VMA entries in the anon_vma->head of the top most parent since the anon-vma pointer is just shared by all the descendants. So even though its effectively mapped for 1 in every 1000 of such processes, the rmap scan for the anon-vma unmap for an anonymous page becomes a brutal O(N) and hence hits the system CPU hard with almost ~100% on the scan with the anon-vma spinlock held during such anon-vma unmap traversals in the rmap code. So obviously the addition of the VMA's of all descendants to the top-most parent was again causing scalability issues with extremely heavy forking workloads like the AIM7 benchmarks.
In order to address this issue, Rik introduced a scalable anon-vma chain patch that distributed the anon-vma's to every process (all child processes down the line). So basically every process has an anon-vma chain with a list head of anon-vma->same_anon_vma chains and a list head of vma->same_vma chains for traversal of the anon-vma. Its basically a list of virtual memory areas stored as vma chains pertaining to a anonymous VMA and a list of anonymous virtual memory areas stored as anon-vma chains pertaining to a vma. (anon-vma->same_anon_vma chains and vma->same_vma chains back linkage)
So a parent process 'A' would start with a anon-vma chain on the above mmap with page touch causing a anon_vma_prepare.(page fault trigger)
Then a fork in process 'B', would result in anon_vma_fork which would first traverse the vma->same_vma chain of parent process 'A' to clone all the anonymous VMA entries to the child for each VMA being cloned on a fork. The vma->same_vma chain is created for the child and this vma chain is added initially to the anon_vma->same_anon_vma chain of the parent for that anonymous page.(similar to earlier anon_vma->head adds of a child VMA but only upto 1 level down the child and not grand-children and below)
Then a new anon-vma is also allocated for the child itself (not present in the earlier anon-vma rmap management code), plus a new anon vma chain is also allocated for the new anon-vma for the child. This new vma chain is then added to both the anon-vma->same_anon_vma chain and the anon-vma added to the vma->same_vma chain for the child itself.(self linkage)
This creation of per child anon-vma chains and anon-vma, distributes the anon-vma for the parents descendants on subsequent forks in its children and below. A break cow is simply delinking the anon-vma chain of the child from the parent and adding it to its own anon-vma chain that was created on a fork instead of adding it all to the top most parent. (oldest in the heirarchy as before)
A page unmap of a anonymous page would now amount to fetching the anon-vma pointer for the page from page->mapping as before, and then traversing the anon_vma->same_anon_chain list_head to iterate through the vma chains for all the children mapping that anonymous page. ( would be now 1000 in AIM7 fork load over a million earlier)
So in case of AIM7 forking workloads that would create a million processes, each process would have 1000 anon-vma chains distributed for its anonymous pages pertaining to the VMA. instead of distributing them all in the oldest parent. So instead of a million VMA entries for an anonymous page, you would have 1000 anon-vma chains for each of the 1000 children for each of the 1000 processes.
This radically improved the AIM7 benchmark with effectively no cpu utilization in the kernel as the anonymous page unmap scans for VMAs were reduced from a million to a 1000 for an anonymous page.
But the patch was triggering random kernel panics in the swap page reclamation path or trying to add in a anon-page that was unreferenced earlier, back to a process. The anon-vma mapped to the page was presumably getting corrupted as it was pointing to a dangling anon-vma pointer stored for an anonymous page in page->mapping.
Riks patch used to add a page to the anon map (page_add_anon_rmap) by trying to use the most recent anon-vma map set for that page. So if process 'A' starts with anon-vma 'X', forks process 'B' which initially also shares anon-vma 'X' in its anon-vma chain prepared/created by parent.
So far so good.
But lets say down the road, the order was inverted after the page was unmapped and referenced again, there is a potential to have a parent left with a dangling/freed anon-vma pointer for the anon-page coz of the above setting of the first created anon-vma to the anonymous page whenever such an anon-page is added to the rmap store through page_add_anon_rmap.
Lets say the following sequence happened (as clearly mentioned by Linus in the bugfix of anon-vma in rc4):
Process 'A' creates an anon-vma, forks 'B'. Initially 'A' and 'B' anon-vma chains point to anon-vma 'X' created by parent process 'A'.
Then both are unmapped lets say by the page LRU code later in an effort to swapout the page under memory pressure. Lets assume that the anonymous page is either in the swap cache or swapped out.
Now lets say that process 'B' referenced the anonymous page FIRST _before_ parent process 'A'.
So a page fault would result in anon_vma_prepare for process 'B' and swapin or swap-cache pull of the anonymous page. The page_anon_add_rmap : if(first) check ensures that the anon-vma pointer created/prepared by child process 'B' gets the anon-vma mapping into the anonymous page: page->mapping FIRST.
Then lets say parent process 'A' tries to access the same page and a page fault on the anon page would again incur a page_add_anon_rmap in which case the page->mapping anon-vma pointer for the parent would be a no-op as it was already set by the fault on child process 'B' earlier.(coz of the : if(first) or already set check).
Now parent process 'A's or the anonymous page's anon-vma mapping now points to the entry created by child process 'B'. Lets say, child process 'B' goes ahead and exits maybe through execvp and the VMA cleanup of process 'B' results in also kmem_cache_free of the anon-vma pointer created by child process 'B'. Hence this would result in a condition of leaving parent process 'A' having a dangling or freed anon-vma pointer of child process 'B' in its vma->anon_vma. Result: Spurious kernel panics as mentioned above during page reclamation or traversals of mapping of that anonymous page.
The fix that Linus suggested does a list_entry.prev to choose the OLDEST anon-vma in the list head of vma->same_vma chain which effectively just picks the oldest anon-vma in the heirarchy thus ensuring that we would always be referring to a valid anon-vma created for a process in the event of such page unmap/access reversals as above. The earlier RMAP code before Riks recent scalable anon-vma changeset also indirectly ensured this since every child always pointed to the anon-vma entry of the top most (or oldest) parent in the heirarchy.
The analysis of the anon-vma chain patch along with anon-vma management in the linux kernel, its implications in the form of random kernel panics and subsequent amazing debugging of Linus with apparently not much help from the author of the new scalable anon-vma himself: Rik Van Riel (as inferred from lkml threads of 2.6.34 rc4 ) further asserts the famous quote from Brian Kernighan:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
Incidentally, I had seen this quote used by Rik himself in his signature earlier and this time, he himself happened to be on the receiving end of that famous quote :-)