30 March 2009 This file documents Open MPI's usage of ptmalloc2. This is perhaps our 7,208,499th iteration of ptmalloc2 support, so let's document it here so that some future developer might spend *slightly* less time understanding what the heck is going on. See glibc documentation about malloc hooks before continuing. This is pretty much required reading before reading the rest of this file / having a hope of understanding what's going on here: http://www.gnu.org/software/libc/manual/html_mono/libc.html#Hooks-for-Malloc The overall goal is that we're using the Linux glibc hooks to wholly replace the underlying allocator. We *used* to use horrid linker tricks to interpose OMPI's ptmalloc2 symbols with the glibc ones -- meaning that user apps would call our symbols and not the glibc ones. But that scheme is fraught with problems, not the least of which is that *all* MPI applications will be forced to use our overridden allocator (not just the ones that need it, such as the ones running on OpenFabrics-based networks). Instead, what we do here is, frankly, quite similar to what is done in MX: we use the 4 glibc hooks to assert our own malloc, realloc, free, and memalign functions. This allows the decision as to whether to use this internal ptmalloc2 allocate to be a run-time decision. This is quite important; using this internal allocator has both benefits (allowing using mpi_leave_pinned=1 behavior) and drawbacks (breaking some debuggers, being unnecessary for non-OpenFabrics-based networks, etc.). Here's how it works... This component *must* be linked statically as part of libopen-pal; it *cannot* be a DSO. Specifically, this library must be present during pre-main() initialization phases so that its __malloc_initialize_hook can be found and executed. Loading it as a DSO during MPI_INIT is far too late. In configure.m4, we define the M4 macro MCA_memory_ptmalloc2_COMPILE_MODE to always compile this component in static mode. Yay flexible build system. This component provides an munmap() function that will intercept calls to munmap() and do the Right Thing. That is fairly straightforward to do. Intercepting the malloc/free/etc. allocator is much more complicated. All the ptmalloc2 public symbols in this component have been name shifted via the rename.h file. Hence, what used to be "malloc" is now opal_memory_ptmalloc2_malloc. Since all the public symbols are name-shifted, we can safely link this component in all MPI applications. Specifically: just because this ptmalloc2 allocator is present in all OMPI executables and user-level applications, it won't necessarily be used -- it's a separate/run-time decision as to whether it will be used. We set the __malloc_initialize_hook variable to point to opal_memory_ptmalloc2_malloc_init_hook (in hooks.c). This function is called by the underlying glibc allocator before any allocations occur and before the memory allocation subsystem is setup. As such, this function is *extremely* restricted in what it can do. It cannot call any form of malloc, for example (which seems fairly obvious, but it's worth mentioning :-) ). This function is one of the determining steps as to whether we'll use the internal ptmalloc2 allocator or not. Several checks are performed: - Was either the MCA params mpi_leave_pinned or mpi_leave_pinned_pipeline set? - Is a driver found to be active indicating that an OS-bypass network is in effect (OpenFabrics, MX, Open-MX, ...etc.) - Was an environment variable set indicating that we want to disable this component? If the $OMPI_MCA_memory_ptmalloc2_disable or the $FAKEROOTKEY env variables are set, we don't enable the memory hooks. We then use the following matrix to determine whether to enable the memory hooks or not (explanation of the matrix is below): lp / lpp yes no runtime not found yes yes yes yes yes no yes no no no runtime yes no runtime runtime not found yes no runtime runtime lp = leave_pinned (the rows), lpp = leave_pinned_pipeline (the columns) yes = found that variable to be set to "yes" (i.e., 1) no = found that variable to be set to "no" (i.e., 0) runtime = found that variable to be set to "determine at runtime" (i.e., -1) not found = that variable was not set at all Hence, if we end up on a "yes" block in the matrix, we enable the hooks. If we end up in a "no" block in the matrix, we disable the hooks. If we end up in a "runtime" block in the matrix, then we enable the hooks *if* we can find indications that an OS bypass network is present and available for use (e.g., OpenFabrics, MX, Open-MX, ...etc.). To be clear: sometime during process startup, this function will definitely be called. It will either set the 4 hook functions to point to our name-shifted ptmalloc2 functions, or it won't. If the 4 hook functions are set, then the underlying glibc allocator will always call our 4 functions in all the relevant places instead of calling its own functions. Specifically: the process is calling the underlying glibc allocator, but that underlying glibc allocator will make function pointer callbacks to our name-shifted ptmalloc2 functions to actually do the work. Note that because we know our ptmalloc will not be providing all 5 hook variables (because we want to use the underlying glibc hook variables), they are #if 0'ed out in our malloc.c. This has the direct consequence that the *_hook_ini() in hooks.c are never used. So to avoid compiler/linker warnings, I #if 0'ed those out as well. All the public functions in malloc.c that call hook functions were modified to #if 0 the hook function invocations. After all, that's something that we want the *underlying* glibc allocator to do -- but we are putting these functions as the hooks, so we don't want to invoke ourselves in an infinite loop! The next thing that happens in the startup sequence is that the ptmalloc2 memory component's "open" function is called during MPI_INIT. But we need to test to see if the glibc memory hooks have been overridden before MPI_INIT was invoked. If so, we need to signal that our allocator support may not be complete. Patrick Geoffray/MX suggests a simple test: malloc() 4MB and then free it. Watch to see if our name-shifted ptmalloc2 free() function was invoked. If it was, then all of our hooks are probably in place and we can proceed. If not, then set flags indicating that this memory allocator only supports MUNMAP (not FREE/CHUNK). We actually perform this test for malloc, realloc, and memalign. If they all pass, then we say that the memory allocator supports everything. If any of them fail, then we say that the memory allocator does not support FREE/CHUNK. NOTE: we *used* to simply set the FREE/CHUNK support flags during our ptmalloc2's internal ptmalloc_init() function. This is not a good idea becaus even after our ptmalloc_init() function has been invoked, someone may come in an override our memory hooks. Doing tests during the ptmalloc2 memory component's open function seems to be the safest way to test whether we *actually* support FREE/CHUNK (this is what MX does, too). As stated above, we always intercept munmap() -- this is acceptable in all environments. But we test that, too, just to be sure that the munmap intercept is working. If we verify that it is working properly, then we set that we have MUNMAP support. Much later in the init sequence during MPI_INIT, components indicate whether they want to use mpi_leave_pinned[_pipeline] support or not. For example, the openib BTL queries the opal_mem_hooks_support_level() function to see if FREE and MUNMAP are supported. If they are, then the openib BTL sets mpi_leave_pinned = 1. Finally, the mpool base does a final check. If mpi_leave_pinned[_pipeline] is set to 1 and/or use_mem_hooks is set, if FREE/MUNMAP are not set in the supported flags, then a warning is printed. Otherwise, life continues (assumedly using mpi_leave_pinned[_pipeline] support). Simple, right?