We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.
Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.
With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.
All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.
Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:
(a) in an error message
(b) in a verbose output for debugging purposes
Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:
- Before this change a significant portion of the rcache
functionality lived in mpool components. This meant that it was
impossible to add a new memory pool to use with rdma networks
(ugni, openib, etc) without duplicating the functionality of an
existing mpool component. All the registration functionality has
been removed from the mpool and placed in the rcache framework.
- All registration cache mpools components (udreg, grdma, gpusm,
rgpusm) have been changed to rcache components. rcaches are
allocated and released in the same way mpool components were.
- It is now valid to pass NULL as the resources argument when
creating an rcache. At this time the gpusm and rgpusm components
support this. All other rcache components require non-NULL
resources.
- A new mpool component has been added: hugepage. This component
supports huge page allocations on linux.
- Memory pools are now allocated using "hints". Each mpool component
is queried with the hints and returns a priority. The current hints
supported are NULL (uses posix_memalign/malloc), page_size=x (huge
page mpool), and mpool=x.
- The sm mpool has been moved to common/sm. This reflects that the sm
mpool is specialized and not meant for any general
allocations. This mpool may be moved back into the mpool framework
if there is any objection.
- The opal_free_list_init arguments have been updated. The unused0
argument is not used to pass in the registration cache module. The
mpool registration flags are now rcache registration flags.
- All components have been updated to make use of the new framework
interfaces.
As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The only user of this code was coll/sm. I implemented a basic replacement
for the removed code. This gets the trunk compiling again with
--disable-dlopen.
This commit was SVN r32333.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.