So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process.
The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff.
Refs trac:3729
This commit was SVN r29056.
The following Trac tickets were found above:
Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
Commit r27211 added ifdef checks for #define
HAVE_IBV_LINK_LAYER_ETHERNET, which is incorrect. The correct #define
is HAVE_DECL_IBV_LINK_LAYER_ETHERNET. This broke OMPI over iWARP.
This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r29053.
The following SVN revision numbers were found above:
r27211 --> open-mpi/ompi@b27862e5c7
The following Trac tickets were found above:
Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.
* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"
* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded
* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base
* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames
This commit was SVN r29052.
Commit r27211 missed a config file change which broke ompi over
iwarp transports.
This fixes trac:3726 and should be added to cmr:v1.7.3:reviewer=jsquyres
This commit was SVN r29049.
The following SVN revision numbers were found above:
r27211 --> open-mpi/ompi@b27862e5c7
The following Trac tickets were found above:
Ticket 3726 --> https://svn.open-mpi.org/trac/ompi/ticket/3726
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.
Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:
* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally
* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test
* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).
Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it
Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.
This commit was SVN r29040.
non-contiguous converter. We can't "convert on the fly" because the #
of bytes requested may not divide evenly into the convertor data type.
This commit was SVN r29014.
The orte rte component checks the orte_standalone_operation to decide
if it should wait for a message from the hnp or wait on the debugger.
This variable needed to be set to true in ess/pmi to enable the
correct path when direct launching.
cmr=v1.7.3:reviewer=rhc
cmr=v1.6.6:reviewer=rhc
This commit was SVN r29013.
spawning a process using MPI_Comm_spawn. For this, the first operation has to
be collective which we can not guarantuee outside of the MPI_File_open
operation.
This commit was SVN r29008.