* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.
* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"
* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded
* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base
* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames
This commit was SVN r29052.
This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times.
Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes:
* upon first request for data, have the OPAL db pmi component fetch and decode *all* the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally
* reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test
* reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued).
Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it
Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time.
This commit was SVN r29040.
ompi_show_help, because opal_show_help is replaced with an
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.
This commit was SVN r28051.
* btl sendi(): if message can be send inline try to avoid signal
* signal is requested one per 64 or when
there are no send wqes
when message can not be send inline
any other btl method then sendi()
This commit was SVN r27724.
- OMPI_SUCCESS
- OMPI_ERROR
- OMPI_ERR_RESOURCE_BUSY
If an "OMPI_ERR_OUT_OF_RESOURCE" occurs, the request is added to the pending list, and will be handled later. An error message
should not be printed to the user in this case. This is not an error, but rather a notification of a possible valid condition.
Only in the case of "OMPI_ERROR" should it be printed to the user.
This commit was SVN r27065.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
allows the BTL to specify a specific ompi_proc_t that had an
error. Also add an optional descriptive string. Currently, arguments
are not used but will be by future failover PML.
Changes based on RFC. Reviewed by George Bosilca.
This commit was SVN r23174.
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
Also includes some minor copytight header additions that were missed in previous checkins
fixes trac:2101 added cmr:v1.4
This commit was SVN r22676.
The following Trac tickets were found above:
Ticket 2101 --> https://svn.open-mpi.org/trac/ompi/ticket/2101
OMPI_* to OPAL_*. This allows opal layer to be used more independent
from the whole of ompi.
NOTE: 9 "svn mv" operations immediately follow this commit.
This commit was SVN r21180.
Adapt orte_process_info to orte_proc_info, and
change orte_proc_info() to orte_proc_info_init().
- Compiled on linux-x86-64
- Discussed with Ralph
This commit was SVN r20739.
Anyway, this is blocking the move: do not include pml.h
if not really needed, aka none of the following used:
mca_pml
MCA_PML_CALL
OMPI_ANY_TAG
OMPI_ANY_SOURCE
OMPI_PROC_NULL
- Notable exceptions (deleting in one header->adding):
- ompi/mca/mtl/psm/
- ompi/mca/osc/rdma/
- ompi/mca/btl/openib/btl_openib_endpoint.c depended on
pml_base_sendreq.h
- Tested on Linux/x86-64, this time including make check
(thanks Jeff and Ralph)
This commit was SVN r20725.
anyhow -- if oob functionality is neededm then orte/mca/oob/oob.h
Nevertheless compiles fine with -Wimplicit-function-declaration
This commit was SVN r20641.
Commit from a long-standing Mercurial tree that ended up incorporating a lot of things:
* A few fixes for CPC interface changes in all the CPCs
* Attempts (but not yet finished) to fix shutdown problems in the IB CM CPC
* #1319: add CTS support (i.e., initiator guarantees to send first message; automatically activated for iWARP over the RDMA CM CPC)
* Some variable and function renamings to make this be generic (e.g., alloc_credit_frag became alloc_control_frag)
* CPCs no longer post receive buffers; they only post a single receive buffer for the CTS if they use CTS. Instead, the main BTL now posts the main sets of receive buffers.
* CPCs allocate a CTS buffer only if they're about to make a connection
* RDMA CM improvements:
* Use threaded mode openib fd monitoring to wait for for RDMA CM events
* Synchronize endpoint finalization and disconnection between main thread and service thread to avoid/fix some race conditions
* Converted several structs to be OBJs so that we can use reference counting to know when to invoke destructors
* Make some new OBJ's have opal_list_item_t's as their base, thereby eliminating the need for the local list_item_t type
* Renamed many variables to be internally consistent
* Centralize the decision in an inline function as to whether this process or the remote process is supposed to be the initiator
* Add oodles of OPAL_OUTPUT statements for debugging (hard-wired to output stream -1; to be activated by developers if they want/need them)
* Use rdma_create_qp() instead of ibv_create_qp()
* openib fd monitoring improvements:
* Renamed a bunch of functions and variables to be a little more obvious as to their true function
* Use pipes to communicate between main thread and service thread
* Add ability for main thread to invoke a function back on the service thread
* Ensure to set initiator_depth and responder_resources properly, but putting max_qp_rd_ataom and ma_qp_init_rd_atom in the modex (see rdma_connect(3))
* Ensure to set the source IP address in rdma_resolve() to ensure that we select the correct OpenFabrics source port
* Make new MCA param: openib_btl_connect_rdmacm_resolve_timeout
* Other improvements:
* btl_openib_device_type MCA param: can be "iw" or "ib" or "all" (or "infiniband" or "iwarp")
* Somewhat improved error handling
* Bunches of spelling fixes in comments, VERBOSE, and OUTPUT statements
* Oodles of little coding style fixes
* Changed shutdown ordering of btl; the device is now an OBJ with ref counting for destruction
* Added some more show_help error messages
* Change configury to only build IBCM / RDMACM if we have threads (because we need a progress thread)
This commit was SVN r19686.
The following Trac tickets were found above:
Ticket 1210 --> https://svn.open-mpi.org/trac/ompi/ticket/1210