In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)
This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.
Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:
OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st
I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
The rml/oob was not doing sanity checks on the input peer
parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd.
Owing to the fact that there are places in the ompi/orte stack
where things like orte_show_help_norender are called way before
ORTE_PROC_MY_HNP, are setup properly, all kinds of weird
startup failures can occur as the rml/oob tries to process send
requests where the peer is junk.
Rather than try to expand this kind of thing:
/* if we are the HNP, or the RML has not yet been setup,
* or ROUTED has not been setup,
* or we weren't given an HNP, or we are running in standalone
* mode, then all we can do is process this locally
*/
if (ORTE_PROC_IS_HNP || orte_standalone_operation ||
NULL == orte_rml.send_buffer_nb ||
NULL == orte_routed.get_route ||
NULL == orte_process_info.my_hnp_uri) {
rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME);
}
do the right thing in the rml level and return an error rather than
eventually failing in the send owing to peer not being valid.
Need to check if the alps odls component has already
read the rdma creds from alps. Its okay to ask apshepherd
multiple times for rdma creds, but opal_setenv gets
a bit picky about this. Rather than check for the OPAL_EXISTS
return value from opal_setenv, for now just check with
a static variable whether or not orte_odls_alps_get_rdma_creds
has already been successfully called before.
Would be nice to have an opal_getenv function for checking
if an env. variable had already been set by opal_putenv.