The bug was a race condition in the barrier operation that caused the barrier in MPI_Finalize to fail on very short programs.
Scalaiblity was improved by using the daemons to aggregate modex and barrier messages before sending them to the rank=0 proc. Improvement is proportional to ppn, of course, but there really wasn't a scaling problem at low ppn anyway. This modification also paves the way for better allgather operations since now all the data for each node is sitting at the daemon level, and the daemons are now aware that a collective operation on the OOB is underway (so they -can- participate in a collective of their own to support it).
Also added better diagnostics to map out the timing associated with MPI_Init - turned on by -mca orte_timing 1.
This commit was SVN r17988.
"all", not just the first 3 chars (i.e., if someone sets the value
"allfoo", we should still error).
This commit was SVN r17981.
The following SVN revision numbers were found above:
r17980 --> open-mpi/ompi@b3ef774d46
r17956 broke the ability for the user to override the 'opal_event_include'
parameter. This commit checks to see if the user specified a value before
forcing the "all" value on the event engine.
This commit fixes Checkpoint/Restart support in the trunk which requires
this feature.
This commit was SVN r17980.
The following SVN revision numbers were found above:
r17956 --> open-mpi/ompi@763218e754
orte_proc_info_finalize properly so the 'init' flag is set on restart.
This is a bit cleaner anyway, esp since the GPR is gone.
This commit was SVN r17978.
The following SVN revision numbers were found above:
r17944 --> open-mpi/ompi@ec76fe4fe4
Event uninit_use: Using uninitialized value "rc"
Instead of initializing rc in the beginning, rather use return value
of opal_hash_table_set_value_uint32.
This commit was SVN r17976.
Event var_deref_model: Variable "array_of_integers" tracked as NULL was
passed to a function that dereferences it. [model]
The arrays passed down type_get_contents may be NULL, only iff max_* is 0...
If the max_* parameter does not fit, an error is returned, anyhow.
One could improve the checks of MPI_PARAM_CHECK, but to be on the
safe side, fix in dt_args.c.
This commit was SVN r17974.
Event var_deref_op: Variable "requests" tracked as NULL was
dereferenced.
Only check requests[i] for NULL, if requests is != NULL itself.
This commit was SVN r17973.
Event uninit_use_in_call: Using uninitialized value "tag" in call to
function "(ompi_dpm).connect_accept" and others
The tag is set and used in get_rport only on root...
This commit was SVN r17972.
Specifically, add two new APIs:
1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job.
Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections.
2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job.
The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module.
This commit was SVN r17969.
mechanisms (such as epoll) if someone (ompi_mpi_init()) requests
otherwise. See big comment in opal/event/event.c for a full
explanation.
This commit was SVN r17956.
Clarify the setting of send_first in the mpi bindings (trivial, i know, but helpful)
Remove the extra xcast of child contact info to the parent job.
This commit was SVN r17952.
Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them.
This commit was SVN r17941.
initialized. For example, there is a period of time during
ompi_mpi_init when orte_initialized==true, but
ompi_mpi_initialized==false (and therefore communicators are not setup
yet, etc.).
This commit was SVN r17937.