Lenny and I went back and forth on whether we should simply register
another "mpi_paffinity_alone" MCA param and then try to figure out
which one was set in ompi_mpi_init, but there was difficulty in
figuring out what to do. So it seemed like the Right Thing to do was
to implement what was committed in r18770; then we could tell where
MCA parameters were set from and you could do Better Things (this is
also useful in the openib BTL, where parameters can be set either via
MCA parameter or via an INI file).
But after that was done, it seemed only a few steps further to
actually implement two new features in the MCA params area:
* Synonyms (where one MCA param name is a synonym for another)
* Allow MCA params and/or their synonyms to be marked as "deprecated"
(printing out warnings if they are used)
These features have actually long been discussed/desired, and I had
some time in airports and airplanes recently where I could work in
this stuff on a standalone laptop. So I did it. :-)
This commit introduces these two new features, and then uses them to
register mpi_paffinity_alone as a non-deprecated synonym for
opal_paffinity_alone. A few other random points in this commit:
* Add a few error checks for conditions that were not checked before
* Correct some comments in mca_base_params.h
* Add a few comments in strategic places
* ompi_info now prints additional information:
* for any MCA parameter that has synonyms, it lists all the
synonyms
* synonyms are also output as 1st-class MCA params, but with an
additional attribute indicating that they have a "parent"
* all MCA param name (both "real" or "synonym") will output an
attribute indicating whether it is deprecated or not. A synonym
is deprecated if it iself is marked as deprecated (via the
mca_base_param_regist_syn() or mca_base_param_register_syn_name()
functions) or if its "parent" MCA parameter is deprecated
This commit was SVN r18859.
The following SVN revision numbers were found above:
r18770 --> open-mpi/ompi@8efe67e08c
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
first to the trunk. So, here is the trunk checkin:
The call to orte_show_help() to notify truncation of the max_inline value
was missing the want_error_header boolean, which eventually results in
a SEGV. This change corrects the call with the bool set to true.
This commit was SVN r18839.
These are mostly long additions to comments to document what is going on and why, and how/where it may be revised in the future. Just a couple of small, but important, changes to the code itself.
This commit was SVN r18827.
see how the next gen panasas stuff does in terms of warnings; we can
always re-merge this later if we want to. It's just easier if we have
as little OMPI-specific code as possible (particularly when we know
that the panasas code has some big changes coming).
This commit was SVN r18823.
The following SVN revision numbers were found above:
r17543 --> open-mpi/ompi@b4ec81a9fd
already, and we're just about to do a ROMIO version refresh -- so the
less OMPI-specific code we have (e.g., indenting and whatnot), the
better.
Refs trac:1370.
This commit was SVN r18821.
The following SVN revision numbers were found above:
r16691 --> open-mpi/ompi@8dca19cb3b
r16693 --> open-mpi/ompi@037a533752
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
ROMIO in Open MPI (the new version of ROMIO will make this patch
defunct, and David Daniel has confirmed that no one at LANL is using
this functionality, anyway).
Refs trac:1370.
This commit was SVN r18819.
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
hierarch disables itself now if the pml module used is *not* ob1. The reason
is, that the multi-level hierarchy detection algorithm checks the names of the
btl modules used. In case there are no btl's, we would segfault.
Furthermore, three minor changes:
- the 2-level hierarchy detection is now the default (sm vs. everything else
in the world).
- add udapl to the list of protocols checked for by the multi-level hierarch detection
- some of the verbose statements of hierarch were inaccurate. Fixed those comments/messages.
This commit was SVN r18817.
With help from Brian, modify the ompi/proc/proc.c code to be more thread-safe. Remove the list operations from the ompi_proc_t constructor and destructor. Insert list appends to ompi_proc_init and ompi_proc_find_and_add as required, and protect those with thread locks. Let only the ompi_proc_finalize function actually remove objects from the ompi_proc_list.
Cleanup a few places where functions might return without unlocking a thread. Ensure the ompi_proc_world also does an OBJ_RETAIN so that the reference count on any subsequently released object is correct.
This commit was SVN r18816.
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.
* Move the passive side QP move to RTS to before we send the reply
(vs. sending it after we get the RTU). A lengthy comment explains
the need for this.
* Add some timers to the code for analyzing where time is spent.
* Clarify a few error messages.
* Currently have a loop around ib_cm_listen() because sometimes it
fails for seeminly no reason. Have pending e-mails in to Sean
Hefty to see if we can figure out why this is happening. Note that
the more MPI processes you add, the more likely this error is to
occur (e.g., ran 720 processes and it happens at least 50% of the
time). This makes IBCM somewhat unattractive for general use;
hopefully we can get a fix...
This commit was SVN r18766.
endpoint.c) because it's almost identical in all the CPC's. OOB,
XOOB, and IBCM now all invoke the btl error handler properly if
there's an error during wireup. RDMACM still needs to be done.
This commit was SVN r18764.
The following SVN revision numbers were found above:
r18762 --> open-mpi/ompi@3eda04578f
upper level btl (and therefore the PML) when something goes wrong
during wireup.
Refs trac:1283.
This commit was SVN r18762.
The following Trac tickets were found above:
Ticket 1283 --> https://svn.open-mpi.org/trac/ompi/ticket/1283
* Properly handle non-symmetric subnet ID's
* Be a bit more stringent when checking for the GID
* Add lots of BTL_VERBOSE's for diagnostics
This commit was SVN r18754.
The issue is that the field mca_topo_base_comm_t->mtc_periods_or_edges
has a different length, depending on whether the communicator is a
graph or a cart. One of the comm dup functions always assumed that it
was the length required by graph comms, which could lead to badness in
some cases. This commit makes the legnth of that field on a comm dup
be the proper length and copies the data over appropriately.
I also changed the syntax of the ompi_comm_copy_topo() function to use
shorter pointer notation; it made the code much easier to read and
fix.
This commit was SVN r18752.
The following Trac tickets were found above:
Ticket 1345 --> https://svn.open-mpi.org/trac/ompi/ticket/1345
as the send and the put share the same execution path and in the case of the
put the MCA_BTL_DES_SEND_ALWAYS_CALLBACK was not set.
This commit was SVN r18735.
ompi_info will crash due to a reference to a unalloced struct. Add a check for
this struct to prevent this crash.
This commit was SVN r18727.
The following SVN revision numbers were found above:
r18725 --> open-mpi/ompi@c2351d39f1
lists of flags for exec. ' ' is a magic value that means match all
white space, and trim leading / trailing whitespace. Will prevent
many spurious arguments to underlying compiler.
This commit was SVN r18726.
If rdmacm cpc is disabled and an iwarp device is forced to be used, then
mca_btl_openib_get_iwarp_subnet_id will attempt to reference an uninitalized
list. Modify the code to check for the uninitalized list and return zero.
Also, fix a memory leak caused by not freeing alloced addresses structs during
shutdown.
This commit was SVN r18725.
the fragments that failed to be send, there is no need to replicate the same
mechanism in the BTL.
Force the SM BTL to empty all ack fragments in the component progress function.
This commit was SVN r18724.
INI file use_eager_rdma value (fixes trac:1169)
* fixed a typo in a MCA param help message
* made the check for enabling short/eager RDMA more robust in the
presence of progress threads; it now emits a show_help warning
This commit was SVN r18723.
The following Trac tickets were found above:
Ticket 1169 --> https://svn.open-mpi.org/trac/ompi/ticket/1169
specified, probe for max value supported by device.
This commit was SVN r18720.
The following Trac tickets were found above:
Ticket 1355 --> https://svn.open-mpi.org/trac/ompi/ticket/1355
of strings. We mostly did the Right Things already; I simplified the
code a bit and also had us not write to more characters in the C
bindings than we're supposed to (per language in the MPI-2.1 spec).
Fixes trac:1238.
This commit was SVN r18705.
The following Trac tickets were found above:
Ticket 1238 --> https://svn.open-mpi.org/trac/ompi/ticket/1238