Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented.
No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release.
This commit was SVN r19275.
* Make the creation of the build dir for the man pages a bit more
robust (thanks to suggestions from Ralf W.).
* Only distribute the .Xin files, not the .X man pages themselves.
* Make the .X files depend on opal_config.h so that if you re-run
configure and change opal_config.h (e.g., a new version), the man
pages should get rebuilt.
* Man pages are now cleaned with "distclean", not "maintainer-clean".
* Fix a typo in opal_crs.7in.
* Udpate make_dist_tarball to update "date" in the VERSION file.
* Make make_dist_tarball a bit friendlier to hg checkouts.
This commit was SVN r19219.
This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify.
This commit was SVN r19159.
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.
At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.
Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):
--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.
--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.
This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.
This commit was SVN r19152.
versions, dates and build names.
Fixes trac:1387
Big thanks to Jeff and Brian for help and oversight.
This commit was SVN r19120.
The following Trac tickets were found above:
Ticket 1387 --> https://svn.open-mpi.org/trac/ompi/ticket/1387
set when it launches under debuggers using the --debug option.
This commit was SVN r19116.
The following Trac tickets were found above:
Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361
Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways.
Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument.
For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work.
This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do.
To solve this problem, we did the following:
1. removed all mca params from the individual plms for the launch agent
2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted".
3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values
4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry.
5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices
Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct.
This commit was SVN r19097.
--debug flag to help developers figure out possible future issues.
This fixes trac:1335.
This commit was SVN r18979.
The following Trac tickets were found above:
Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335
Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned.
This commit was SVN r18842.
Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell.
This commit fixes trac:1373
This commit was SVN r18836.
The following Trac tickets were found above:
Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.
Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes.
This commit was SVN r18721.
individual orte/tools/*/Makefile.am files. This causes "make" to
travese into every directory, even if it's not going to build anything
in that directory (which is a good thing). It also helps cleanup and
dist issues.
This also affects orte-checkpoint and orte-restart, but I couldn't get
--with-ft to compile properly; I'll pass along a heads-up to Josh to
ensure that I didn't break anything.
This commit was SVN r18680.
Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs.
This commit was SVN r18674.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.
This commit was SVN r18664.
is still to use the C based wrapper compilers (which have many more features
and are more well tested). The Perl compilers are enabled with the option
--enable-script-wrapper-compilers, which also ignores the option
--disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries
will result in perl-based wrapper compilers being installed, but no other
binaries being installed).
This commit was SVN r18655.
This allows r18645 to fix the memory corruption issue, but also allows us to resolve the memory leaks cited by CID 1039
This commit was SVN r18646.
The following SVN revision numbers were found above:
r18645 --> open-mpi/ompi@53d83ba1c5
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.
So the ticket can be checked by Jeff when he returns from vacation... :-)
This commit was SVN r18625.
The following Trac tickets were found above:
Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.
orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later.
It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will *not* be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this.
orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly.
Thanks to Eric Roman for bringing this to my attention.
This commit was SVN r18583.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.
Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.
More info to come on devel list.
This commit was SVN r18555.
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.
2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.
This represents the best-available fix for ticket #221 so I am closing that ticket at this time.
This commit was SVN r18536.