1
1
Граф коммитов

1735 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
0fa9d88009 Set $PWD for the application proc to match the cwd. If the user specifies a working dir via -wdir, this ensures that the enviro variable matches what they get from getcwd. Note that any subsequent calls to chdir in the user's program will break that equivalence - we can only ensure it starts out matching!
This commit was SVN r18709.
2008-06-23 18:25:41 +00:00
Ralph Castain
acbcbb81b5 Add some debugging output to the modex set_proc_attr function to see what is being added to the modex
This commit was SVN r18708.
2008-06-23 18:24:08 +00:00
Jeff Squyres
24c3aa1d77 Really fix "make dist". Really.
This commit was SVN r18704.
2008-06-21 18:04:38 +00:00
Jeff Squyres
930667ac73 Ensure that orte-checkpoint and orte-restart man pages are always
included in the distribution tarball.  This ''appears'' to be an
Automake bug -- I have submitted a bug report to the bug-automake list:

http://lists.gnu.org/archive/html/bug-automake/2008-06/msg00019.html

This commit was SVN r18696.
2008-06-20 18:19:01 +00:00
Ralph Castain
c693d3a5d1 I hadn't honestly considered before that an MPI process might attempt to call functions in the routed framework intended solely for daemons and HNPs. By design, MPI processes are not allowed to route RML/OOB messages, and hence the routed module in an MPI process has no knowledge whatsoever of how a message will reach its destination (except in the direct module). Thus, it has no way to return a valid routing tree, update a routing tree, or get wireup info.
This commit ensures that attempts to access information that is unknowable or undefined returns appropriate invalid or not_supported values to avoid unexpected behavior and/or segfaults.

This commit was SVN r18692.
2008-06-20 03:26:13 +00:00
Ralph Castain
5ebe10ebf1 Fix a bad typo - need to look at the node array as the arch array hasn't been built yet
This commit was SVN r18689.
2008-06-19 21:34:39 +00:00
Ralph Castain
174b9f1482 Ensure this module works in heterogeneous environments.
Note: this module is under development, which is why it is not set as the default. Use at your own risk!

This commit was SVN r18688.
2008-06-19 19:40:47 +00:00
Ralph Castain
26c9ad5799 Clean-up the DSS API to remove two functions that are supposed to be used solely internally to the DSS. These were likely exposed because we need to call them when packing/unpacking declared types, but this means that developers may accidentally use the wrong functions, causing the DSS buffer to get confused. Instead, return the system to the way it used to work and hide those functions.
This commit was SVN r18684.
2008-06-19 18:46:25 +00:00
Josh Hursey
b78ae13bf3 add back a missing a header taken away in r18664
This commit was SVN r18682.

The following SVN revision numbers were found above:
  r18664 --> open-mpi/ompi@0532d799d6
2008-06-19 16:08:27 +00:00
Jeff Squyres
e4172a3c44 Shift the AM "if" logic down from orte/tools/Makefile.am down to the
individual orte/tools/*/Makefile.am files.  This causes "make" to
travese into every directory, even if it's not going to build anything
in that directory (which is a good thing).  It also helps cleanup and
dist issues.

This also affects orte-checkpoint and orte-restart, but I couldn't get
--with-ft to compile properly; I'll pass along a heads-up to Josh to
ensure that I didn't break anything.

This commit was SVN r18680.
2008-06-19 14:46:10 +00:00
Ralph Castain
571f483c39 Ensure that we don't breakpoint the debugger until -after- all procs have reported their contact info so we can successfully send the release message
This commit was SVN r18678.
2008-06-19 14:37:46 +00:00
Ralph Castain
3b5e80fa61 Shift responsibility for preconnecting the oob to the orte routed framework, which is the only place that knows what needs to be done. Only the direct module will actually do anything - it uses the same algo as the original preconnect function.
This commit was SVN r18677.
2008-06-19 13:48:26 +00:00
Jeff Squyres
7e45b24001 MPIR_being_debugged is an int, not a bool.
This commit was SVN r18676.
2008-06-19 13:31:34 +00:00
Ralph Castain
b56f8ced4f Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon.
Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs.

This commit was SVN r18674.
2008-06-19 02:58:14 +00:00
Ralph Castain
955d117f5e Add a new grpcomm module that mimics the old 1.2 behavior - it -always- does a modex because it always includes the architecture. Hence, we called it "blind-and-dumb" since it doesn't look to see if this is required - moniker of "bad". :-)
Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done.

Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations.

For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require.

This fixes ticket #1340

This commit was SVN r18673.
2008-06-18 22:17:53 +00:00
Ralph Castain
282a220e7e Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation
This commit was SVN r18670.
2008-06-18 15:28:46 +00:00
George Bosilca
8e7c35e76c These symbols are only available via the module/component structure, so they
don't have to be globally visible.

This commit was SVN r18666.
2008-06-18 08:20:02 +00:00
George Bosilca
0f9b9c0aff Remove a warning and add arequired header (otherwise we cannot compile when
--disable-debug is specified).

This commit was SVN r18665.
2008-06-18 08:10:02 +00:00
Ralph Castain
0532d799d6 Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.

This commit was SVN r18664.
2008-06-18 03:15:56 +00:00
Brian Barrett
7712b07ac4 Add perl based wrapper compilers for cross-compile environments. The default
is still to use the C based wrapper compilers (which have many more features
and are more well tested).  The Perl compilers are enabled with the option
--enable-script-wrapper-compilers, which also ignores the option
--disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries
will result in perl-based wrapper compilers being installed, but no other
binaries being installed).

This commit was SVN r18655.
2008-06-13 22:52:25 +00:00
Ralph Castain
a87aa442e3 Remove last remaining reference to iof_flush - it was #if'd out anyway. The existing flush code appears to have several critical problems. Given the impending rework of the IOF subsystem, there is no point in trying to fix it here.
This commit was SVN r18649.
2008-06-11 16:25:46 +00:00
Ralph Castain
1f41069ac9 Fix CID 752 - if we can't find the daemon job object, we have to ensure we exit without attempting to dereference it
This commit was SVN r18647.
2008-06-11 14:49:58 +00:00
Ralph Castain
13ea4e4673 Be consistent - since we don't strdup the other values for param, don't strdup this one.
This allows r18645 to fix the memory corruption issue, but also allows us to resolve the memory leaks cited by CID 1039

This commit was SVN r18646.

The following SVN revision numbers were found above:
  r18645 --> open-mpi/ompi@53d83ba1c5
2008-06-11 14:42:47 +00:00
Pak Lui
53d83ba1c5 Take out a couple of free's.
This commit fixes trac:1343

This commit was SVN r18645.

The following Trac tickets were found above:
  Ticket 1343 --> https://svn.open-mpi.org/trac/ompi/ticket/1343
2008-06-11 14:02:49 +00:00
Ralph Castain
d61fe87d04 Use the opal_show_help system if orte_show_help has not been initialized
This fixes ticket #1342

This commit was SVN r18644.
2008-06-11 12:50:40 +00:00
Ralph Castain
f9d809748c Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable.
This commit was SVN r18643.
2008-06-10 19:15:59 +00:00
Camille Coti
67cd1849f7 *map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to.
This commit was SVN r18642.
2008-06-10 19:00:57 +00:00
Ralph Castain
1a422995ae Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak.
This commit was SVN r18641.
2008-06-10 17:53:28 +00:00
Brian Barrett
4127bd0dcc fix two other mistakes in the cnos ess
This commit was SVN r18632.
2008-06-09 22:28:26 +00:00
George Bosilca
f72ab90b16 Allow xgrid to compile again.
This commit was SVN r18631.
2008-06-09 21:51:41 +00:00
Brian Barrett
11cd3a7cba Fix problem where local rank always had different architecture than remote
ranks on Red Storm

This commit was SVN r18630.
2008-06-09 21:46:03 +00:00
Ralph Castain
8d9ff44134 Add visibility required for some environments and configs
This commit was SVN r18629.
2008-06-09 21:28:19 +00:00
Ralph Castain
03ab4f5c64 Make the ifdef name mirror the change in filename
This commit was SVN r18626.
2008-06-09 20:36:55 +00:00
Ralph Castain
c13cadc3c7 Refs trac:1255
This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something.

So the ticket can be checked by Jeff when he returns from vacation... :-)

This commit was SVN r18625.

The following Trac tickets were found above:
  Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255
2008-06-09 20:34:14 +00:00
Ralph Castain
2cc8b2c51f Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize.
Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set.

This commit was SVN r18622.
2008-06-09 19:21:20 +00:00
Ralph Castain
bf5c34d10a The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here.
This commit fixes trac:1200

This commit was SVN r18621.

The following Trac tickets were found above:
  Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200
2008-06-09 19:07:19 +00:00
Ralph Castain
11692ca98e Update tests to flag that these are non-MPI apps
This commit was SVN r18620.
2008-06-09 18:48:21 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
Ralph Castain
83dd3d8c6f Restore the ability to forcibly terminate by providing multiple ctrl-c's
This commit was SVN r18618.
2008-06-09 13:08:54 +00:00
Pak Lui
caac0e0182 Add in a couple missing ones from r18611 for all tm users out there...
This commit was SVN r18615.

The following SVN revision numbers were found above:
  r18611 --> open-mpi/ompi@7bee71aa59
2008-06-06 22:53:43 +00:00
Ralph Castain
b65eb54ea2 Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly
Thanks to Pak for letting me know...

This commit was SVN r18614.
2008-06-06 21:24:15 +00:00
Pak Lui
7f7777a538 Check for NULL in prefix_dir.
This commit fixes trac:1337.

This commit was SVN r18612.

The following Trac tickets were found above:
  Ticket 1337 --> https://svn.open-mpi.org/trac/ompi/ticket/1337
2008-06-06 19:55:01 +00:00
Ralph Castain
7bee71aa59 Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit.
Add a new function to opal_progress that tells us our recursion depth to support that solution.

Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death...

Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang.

After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them.

Thanks you Jeff! :-/

This commit was SVN r18611.
2008-06-06 19:36:27 +00:00
Josh Hursey
1de50b523c Fix some Coverity 'Event set_but_not_used' highlights.
Thanks to Jeff for bringing them to my attention.

This commit was SVN r18606.
2008-06-06 14:38:41 +00:00
Jeff Squyres
d3795d7a34 Fix CID 987: remove unused variable.
This commit was SVN r18598.
2008-06-05 20:17:02 +00:00
Ralph Castain
332e6c89ab Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero
This commit was SVN r18588.
2008-06-04 23:59:53 +00:00
Ralph Castain
0da811ce79 Initial work on xml support - allocation and job map outputs completed. More to come.
This commit was SVN r18587.
2008-06-04 20:53:12 +00:00
Ralph Castain
ca91ec525b Add a suffix to the opal_output stream descriptor object - we can now output both a prefix and a suffix for a given stream. Default the suffix to NULL.
Remove lingering references to a filtering system as this will no longer be implemented.

This commit was SVN r18586.
2008-06-04 20:52:20 +00:00
Josh Hursey
78f14b5255 Fix the none.checkpoint command.
orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later.

It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will *not* be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this.


orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly.

Thanks to Eric Roman for bringing this to my attention.

This commit was SVN r18583.
2008-06-04 14:44:11 +00:00
Jeff Squyres
75a97ebbf0 Many thanks to Ralf W. for finding a subtle bug in these Makefile.am's
that can *sometimes* cause problems with "make -j [N>1] install".
Ensure to make the target directory before we copy stuff into it --
read the thread starting here for more details:

    http://www.open-mpi.org/community/lists/devel/2008/06/4080.php

This commit was SVN r18570.
2008-06-04 01:28:03 +00:00