Tested this functionality quite a bit more and made some fixes:
* Print far fewer help messages
* Fix one additional deadlock upon error
* Change some ORTE_LOG messages to silent (because they're not
errors)
* Some code got re-indented, sorry...
Discussed and reviewed with Ralph.
This commit was SVN r13375.
The following Trac tickets were found above:
Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
yesterday and set it to "true" in orte_init(). But ompi_mpi_init()
doesn't call orte_init() -- it calls orte_init_stage1() and
orte_init_stage2(). So orte_initialized was never set to true, and
Badness happend from there (w.r.t. ompi_mpi_abort()).
This patch moves the setting of orte_initialized to orte_init_stage2()
so that everyone will always get it set properly.
It also moves setting orte_universe_info.state to RUNNING into
stage2() as well -- Ralph confirmed that that should have been there
for the same reasons that orte_initialized needs to be there.
This commit was SVN r13374.
completed successfully, Bad Things(tm) could happen.
* Now we explicitly check orte_initialized (a new global in ORTE
indicating whether we are between orte_init() and orte_finalize()
or not), and if so, react accordingly.
* If ORTE is initialized, use orte_system_info.nodename; otherwise,
use gethostname().
* Add loop protection to ensure that ompi_mpi_abort() is not invoked
multiple times recursively.
This commit was SVN r13354.
and George on these refinements):
* Rename the static OBJ initializer macro to be
OPAL_OBJ_STATIC_INIT(class)
* Ensure that all static OBJ initializations get a refcount of 1
(doesn't ''really'' matter, since they're static, it should never
get to the point where the OBJ is DESTRUCTed, but more correct
nonetheless)
* Add a "magic number" to the OBJ when compiling with debug support.
The magic number does some rudimentary support to ensure that
you're operating on a valid OBJ (and fails an assertion if you're
not). Check to ensure that the memory contains the magic number
when performing actions of OBJ's. Also remove the magic number
when DESTRUCTing OBJs, so that if, for example, an OBJ is
DESTRUCTed more than once, we'll fail the magic number assert.
This commit was SVN r13338.
The following SVN revision numbers were found above:
r13227 --> open-mpi/ompi@96030de97b
r13228 --> open-mpi/ompi@c2e9075d29
then we modify the argv, forcing the reallocation of the array. With luck
the saved pointer still have a meaning ... without execve return with error
14 (EFAULT).
This commit was SVN r13321.
The following SVN revision numbers were found above:
r12059 --> open-mpi/ompi@ae79894bad
the ORTE_DAEMON_CMD type. Which, unfortunately, is used all over
the place. Without this, we get error:
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 83
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 58
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../../../ompi-trunk/orte/mca/pls/base/pls_base_orted_cmds.c at line 136
This commit was SVN r13320.
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.
2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.
3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded
4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond
This needs more testing before migrating to 1.2.
This commit was SVN r13304.
This fix includes two parts: (a) we now initialize the keyval pointer locations to NULL after the malloc, and (b) we now OBJ_NEW the keyvals prior to storing info in them.
BTW, in case anyone reads this and wonders why we don't just OBJ_NEW the keyvals in create_value, the reason is simply that some places in the code use static keyvals and simply assign those addresses into the value object's array. So not everyone wants to OBJ_NEW keyvals - by not forcing it here in create_value, we give the user the flexibility to do whatever they want.
This commit was SVN r13300.
- Make it so the SLURM ras can handle different nodelist configurations
- Some code cleanup and better/more informative error messages and error handling
This commit was SVN r13271.
The following Trac tickets were found above:
Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
then exec the "srun..." from there. But somewhere along the line, we
switched to having a copy of environ and modifying that. It looks
like we forgot to update the stuff for --prefix behavior. So this
commit fixes the setenv's for PATH and LD_LIBRARY_PATH to modify the
environ copy (not environ itself) so that the values properly get
passed down to the srun environment via execve().
This restores --prefix behavior in the SLURM pls.
This commit was SVN r13239.