1
1
Граф коммитов

21 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
5fd742e769 Add in the standardized way to notify a debugger if the MPI job is
about to abort.  Fixes trac:1509.

This commit was SVN r19596.

The following Trac tickets were found above:
  Ticket 1509 --> https://svn.open-mpi.org/trac/ompi/ticket/1509
2008-09-20 11:34:37 +00:00
Jeff Squyres
004c3a5b09 Ensure to cover all cases when either ORTE or OMPI is not yet
initialized.  For example, there is a period of time during
ompi_mpi_init when orte_initialized==true, but
ompi_mpi_initialized==false (and therefore communicators are not setup
yet, etc.).

This commit was SVN r17937.
2008-03-24 16:25:14 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Jeff Squyres
4314609a00 * Remove a meaningless clause (it could never be true)
* Fix an error message to correctly display if we were before
   MPI_INIT or after MPI_FINALIZE (refs trac:1243)

This commit was SVN r17873.

The following Trac tickets were found above:
  Ticket 1243 --> https://svn.open-mpi.org/trac/ompi/ticket/1243
2008-03-18 22:26:43 +00:00
George Bosilca
d9937cca81 Only declare ret in the block where it is used (avoid a warning about
unused variable).

This commit was SVN r17638.
2008-02-28 06:18:57 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Jeff Squyres
e90b3e415b * Before this commit, if we called ompi_mpi_abort() before MPI_INIT
completed successfully, Bad Things(tm) could happen.
 * Now we explicitly check orte_initialized (a new global in ORTE
   indicating whether we are between orte_init() and orte_finalize()
   or not), and if so, react accordingly.
 * If ORTE is initialized, use orte_system_info.nodename; otherwise,
   use gethostname().
 * Add loop protection to ensure that ompi_mpi_abort() is not invoked
   multiple times recursively.

This commit was SVN r13354.
2007-01-29 22:01:28 +00:00
Jeff Squyres
75df4ca602 Minor fixes for MPI-level aborting:
- Fix some fpritnf's in ompi_mpi_abort() that incorrectly assumed that
  we were always being invoked from MPI_ABORT (ompi_mpi_abort() may be
  invoked from a bunch of different places)
- Also try to opal_backtrace_print() if opal_bactrace_buffer() is not
  supported. 
- Print a message in MPI_ABORT if we're aborting.

This commit was SVN r12998.
2007-01-04 22:30:28 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
George Bosilca
4689c56210 Always cast the return of malloc.
This commit was SVN r11990.
2006-10-05 05:07:43 +00:00
Brian Barrett
ab6cbb2359 * update ompi_mpi_abort to call abort_procs_request on the processes that
should die, according to the MPI standard.  It's possible that the
    ORTE layer may kill additional processes, but that's beyond our
    control and seems to be allowed by the standard (ie, it might also
    end up killing all the procs in all the jobs covered by the
    communicator).

  * update the stack trace printing code to use the framework rather
    than calling execinfo directly, so that we should be able to get
    stack traces on all the platforms we support stack tracing on
    (if the user wants stack traces on abort, of course)

This commit was SVN r11753.
2006-09-22 15:04:04 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
Jeff Squyres
e371aff9f5 Fix minor compiler warning
This commit was SVN r9514.
2006-04-01 12:41:48 +00:00
Jeff Squyres
fd61d78599 Add two MCA parameters to the MPI level to control behavior during
MPI_ABORT.  From the ompi_info output:

       MCA mpi: parameter "mpi_abort_delay" (current value: "0")
                If nonzero, print out an identifying message when
                MPI_ABORT is invoked (hostname, PID of the process
                that called MPI_ABORT) and delay for that many seconds
                before exiting (a negative delay value means to never
                abort).  This allows attaching of a debugger before
                quitting the job.
       MCA mpi: parameter "mpi_abort_print_stack" (current value: "0")
                If nonzero, print out a stack trace when MPI_ABORT is
                invoked

This commit was SVN r9487.
2006-03-31 00:31:15 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Jeff Squyres
285ded5655 - Ensure to have !initialized || finalized test *first*
- If we have an NS error, don't return an error -- this function's
  purpose is to abort :-)
- s/abort()/exit(1)/ so that we don't drop massive corefiles

This commit was SVN r7524.
2005-09-27 20:26:38 +00:00
George Bosilca
948683215b And more fixes ...
This commit was SVN r7321.
2005-09-12 20:25:01 +00:00
Brian Barrett
170ef8af1f * rename ompi_show_help to opal_show_help
* rename ompi_stacktrace to opal_stacktrace
* rename ompi_strncpy to opal_strncpy

This commit was SVN r6336.
2005-07-04 02:38:44 +00:00
Brian Barrett
23b687b0f4 * rename ompi_event to opal_event
This commit was SVN r6328.
2005-07-03 23:09:55 +00:00
Jeff Squyres
35c141aef6 While we're moving directories around, move ompi/mpi/runtime ->
ompi/runtime, for consistency and parallel-ness will orte/runtime.
Also remove a few useless #includes along the way.

This commit was SVN r6317.
2005-07-03 12:07:29 +00:00