1
1

1915 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
0cd65bfaa8 Fix a SIGPIPE that may occur when checkpointing a restarted process. This was a result of calling system() in the BLCR CRS. After inspection and testing it was determined that the operation was no longer necessary. So the call was removed thus fixing the bug.
This commit was SVN r19601.
2008-09-22 16:49:56 +00:00
Jeff Squyres
8eccda391a Fix comment to match the code.
This commit was SVN r19598.
2008-09-20 12:35:48 +00:00
Ralph Castain
16e4b0b698 Ensure that a child job inherits its parent job's prefix dir during comm_spawn operations
This commit was SVN r19538.
2008-09-10 19:05:23 +00:00
Ralph Castain
f326ee356e Add some error output to the plm rsh
This commit was SVN r19532.
2008-09-10 01:59:49 +00:00
Ralph Castain
20ece3cb86 Add new test that stresses MPI send/recv
This commit was SVN r19530.
2008-09-09 15:47:31 +00:00
Ralph Castain
c0d7fbaf88 A few mapping cleanups - mostly aimed to properly balancing loads so multi app-context comm_spawns don't dump everything on one node.
This commit was SVN r19519.
2008-09-08 15:45:55 +00:00
Ralph Castain
9b8473fdbf Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use.
Cleanup when an orted actually opens the PLM. Unfortunately, some unmentionable people are pushing head node environs out to remote nodes, causing the daemons to think they are the HNP. This helps prevent the confusion.

This commit was SVN r19518.
2008-09-08 15:45:11 +00:00
Shiqing Fan
04ee20a880 - Mainly type casts. Microsoft VC++ compiler is too strict.
This commit was SVN r19517.
2008-09-08 15:39:30 +00:00
Shiqing Fan
c90e6e4f6d - The correct function to close a socket. Thanks to George for noticing it.
This commit was SVN r19513.
2008-09-08 14:35:47 +00:00
Shiqing Fan
93897c87a8 - Update the orte wait function for Windows.
This commit was SVN r19512.
2008-09-08 14:11:26 +00:00
Josh Hursey
edf52e7258 This commit should fix and close #1482
The problem was that (outside of Odin configure issues) that the IOF is no longer enabled by application processes.

Checkpoint/restart seems to be working once again.

Thanks to Ralph for pointing me here.

This commit was SVN r19508.
2008-09-05 18:39:17 +00:00
George Bosilca
579d70edad We should use #ifdef and not #if
This commit was SVN r19504.
2008-09-05 12:44:19 +00:00
Brian Barrett
c2c5a34cb1 Missed a symbol that needs to exist in non-full RTE case
This commit was SVN r19482.
2008-09-02 15:07:48 +00:00
Brian Barrett
79cf946bce Add header file needed for the non-full RTE, non-debug case
This commit was SVN r19475.
2008-09-01 18:02:32 +00:00
Brian Barrett
52d4be78dc Add missing header file for the full rte case caused by changes in r19471.
This commit was SVN r19474.

The following SVN revision numbers were found above:
  r19471 --> open-mpi/ompi@38eb301919
2008-09-01 17:49:31 +00:00
Brian Barrett
38eb301919 Follow-on to r19457. Rather than have #ifs in the middle of functions
(which neither Ralph nor I liked), don't allow the functions we don't
need to be visible.  Still not happy about the number of #ifs in the
code, but splitting the code further would have been a nightmare
and this was a good cutting point.

Also protected some variables that were declared but not instanced
so that users would be notified at compile time instead of link or
run time (in the case of dss constants) that things wouldn't work.

This commit was SVN r19471.

The following SVN revision numbers were found above:
  r19457 --> open-mpi/ompi@a15171e46b
2008-09-01 17:15:01 +00:00
Shiqing Fan
cd6ff74d89 Update the ccp module:
rename the get_cluster_message function for both ras/plm.
  use _umask instead of umask.
  add WIN32_DCOM definition to support Windows Vista.

This commit was SVN r19470.
2008-09-01 16:35:38 +00:00
Brian Barrett
a15171e46b Some fixes for the disabled ORTE case
* Protect an orte variable used in the orte debugger stuff
  * Initialize the datatype code in the Catamount code, as we need it
    for intercommunicators (the proc code needs it to pack the remote
    name)
  * Turn on a bunch of the orte datatype code so that ORTE_NAME is available.

This commit was SVN r19457.
2008-08-31 18:06:55 +00:00
Shiqing Fan
ce40b8a35e - Fix typo ;-)
This commit was SVN r19438.
2008-08-27 17:06:40 +00:00
Ralph Castain
a5efefe980 Ensure var is init before use
This commit was SVN r19416.
2008-08-26 13:38:11 +00:00
Ralph Castain
063837a413 Add oob and iof stress tests
This commit was SVN r19404.
2008-08-26 03:02:46 +00:00
Ralph Castain
28346b5bac Get -host to not use empty nodes called out specifically later in the -host list
This commit was SVN r19403.
2008-08-26 03:02:28 +00:00
Ralph Castain
6039e385cd Per request from Terry, make -host and -hostfile respect order when used as filters. In other words, if you specify -host host1,host3,host2, then we should use the hosts in that order. Previously, we used them in whatever order they were found in the allocation - all the -host did was tell us which nodes to use, not what order to use them in.
Relative node syntax remains supported. Also, if you specify empty nodes, but have a specific empty node called out later, we will not include that node in the empties we add. I'll provide examples in the manpage.

This commit was SVN r19402.
2008-08-26 02:56:10 +00:00
Shiqing Fan
94a2147e3d - make sure that the system has the header files.
This commit was SVN r19400.
2008-08-25 13:56:10 +00:00
Ralph Castain
b45029fd0e Application processes should not open/close the IOF framework - there is nothing in that framework for application procs to do.
Fix a bug in iof_base_close where we destruct a thread lock prior to unlocking it.

This commit was SVN r19392.
2008-08-22 01:28:19 +00:00
Ralph Castain
4ef9d15d97 Revamp the opal mca paffinity interface. We ran into a problem when we encountered machines that had "holes" in their physical processor layout - e.g., machines that supported "hotplugging", or that had unpopulated sockets. To solve that problem, we had to clarify at the API level where we were describing physical vs logical processor info, and then translate accordingly in the underlying implementation.
See opal/mca/paffinity/paffinity.h for explanation as to the physical vs logical nature of the params used in the API.

Fixes trac:1435

This commit was SVN r19391.

The following Trac tickets were found above:
  Ticket 1435 --> https://svn.open-mpi.org/trac/ompi/ticket/1435
2008-08-21 19:21:28 +00:00
Ralph Castain
43f8bcfe54 Update slurm plm to respect leave_session_attached
This commit was SVN r19370.
2008-08-19 18:30:30 +00:00
Ralph Castain
4e0f34a062 When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc.
Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages.

This commit was SVN r19366.
2008-08-19 15:19:30 +00:00
Ralph Castain
9447334749 Some comments relating to relative indexing
This commit was SVN r19365.
2008-08-19 15:17:40 +00:00
Ralph Castain
6d82efba21 Add relative indexing capabilities for hostfile and -host - we can now reference hosts using a relative syntax.
See the orte_hosts manpage for an explanation

This commit was SVN r19364.
2008-08-19 15:16:27 +00:00
Ralph Castain
49745c5f40 Provide a new option that allows a user to leave an ssh session open without getting deluged by ORTE debug output. The new option is --leave-session-attached, with a corresponding MCA param of orte_leave_session_attached.
Theoretically, any PLM could use this - but in reality, all of them except rsh/ssh already leave the session attached anyway.

This fixes trac:656 - a REALLY old ticket

This commit was SVN r19294.

The following Trac tickets were found above:
  Ticket 656 --> https://svn.open-mpi.org/trac/ompi/ticket/656
2008-08-14 18:59:01 +00:00
Ralph Castain
dd16e4e4a6 Update process component of odls.
This commit was SVN r19281.
2008-08-13 20:07:59 +00:00
Ralph Castain
913cf04633 Only co-locate debugger daemon if the orted has local children - prevents mpirun from co-locating a daemon when it has no local procs
This commit was SVN r19280.
2008-08-13 20:06:28 +00:00
Ralph Castain
3e2a3db887 Add a missing ntoh conversion when pushing a message back onto the RML progress queue.
If a message cannot be routed because the addressee isn't yet known, then the message is held on a queue in the RML for a period of time (currently set to 500 millisec). At the end of that time, we pop the message from the list and attempt to send it again. This action requires that we convert the header back to
network-byte-order before calling the OOB.

If the message still cannot be routed, we put the message back on the list and reset the timer. However, since we are going to convert the header when it com
es off of the list, we have to ntoh it before putting it back on the list so it all comes out right. This step was missing.

Thus, the problem only showed up relatively rarely because a message would have to be pushed onto the queue at least twice for the problem to surface.

This should fix a specific ticket (1389), but we will wait to see the results of MTT runs to verify. Note that we really don't know why a message is rattling around in the RML for so long, especially since this all seems to be happening during finalize, so this could cause mpirun to hang. Or it could simply trash the message and exit cleanly. Shall be interesting to see!

This commit was SVN r19276.
2008-08-13 17:54:15 +00:00
Ralph Castain
30f37f762d Enable co-location of debugger daemons during initial launch and when debugging a running job.
Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented.

No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release.

This commit was SVN r19275.
2008-08-13 17:47:24 +00:00
Rainer Keller
d57ef70149 - Store the result of the 1-byte read... and assert, in case
of error checking -- we don't return errors here anyway.
   Fixes Coverity CID 981

This commit was SVN r19259.
2008-08-12 18:00:38 +00:00
Ralph Castain
f017c55bfa Close a minor memory leak - we can reuse timer events
This commit was SVN r19251.
2008-08-12 12:53:30 +00:00
Ralph Castain
baed5dcad0 Ensure contact info is placed in the job family session directory so orte-ps and other tools can find it
This commit was SVN r19245.
2008-08-11 23:48:39 +00:00
Terry Dontje
f0eec291d0 Added a couple examples and spelling corrections.
This commit was SVN r19234.
2008-08-11 12:34:02 +00:00
Dan Lacher
7ef29d4abe More fixes for #1387. Minor fixes for the orte_host.7
man page file that was missed in the inital pass.

We are using $(am_dirstamp) instead of creating our own dirstamp since there
is src code in util/hostfile directory is created.  The automake process
creates the $(am_dirstamp), we found the use of this in the generated Makefile
in the util/Makefile

This commit was SVN r19230.
2008-08-08 19:10:02 +00:00
Rainer Keller
0f8a80d81d - The intel compiler does not play nice with the
__opal_attribute_format__ on typedef defined functions and
   emits a warning once errmgr.h is included --> read: often...

This commit was SVN r19229.
2008-08-08 16:26:09 +00:00
Jeff Squyres
797ec531aa Some more work on the man pages:
* Make the creation of the build dir for the man pages a bit more
   robust (thanks to suggestions from Ralf W.).
 * Only distribute the .Xin files, not the .X man pages themselves.
 * Make the .X files depend on opal_config.h so that if you re-run
   configure and change opal_config.h (e.g., a new version), the man
   pages should get rebuilt.
 * Man pages are now cleaned with "distclean", not "maintainer-clean".
 * Fix a typo in opal_crs.7in.
 * Udpate make_dist_tarball to update "date" in the VERSION file.
 * Make make_dist_tarball a bit friendlier to hg checkouts.

This commit was SVN r19219.
2008-08-07 19:20:40 +00:00
Rainer Keller
ad3538ea38 - Per Discussion w/ Ralph Castain (related to CID 1051)
- Move up the __opal_attribute_noreturn__ information
 - Actually make it known outside in ess.h
 - Additionally allow printf-type checking

This commit was SVN r19210.
2008-08-07 09:36:10 +00:00
Ralph Castain
c9e53fd0d4 Add capability to notify system admins of potential problems in system communication networks and/or other system elements that are detected by Open MPI during operation. For example, failures in connections that may be indicative of connectivity problems can be reported to sys admins in addition to our current error message to the user, thus allowing more rapid correction of the problem.
This system is "off" by default and only operates upon specific directive for selection of a notifier component. At the moment, the only available component will write an error message to the syslog.

This commit was SVN r19209.
2008-08-06 21:59:21 +00:00
Ralph Castain
d7da6b3226 Just a minor cleanup of race conditions on trigger events for exit. Close the trigger pipe upon use since it is only a one-shut anyway. This removes the need to destruct the object, leaving the lock available to protect one-time termination routines throughout the life of the program.
This commit was SVN r19208.
2008-08-06 21:53:35 +00:00
Rainer Keller
c58e89e471 - Fix variable set but not used
Coverity CID1061

This commit was SVN r19197.
2008-08-06 14:42:16 +00:00
George Bosilca
d8fe05264b Fix recursion in include files (Coverty fix 156).
This commit was SVN r19181.
2008-08-06 13:50:01 +00:00
Ralph Castain
63c33a9c32 Some minor updates to the locking system changes. Remove obsolete locks. Ensure the trigger event objects do not get deconstructed until the very end to avoid possible problems due to race conditions. Route all orted abnormal term tests through the trigger.
This commit was SVN r19172.
2008-08-06 11:31:06 +00:00
Shiqing Fan
bb90ad793a - Move the entire OBJ_CLASS_INSTANCE of orte_trigger_event_t into #if blocks, so that windows can have its own destructor for socket. Thanks to Ralph.
- The modification for handling windows socket will first be applied to windows branch.

This commit was SVN r19170.
2008-08-06 09:42:48 +00:00
Ralph Castain
be02211b4f Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements.
This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify.

This commit was SVN r19159.
2008-08-05 15:09:29 +00:00