1
1
Граф коммитов

11826 Коммитов

Автор SHA1 Сообщение Дата
Pak Lui
924bface15 The plm env var should set to the name of a current plm module, which is rsh.
This commit was SVN r18844.
2008-07-08 23:15:52 +00:00
Ralph Castain
8e3658b320 Remove the nodename:pid prefix from show_help output so it doesn't disrupt the formatted output
This commit was SVN r18843.
2008-07-08 22:57:50 +00:00
Ralph Castain
f1114b4144 Upgrade the ability of orterun to deal with cmd line MCA params that are passed to the orteds. Help reduce the size of the cmd line by eliminating duplicates where possible, and alert to duplicate entries that can cause problems.
Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned.

This commit was SVN r18842.
2008-07-08 22:36:39 +00:00
Josh Hursey
c4035d848f This commit fixes runs when there is no available CRS component (BLCR is unavailable, and SELF is deactivated). Previously the run would fail out of MPI_INIT since the OPAL CRS framework could not select a component. This is because the framework did not recognize the 'none' component as a full component because it was part of crs/base.
I promoted the ''none'' component to a full component, and updated the other components to reflect this code movement. The ''none'' component is the default component unless the user requests '''-am ft-enable-cr''' to auto-select a component. There is an MCA parameter to show a warning if the application requested an FT enabled job, but the ''none'' component was selected ({{{crs_none_select_warning}}}).

This temporarily fixes the problem mentioned in r18739. The full fix will entail working on ticket #1291.

Thanks to Ethan from Sun for finding this bug.

This commit was SVN r18840.

The following SVN revision numbers were found above:
  r18739 --> open-mpi/ompi@a003fa7a50
2008-07-08 20:04:39 +00:00
Brad Benton
9f0280bd55 arghh...I inadvertantly checked this in to the 1.3 branch rather than
first to the trunk.  So, here is the trunk checkin:

The call to orte_show_help() to notify truncation of the max_inline value
was missing the want_error_header boolean, which eventually results in
a SEGV.  This change corrects the call with the bool set to true.

This commit was SVN r18839.
2008-07-08 15:28:53 +00:00
Ralph Castain
51da9f2980 Properly ensure that cmd line MCA params override environmental MCA params for procs local to mpirun.
Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell.

This commit fixes trac:1373

This commit was SVN r18836.

The following Trac tickets were found above:
  Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373
2008-07-08 13:48:47 +00:00
Lenny Verkhovsky
23da11fdcc add some documentation for mpirun's new --loadbalance option closing #1277
This commit was SVN r18832.
2008-07-08 07:56:54 +00:00
Pavel Shamis
452141bfb8 Bugfix for #1375.
- Adding configure options that allow to disable IB/RDMA-CM support.
- Code cleanup in openib section of configure

This commit was SVN r18830.
2008-07-08 06:32:54 +00:00
Ralph Castain
613d0f8017 Missed file...just some comment changes, but important ones
This commit was SVN r18828.
2008-07-08 04:02:31 +00:00
Ralph Castain
cf353a1412 Complete the revisions per Brian's email to devel list, plus lengthy discussions between Brian, Jeff, and myself.
These are mostly long additions to comments to document what is going on and why, and how/where it may be revised in the future. Just a couple of small, but important, changes to the code itself.

This commit was SVN r18827.
2008-07-08 03:56:51 +00:00
Josh Hursey
22f4c829ba cleanup BLCR configure so --without-blcr works correctly
This commit was SVN r18825.
2008-07-08 02:48:20 +00:00
Jeff Squyres
83987fea75 Next step: Back out r17543 (ficxing a bunch of ROMIO warnings). Let's
see how the next gen panasas stuff does in terms of warnings; we can
always re-merge this later if we want to.  It's just easier if we have
as little OMPI-specific code as possible (particularly when we know
that the panasas code has some big changes coming).

This commit was SVN r18823.

The following SVN revision numbers were found above:
  r17543 --> open-mpi/ompi@b4ec81a9fd
2008-07-07 23:22:26 +00:00
Jeff Squyres
09ff80ff06 Back out r16691 and r16693 because the meat of them are upstream
already, and we're just about to do a ROMIO version refresh -- so the
less OMPI-specific code we have (e.g., indenting and whatnot), the
better. 

Refs trac:1370.

This commit was SVN r18821.

The following SVN revision numbers were found above:
  r16691 --> open-mpi/ompi@8dca19cb3b
  r16693 --> open-mpi/ompi@037a533752

The following Trac tickets were found above:
  Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
2008-07-07 22:33:49 +00:00
Jeff Squyres
a6cfe0c574 Remove LANL-specific Panasas patches. This is step 1 in upgrading the
ROMIO in Open MPI (the new version of ROMIO will make this patch
defunct, and David Daniel has confirmed that no one at LANL is using
this functionality, anyway).

Refs trac:1370.

This commit was SVN r18819.

The following Trac tickets were found above:
  Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
2008-07-07 22:08:26 +00:00
Edgar Gabriel
798f47b430 Fixes ticket #1334
hierarch disables itself now if the pml module used is *not* ob1. The reason
is, that the multi-level hierarchy detection algorithm checks the names of the
btl modules used. In case there are no btl's, we would segfault.

Furthermore, three minor changes:
 - the 2-level hierarchy detection is now the default (sm vs. everything else
 in the world).
 - add udapl to the list of protocols checked for by the multi-level hierarch detection
 - some of the verbose statements of hierarch were inaccurate. Fixed those comments/messages.

This commit was SVN r18817.
2008-07-07 18:44:48 +00:00
Ralph Castain
2a1e0a2e64 Fix ticket #1267
With help from Brian, modify the ompi/proc/proc.c code to be more thread-safe. Remove the list operations from the ompi_proc_t constructor and destructor. Insert list appends to ompi_proc_init and ompi_proc_find_and_add as required, and protect those with thread locks. Let only the ompi_proc_finalize function actually remove objects from the ompi_proc_list.

Cleanup a few places where functions might return without unlocking a thread. Ensure the ompi_proc_world also does an OBJ_RETAIN so that the reference count on any subsequently released object is correct.

This commit was SVN r18816.
2008-07-07 17:39:49 +00:00
Josh Hursey
0071ed8961 Fix broken C/R build resulting from r18804
Will patch v1.3 branch shortly.

This commit was SVN r18814.

The following SVN revision numbers were found above:
  r18804 --> open-mpi/ompi@ba5498cdc6
2008-07-07 14:55:29 +00:00
Lenny Verkhovsky
489c22b6b1 Added -rf|--rankfile <arg0> oprion to mpirun -h + man info
This commit was SVN r18812.
2008-07-07 13:46:22 +00:00
Lenny Verkhovsky
2fae770a86 information about carto framework in mpirun man page, ticket #1372
This commit was SVN r18809.
2008-07-07 12:02:10 +00:00
Jeff Squyres
74aa9689e4 From an initial patch from George, update all the set/get errhandler
functions to use atomics in order to be thread safe.

This commit was SVN r18807.
2008-07-03 19:28:02 +00:00
Jeff Squyres
1b3b8732ca Some updates regarding iWARP support.
This commit was SVN r18805.
2008-07-03 18:47:18 +00:00
Ralph Castain
ba5498cdc6 Repair the MPI-2 dynamic operations. This includes:
1. repair of the linear and direct routed modules

2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI

3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature

4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.

Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB

5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.

Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.

6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch

7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch

8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies

9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.

10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon

There is still more cleanup to be done for efficiency, but this at least works.

Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.

Fixes ticket #1256

This commit was SVN r18804.
2008-07-03 17:53:37 +00:00
Lenny Verkhovsky
1ed465326b Change of name conventions in carto
NODE -> EDGE
CONNECTION ->   BRANCH
SLOT -> SOCKET.

This commit was SVN r18799.
2008-07-03 14:19:16 +00:00
Lenny Verkhovsky
ba1fa73881 Selectign Maffinity only if Paffinity selected fix
This commit was SVN r18797.
2008-07-03 13:39:34 +00:00
Jeff Squyres
7897db314e Add in the use of MPIR_being_debugged for DDT.
This commit was SVN r18796.
2008-07-03 12:27:35 +00:00
Shiqing Fan
5d0f4dc88d - Clean up the unreferenced variables.
- Change the arguments for launch failed function according to changeset r18611.

This commit was SVN r18795.

The following SVN revision numbers were found above:
  r18611 --> open-mpi/ompi@7bee71aa59
2008-07-03 10:11:08 +00:00
George Bosilca
07cb54995b Reactivate the daemon spin from the command line.
This commit was SVN r18794.
2008-07-02 01:46:58 +00:00
Shiqing Fan
a3e1718126 Missing one argument for calling this function.
This commit was SVN r18793.
2008-07-01 18:01:22 +00:00
Lenny Verkhovsky
c143c95ff9 Partial rankfile slots allocation fix
This commit was SVN r18787.
2008-07-01 08:54:20 +00:00
Ralph Castain
6f85e34d66 Detect homo/hetero scenarios in the nidmap, setup to take appropriate actions in the basic grpcomm module.
NOT for inclusion in v1.3

This commit was SVN r18786.
2008-07-01 02:44:57 +00:00
Jeff Squyres
160ba5fe11 Set max_inline_data for the iWARP adapters to be 64
This commit was SVN r18782.
2008-06-30 14:25:32 +00:00
Matthias Jurenz
c0ea3635b6 Improved passing of OMPI configure arguments to VT's configure (Ticket #1353)
This commit was SVN r18779.
2008-06-30 13:32:04 +00:00
Ralph Castain
bbaf000db2 Singletons need to construct their own nidmap and cannot use the std function in the base
This commit was SVN r18777.
2008-06-30 13:28:56 +00:00
Pavel Shamis
eaa7676c57 Changing default maximum inline data size
from Maximum_Supported_By_Device to 128 (our original value).

This commit was SVN r18774.
2008-06-30 07:47:09 +00:00
Jeff Squyres
8efe67e08c Improvements to the MCA param system: allow querying to find out where
an MCA parameter's value came from.  Note that the actual value of the
parameter is irrelevant.  For example, if a value was specified in an
MCA parameter file that happened to have the same defaultvalue that
was specified when the parameter was registered, the returned location
will indicate that the value was set from the file.

Possible answers:

 * '''MCA_BASE_PARAM_SOURCE_DEFAULT:''' no user-specified values were
   found, so the default value was used
 * '''MCA_BASE_PARAM_SOURCE_ENV:''' the value came from the
   environment (which also means the mpirun/orterun command line!)
 * '''MCA_BASE_PARAM_SOURCE_FILE:''' the value came a file (or the
   Windows registry)
 * '''MCA_BASE_PARAM_SOURCE_KEYVAL:''' the value came from a keyval
   (can currently never happen)
 * '''MCA_BASE_PARAM_SOURCE_OVERRIDE:''' the value came from an MCA
   param API "set" function

This commit was SVN r18770.
2008-06-28 15:13:25 +00:00
Jeff Squyres
59b7665a3a Clarify and correct comments
This commit was SVN r18768.
2008-06-28 14:15:04 +00:00
Jeff Squyres
f42f55f84b Several improvements and fixes to the openib IBCM CPC:
* Move the passive side QP move to RTS to before we send the reply
   (vs. sending it after we get the RTU).  A lengthy comment explains
   the need for this.
 * Add some timers to the code for analyzing where time is spent.
 * Clarify a few error messages.
 * Currently have a loop around ib_cm_listen() because sometimes it
   fails for seeminly no reason.  Have pending e-mails in to Sean
   Hefty to see if we can figure out why this is happening.  Note that
   the more MPI processes you add, the more likely this error is to
   occur (e.g., ran 720 processes and it happens at least 50% of the
   time).  This makes IBCM somewhat unattractive for general use;
   hopefully we can get a fix...

This commit was SVN r18766.
2008-06-28 10:53:58 +00:00
Jeff Squyres
67933e743c Move some of the things I did in r18762 to the openib BTL proper (in
endpoint.c) because it's almost identical in all the CPC's.  OOB,
XOOB, and IBCM now all invoke the btl error handler properly if
there's an error during wireup.  RDMACM still needs to be done.

This commit was SVN r18764.

The following SVN revision numbers were found above:
  r18762 --> open-mpi/ompi@3eda04578f
2008-06-27 22:48:45 +00:00
Jeff Squyres
3eda04578f Clean up a lot of error handling in IBCM CPC; properly pass error to
upper level btl (and therefore the PML) when something goes wrong
during wireup.

Refs trac:1283.

This commit was SVN r18762.

The following Trac tickets were found above:
  Ticket 1283 --> https://svn.open-mpi.org/trac/ompi/ticket/1283
2008-06-27 21:02:05 +00:00
Jeff Squyres
21c7d95109 Fixes trac:1365: if we're using !^ to negate module inclusion, then don't
bother to check to see whether they exist or not.  Specifically, this
will not cause an error:

{{{
shell$ mpirun --mca btl ^does_not_exist ...
}}}

but neither will this:

{{{
shell$ mpirun --mca btl ^sm ...
}}}

(where the sm BTL ''does'' exist)

This commit was SVN r18760.

The following Trac tickets were found above:
  Ticket 1365 --> https://svn.open-mpi.org/trac/ompi/ticket/1365
2008-06-27 19:42:08 +00:00
Ralph Castain
830ea9dfe6 Reconnect the opal dss debug envar with the debug output
This commit was SVN r18759.
2008-06-27 19:29:18 +00:00
Shiqing Fan
d129578694 Small fix for including unistd.h header file.
This commit was SVN r18758.
2008-06-27 16:25:31 +00:00
Jeff Squyres
ad16d8f335 Fix some issues in the ibcm CPC:
* Properly handle non-symmetric subnet ID's
 * Be a bit more stringent when checking for the GID
 * Add lots of BTL_VERBOSE's for diagnostics

This commit was SVN r18754.
2008-06-26 20:23:56 +00:00
Ralph Castain
f4621af954 Restore the route initialization to the global server, if one is specified. This enables us to publish/lookup to a global server when using the direct routed module.
This commit was SVN r18753.
2008-06-26 18:02:45 +00:00
Jeff Squyres
90576a435b Fixes trac:1345
The issue is that the field mca_topo_base_comm_t->mtc_periods_or_edges
has a different length, depending on whether the communicator is a
graph or a cart.  One of the comm dup functions always assumed that it
was the length required by graph comms, which could lead to badness in
some cases.  This commit makes the legnth of that field on a comm dup
be the proper length and copies the data over appropriately.

I also changed the syntax of the ompi_comm_copy_topo() function to use
shorter pointer notation; it made the code much easier to read and
fix. 

This commit was SVN r18752.

The following Trac tickets were found above:
  Ticket 1345 --> https://svn.open-mpi.org/trac/ompi/ticket/1345
2008-06-26 16:59:31 +00:00
Ralph Castain
158040cf3b First step: be kind to Jeff's disk space - let's abort without dumping core files all over the place
This commit was SVN r18751.
2008-06-26 16:10:03 +00:00
Ralph Castain
ecfb436e96 Add something that may only exist on the Mac...but is annoying nonetheless.
This commit was SVN r18750.
2008-06-26 15:10:47 +00:00
Ralph Castain
6af8a73dc0 Modify the checking logic to look for NULL return
This commit was SVN r18749.
2008-06-26 14:08:36 +00:00
Ralph Castain
af8c167861 May be picky, but cleanup before returning in error conditions
This commit was SVN r18748.
2008-06-26 13:31:36 +00:00
Ralph Castain
3631a60181 Update the PML selection logic to detect when a modex is required, and in those cases to only have rank=0 report its selected module. This is per the email thread on the devel list:
http://www.open-mpi.org/community/lists/devel/2008/06/4223.php

This commit was SVN r18747.
2008-06-26 13:22:48 +00:00