1
1
Граф коммитов

11944 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
01a7259a7d This fixes ticket #1426 - mpirun is cleaning up ALL session dirs
Mpirun - and the orteds - were doing their best to whack all session dirs on their nodes just in case there was something lingering due to an abnormal termination. Unfortunately, they were -too- good at it. They were whacking all session directories under the user's name, even those from other mpiruns!

This adds another layer to the session dir tree so that we can denote which jobs come from our own job family, and restricts the cleanup operation to only session dirs from within our own job family. So we'll still cleanup anything due to our own mpirun, but won't whack any other mpirun from this user.

Call it being polite...

This commit was SVN r19083.
2008-07-29 18:58:35 +00:00
Jeff Squyres
49d9f614d0 Remove errant debugging printf
This commit was SVN r19082.
2008-07-29 18:53:40 +00:00
George Bosilca
2d8cbc6ade Allow other BTL to work even if they collide with regard to the exclusivity. The problem was
that by decreasing the btl_inuse if there was already a registered BTL we basically reset
the changes for this new BTL to register it's progress function, even if it was supposed to
handle another peer.

This commit was SVN r19080.
2008-07-29 18:38:11 +00:00
Ralph Castain
d45d728e8e Allow debuggers to attach to a running mpirun by -always- setting up the MPIR_Proctable. Only wait for MPIR_Breakpoint and hold MPI proc
s if we are launching under a debugger.

This commit was SVN r19079.
2008-07-29 17:39:16 +00:00
George Bosilca
a4d905db4a Allow xgrid to compile.
This commit was SVN r19076.
2008-07-29 13:24:08 +00:00
Ralph Castain
1210a96d82 Ensure a value gets defined before used...thanks Jeff
This commit was SVN r19075.
2008-07-29 13:08:45 +00:00
Jeff Squyres
388c047579 Ensure to actually register the proper synonym index
This commit was SVN r19074.
2008-07-29 13:05:39 +00:00
Jeff Squyres
0af7ac53f2 Fixes trac:1392, #1400
* add "register" function to mca_base_component_t
   * converted coll:basic and paffinity:linux and paffinity:solaris to
     use this function
   * we'll convert the rest over time (I'll file a ticket once all
     this is committed)
 * add 32 bytes of "reserved" space to the end of mca_base_component_t
   and mca_base_component_data_2_0_0_t to make future upgrades
   [slightly] easier
   * new mca_base_component_t size: 196 bytes
   * new mca_base_component_data_2_0_0_t size: 36 bytes
 * MCA base version bumped to v2.0
   * '''We now refuse to load components that are not MCA v2.0.x'''
 * all MCA frameworks versions bumped to v2.0
 * be a little more explicit about version numbers in the MCA base
   * add big comment in mca.h about versioning philosophy

This commit was SVN r19073.

The following Trac tickets were found above:
  Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392
2008-07-28 22:40:57 +00:00
Jeff Squyres
a7c79558ad Remove an errant printf that was causing non-fatal errors to be
displayed (e.g., when deleting files that should have been silently
ignored if they didn't exist).

This commit was SVN r19071.
2008-07-28 22:11:31 +00:00
Jeff Squyres
c93f1d8c5e Fixes trac:1419: got a patch from upstream to fix the "OS X doesn't
have/need lseek64" issue.  This fix is also included in MPICH2 after
v1.0.7.

This commit was SVN r19070.

The following Trac tickets were found above:
  Ticket 1419 --> https://svn.open-mpi.org/trac/ompi/ticket/1419
2008-07-28 17:02:53 +00:00
Ralph Castain
96c6262abd Update LANL platform files
This commit was SVN r19069.
2008-07-28 16:40:31 +00:00
Ralph Castain
a0ae63f19e Ensure we call close_port after comm_spawn[_multiple]. Cleanout the port name in close_port
This commit was SVN r19068.
2008-07-28 16:40:11 +00:00
Ralph Castain
1a77b15523 Modify the handling of hostfiles to allow them to subdivide allocations. Utilize the "slots_alloc" field of the orte_node_t object - which had previously been unused - to track the #slots allocated to a given app_context. Let the hostfile filtering action utilize the #slots field to modify the allocated slots for each app_context.
This commit was SVN r19066.
2008-07-28 15:10:40 +00:00
Lenny Verkhovsky
f278609d77 Funny warning message about rd_win size fix
This commit was SVN r19063.
2008-07-28 14:30:57 +00:00
Ralph Castain
3107545709 Ensure that ORTE processes such as mpirun and orted never inadvertently bind themselves to cores. Change the mca param name used by the rank_file mapper to get user directives on slot lists to be different from that used by MPI procs to discover their binding. Add a cmd line option to orterun to make it easier for a user to specify the slot list (basically, hide the mca param name).
Discussed and reviewed with Lenny and Jeff.

This commit was SVN r19062.
2008-07-28 14:18:36 +00:00
Adrian Knoth
5096512c3a Cosmetics, only typos.
This commit was SVN r19061.
2008-07-28 13:33:08 +00:00
Ralph Castain
96a74c2d09 Update ignore properties
This commit was SVN r19060.
2008-07-28 12:52:24 +00:00
Jeff Squyres
477cdb0b62 Had to abondon the first approach from r19040: it caused problems with
"make distclean".  It's not clear whether it's an Automake bug or
whether what I did simply is not supported (I've got pending mail into
Ralf W. asking about it).  The short version is that during "make
distclean", ompi/mpi/f77/Makefile would rm -rf ompi/mpi/f77/.deps.
But ompi/Makefile still include's some .Plo files from that directory,
so Bad Things happened when "make distclean" unrolled from the
ompi/mpi/f77 dir back up to the ompi/ dir.

So I went with George's original suggestion and moved the f77 "base"
files in question into a new directory: ompi/mpi/f77/base and put a
Makefile.include in there.  That way, this directory is not traversed
twice by distclean, and .deps is only removed when it is supposed to
be.  Maybe we'll be able to do it a little better someday, but that's
the way it is now.

I'll check this with a fresh checkout once this is committed to SVN as
well; some of these kinds of problems don't show up until you do a
build from a completely fresh SVN checkout.

This commit was SVN r19054.

The following SVN revision numbers were found above:
  r19040 --> open-mpi/ompi@9f4d4c4312
2008-07-26 20:38:30 +00:00
Jeff Squyres
505ffc6719 This file was never used
This commit was SVN r19053.
2008-07-26 19:27:18 +00:00
Tim Mattox
73b528b050 Include the base version (trunk, v1.2, etc.) in the body of a
create failure e-mail as well as on the subject line.

This commit was SVN r19052.
2008-07-26 14:13:20 +00:00
Jeff Squyres
7a4359b43f Since we no longer generate a Makefile or Makefile.in in this
directory, no longer ignore those files.

This commit was SVN r19051.
2008-07-26 14:05:57 +00:00
Tim Mattox
8753064190 Have create failure e-mails say what base version failed.
This commit was SVN r19050.
2008-07-26 13:31:06 +00:00
Jeff Squyres
78b8bac900 First part of "make distcheck" fix, but at least one more commit will
be required for a total fix.  Still looking into the problem...

This commit was SVN r19049.
2008-07-26 13:25:11 +00:00
Jeff Squyres
7c4d46a8d9 Grrr: I ''did'' remove these files on the initial commit of the new
SVN version (r19045), but I also edited the svn:ignore to ignore these
files in the same SVN commit -- I suspect that SVN got confused and
did not actually delete them.

This commit was SVN r19048.

The following SVN revision numbers were found above:
  r19045 --> open-mpi/ompi@63b63d48c3
2008-07-26 13:00:24 +00:00
Jeff Squyres
6dac4706ea Somehow this file got missed in the SVN import.
This commit was SVN r19047.
2008-07-26 12:54:15 +00:00
Jeff Squyres
4d034383d9 Apply patch from Ralf W. to remove a non-portable use of ==.
This commit was SVN r19046.
2008-07-26 12:36:24 +00:00
Jeff Squyres
63b63d48c3 Fixes trac:1370, #1147
Update the version of ROMIO to that which was contained in
MPICH2-1.0.7, plus a few patches from the upstream ROMIO maintainers
(because OMPI uses a few code paths in ROMIO that MPICH2 does not;
there were a few compile bugs in the ROMIO from MPICH2-1.0.7).

Added an info MCA param to be able to tell which version of ROMIO is
contained in OMPI: io_romio_version.

Many, many thanks to romio-maint@mcs.anl.gov for all their help in
integrating this new version of ROMIO into Open MPI.

This commit was SVN r19045.

The following Trac tickets were found above:
  Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
2008-07-26 12:23:30 +00:00
Ralph Castain
0735d6f1c2 This commit fixes ticket #1414
Cleanup the logic in the odls for when processes terminate. It turns out that we were only going through the kill_proc logic once instead of looping over all local children when we ordered a daemon to kill its local procs. This went unnoticed for some time as for most systems the local procs were terminated anyway when the daemon terminated due to the parent/child relationship.

Solaris is apparently different - the children are not automatically terminated when the parent dies. As a result, it acts as a detector for this bug.

Mucho thanks to Rolf V. for his help in debugging - and to IM for letting me follow his gdb progress in quasi real-time!

This commit was SVN r19044.
2008-07-26 02:54:43 +00:00
Jeff Squyres
92c10cd187 Remove some old kruft from Makefile.am's -- likely the result of
copying some old Makefile.am a long time ago.

This commit was SVN r19043.
2008-07-26 00:27:42 +00:00
Jeff Squyres
773f79f495 Follow-on to r19040; missed one file on the checkin.
This commit was SVN r19041.

The following SVN revision numbers were found above:
  r19040 --> open-mpi/ompi@9f4d4c4312
2008-07-25 21:35:29 +00:00
Jeff Squyres
9f4d4c4312 Fixes trac:1409: ensure that the C++, F77, and F90 bindings libraries
are properly linked against libmpi.la.

This required a little creative AM usage, inspired by discussion on
OMPI devel list:

 * Make a new ompi/mpi/f77/Makefile_f77base.include; effectively move
   the building of the f77 "base" glue stuff (libmpi_f77base.la) into
   this Makefile and away from ompi/mpi/f77/Makefile.am.  The sources
   in question require some specific CPPFLAGS, so we couldn't just add
   the raw sources into libmpi_la_SOURCES, unfortunately.
 * Include this new Makefile in the top-level ompi/Makefile.am
 * The libmpi_f77base.la LT convenience library was already sucked
   into libmpi.la; breaking it out into its own Makefile allows us
   to build it earlier and therefore complete buidling libmpi.la
   earlier.
 * Side effect: the ompi/mpi/Makefile.am is now mostly unnecessary; it
   no longer specifies a SUBDIRS for each of the bindings directories
   to traverse into (since they are now in the top-level SUBDIRS).  As
   such, the man pages are now also now included in the top-level
   ompi/Makefile.am.

The end of the result is that libmpi.la -- including a few sources
from mpi/f77 -- is fully built before the C++, F77, and F90 bindings
are built.  Therefore, the C++, F77, and F90 bindings libraries can
all link against libmpi.la.

This commit was SVN r19040.

The following Trac tickets were found above:
  Ticket 1409 --> https://svn.open-mpi.org/trac/ompi/ticket/1409
2008-07-25 21:18:05 +00:00
Ralph Castain
d5a916d350 Fix a problem reported by IBM: nolocal and bynode combined to map byslot. Problem actually was that any time multiple mapping policy directives were provided, we would only map byslot due to incorrect if statement conditions.
Thanks to Kris Davis for his patience while we tracked this down!

This commit was SVN r19039.
2008-07-25 17:50:46 +00:00
Ralph Castain
718cceddaa Ensure that we only launch procs on the HNP if that node is actually included in the allocation.
This commit was SVN r19038.
2008-07-25 17:13:22 +00:00
Ralph Castain
cb93775cca Just for the AR - remove unnecessary typecast
This commit was SVN r19034.
2008-07-25 15:30:37 +00:00
Thomas Herault
28dc80b67e Deal with the SIGCHLD issue in LSF.
lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function.

This commit fixes trac:1356.

This commit was SVN r19033.

The following Trac tickets were found above:
  Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356
2008-07-25 15:23:23 +00:00
Ralph Castain
7e6e104fc3 Add more debugging to the RML when it fails to find a route - specifically, have it print a stacktrace so we can figure out where it came from.
This commit was SVN r19032.
2008-07-25 15:01:41 +00:00
Ralph Castain
42c134cb32 Silence stupid compiler warning - and a certain someone who keeps reminding me of it... :-)
This commit was SVN r19031.
2008-07-25 14:01:06 +00:00
Ralph Castain
a1d296ae03 This commit fixes ticket #1410
Fix a few bugs in the mappers:

1. Ensure that bynode with no -np fills all available slots - it just does so with the ranks set bynode instead of byslot

2. fix --nolocal behavior so it works correctly in all cases. We still have to test the host's name using opal_ifislocal in the mapper because the name returned by gethostname to orte_process_info.hostname can be an FQDN, but a hostfile may contain a non-FQDN version.

3. Add missing --nolocal logic to the seq mapper

Oversubscribed mapping seemed to be working okay without repair, so I couldn't verify my own bug report in that regard.

Also included are some preliminary changes to support the modified hostfile behavior, which will be committed shortly:

1. removed the totally useless "allocate" field in the orte_node_t object since every node is automatically allocated for use - and everything ignored the field anyway

2. correctly initialize the slots_alloc field when the allocation is read

This commit was SVN r19030.
2008-07-25 13:35:12 +00:00
Jeff Squyres
31df89ccb2 Add bullet about mpi_leave_pinned and the openib BTL
This commit was SVN r19029.
2008-07-25 11:29:37 +00:00
Jeff Squyres
4adc4a632a This option is neither documented nor implemented.
This commit was SVN r19027.
2008-07-24 23:37:16 +00:00
Jeff Squyres
e3e79c0881 Fixes trac:1379:
* Use synonym/deprecated MCA param API for some mca base params
 * In openib BTL, if we have appropriate memory hooks support, and if
   mpi_leave_pinned and mpi_leave_pinned_pipeline were not set by the
   user, set mpi_leave_pinned to 1.
 * Defer checking mpi_leave_pinned_* until as late as possible (i.e.,
   until after the btl's have had a chance to set mpi_leave_pinned to
   1):
   * in ob1 pml
   * in rdma mpool

This commit was SVN r19022.

The following Trac tickets were found above:
  Ticket 1379 --> https://svn.open-mpi.org/trac/ompi/ticket/1379
2008-07-24 22:51:26 +00:00
Josh Hursey
ca43968418 Fix a dealock scenario when registering depricated MCA parameters. The internal loop uses the 'item' variable that is used by the outer loop as well. So when the outer loop checks the value of 'item' it will never equal the end of the list since it no longer references the same list.
Kinda found by MTT. MTT calls 'ompi_info --all --parsable' and it was livelocked and had to be killed by hand.

I'm going to push this one to Jeff to push to v1.3 since he did the original implementation and should check this code.

This commit was SVN r19014.
2008-07-24 15:51:54 +00:00
George Bosilca
6c21851160 Only register the BTL progress function if there is a need for it. This require
a little bit more than "BTL was able to add some procs". The real condition to
allow the BTL progress is that we will use it to send/recv data to/from some
of the peers (this include the BTL exclusivity in the process).

This commit was SVN r19010.
2008-07-24 10:33:17 +00:00
Ralph Castain
fdb2408bf2 Rename the osx paffinity component the "posix" component since it really has nothing osx specific in it - it is just a generic posix call to determine #processors. Set the priority low so that both linux and solaris components override it if they build. It shouldn't build in Windows at all.
Modify the odls to remove a (size_t) typecast in front of the num_processors variable just in case it is returned negative. This usually is accompanied by an opal_error, so this shouldn't make any difference - but it is more technically correct.

This commit was SVN r19008.
2008-07-24 01:54:51 +00:00
Ralph Castain
d880d6282a Update LANL platform files
This commit was SVN r19005.
2008-07-23 18:57:03 +00:00
Lenny Verkhovsky
b4d54dda57 Fixed possible seqf when using RANKFILE, but not all ranks assigned
Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned
Aborting if using RANKFILE, but np wasn't specified a little earlier
Clean mca_rmaps_rank_file_component.debug

This commit was SVN r19004.
2008-07-23 17:44:02 +00:00
Shiqing Fan
0646cd2491 - Move wait object instance code out of the #ifdef block, so that systems with waitpid and Windows can both use it. Thanks to Ralph.
This commit was SVN r19003.
2008-07-23 16:20:42 +00:00
Jeff Squyres
1fd5b0402a Refs trac:1250
* Fix linux paffinity component to make a "best" guess when PLPA
   can't find topology information in the Linux kernel.  That is, if
   PLPA can't tell us the max_processor_id, just assume that it's the
   same as the number of processors.  If you have a more complex
   system than that (e.g., you have holes in your available processor
   IDs), you'll likely be running a Linux kernel that supports the
   topology information, and this problem won't happen.
 * Make sure to conver the return codes from PLPA to OPAL_ERR* codes.

This commit was SVN r19001.

The following Trac tickets were found above:
  Ticket 1250 --> https://svn.open-mpi.org/trac/ompi/ticket/1250
2008-07-23 15:47:43 +00:00
Ralph Castain
e3c3d28bf1 Add some more debugging to tell us how many processors were found when setting sched_yield
This commit was SVN r18999.
2008-07-23 15:28:51 +00:00
Thomas Herault
b6affd35e9 Small typos for LSF compilation and update Makefile.am
This commit was SVN r18998.
2008-07-23 14:42:26 +00:00