Ralph Castain
55cd65b149
Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along.
...
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
2013-12-19 16:31:45 +00:00
Ralph Castain
9b32dacb6c
Ensure we don't abort if a tool cannot send a message - the orte/util/comm library used by tools to query mpirun knows how to handle this situation.
...
Refs trac:3992
This commit was SVN r29975.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 07:10:36 +00:00
Ralph Castain
6239e64f36
Further cleanup of orte-ps so it doesn't abort when hitting a stale HNP - only report that event once and just keep working.
...
Refs trac:3992
This commit was SVN r29974.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 03:28:05 +00:00
Ralph Castain
bf5e314f76
Tools require their own errmgr and state components so they can handle any errors that occur in, for example, communication .
...
Refs trac:3992
This commit was SVN r29972.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 01:49:33 +00:00
Ralph Castain
3aaca16faa
Silence warnings that are no longer valid
...
Refs trac:3992
This commit was SVN r29970.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 00:40:36 +00:00
Ralph Castain
c5956e7b8c
Convert debug output to opal_output_verbose
...
Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29969.
2013-12-19 00:36:15 +00:00
Ralph Castain
39957df08e
Fixes trac:3963. Fix the tool ess procedure so it opens and selects the OOB framework, and have the OOB TCP module update the route to new connections (the routed modules know what to do).
...
Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
2013-12-18 23:13:46 +00:00
Ralph Castain
77553f72be
Per this email thread:
...
http://www.open-mpi.org/community/lists/devel/2013/12/13412.php
fix the backtrace function to avoid async issues. Thanks to Takahiro Kawashima for the patch
This commit was SVN r29955.
2013-12-18 17:57:37 +00:00
Ralph Castain
ab4636c47b
Per email on devel list, change the default rank-by to slot unless map-by <obj> is specified, in which case use rank-by <obj>
...
Refs trac:3977
This commit was SVN r29945.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-18 00:48:50 +00:00
Ralph Castain
53cd00fe16
By setting a default mapping/ranking/binding policy that wasn't "none", we introduced a problem for users of the Mac and any other machine where sockets aren't defined and/or binding is not supported. Fix that by checking to see if the user specified the failing policy - if not, then fall back to the old map/rank by slot and no binding.
...
Refs trac:3977
This commit was SVN r29933.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-17 14:50:10 +00:00
Adrian Reber
b42aad44a3
Trying to get the C/R code to compile again. This patch
...
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
2013-12-16 15:35:28 +00:00
Ralph Castain
8b6d117541
Per the OMPI devel conference that changed our default behaviors:
...
* default to bind-to core
* map-by slot if np=2
* map-by socket (balance across sockets on each node) if np > 2
* map-by <obj> will imply rank-by <obj> by default (leave default binding as above)
Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs
cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values
This commit was SVN r29919.
2013-12-15 17:25:54 +00:00
Jeff Squyres
770bf77149
Fix some minor memory leaks in error code paths.
...
Many thanks to Tom Fogal for the patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix minor memory leaks in error code paths
This commit was SVN r29905.
2013-12-14 00:41:21 +00:00
Jeff Squyres
0ab48ad0d2
Fix some annoying flex warnings that have been there for years.
...
Many thanks to Tom Fogal for the initial patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix annoying flex warnings
This commit was SVN r29904.
2013-12-14 00:36:12 +00:00
Jeff Squyres
2e7653e4c2
Add missing argv.h includes.
...
Noticed these as part of #3694 : external libevent's don't cause argv.h
to automatically get included.
Refs trac:3694
This commit was SVN r29897.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-13 21:17:36 +00:00
Brian Barrett
121ca26c59
Per discussion at Develoepr's Meeting, remove Solaris threads support. Solaris
...
will just fall back to pthreads, which should be no problem.
This commit was SVN r29893.
2013-12-13 20:07:11 +00:00
Ralph Castain
0e81959aae
Cleanup mindist error messages - already patched in 1.7
...
This commit was SVN r29869.
2013-12-12 15:30:29 +00:00
Ralph Castain
1ff12362da
Cleanup merge conflict that was incorrectly committed
...
This commit was SVN r29851.
2013-12-09 20:20:14 +00:00
Ralph Castain
83e59e6761
Once again, the Slurm folks have decided to redefine their envars, reversing what they had previously told us to do. So cleanup the Slurm allocation code, and also adjust to a change in srun behavior that now aborts a job if the ntasks-per-node doesn't get specified when ORTE calls it, but the user specified it when getting an allocation. Sigh.
...
cmr=v1.7.4:reviewer=miked:subject=Update Slurm allocation and launch
This commit was SVN r29849.
2013-12-09 17:58:46 +00:00
Mike Dubman
c208b858e7
improve error messages in mindist
...
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29846.
2013-12-09 06:34:38 +00:00
Ralph Castain
f2c49c6c19
Fix the map-by object mapper to handle cpus-per-proc by accounting for the request when computing the number of procs to put on each object. This ensures that the binding routine doesn't automatically overload the cores.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29843.
2013-12-08 16:59:25 +00:00
Ralph Castain
9604f36c3b
Specify units for the job completion timeout
...
This commit was SVN r29839.
2013-12-08 04:51:58 +00:00
Ralph Castain
62c9e5c64c
Really is better if we output a message indicating that the job was aborted due to hitting the execution time limit
...
Refs trac:3960
This commit was SVN r29833.
The following Trac tickets were found above:
Ticket 3960 --> https://svn.open-mpi.org/trac/ompi/ticket/3960
2013-12-07 15:33:56 +00:00
Ralph Castain
d44e4a311f
Per request from Dave Goodell, add support for MPIEXEC_TIMEOUT - if set in the environment, terminate the job after the specified number of seconds has passed. Equivalent to MPICH functionality.
...
cmr=v1.7.4:reviewer=dgoodell:subject=add support for MPIEXEC_TIMEOUT
This commit was SVN r29831.
2013-12-07 01:58:32 +00:00
Jeff Squyres
ed9aba3896
This patch fixes
...
error: void value not ignored as it ought to be
in the C/R code by ignoring the return value of functions which
no longer return a value (only void).
Signed-off-by: Adrian Reber <adrian.reber@hs-esslingen.de>
This commit was SVN r29816.
2013-12-06 14:40:10 +00:00
Ralph Castain
fb59b6b875
Silence compiler warning when --disable-orte-static-ports
...
This commit was SVN r29783.
2013-12-03 01:53:31 +00:00
Ralph Castain
617a0edbb8
Fix hostfile parsing for the case where RMs count slots by listing the node multiple times. Thanks to Tetsuya Mishima for rep[orting the problem and providing a patch.
...
cmf=v1.7.4:reviewer=rhc
This commit was SVN r29748.
2013-11-24 16:17:52 +00:00
Ralph Castain
7c23a5ad65
Fix headers when building with ft enabled. Thanks to Adrian Reber for the patch!
...
This commit was SVN r29743.
2013-11-23 22:58:32 +00:00
Ralph Castain
7480beb7f0
Per request from Nathan, add an offset value to the job struct so we can construct a "global rank" that spans multiple jobs during dynamic launch operations. Store a new ORTE_DB_GLOBAL_RANK value for each process in the database, and ensure that we share our own value during connect_accept so both sides can see it.
...
This isn't being used yet - just enabling Nathan to do what he needs.
***** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *****
This commit was SVN r29708.
2013-11-14 17:01:43 +00:00
Ralph Castain
0f420f3676
Add a little debug
...
cmr:v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r29705.
2013-11-14 04:22:59 +00:00
Ralph Castain
561c1830f7
Cleanup the radix MCA params - the max connections has no relationship to the max fd's on a node. It solely is used to improve the performance of small jobs by avoiding unnecessary inter-daemon routing.
...
Refs trac:3917
This commit was SVN r29704.
The following Trac tickets were found above:
Ticket 3917 --> https://svn.open-mpi.org/trac/ompi/ticket/3917
2013-11-14 04:19:06 +00:00
Ralph Castain
540d38bc12
Per patch from Jeff, treat the case where someone transfers an archived file of an unknown type. This can't actually happen as this is on the receive end, and the error would have been treated and rejected on the send side. Still, practice defensive programming.
...
cmr:v1.7.4:reviewer=rhc:subject=Practice defensive programming
This commit was SVN r29699.
2013-11-13 23:49:26 +00:00
Jeff Squyres
f4e647538c
mca_routed_radix_component.max_connections is unsigned; it can never be <0
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r29694.
2013-11-13 19:37:24 +00:00
Jeff Squyres
038116e4b8
orte_iof_base.output_limit is an unsigned type; just initialize it to INT_MAX
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r29693.
2013-11-13 19:36:43 +00:00
Mike Dubman
840e2cb4a2
mindist: cosmetic, use fallback to byslot if unable to read NUMA info, small fix.
...
fixed by Elena, reviewed by Ralph/Mike
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r29679.
2013-11-13 09:26:40 +00:00
Ralph Castain
5b38259264
Ouch - remove an extraneous line.
...
Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.4:reviewer=rhc:subject=Remove extraneous line from OOB
This commit was SVN r29677.
2013-11-13 04:02:05 +00:00
Ralph Castain
f1e510154c
Revise the launch timeout detection so we don't mistakenly declare "failed to start". Recognize that timeout is at the per-job level, and define the timeout param as a total value instead of seconds/daemon as it otherwise can get to be an enormous (and useless) number.
...
Resolves problems in loop_spawn where the timer was incorrectly firing and killing the overall job.
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r29661.
2013-11-11 23:50:40 +00:00
Ralph Castain
46f633883b
Correct the error check on rml.send
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29660.
2013-11-11 23:23:12 +00:00
Ralph Castain
e35ad23176
Correctly compute usage for dynamic spawns when binding is invoked. Ensure we correctly account for existing process usage on each node when computing bindings during dynamic spawns.
...
cmr=v1.7.4:reviewer=hjelmn:subject=Correctly compute usage for dynamic spawns when binding is invoked
This commit was SVN r29649.
2013-11-10 00:38:01 +00:00
Joshua Ladd
d594ffbfc7
Backing out Elena's patch - abstraction violation
...
This commit was SVN r29645.
2013-11-08 13:12:07 +00:00
Joshua Ladd
da3e272fdd
Adds a check in the mindist mapper for whether or not the user asks for a specific device. This patch was submited by Elena Elkina and reviewed by Josh Ladd and should be added to
...
cmr=v1.7.4:reviewer=jladd
This commit was SVN r29644.
2013-11-08 04:28:53 +00:00
Brian Barrett
6d7a1fbb82
Move opal_portable_platform.h to opal/include/opal, which is where it really
...
should have been all along and fix one place that uses the file
Update opal_portable_platform.h with changes to mpi_portable_platform.h made
in r29608.
Make mpi_portable_platform.h a symlink to opal_portable_platform.h, so that
they won't get out of sync. I'd like to remove mpi_portable_platform.h, but
we don't automatically add -I${includedir}/openmpi/ to make that sane from
a header include point of view, so that's future work.
This commit was SVN r29618.
The following SVN revision numbers were found above:
r29608 --> open-mpi/ompi@b71bd51cdd
2013-11-06 17:12:26 +00:00
Ralph Castain
fb0940a9d9
Add a couple of useful tests
...
This commit was SVN r29539.
2013-10-28 13:24:16 +00:00
Ralph Castain
8c5c7d0db4
Correct a bug in handling of oob_tcp_if_include/exclude addresses by using the kernel index instead of the raw index of the interface.
...
Refs trac:3696
This commit was SVN r29522.
The following Trac tickets were found above:
Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696
2013-10-26 00:47:14 +00:00
Ralph Castain
604970a1a2
Initialize orte_coprocessors hash table to NULL. Delay coprocessor detection on HNP until after node topology final definition in case rmaps changes it. Minor spacing change.
...
Refs trac:3847
This commit was SVN r29504.
The following Trac tickets were found above:
Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
2013-10-24 00:08:47 +00:00
Ralph Castain
f5920e9312
Revert r29489. This function only executes in the HNP. In orte/mca/ess/hnp/ess_hnp_module.c, we already check for local coprocessors and add them to the hash table if found. Thus, r29489 simply overwrote what was already present.
...
The data for each remote daemon is added later in the daemon callback function. Only the HNP retains info in the hash table.
If it is desirable to have each daemon retain its own coprocessor info, then this must be done in orte/mca/ess/base/ess_base_std_orted.c.
This commit was SVN r29497.
The following SVN revision numbers were found above:
r29489 --> open-mpi/ompi@2e2794fa15
2013-10-23 22:35:24 +00:00
Nathan Hjelm
2e2794fa15
Fix coprocessor detection by always adding the local daemon's co-processors
...
to the hash table.
Tested and working on a system with 2 Xeon Phi co-processors.
cmr=v1.7.4:ticket=3847:reviewer=ompi-rm1.7
This commit was SVN r29489.
The following Trac tickets were found above:
Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
2013-10-23 15:56:23 +00:00
Ralph Castain
7c86a843c8
Silence compiler warning
...
This commit was SVN r29477.
2013-10-23 04:13:36 +00:00
Ralph Castain
960a255e7f
Do some cleanup of the --without-hwloc build - no need to work on coprocessors since we can't detect them anyway, cleanup some unused variables in the ppr mapper
...
This commit was SVN r29476.
2013-10-23 01:45:21 +00:00
Ralph Castain
25a84c7f0a
Fix build --without-hwloc
...
This commit was SVN r29453.
2013-10-19 23:12:33 +00:00