Ralph Castain
14bb7a117c
Fix bugs in the oob base - ensure we get the components in high-to-low priority, and that we correctly track reachability via all components. Adjust the priority of the tcp component to leave headroom for others
...
Refs trac:267
This commit was SVN r30740.
The following Trac tickets were found above:
Ticket 267 --> https://svn.open-mpi.org/trac/ompi/ticket/267
2014-02-16 03:19:08 +00:00
Ralph Castain
509d5d82b0
Add some verbage requested by Jeff, change the param level to something...?
...
Refs trac:4275
This commit was SVN r30736.
The following Trac tickets were found above:
Ticket 4275 --> https://svn.open-mpi.org/trac/ompi/ticket/4275
2014-02-15 15:11:05 +00:00
Ralph Castain
3f9db36e0d
Make Jeff smile - pretty-up the indentation
...
Refs trac:4267
This commit was SVN r30733.
The following Trac tickets were found above:
Ticket 4267 --> https://svn.open-mpi.org/trac/ompi/ticket/4267
2014-02-14 23:25:48 +00:00
Ralph Castain
91f90058ce
Add missing options and cleanup the code a bit. Default to by-slot ranking if a non-hardware option isn't given. Thanks to Tetsuya Mishima for the assist.
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30725.
2014-02-14 10:23:16 +00:00
Ralph Castain
fd9b301a8b
Check equality instead of bit-mask - thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30722.
2014-02-14 02:34:42 +00:00
Ralph Castain
4e1c07cbf2
If we are given a TCP oob address that doesn't match any active module, it is still possible that we could route to the address if a router is in the system. No harm in trying, so arbitrarily pick the first connection in the active module list and assign the peer to it. If that module can't reach it, we'll follow the usual failover mechanism until finally concluding that nobody can get there.
...
cmr=v1.7.5:reviewer=jsquyres:subject=handle non-matching addresses
This commit was SVN r30719.
2014-02-13 23:37:22 +00:00
Ralph Castain
449cd8f3d7
Update a couple of fields, add a scheduler field to proc_info
...
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Ralph Castain
fc6101b508
Handle "localhost" better
...
Refs trac:4263
This commit was SVN r30702.
The following Trac tickets were found above:
Ticket 4263 --> https://svn.open-mpi.org/trac/ompi/ticket/4263
2014-02-12 20:30:39 +00:00
Ralph Castain
a8a9801a0b
Ensure an orted exits with non-zero status if it is unable to send a message. Add more diagnostic messages to the OOB set_addr code
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30701.
2014-02-12 19:44:01 +00:00
Ralph Castain
1473dde6ea
Okay, once again be caught by the blasted hwloc inability to cleanly handle caches. Protect the calls to get_depth by first checking to see if it is a "cache", then use a cache-specific function to get the stupid data. Very, very irritating.
...
cmr=v1.7.5:reviewer=jsquyres:subject=treat caches as something different yet again
This commit was SVN r30693.
2014-02-12 01:45:06 +00:00
Ralph Castain
1565816988
Do a little better job of cleaning up the session directory left by mpirun by ensuring we delete the event associated with debugger attachment and unlinking the pipe used for that purpose. Also, we no longer leave "abort" files around, so remove that check when deleting session directory trees
...
cmr=v1.7.5:reviewer=jsquyres:subject=cleanup session directories better
This commit was SVN r30689.
2014-02-11 22:16:17 +00:00
Ralph Castain
fa7b686ccc
Provide better messages when we don't find any included interfaces, and/or don't find any interfaces for use by OOB.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30675.
2014-02-11 19:29:03 +00:00
Ralph Castain
b566cd5e30
Protect against no modifiers
...
Refs trac:4117
This commit was SVN r30672.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:34:37 +00:00
Ralph Castain
6fa34407bf
Handle modifiers to the --map-by dist option
...
Refs trac:4117
This commit was SVN r30671.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:19:05 +00:00
Ralph Castain
4781ea71b6
Correct the handling of various map/bind combinations when pe=N is given. Thanks to Elena Elkina for reporting it.
...
Refs trac:4117
This commit was SVN r30663.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 03:05:26 +00:00
Ralph Castain
707e51d786
Check for --cpus-per-proc earlier, before the correct option can be processed. Thanks to Tetsuya Mishima for reporting it.
...
Refs trac:4117
This commit was SVN r30662.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 02:53:53 +00:00
Ralph Castain
d66d2f5fb3
It is just fine to map by node or slot and bind, so ensure the switch statement includes those options. Thanks to Tatsuya Mishima for point it out.
...
Refs trac:4240
This commit was SVN r30661.
The following Trac tickets were found above:
Ticket 4240 --> https://svn.open-mpi.org/trac/ompi/ticket/4240
2014-02-11 02:52:01 +00:00
Ralph Castain
a49e0db8dd
We haven't supported a c++ wrapper for ORTE in quite some time
...
cmr=v1.7.5:reviewer=ompi-gk1.7:subject=remove c++ cruft
This commit was SVN r30653.
2014-02-10 17:16:30 +00:00
Ralph Castain
1a12325094
Rats - need to include bydist in the mapping list
...
Refs trac:4117
This commit was SVN r30649.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-09 16:17:05 +00:00
Ralph Castain
0dc5f50d27
Add a plm component for local-only operation that doesn't require rsh/ssh to be installed. Requested by Fedora packagers for testing purposes.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Add a plm component for local-only operation
This commit was SVN r30645.
2014-02-09 15:53:10 +00:00
Ralph Castain
ca0c806662
Resolve the problem of binding in inverted topologies - check the relative depth of the map and bind objects in the topology, and let that determine whether we bind downward or upwards.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Resolve the problem of binding in inverted topologies
This commit was SVN r30643.
2014-02-09 05:30:17 +00:00
Ralph Castain
0ee38353ba
In case there are stale session directories around, do a purge of the relevant session directory tree when an orted, HNP, or singleton start. This won't help in the case of direct-launched apps, but it's the best we can do.
...
cmr=v1.7.5:reviewer=jsquyres:subject=purge stale session dirs at startup
This commit was SVN r30642.
2014-02-09 02:10:31 +00:00
Ralph Castain
1d8c061687
Fix a race condition that could result in assert failures during finalize. Ensure we shutdown the orte progress thread prior to finalizing the rml/oob frameworks so that no async operations are executing during destruct of the base-level lists and objects.
...
cmr=v1.7.5:reviewer=jsquyres:subject=fix race condition in finalize
This commit was SVN r30641.
2014-02-08 22:04:19 +00:00
Ralph Castain
5b8e1180cf
Update a test
...
This commit was SVN r30640.
2014-02-08 22:00:12 +00:00
Ralph Castain
a94920276d
Fix singleton MPI_Abort. Singletons no longer immediately start an HNP, but only launch one when they need it for comm_spawn. So there isn't anyone to send the "abort" report to, and thus we just exit after emitting our message.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Fix singleton MPI_Abort
This commit was SVN r30635.
2014-02-08 18:15:07 +00:00
Ralph Castain
bc7cc09749
After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
...
can't apply its normal precedence rules.
So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
can do. If only the old style param is given, then process it after the warning.
Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.
For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.
Refs trac:4117
This commit was SVN r30620.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
Ralph Castain
c617d66d98
Paul Hargrove has pointed out that some big SMP systems (e.g., from SGI) configure Torque differently - instead of listing each node name once/slot in the nodefile, they list the node only once and set an envar to indicate the number of procs/node being allocated. Add an MCA param users can set to indicate we are in such an environment, and then use the envar to set the slots. Error out if the mode flag is given, but (a) we don't find the PBS_PPN envar, or (b) we find a node actually listed more than once in the PBS_Nodefile.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Support SMP mode in Torque
This commit was SVN r30568.
2014-02-05 15:51:17 +00:00
Ralph Castain
1326ed704f
Per the RFC discussed here:
...
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php
add support for async modex when requested.
cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support
This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Ralph Castain
230336b6a8
Upgrade the security framework to avoid multiple hits against the global security server. Add support for future case where mpirun assings a global security credential for a given run, though we need to work out how to handle connect-accept from other mpirun's in that case. Remove a bunch of duplicate code in the OOB by consolidating the connection handshake code.
...
Refs trac:4221
This commit was SVN r30554.
The following Trac tickets were found above:
Ticket 4221 --> https://svn.open-mpi.org/trac/ompi/ticket/4221
2014-02-04 14:47:04 +00:00
Adrian Reber
fde1040d2f
Use unique collective ids for the checkpoint/restart code
...
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
5980b7e042
Add a security framework for authenticating connections - we will add LDAP, Kerberos, and Keystone support in the next month. For now, just put a placeholder "basic" module that does the minimum.
...
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
2014-02-04 01:38:45 +00:00
Ralph Castain
e43589ed84
Fix warning - thanks to Paul Hargrove for reporting it
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30548.
2014-02-03 23:51:45 +00:00
Ralph Castain
993198cfba
Fix lost message problem - if multiple messages are queued before the connection is formed, we lost all but the first one. Ensure that all messages get properly queued prior to completing the connection
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix lost message problem
This commit was SVN r30516.
2014-01-31 05:30:51 +00:00
Ralph Castain
2bc9fd30ee
Orcm sends heartbeats to its daemons, but ORTE needs to continue sending it to the HNP
...
This commit was SVN r30514.
2014-01-31 01:56:01 +00:00
Ralph Castain
193cceb483
Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps.
...
Yippee skippee...
This commit was SVN r30513.
2014-01-30 23:50:14 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00
Ralph Castain
83e32aadb7
Add a variant of opal_init/finalize for running unit tests
...
This commit was SVN r30497.
2014-01-30 11:14:36 +00:00
Ralph Castain
db92ac3ce1
Cleanup role of aggregator relative to daemons
...
Refs trac:4176
This commit was SVN r30495.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-30 00:53:30 +00:00
Ralph Castain
ed3da20672
Add unit test for opal_db
...
This commit was SVN r30494.
2014-01-30 00:51:44 +00:00
Adrian Reber
af934fc6e8
removed trailing whitespaces in snapc
...
This commit was SVN r30489.
2014-01-29 21:27:13 +00:00
Adrian Reber
7de34ea201
SNAPC/CRCP/SSTORE: remove compiler warnings
...
This commit was SVN r30488.
2014-01-29 20:52:00 +00:00
Adrian Reber
5f95db3902
SNAPC: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls.
This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.
This commit was SVN r30487.
2014-01-29 20:46:14 +00:00
Adrian Reber
fa1036f38c
SSTORE/CRCP: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls were replaced by non-blocking
which broke the code. This patch uses ORTE_WAIT_FOR_COMPLETION()
to wait until the non-blocking calls have finished.
This commit was SVN r30486.
2014-01-29 20:30:35 +00:00
Adrian Reber
d5c1e33900
SSTORE: use dynamic buffers for rml.send and rml.recv
...
The sstore component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30485.
2014-01-29 20:06:23 +00:00
Adrian Reber
2900f24b67
SNAPC: use dynamic buffers for rml.send and rml.recv
...
The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30484.
2014-01-29 19:58:33 +00:00
Ralph Castain
4e3d12d9c1
Fix suicide operation when MPI app loses connection to its local daemon. In that scenario, we correctly callback up to the MPI layer notifying it of the lost connection. However, when the MPI layer calls back down to tell the RTE to abort, it is passing back a flag indicating we should report that error to our local daemon - which is dead. This leads to an infinite loop. Break it by using checking the flag indicating an abnormal term was ordered by the RTE and thus don't attempt to send the message.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30475.
2014-01-29 16:56:54 +00:00
Ralph Castain
410a3afa7b
Fix --without-hwloc operations - must default to map-by slot in that scenario
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30474.
2014-01-29 16:54:05 +00:00
Ralph Castain
42eb0bbe1b
Fix --without-hwloc builds
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30462.
2014-01-28 17:10:32 +00:00
Ralph Castain
c874ce3b61
Don't look for the coretemp file when configuring as it might not be on the head node, but is available on the backend
...
Refs trac:4176
This commit was SVN r30461.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-28 16:15:12 +00:00
Jeff Squyres
4edeb229cc
Add MPIEXEC_TIMEOUT environment variable to the man page.
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30455.
2014-01-28 14:40:17 +00:00
Ralph Castain
84a0ab3a75
Ah @$#!$#% - missed one last help message that needs to be corrected.
...
cmr=v1.7.4:reviewer=jsquyres:subject=correct help message
This commit was SVN r30449.
2014-01-28 04:03:24 +00:00
Ralph Castain
941bfd4604
Final cleanup of cpus-per-proc for 1.7.4 - provide better checking for cpus-per-proc and mismatched mapping/binding directives, and provide error messages telling the user what to do to get it right.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30438.
2014-01-27 22:40:51 +00:00
Ralph Castain
53b1be5067
Only report launch progress when specifically requested to do so. Thanks to Tetsuya Mishima for spotting it.
...
Reviewed by rhc and RM-approved
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30434.
2014-01-27 15:17:42 +00:00
Ralph Castain
956aab03a7
Track the origin of a message so it can be passed across transports
...
Refs trac:4184
This commit was SVN r30433.
The following Trac tickets were found above:
Ticket 4184 --> https://svn.open-mpi.org/trac/ompi/ticket/4184
2014-01-26 21:09:26 +00:00
Ralph Castain
11562ab7cb
Ensure we build the sensor components even if the local system doesn't have the required directories and/or access permissions. Backend nodes that get the binary may have them, and aggregators need to load the component so they can log data even if they aren't locally monitoring. Detect that we can't access the required files when we first try to sample and turn the sampling portion of the plugin off at that time.
...
Refs trac:4172
This commit was SVN r30426.
The following Trac tickets were found above:
Ticket 4172 --> https://svn.open-mpi.org/trac/ompi/ticket/4172
2014-01-25 04:34:33 +00:00
Jeff Squyres
21ffddbbd0
Addendum to r30408: if we're going to remove stale kruft, let's remove
...
all of it. :-)
Refs trac:4175.
This commit was SVN r30417.
The following SVN revision numbers were found above:
r30408 --> open-mpi/ompi@31acdb15bc
The following Trac tickets were found above:
Ticket 4175 --> https://svn.open-mpi.org/trac/ompi/ticket/4175
2014-01-24 22:19:36 +00:00
Ralph Castain
f73d23e723
Correct the location of the counter when tracking process launch for reporting progress
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b
Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
...
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested
This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
e496e348a4
Some cleanup of the sensor system to ensure things go in the right place, avoid segfaults under abnormal conditions, etc.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30409.
2014-01-24 17:29:24 +00:00
Ralph Castain
31acdb15bc
We haven't really supported orteCC in a long time, so let's remove the stale cruft. Thanks to Paul Hargrove for noticing!
...
cmr=v1.7.4:reviewer=jsquyres:subject=remove stale orteCC cruft
This commit was SVN r30408.
2014-01-24 17:26:54 +00:00
Adrian Reber
0af2897c12
removed trailing whitespaces in orte-checkpoint.c
...
This commit was SVN r30407.
2014-01-24 17:23:49 +00:00
Adrian Reber
659eb1b10a
silence two compiler warnings
...
This commit was SVN r30406.
2014-01-24 17:22:28 +00:00
Adrian Reber
919260a0d2
fix communication between orte-checkpoint and orterun
...
Right after starting the communication with orterun the buffer
containing the message is deleted. This patch removes the deletion
of the buffer which is now done by orte_rml_send_callback(). This is
now also the callback function used by orte_rml.send_buffer_nb().
The previous callback hnp_receiver() was introduced by an
earlier patch which only was trying to get the code to compile again.
This commit was SVN r30405.
2014-01-24 17:18:28 +00:00
Adrian Reber
8c93ebffeb
orte_snapc_base_select() wants to know if it is an application
...
The function
int orte_snapc_base_select(bool seed, bool app);
wants to know if it called by an application or not. Therefore
it expects as second paremeter 'bool app'. It used to be
'!ORTE_PROC_IS_DAEMON' which is not always correct if it is
a tool or a HNP. This patch changes it to ORTE_PROC_IS_APP, which
has the correct information if it is an application.
This commit was SVN r30404.
2014-01-24 17:14:41 +00:00
Ralph Castain
14bf1c9463
Some minor cleanups:
...
* don't return null if someone wants to print ORTE_SUCCESS
* rename some stale process types
* keep show_help local if we are in standalone operation as there is nobody to send it to
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30400.
2014-01-23 21:35:20 +00:00
Ralph Castain
32996cd705
Add new sensors for chip frequency and power (when permissions allow) Note that we don't support all chipsets at this time, but others are welcome to extend as desired.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30399.
2014-01-23 21:33:21 +00:00
Ralph Castain
886fee9367
Properly set num_procs when np is not given, but cpus-per-proc is used. Thanks to Tetsuya Mishima for pointing it out
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30389.
2014-01-23 05:01:07 +00:00
Ralph Castain
a01470190d
Allow a little more flexibility - if getpwuid fails, just use the return from getuid to define the session directory
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30388.
2014-01-23 05:00:05 +00:00
Ralph Castain
de07a64599
Cleanup the sensor code:
...
* use the global flags for linux and apple being found instead of re-doing the case statements
* update select procedure to ignore components that measure the same thing (e.g., resusage and sigar), taking the higher priority module
cmr=v1.7.5:reviewer=jsquyres:subject=Cleanup the sensor code
This commit was SVN r30368.
2014-01-22 21:01:09 +00:00
Jeff Squyres
7768828d2d
Addendum to r30298: tweak the wording of the help messages a bit.
...
Refs trac:4117. Please use this commit rather than the patch attached to
the ticket; the patch had a few mistakes in the tweaked wording.
This commit was SVN r30362.
The following SVN revision numbers were found above:
r30298 --> open-mpi/ompi@58479399c3
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-01-22 12:17:14 +00:00
Ralph Castain
e0edc29029
Add comment on future work
...
This commit was SVN r30336.
2014-01-20 19:54:31 +00:00
Ralph Castain
9b2066cfba
Add two new sensor modules - one to monitor core temperatures, and the other to monitor resource usage using the sigar library
...
This commit was SVN r30335.
2014-01-20 19:35:48 +00:00
Ralph Castain
3e9c8497e0
Shift the verbose output a bit
...
Refs trac:4136
This commit was SVN r30332.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-20 14:41:37 +00:00
Ralph Castain
5ad9795bd8
Cleanup some potential memory overruns
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30331.
2014-01-19 16:31:26 +00:00
Ralph Castain
9f6fd7b98d
A few corrections to hostfile parsing - thanks to Tetsuya Mishima for the review
...
Refs trac:4136
This commit was SVN r30330.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-19 16:26:12 +00:00
Ralph Castain
657796f9e0
Revert r30327 - turns out it isn't quite right just yet. :-(
...
Closes trac:4138
This commit was SVN r30328.
The following SVN revision numbers were found above:
r30327 --> open-mpi/ompi@87d5f86025
The following Trac tickets were found above:
Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138
2014-01-18 23:38:39 +00:00
Ralph Castain
87d5f86025
Enable use of unix domain sockets for local OOB communications, thereby removing the requirement for an active network interface when running strictly on a single node. Update the overall OOB system to support cross-transport movement of messages so that the OOB can move a received message to another transport for transmission.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Enable use of unix domain sockets for local OOB communications
This commit was SVN r30327.
2014-01-18 21:36:49 +00:00
Ralph Castain
fcdd904af4
Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
87f34860fe
Protect array against crossing boundaries
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30316.
2014-01-17 21:36:20 +00:00
Mike Dubman
874c4e2558
PMI2: add missing file from prev commit
...
Refs trac:4119
This commit was SVN r30301.
The following Trac tickets were found above:
Ticket 4119 --> https://svn.open-mpi.org/trac/ompi/ticket/4119
2014-01-16 13:17:08 +00:00
Mike Dubman
98234b5a69
SLURM/PMI2: Fix parsing of PMI2 process mapping
...
fixed by AlexM, reviewed by miked
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30300.
2014-01-16 12:05:29 +00:00
Ralph Castain
58479399c3
As per RFC and telecon, deprecate cmd line options and their corresponding MCA params for old-style mapping and binding directives
...
cmr=v1.7.5:reviewer=jsquyres:subject=deprecate old-style mapping and binding directives
This commit was SVN r30298.
2014-01-15 14:48:39 +00:00
Ralph Castain
590a87c730
You can't pass static buffer definitions to rml.send as it will attempt to release them upon completion - you need to send dynamically allocated buffers
...
This commit was SVN r30261.
2014-01-11 19:38:11 +00:00
Ralph Castain
286ff6d552
For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun.
...
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!
Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.
This behavior will *only* take effect under the following conditions:
1. launched via mpirun
2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX
3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.
The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:
1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".
2. adjust the SM rendezvous logic to loop until the required file has been created
Creating a placeholder to bring this over to 1.7.5 when ready.
cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale
This commit was SVN r30259.
2014-01-11 17:36:06 +00:00
Ralph Castain
fb9e427320
One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do *not* bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound.
...
Refs trac:4077
This commit was SVN r30200.
The following Trac tickets were found above:
Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077
2014-01-09 22:39:34 +00:00
Ralph Castain
24e990e747
Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems
This commit was SVN r30197.
2014-01-09 20:33:48 +00:00
Ralph Castain
9fcb46d85a
Correctly detect and handle oversubscription for comm_spawn
...
cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn
This commit was SVN r30186.
2014-01-09 18:27:51 +00:00
Ralph Castain
6e5fedeb04
Oops - add verbose output to inform that cannot default bind due to no cores detected
...
Refs trac:4074
This commit was SVN r30185.
The following Trac tickets were found above:
Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074
2014-01-09 18:17:14 +00:00
Ralph Castain
4cdc291df1
Ensure slurm properly dies on abnormal termination
...
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination
This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Jeff Squyres
87e476ebd8
Clean up many references to "rank": usually change to "process" and/or
...
specifically delineate that we're referring to the process' rank in
MPI_COMM_WORLD.
Refs trac:4068
This commit was SVN r30181.
The following Trac tickets were found above:
Ticket 4068 --> https://svn.open-mpi.org/trac/ompi/ticket/4068
2014-01-09 16:37:49 +00:00
Ralph Castain
7e4748a0f1
Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes.
...
Thanks to Paul Hargrove for reporting the problem on NetBSD.
cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores
This commit was SVN r30180.
2014-01-09 16:27:58 +00:00
Ralph Castain
f179f2086b
Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings
This commit was SVN r30179.
2014-01-09 16:16:16 +00:00
Ralph Castain
2a0e4b5e62
Update the orterun help messages and man page to reflect new map/rank/bind options and defaults. Thanks to Paul Hargrove for reporting it.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30173.
2014-01-09 04:44:28 +00:00
Ralph Castain
bf453a2575
Reference the correct variable...sigh
...
Refs trac:4059
This commit was SVN r30163.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 22:36:39 +00:00
Ralph Castain
80497d73cf
Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Ralph Castain
d5647394d8
Initialize variable so dash-host option gets correctly parsed
...
cmr=v1.7.4:reviewer=rolfv
This commit was SVN r30159.
2014-01-08 15:17:16 +00:00
Ralph Castain
e724d0d12d
Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives
...
Refs trac:4059
This commit was SVN r30158.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 15:16:22 +00:00
Ralph Castain
fb650aed0c
Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job
This commit was SVN r30155.
2014-01-08 04:25:43 +00:00
Ralph Castain
bc75250951
Cleanup the sensor framework close - existing code was using incorrect object type. Don't start sensors if sample rate is zero. Don't add zero-byte data from resusage as it means nothing was measured.
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r30150.
2014-01-08 02:38:56 +00:00
Jeff Squyres
13b29cff2c
This commit compliements/completes r30140. r30140 made all the
...
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:
* pkgdatdir -> ompidatadir
* pkglibdir -> ompilibdir
* pkgincludedir -> ompiincludedir
This commit was SVN r30145.
The following SVN revision numbers were found above:
r30140 --> open-mpi/ompi@8b778903d8
2014-01-07 23:36:33 +00:00
Brian Barrett
8b778903d8
Fix longstanding issue with our multi-project support. Rather than using
...
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Mike Dubman
40aadab85f
re-enable map-by dist
...
after last refactoring in rmaps, map-by dist:hca was disabled.
reverting it back
found/fixed by Elena, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30118.
2014-01-04 20:44:41 +00:00
Ralph Castain
9a855ff58e
Update sensor component for new OOB calls
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30117.
2014-01-03 22:37:15 +00:00
Ralph Castain
3f2b3c53ea
Ensure that rankfile-provided allocations are correctly handled
...
Fixes trac:4043
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that rankfile-provided allocations are correctly handled
This commit was SVN r30106.
The following Trac tickets were found above:
Ticket 4043 --> https://svn.open-mpi.org/trac/ompi/ticket/4043
2014-01-02 16:07:16 +00:00
Ralph Castain
d5a5caa7e0
Restore the bycore mpirun option for backward compatibility
...
Refs trac:4044
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30103.
The following Trac tickets were found above:
Ticket 4044 --> https://svn.open-mpi.org/trac/ompi/ticket/4044
2014-01-02 04:16:43 +00:00
Ralph Castain
a8a91b374e
Update component-level selection comments to match latest revisions
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30087.
2013-12-25 19:12:43 +00:00
Ralph Castain
d049731911
Add pubsub pmi component to list of components to avoid when indirect launch used
...
Refs trac:4032
This commit was SVN r30083.
The following Trac tickets were found above:
Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032
2013-12-25 16:25:37 +00:00
Ralph Castain
85f2429819
Ensure the ipv6 lists get initialized and finalized
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30081.
2013-12-24 17:24:39 +00:00
Ralph Castain
2e08219cac
Silence the valgrind report from the OOB
...
Refs trac:4033
This commit was SVN r30080.
The following Trac tickets were found above:
Ticket 4033 --> https://svn.open-mpi.org/trac/ompi/ticket/4033
2013-12-24 17:06:45 +00:00
Ralph Castain
81df8d09ca
Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion.
...
cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun
This commit was SVN r30078.
2013-12-24 16:32:31 +00:00
Ralph Castain
01ee5f380b
Remove debug - problem has been identified
...
Refs trac:4026
This commit was SVN r30075.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 15:22:18 +00:00
Jeff Squyres
ce02002a5e
Free minor memory leak / squash valgrind still-reachable warning.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30071.
2013-12-24 11:04:38 +00:00
Ralph Castain
38f46641ce
Ensure the recv handler has been initialized
...
Refs trac:4026
This commit was SVN r30068.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 06:09:45 +00:00
Ralph Castain
bb80625a8a
Add missing var initialization
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30063.
2013-12-24 00:02:22 +00:00
Ralph Castain
65228d3571
Don't use "size_t" for the nbytes field in the header - use uint32_t to ensure that ntohl/htonl correctly match it
...
Refs trac:4026
This commit was SVN r30062.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-23 21:39:49 +00:00
Ralph Castain
7d8c0459a4
Attempt to debug hang that is hitting some environments. Posting to 1.7.4 as a placeholder for the eventual solution
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30060.
2013-12-23 19:57:05 +00:00
Nathan Hjelm
3be4536d9b
Cleanup various leaks in ompi_info reported by valgrind.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
George Bosilca
24879f9def
Code cleanup while chasing valgrind complaints.
...
This commit was SVN r30048.
2013-12-21 23:28:14 +00:00
George Bosilca
38cbaeaa82
Try to impose a little bit of consistency on how we parse lists of
...
modules by enforcing the use of OPAL list accessors.
This commit was SVN r30045.
2013-12-21 23:23:33 +00:00
Ralph Castain
264150872b
Add a bunch of debug output to the OOB connection completion code so we can track down a handshake problem. Available in optimized builds as well as debug ones by setting -mca oob_base_verbose 10
...
No review will be required as this is just debug code for those helping us debug the 1.7.4 release candidates
cmr-=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30043.
2013-12-21 16:09:26 +00:00
Ralph Castain
9c768df8b8
Resolve an unexpected behavior in hostfile allocations. Now that we filter allocations to determine what will be used for mapping, let the initial global pool be the union of nodes from all sources (default hostfile, hostfiles, and dash-hosts). Each app will filter down to only those specified for it using its own hostfile and dash-host options.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Resolve an unexpected behavior in hostfile allocations
This commit was SVN r30040.
2013-12-21 01:38:27 +00:00
Adrian Reber
53a70fe87f
Trying to get the C/R code to compile again. (send_*_nb)
...
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)
This commit was SVN r30036.
2013-12-20 21:58:28 +00:00
Adrian Reber
a3813d37c7
Trying to get the C/R code to compile again. (recv_*_nb)
...
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* only #ifdef out the code where the behaviour is changed
(used to be blocking; now non-blocking)
This commit was SVN r30035.
2013-12-20 21:05:40 +00:00
Ralph Castain
31248c0985
Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match.
...
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
2013-12-20 20:42:39 +00:00
Ralph Castain
71b52fe861
Ensure that comm_spawn'd procs get user-specified forwarded envars
...
Thanks to Tim Miller for reporting the regression from the 1.6 series
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars
This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Ralph Castain
d47d2569f3
We stripped the process info packing routine to minimize message size when sending the launch message, but tools still require all the info. So modify the tool-hnp handshake to explicitly add the missing info
...
Refs trac:3992
This commit was SVN r29989.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 20:42:20 +00:00
Ralph Castain
55cd65b149
Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along.
...
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
2013-12-19 16:31:45 +00:00
Ralph Castain
9b32dacb6c
Ensure we don't abort if a tool cannot send a message - the orte/util/comm library used by tools to query mpirun knows how to handle this situation.
...
Refs trac:3992
This commit was SVN r29975.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 07:10:36 +00:00
Ralph Castain
6239e64f36
Further cleanup of orte-ps so it doesn't abort when hitting a stale HNP - only report that event once and just keep working.
...
Refs trac:3992
This commit was SVN r29974.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 03:28:05 +00:00
Ralph Castain
bf5e314f76
Tools require their own errmgr and state components so they can handle any errors that occur in, for example, communication .
...
Refs trac:3992
This commit was SVN r29972.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 01:49:33 +00:00
Ralph Castain
3aaca16faa
Silence warnings that are no longer valid
...
Refs trac:3992
This commit was SVN r29970.
The following Trac tickets were found above:
Ticket 3992 --> https://svn.open-mpi.org/trac/ompi/ticket/3992
2013-12-19 00:40:36 +00:00
Ralph Castain
c5956e7b8c
Convert debug output to opal_output_verbose
...
Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29969.
2013-12-19 00:36:15 +00:00
Ralph Castain
39957df08e
Fixes trac:3963. Fix the tool ess procedure so it opens and selects the OOB framework, and have the OOB TCP module update the route to new connections (the routed modules know what to do).
...
Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
2013-12-18 23:13:46 +00:00
Ralph Castain
77553f72be
Per this email thread:
...
http://www.open-mpi.org/community/lists/devel/2013/12/13412.php
fix the backtrace function to avoid async issues. Thanks to Takahiro Kawashima for the patch
This commit was SVN r29955.
2013-12-18 17:57:37 +00:00
Ralph Castain
ab4636c47b
Per email on devel list, change the default rank-by to slot unless map-by <obj> is specified, in which case use rank-by <obj>
...
Refs trac:3977
This commit was SVN r29945.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-18 00:48:50 +00:00
Ralph Castain
53cd00fe16
By setting a default mapping/ranking/binding policy that wasn't "none", we introduced a problem for users of the Mac and any other machine where sockets aren't defined and/or binding is not supported. Fix that by checking to see if the user specified the failing policy - if not, then fall back to the old map/rank by slot and no binding.
...
Refs trac:3977
This commit was SVN r29933.
The following Trac tickets were found above:
Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977
2013-12-17 14:50:10 +00:00
Adrian Reber
b42aad44a3
Trying to get the C/R code to compile again. This patch
...
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
2013-12-16 15:35:28 +00:00
Ralph Castain
8b6d117541
Per the OMPI devel conference that changed our default behaviors:
...
* default to bind-to core
* map-by slot if np=2
* map-by socket (balance across sockets on each node) if np > 2
* map-by <obj> will imply rank-by <obj> by default (leave default binding as above)
Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs
cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values
This commit was SVN r29919.
2013-12-15 17:25:54 +00:00
Jeff Squyres
770bf77149
Fix some minor memory leaks in error code paths.
...
Many thanks to Tom Fogal for the patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix minor memory leaks in error code paths
This commit was SVN r29905.
2013-12-14 00:41:21 +00:00
Jeff Squyres
0ab48ad0d2
Fix some annoying flex warnings that have been there for years.
...
Many thanks to Tom Fogal for the initial patch.
cmr=v1.7.4:reviewer=rhc:subject=Fix annoying flex warnings
This commit was SVN r29904.
2013-12-14 00:36:12 +00:00
Jeff Squyres
2e7653e4c2
Add missing argv.h includes.
...
Noticed these as part of #3694 : external libevent's don't cause argv.h
to automatically get included.
Refs trac:3694
This commit was SVN r29897.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-13 21:17:36 +00:00
Brian Barrett
121ca26c59
Per discussion at Develoepr's Meeting, remove Solaris threads support. Solaris
...
will just fall back to pthreads, which should be no problem.
This commit was SVN r29893.
2013-12-13 20:07:11 +00:00
Ralph Castain
0e81959aae
Cleanup mindist error messages - already patched in 1.7
...
This commit was SVN r29869.
2013-12-12 15:30:29 +00:00
Ralph Castain
1ff12362da
Cleanup merge conflict that was incorrectly committed
...
This commit was SVN r29851.
2013-12-09 20:20:14 +00:00
Ralph Castain
83e59e6761
Once again, the Slurm folks have decided to redefine their envars, reversing what they had previously told us to do. So cleanup the Slurm allocation code, and also adjust to a change in srun behavior that now aborts a job if the ntasks-per-node doesn't get specified when ORTE calls it, but the user specified it when getting an allocation. Sigh.
...
cmr=v1.7.4:reviewer=miked:subject=Update Slurm allocation and launch
This commit was SVN r29849.
2013-12-09 17:58:46 +00:00
Mike Dubman
c208b858e7
improve error messages in mindist
...
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29846.
2013-12-09 06:34:38 +00:00
Ralph Castain
f2c49c6c19
Fix the map-by object mapper to handle cpus-per-proc by accounting for the request when computing the number of procs to put on each object. This ensures that the binding routine doesn't automatically overload the cores.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r29843.
2013-12-08 16:59:25 +00:00
Ralph Castain
9604f36c3b
Specify units for the job completion timeout
...
This commit was SVN r29839.
2013-12-08 04:51:58 +00:00
Ralph Castain
62c9e5c64c
Really is better if we output a message indicating that the job was aborted due to hitting the execution time limit
...
Refs trac:3960
This commit was SVN r29833.
The following Trac tickets were found above:
Ticket 3960 --> https://svn.open-mpi.org/trac/ompi/ticket/3960
2013-12-07 15:33:56 +00:00
Ralph Castain
d44e4a311f
Per request from Dave Goodell, add support for MPIEXEC_TIMEOUT - if set in the environment, terminate the job after the specified number of seconds has passed. Equivalent to MPICH functionality.
...
cmr=v1.7.4:reviewer=dgoodell:subject=add support for MPIEXEC_TIMEOUT
This commit was SVN r29831.
2013-12-07 01:58:32 +00:00