Ralph Castain
c617d66d98
Paul Hargrove has pointed out that some big SMP systems (e.g., from SGI) configure Torque differently - instead of listing each node name once/slot in the nodefile, they list the node only once and set an envar to indicate the number of procs/node being allocated. Add an MCA param users can set to indicate we are in such an environment, and then use the envar to set the slots. Error out if the mode flag is given, but (a) we don't find the PBS_PPN envar, or (b) we find a node actually listed more than once in the PBS_Nodefile.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Support SMP mode in Torque
This commit was SVN r30568.
2014-02-05 15:51:17 +00:00
Ralph Castain
1326ed704f
Per the RFC discussed here:
...
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php
add support for async modex when requested.
cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support
This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Ralph Castain
230336b6a8
Upgrade the security framework to avoid multiple hits against the global security server. Add support for future case where mpirun assings a global security credential for a given run, though we need to work out how to handle connect-accept from other mpirun's in that case. Remove a bunch of duplicate code in the OOB by consolidating the connection handshake code.
...
Refs trac:4221
This commit was SVN r30554.
The following Trac tickets were found above:
Ticket 4221 --> https://svn.open-mpi.org/trac/ompi/ticket/4221
2014-02-04 14:47:04 +00:00
Adrian Reber
fde1040d2f
Use unique collective ids for the checkpoint/restart code
...
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
5980b7e042
Add a security framework for authenticating connections - we will add LDAP, Kerberos, and Keystone support in the next month. For now, just put a placeholder "basic" module that does the minimum.
...
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
2014-02-04 01:38:45 +00:00
Ralph Castain
e43589ed84
Fix warning - thanks to Paul Hargrove for reporting it
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30548.
2014-02-03 23:51:45 +00:00
Ralph Castain
993198cfba
Fix lost message problem - if multiple messages are queued before the connection is formed, we lost all but the first one. Ensure that all messages get properly queued prior to completing the connection
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix lost message problem
This commit was SVN r30516.
2014-01-31 05:30:51 +00:00
Ralph Castain
2bc9fd30ee
Orcm sends heartbeats to its daemons, but ORTE needs to continue sending it to the HNP
...
This commit was SVN r30514.
2014-01-31 01:56:01 +00:00
Ralph Castain
193cceb483
Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps.
...
Yippee skippee...
This commit was SVN r30513.
2014-01-30 23:50:14 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00
Ralph Castain
83e32aadb7
Add a variant of opal_init/finalize for running unit tests
...
This commit was SVN r30497.
2014-01-30 11:14:36 +00:00
Ralph Castain
db92ac3ce1
Cleanup role of aggregator relative to daemons
...
Refs trac:4176
This commit was SVN r30495.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-30 00:53:30 +00:00
Ralph Castain
ed3da20672
Add unit test for opal_db
...
This commit was SVN r30494.
2014-01-30 00:51:44 +00:00
Adrian Reber
af934fc6e8
removed trailing whitespaces in snapc
...
This commit was SVN r30489.
2014-01-29 21:27:13 +00:00
Adrian Reber
7de34ea201
SNAPC/CRCP/SSTORE: remove compiler warnings
...
This commit was SVN r30488.
2014-01-29 20:52:00 +00:00
Adrian Reber
5f95db3902
SNAPC: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls.
This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.
This commit was SVN r30487.
2014-01-29 20:46:14 +00:00
Adrian Reber
fa1036f38c
SSTORE/CRCP: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls were replaced by non-blocking
which broke the code. This patch uses ORTE_WAIT_FOR_COMPLETION()
to wait until the non-blocking calls have finished.
This commit was SVN r30486.
2014-01-29 20:30:35 +00:00
Adrian Reber
d5c1e33900
SSTORE: use dynamic buffers for rml.send and rml.recv
...
The sstore component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30485.
2014-01-29 20:06:23 +00:00
Adrian Reber
2900f24b67
SNAPC: use dynamic buffers for rml.send and rml.recv
...
The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30484.
2014-01-29 19:58:33 +00:00
Ralph Castain
4e3d12d9c1
Fix suicide operation when MPI app loses connection to its local daemon. In that scenario, we correctly callback up to the MPI layer notifying it of the lost connection. However, when the MPI layer calls back down to tell the RTE to abort, it is passing back a flag indicating we should report that error to our local daemon - which is dead. This leads to an infinite loop. Break it by using checking the flag indicating an abnormal term was ordered by the RTE and thus don't attempt to send the message.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30475.
2014-01-29 16:56:54 +00:00
Ralph Castain
410a3afa7b
Fix --without-hwloc operations - must default to map-by slot in that scenario
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30474.
2014-01-29 16:54:05 +00:00
Ralph Castain
42eb0bbe1b
Fix --without-hwloc builds
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30462.
2014-01-28 17:10:32 +00:00
Ralph Castain
c874ce3b61
Don't look for the coretemp file when configuring as it might not be on the head node, but is available on the backend
...
Refs trac:4176
This commit was SVN r30461.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-28 16:15:12 +00:00
Jeff Squyres
4edeb229cc
Add MPIEXEC_TIMEOUT environment variable to the man page.
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30455.
2014-01-28 14:40:17 +00:00
Ralph Castain
84a0ab3a75
Ah @$#!$#% - missed one last help message that needs to be corrected.
...
cmr=v1.7.4:reviewer=jsquyres:subject=correct help message
This commit was SVN r30449.
2014-01-28 04:03:24 +00:00
Ralph Castain
941bfd4604
Final cleanup of cpus-per-proc for 1.7.4 - provide better checking for cpus-per-proc and mismatched mapping/binding directives, and provide error messages telling the user what to do to get it right.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30438.
2014-01-27 22:40:51 +00:00
Ralph Castain
53b1be5067
Only report launch progress when specifically requested to do so. Thanks to Tetsuya Mishima for spotting it.
...
Reviewed by rhc and RM-approved
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30434.
2014-01-27 15:17:42 +00:00
Ralph Castain
956aab03a7
Track the origin of a message so it can be passed across transports
...
Refs trac:4184
This commit was SVN r30433.
The following Trac tickets were found above:
Ticket 4184 --> https://svn.open-mpi.org/trac/ompi/ticket/4184
2014-01-26 21:09:26 +00:00
Ralph Castain
11562ab7cb
Ensure we build the sensor components even if the local system doesn't have the required directories and/or access permissions. Backend nodes that get the binary may have them, and aggregators need to load the component so they can log data even if they aren't locally monitoring. Detect that we can't access the required files when we first try to sample and turn the sampling portion of the plugin off at that time.
...
Refs trac:4172
This commit was SVN r30426.
The following Trac tickets were found above:
Ticket 4172 --> https://svn.open-mpi.org/trac/ompi/ticket/4172
2014-01-25 04:34:33 +00:00
Jeff Squyres
21ffddbbd0
Addendum to r30408: if we're going to remove stale kruft, let's remove
...
all of it. :-)
Refs trac:4175.
This commit was SVN r30417.
The following SVN revision numbers were found above:
r30408 --> open-mpi/ompi@31acdb15bc
The following Trac tickets were found above:
Ticket 4175 --> https://svn.open-mpi.org/trac/ompi/ticket/4175
2014-01-24 22:19:36 +00:00
Ralph Castain
f73d23e723
Correct the location of the counter when tracking process launch for reporting progress
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b
Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
...
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested
This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
e496e348a4
Some cleanup of the sensor system to ensure things go in the right place, avoid segfaults under abnormal conditions, etc.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30409.
2014-01-24 17:29:24 +00:00
Ralph Castain
31acdb15bc
We haven't really supported orteCC in a long time, so let's remove the stale cruft. Thanks to Paul Hargrove for noticing!
...
cmr=v1.7.4:reviewer=jsquyres:subject=remove stale orteCC cruft
This commit was SVN r30408.
2014-01-24 17:26:54 +00:00
Adrian Reber
0af2897c12
removed trailing whitespaces in orte-checkpoint.c
...
This commit was SVN r30407.
2014-01-24 17:23:49 +00:00
Adrian Reber
659eb1b10a
silence two compiler warnings
...
This commit was SVN r30406.
2014-01-24 17:22:28 +00:00
Adrian Reber
919260a0d2
fix communication between orte-checkpoint and orterun
...
Right after starting the communication with orterun the buffer
containing the message is deleted. This patch removes the deletion
of the buffer which is now done by orte_rml_send_callback(). This is
now also the callback function used by orte_rml.send_buffer_nb().
The previous callback hnp_receiver() was introduced by an
earlier patch which only was trying to get the code to compile again.
This commit was SVN r30405.
2014-01-24 17:18:28 +00:00
Adrian Reber
8c93ebffeb
orte_snapc_base_select() wants to know if it is an application
...
The function
int orte_snapc_base_select(bool seed, bool app);
wants to know if it called by an application or not. Therefore
it expects as second paremeter 'bool app'. It used to be
'!ORTE_PROC_IS_DAEMON' which is not always correct if it is
a tool or a HNP. This patch changes it to ORTE_PROC_IS_APP, which
has the correct information if it is an application.
This commit was SVN r30404.
2014-01-24 17:14:41 +00:00
Ralph Castain
14bf1c9463
Some minor cleanups:
...
* don't return null if someone wants to print ORTE_SUCCESS
* rename some stale process types
* keep show_help local if we are in standalone operation as there is nobody to send it to
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30400.
2014-01-23 21:35:20 +00:00
Ralph Castain
32996cd705
Add new sensors for chip frequency and power (when permissions allow) Note that we don't support all chipsets at this time, but others are welcome to extend as desired.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30399.
2014-01-23 21:33:21 +00:00
Ralph Castain
886fee9367
Properly set num_procs when np is not given, but cpus-per-proc is used. Thanks to Tetsuya Mishima for pointing it out
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30389.
2014-01-23 05:01:07 +00:00
Ralph Castain
a01470190d
Allow a little more flexibility - if getpwuid fails, just use the return from getuid to define the session directory
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30388.
2014-01-23 05:00:05 +00:00
Ralph Castain
de07a64599
Cleanup the sensor code:
...
* use the global flags for linux and apple being found instead of re-doing the case statements
* update select procedure to ignore components that measure the same thing (e.g., resusage and sigar), taking the higher priority module
cmr=v1.7.5:reviewer=jsquyres:subject=Cleanup the sensor code
This commit was SVN r30368.
2014-01-22 21:01:09 +00:00
Jeff Squyres
7768828d2d
Addendum to r30298: tweak the wording of the help messages a bit.
...
Refs trac:4117. Please use this commit rather than the patch attached to
the ticket; the patch had a few mistakes in the tweaked wording.
This commit was SVN r30362.
The following SVN revision numbers were found above:
r30298 --> open-mpi/ompi@58479399c3
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-01-22 12:17:14 +00:00
Ralph Castain
e0edc29029
Add comment on future work
...
This commit was SVN r30336.
2014-01-20 19:54:31 +00:00
Ralph Castain
9b2066cfba
Add two new sensor modules - one to monitor core temperatures, and the other to monitor resource usage using the sigar library
...
This commit was SVN r30335.
2014-01-20 19:35:48 +00:00
Ralph Castain
3e9c8497e0
Shift the verbose output a bit
...
Refs trac:4136
This commit was SVN r30332.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-20 14:41:37 +00:00
Ralph Castain
5ad9795bd8
Cleanup some potential memory overruns
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30331.
2014-01-19 16:31:26 +00:00
Ralph Castain
9f6fd7b98d
A few corrections to hostfile parsing - thanks to Tetsuya Mishima for the review
...
Refs trac:4136
This commit was SVN r30330.
The following Trac tickets were found above:
Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-19 16:26:12 +00:00
Ralph Castain
657796f9e0
Revert r30327 - turns out it isn't quite right just yet. :-(
...
Closes trac:4138
This commit was SVN r30328.
The following SVN revision numbers were found above:
r30327 --> open-mpi/ompi@87d5f86025
The following Trac tickets were found above:
Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138
2014-01-18 23:38:39 +00:00