Ralph Castain
2bc9fd30ee
Orcm sends heartbeats to its daemons, but ORTE needs to continue sending it to the HNP
...
This commit was SVN r30514.
2014-01-31 01:56:01 +00:00
Ralph Castain
193cceb483
Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps.
...
Yippee skippee...
This commit was SVN r30513.
2014-01-30 23:50:14 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00
Ralph Castain
db92ac3ce1
Cleanup role of aggregator relative to daemons
...
Refs trac:4176
This commit was SVN r30495.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-30 00:53:30 +00:00
Adrian Reber
af934fc6e8
removed trailing whitespaces in snapc
...
This commit was SVN r30489.
2014-01-29 21:27:13 +00:00
Adrian Reber
7de34ea201
SNAPC/CRCP/SSTORE: remove compiler warnings
...
This commit was SVN r30488.
2014-01-29 20:52:00 +00:00
Adrian Reber
5f95db3902
SNAPC: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls.
This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.
This commit was SVN r30487.
2014-01-29 20:46:14 +00:00
Adrian Reber
fa1036f38c
SSTORE/CRCP: use ORTE_WAIT_FOR_COMPLETION with non-blocking receives
...
During the commits to make the C/R code compile again the
blocking receive calls were replaced by non-blocking
which broke the code. This patch uses ORTE_WAIT_FOR_COMPLETION()
to wait until the non-blocking calls have finished.
This commit was SVN r30486.
2014-01-29 20:30:35 +00:00
Adrian Reber
d5c1e33900
SSTORE: use dynamic buffers for rml.send and rml.recv
...
The sstore component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30485.
2014-01-29 20:06:23 +00:00
Adrian Reber
2900f24b67
SNAPC: use dynamic buffers for rml.send and rml.recv
...
The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;
This commit was SVN r30484.
2014-01-29 19:58:33 +00:00
Ralph Castain
4e3d12d9c1
Fix suicide operation when MPI app loses connection to its local daemon. In that scenario, we correctly callback up to the MPI layer notifying it of the lost connection. However, when the MPI layer calls back down to tell the RTE to abort, it is passing back a flag indicating we should report that error to our local daemon - which is dead. This leads to an infinite loop. Break it by using checking the flag indicating an abnormal term was ordered by the RTE and thus don't attempt to send the message.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30475.
2014-01-29 16:56:54 +00:00
Ralph Castain
410a3afa7b
Fix --without-hwloc operations - must default to map-by slot in that scenario
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30474.
2014-01-29 16:54:05 +00:00
Ralph Castain
42eb0bbe1b
Fix --without-hwloc builds
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30462.
2014-01-28 17:10:32 +00:00
Ralph Castain
c874ce3b61
Don't look for the coretemp file when configuring as it might not be on the head node, but is available on the backend
...
Refs trac:4176
This commit was SVN r30461.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-28 16:15:12 +00:00
Ralph Castain
84a0ab3a75
Ah @$#!$#% - missed one last help message that needs to be corrected.
...
cmr=v1.7.4:reviewer=jsquyres:subject=correct help message
This commit was SVN r30449.
2014-01-28 04:03:24 +00:00
Ralph Castain
941bfd4604
Final cleanup of cpus-per-proc for 1.7.4 - provide better checking for cpus-per-proc and mismatched mapping/binding directives, and provide error messages telling the user what to do to get it right.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30438.
2014-01-27 22:40:51 +00:00
Ralph Castain
53b1be5067
Only report launch progress when specifically requested to do so. Thanks to Tetsuya Mishima for spotting it.
...
Reviewed by rhc and RM-approved
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30434.
2014-01-27 15:17:42 +00:00
Ralph Castain
956aab03a7
Track the origin of a message so it can be passed across transports
...
Refs trac:4184
This commit was SVN r30433.
The following Trac tickets were found above:
Ticket 4184 --> https://svn.open-mpi.org/trac/ompi/ticket/4184
2014-01-26 21:09:26 +00:00
Ralph Castain
11562ab7cb
Ensure we build the sensor components even if the local system doesn't have the required directories and/or access permissions. Backend nodes that get the binary may have them, and aggregators need to load the component so they can log data even if they aren't locally monitoring. Detect that we can't access the required files when we first try to sample and turn the sampling portion of the plugin off at that time.
...
Refs trac:4172
This commit was SVN r30426.
The following Trac tickets were found above:
Ticket 4172 --> https://svn.open-mpi.org/trac/ompi/ticket/4172
2014-01-25 04:34:33 +00:00
Ralph Castain
f73d23e723
Correct the location of the counter when tracking process launch for reporting progress
...
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b
Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
...
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested
This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
e496e348a4
Some cleanup of the sensor system to ensure things go in the right place, avoid segfaults under abnormal conditions, etc.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30409.
2014-01-24 17:29:24 +00:00
Adrian Reber
8c93ebffeb
orte_snapc_base_select() wants to know if it is an application
...
The function
int orte_snapc_base_select(bool seed, bool app);
wants to know if it called by an application or not. Therefore
it expects as second paremeter 'bool app'. It used to be
'!ORTE_PROC_IS_DAEMON' which is not always correct if it is
a tool or a HNP. This patch changes it to ORTE_PROC_IS_APP, which
has the correct information if it is an application.
This commit was SVN r30404.
2014-01-24 17:14:41 +00:00
Ralph Castain
32996cd705
Add new sensors for chip frequency and power (when permissions allow) Note that we don't support all chipsets at this time, but others are welcome to extend as desired.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30399.
2014-01-23 21:33:21 +00:00
Ralph Castain
886fee9367
Properly set num_procs when np is not given, but cpus-per-proc is used. Thanks to Tetsuya Mishima for pointing it out
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30389.
2014-01-23 05:01:07 +00:00
Ralph Castain
de07a64599
Cleanup the sensor code:
...
* use the global flags for linux and apple being found instead of re-doing the case statements
* update select procedure to ignore components that measure the same thing (e.g., resusage and sigar), taking the higher priority module
cmr=v1.7.5:reviewer=jsquyres:subject=Cleanup the sensor code
This commit was SVN r30368.
2014-01-22 21:01:09 +00:00
Jeff Squyres
7768828d2d
Addendum to r30298: tweak the wording of the help messages a bit.
...
Refs trac:4117. Please use this commit rather than the patch attached to
the ticket; the patch had a few mistakes in the tweaked wording.
This commit was SVN r30362.
The following SVN revision numbers were found above:
r30298 --> open-mpi/ompi@58479399c3
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-01-22 12:17:14 +00:00
Ralph Castain
e0edc29029
Add comment on future work
...
This commit was SVN r30336.
2014-01-20 19:54:31 +00:00
Ralph Castain
9b2066cfba
Add two new sensor modules - one to monitor core temperatures, and the other to monitor resource usage using the sigar library
...
This commit was SVN r30335.
2014-01-20 19:35:48 +00:00
Ralph Castain
5ad9795bd8
Cleanup some potential memory overruns
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30331.
2014-01-19 16:31:26 +00:00
Ralph Castain
657796f9e0
Revert r30327 - turns out it isn't quite right just yet. :-(
...
Closes trac:4138
This commit was SVN r30328.
The following SVN revision numbers were found above:
r30327 --> open-mpi/ompi@87d5f86025
The following Trac tickets were found above:
Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138
2014-01-18 23:38:39 +00:00
Ralph Castain
87d5f86025
Enable use of unix domain sockets for local OOB communications, thereby removing the requirement for an active network interface when running strictly on a single node. Update the overall OOB system to support cross-transport movement of messages so that the OOB can move a received message to another transport for transmission.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Enable use of unix domain sockets for local OOB communications
This commit was SVN r30327.
2014-01-18 21:36:49 +00:00
Ralph Castain
fcdd904af4
Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
87f34860fe
Protect array against crossing boundaries
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30316.
2014-01-17 21:36:20 +00:00
Mike Dubman
874c4e2558
PMI2: add missing file from prev commit
...
Refs trac:4119
This commit was SVN r30301.
The following Trac tickets were found above:
Ticket 4119 --> https://svn.open-mpi.org/trac/ompi/ticket/4119
2014-01-16 13:17:08 +00:00
Mike Dubman
98234b5a69
SLURM/PMI2: Fix parsing of PMI2 process mapping
...
fixed by AlexM, reviewed by miked
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30300.
2014-01-16 12:05:29 +00:00
Ralph Castain
58479399c3
As per RFC and telecon, deprecate cmd line options and their corresponding MCA params for old-style mapping and binding directives
...
cmr=v1.7.5:reviewer=jsquyres:subject=deprecate old-style mapping and binding directives
This commit was SVN r30298.
2014-01-15 14:48:39 +00:00
Ralph Castain
590a87c730
You can't pass static buffer definitions to rml.send as it will attempt to release them upon completion - you need to send dynamically allocated buffers
...
This commit was SVN r30261.
2014-01-11 19:38:11 +00:00
Ralph Castain
286ff6d552
For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun.
...
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!
Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.
This behavior will *only* take effect under the following conditions:
1. launched via mpirun
2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX
3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.
The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:
1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".
2. adjust the SM rendezvous logic to loop until the required file has been created
Creating a placeholder to bring this over to 1.7.5 when ready.
cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale
This commit was SVN r30259.
2014-01-11 17:36:06 +00:00
Ralph Castain
fb9e427320
One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do *not* bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound.
...
Refs trac:4077
This commit was SVN r30200.
The following Trac tickets were found above:
Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077
2014-01-09 22:39:34 +00:00
Ralph Castain
24e990e747
Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems
This commit was SVN r30197.
2014-01-09 20:33:48 +00:00
Ralph Castain
9fcb46d85a
Correctly detect and handle oversubscription for comm_spawn
...
cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn
This commit was SVN r30186.
2014-01-09 18:27:51 +00:00
Ralph Castain
6e5fedeb04
Oops - add verbose output to inform that cannot default bind due to no cores detected
...
Refs trac:4074
This commit was SVN r30185.
The following Trac tickets were found above:
Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074
2014-01-09 18:17:14 +00:00
Ralph Castain
4cdc291df1
Ensure slurm properly dies on abnormal termination
...
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination
This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Ralph Castain
7e4748a0f1
Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes.
...
Thanks to Paul Hargrove for reporting the problem on NetBSD.
cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores
This commit was SVN r30180.
2014-01-09 16:27:58 +00:00
Ralph Castain
f179f2086b
Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings
This commit was SVN r30179.
2014-01-09 16:16:16 +00:00
Ralph Castain
bf453a2575
Reference the correct variable...sigh
...
Refs trac:4059
This commit was SVN r30163.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 22:36:39 +00:00
Ralph Castain
80497d73cf
Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Ralph Castain
e724d0d12d
Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives
...
Refs trac:4059
This commit was SVN r30158.
The following Trac tickets were found above:
Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 15:16:22 +00:00
Ralph Castain
fb650aed0c
Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained.
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job
This commit was SVN r30155.
2014-01-08 04:25:43 +00:00