1
1

4286 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
11562ab7cb Ensure we build the sensor components even if the local system doesn't have the required directories and/or access permissions. Backend nodes that get the binary may have them, and aggregators need to load the component so they can log data even if they aren't locally monitoring. Detect that we can't access the required files when we first try to sample and turn the sampling portion of the plugin off at that time.
Refs trac:4172

This commit was SVN r30426.

The following Trac tickets were found above:
  Ticket 4172 --> https://svn.open-mpi.org/trac/ompi/ticket/4172
2014-01-25 04:34:33 +00:00
Jeff Squyres
21ffddbbd0 Addendum to r30408: if we're going to remove stale kruft, let's remove
all of it.  :-)

Refs trac:4175.

This commit was SVN r30417.

The following SVN revision numbers were found above:
  r30408 --> open-mpi/ompi@31acdb15bc

The following Trac tickets were found above:
  Ticket 4175 --> https://svn.open-mpi.org/trac/ompi/ticket/4175
2014-01-24 22:19:36 +00:00
Ralph Castain
f73d23e723 Correct the location of the counter when tracking process launch for reporting progress
cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested

This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
e496e348a4 Some cleanup of the sensor system to ensure things go in the right place, avoid segfaults under abnormal conditions, etc.
cmr=v1.7.5:reviewer=rhc

This commit was SVN r30409.
2014-01-24 17:29:24 +00:00
Ralph Castain
31acdb15bc We haven't really supported orteCC in a long time, so let's remove the stale cruft. Thanks to Paul Hargrove for noticing!
cmr=v1.7.4:reviewer=jsquyres:subject=remove stale orteCC cruft

This commit was SVN r30408.
2014-01-24 17:26:54 +00:00
Adrian Reber
0af2897c12 removed trailing whitespaces in orte-checkpoint.c
This commit was SVN r30407.
2014-01-24 17:23:49 +00:00
Adrian Reber
659eb1b10a silence two compiler warnings
This commit was SVN r30406.
2014-01-24 17:22:28 +00:00
Adrian Reber
919260a0d2 fix communication between orte-checkpoint and orterun
Right after starting the communication with orterun the buffer
containing the message is deleted. This patch removes the deletion
of the buffer which is now done by orte_rml_send_callback(). This is
now also the callback function used by orte_rml.send_buffer_nb().
The previous callback hnp_receiver() was introduced by an
earlier patch which only was trying to get the code to compile again.

This commit was SVN r30405.
2014-01-24 17:18:28 +00:00
Adrian Reber
8c93ebffeb orte_snapc_base_select() wants to know if it is an application
The function

   int orte_snapc_base_select(bool seed, bool app);

wants to know if it called by an application or not. Therefore
it expects as second paremeter 'bool app'. It used to be
'!ORTE_PROC_IS_DAEMON' which is not always correct if it is
a tool or a HNP. This patch changes it to ORTE_PROC_IS_APP, which
has the correct information if it is an application.

This commit was SVN r30404.
2014-01-24 17:14:41 +00:00
Ralph Castain
14bf1c9463 Some minor cleanups:
* don't return null if someone wants to print ORTE_SUCCESS

* rename some stale process types

* keep show_help local if we are in standalone operation as there is nobody to send it to

cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30400.
2014-01-23 21:35:20 +00:00
Ralph Castain
32996cd705 Add new sensors for chip frequency and power (when permissions allow) Note that we don't support all chipsets at this time, but others are welcome to extend as desired.
cmr=v1.7.5:reviewer=rhc

This commit was SVN r30399.
2014-01-23 21:33:21 +00:00
Ralph Castain
886fee9367 Properly set num_procs when np is not given, but cpus-per-proc is used. Thanks to Tetsuya Mishima for pointing it out
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30389.
2014-01-23 05:01:07 +00:00
Ralph Castain
a01470190d Allow a little more flexibility - if getpwuid fails, just use the return from getuid to define the session directory
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30388.
2014-01-23 05:00:05 +00:00
Ralph Castain
de07a64599 Cleanup the sensor code:
* use the global flags for linux and apple being found instead of re-doing the case statements

* update select procedure to ignore components that measure the same thing (e.g., resusage and sigar), taking the higher priority module

cmr=v1.7.5:reviewer=jsquyres:subject=Cleanup the sensor code

This commit was SVN r30368.
2014-01-22 21:01:09 +00:00
Jeff Squyres
7768828d2d Addendum to r30298: tweak the wording of the help messages a bit.
Refs trac:4117.  Please use this commit rather than the patch attached to
the ticket; the patch had a few mistakes in the tweaked wording.

This commit was SVN r30362.

The following SVN revision numbers were found above:
  r30298 --> open-mpi/ompi@58479399c3

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-01-22 12:17:14 +00:00
Ralph Castain
e0edc29029 Add comment on future work
This commit was SVN r30336.
2014-01-20 19:54:31 +00:00
Ralph Castain
9b2066cfba Add two new sensor modules - one to monitor core temperatures, and the other to monitor resource usage using the sigar library
This commit was SVN r30335.
2014-01-20 19:35:48 +00:00
Ralph Castain
3e9c8497e0 Shift the verbose output a bit
Refs trac:4136

This commit was SVN r30332.

The following Trac tickets were found above:
  Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-20 14:41:37 +00:00
Ralph Castain
5ad9795bd8 Cleanup some potential memory overruns
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30331.
2014-01-19 16:31:26 +00:00
Ralph Castain
9f6fd7b98d A few corrections to hostfile parsing - thanks to Tetsuya Mishima for the review
Refs trac:4136

This commit was SVN r30330.

The following Trac tickets were found above:
  Ticket 4136 --> https://svn.open-mpi.org/trac/ompi/ticket/4136
2014-01-19 16:26:12 +00:00
Ralph Castain
657796f9e0 Revert r30327 - turns out it isn't quite right just yet. :-(
Closes trac:4138

This commit was SVN r30328.

The following SVN revision numbers were found above:
  r30327 --> open-mpi/ompi@87d5f86025

The following Trac tickets were found above:
  Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138
2014-01-18 23:38:39 +00:00
Ralph Castain
87d5f86025 Enable use of unix domain sockets for local OOB communications, thereby removing the requirement for an active network interface when running strictly on a single node. Update the overall OOB system to support cross-transport movement of messages so that the OOB can move a received message to another transport for transmission.
cmr=v1.7.5:reviewer=jsquyres:subject=Enable use of unix domain sockets for local OOB communications

This commit was SVN r30327.
2014-01-18 21:36:49 +00:00
Ralph Castain
fcdd904af4 Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
87f34860fe Protect array against crossing boundaries
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30316.
2014-01-17 21:36:20 +00:00
Mike Dubman
874c4e2558 PMI2: add missing file from prev commit
Refs trac:4119

This commit was SVN r30301.

The following Trac tickets were found above:
  Ticket 4119 --> https://svn.open-mpi.org/trac/ompi/ticket/4119
2014-01-16 13:17:08 +00:00
Mike Dubman
98234b5a69 SLURM/PMI2: Fix parsing of PMI2 process mapping
fixed by AlexM, reviewed by miked
cmr=v1.7.4:reviewer=rhc

This commit was SVN r30300.
2014-01-16 12:05:29 +00:00
Ralph Castain
58479399c3 As per RFC and telecon, deprecate cmd line options and their corresponding MCA params for old-style mapping and binding directives
cmr=v1.7.5:reviewer=jsquyres:subject=deprecate old-style mapping and binding directives

This commit was SVN r30298.
2014-01-15 14:48:39 +00:00
Ralph Castain
590a87c730 You can't pass static buffer definitions to rml.send as it will attempt to release them upon completion - you need to send dynamically allocated buffers
This commit was SVN r30261.
2014-01-11 19:38:11 +00:00
Ralph Castain
286ff6d552 For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun.
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!

Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.

This behavior will *only* take effect under the following conditions:

1. launched via mpirun

2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX

3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.

The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:

1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".

2. adjust the SM rendezvous logic to loop until the required file has been created

Creating a placeholder to bring this over to 1.7.5 when ready.

cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale

This commit was SVN r30259.
2014-01-11 17:36:06 +00:00
Ralph Castain
fb9e427320 One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do *not* bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound.
Refs trac:4077

This commit was SVN r30200.

The following Trac tickets were found above:
  Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077
2014-01-09 22:39:34 +00:00
Ralph Castain
24e990e747 Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots
cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems

This commit was SVN r30197.
2014-01-09 20:33:48 +00:00
Ralph Castain
9fcb46d85a Correctly detect and handle oversubscription for comm_spawn
cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn

This commit was SVN r30186.
2014-01-09 18:27:51 +00:00
Ralph Castain
6e5fedeb04 Oops - add verbose output to inform that cannot default bind due to no cores detected
Refs trac:4074

This commit was SVN r30185.

The following Trac tickets were found above:
  Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074
2014-01-09 18:17:14 +00:00
Ralph Castain
4cdc291df1 Ensure slurm properly dies on abnormal termination
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination

This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Jeff Squyres
87e476ebd8 Clean up many references to "rank": usually change to "process" and/or
specifically delineate that we're referring to the process' rank in
MPI_COMM_WORLD.

Refs trac:4068

This commit was SVN r30181.

The following Trac tickets were found above:
  Ticket 4068 --> https://svn.open-mpi.org/trac/ompi/ticket/4068
2014-01-09 16:37:49 +00:00
Ralph Castain
7e4748a0f1 Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes.
Thanks to Paul Hargrove for reporting the problem on NetBSD.

cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores

This commit was SVN r30180.
2014-01-09 16:27:58 +00:00
Ralph Castain
f179f2086b Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's.
cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings

This commit was SVN r30179.
2014-01-09 16:16:16 +00:00
Ralph Castain
2a0e4b5e62 Update the orterun help messages and man page to reflect new map/rank/bind options and defaults. Thanks to Paul Hargrove for reporting it.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30173.
2014-01-09 04:44:28 +00:00
Ralph Castain
bf453a2575 Reference the correct variable...sigh
Refs trac:4059

This commit was SVN r30163.

The following Trac tickets were found above:
  Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 22:36:39 +00:00
Ralph Castain
80497d73cf Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Ralph Castain
d5647394d8 Initialize variable so dash-host option gets correctly parsed
cmr=v1.7.4:reviewer=rolfv

This commit was SVN r30159.
2014-01-08 15:17:16 +00:00
Ralph Castain
e724d0d12d Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives
Refs trac:4059

This commit was SVN r30158.

The following Trac tickets were found above:
  Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059
2014-01-08 15:16:22 +00:00
Ralph Castain
fb650aed0c Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job

This commit was SVN r30155.
2014-01-08 04:25:43 +00:00
Ralph Castain
bc75250951 Cleanup the sensor framework close - existing code was using incorrect object type. Don't start sensors if sample rate is zero. Don't add zero-byte data from resusage as it means nothing was measured.
cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r30150.
2014-01-08 02:38:56 +00:00
Jeff Squyres
13b29cff2c This commit compliements/completes r30140. r30140 made all the
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:

 * pkgdatdir ->	ompidatadir
 * pkglibdir -> ompilibdir
 * pkgincludedir -> ompiincludedir

This commit was SVN r30145.

The following SVN revision numbers were found above:
  r30140 --> open-mpi/ompi@8b778903d8
2014-01-07 23:36:33 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Mike Dubman
40aadab85f re-enable map-by dist
after last refactoring in rmaps, map-by dist:hca  was disabled.
reverting it back

found/fixed by Elena, reviewed by miked

cmr=v1.7.4:reviewer=ompi-rm1.7

This commit was SVN r30118.
2014-01-04 20:44:41 +00:00
Ralph Castain
9a855ff58e Update sensor component for new OOB calls
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30117.
2014-01-03 22:37:15 +00:00
Ralph Castain
3f2b3c53ea Ensure that rankfile-provided allocations are correctly handled
Fixes trac:4043

cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that rankfile-provided allocations are correctly handled

This commit was SVN r30106.

The following Trac tickets were found above:
  Ticket 4043 --> https://svn.open-mpi.org/trac/ompi/ticket/4043
2014-01-02 16:07:16 +00:00