1
1
Граф коммитов

4733 Коммитов

Автор SHA1 Сообщение Дата
Mike Dubman
f83d6045aa ORTE: undeprecate -x var=val in mpirun
mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"
2014-11-12 10:51:15 +02:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Ralph Castain
d0704ef118 Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case. 2014-11-10 14:00:40 -08:00
Ralph Castain
2a90788724 Support physical processor ids in rankfile 2014-11-10 14:00:40 -08:00
Ralph Castain
8c837d3cb3 Doh - if we can't output an entire block, then we need to adjust the number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write. 2014-11-07 13:13:13 -08:00
Ralph Castain
b56b744041 Silence some warnings and remove debug output 2014-11-07 07:54:01 -08:00
Elena
03fc809bc9 This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory. 2014-11-06 16:01:19 +02:00
Ralph Castain
738c3e1d72 Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available. 2014-11-03 21:35:42 -08:00
Ralph Castain
6fbc68c830 Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one 2014-11-03 14:43:17 -08:00
Gilles Gouaillardet
652ecdb888 oob/tcp: always include a missing header file
improve open-mpi/ompi@c9d1e16a9e
2014-10-29 13:39:23 +09:00
Gilles Gouaillardet
eef7590e58 wrappers: add the $(EXEEXT) extension to the installed symbolic links 2014-10-28 16:42:51 +09:00
Gilles Gouaillardet
c9d1e16a9e oob/tcp: include a missing header file
warning can be seen under cygwin without the missing header file
2014-10-28 13:56:25 +09:00
Ralph Castain
64fae47d85 Ensure that the proxy "pull" of output gets directed to the requesting tool, which is no longer just the sender since the HNP may be making the request on behalf of someone else 2014-10-23 10:21:21 -07:00
Ralph Castain
526682e2f9 Add the ability for a tool that requests spawn of a job to also request forwarding of all output to the tool. The tool is responsible for its own call to push its stdin to the new job. The push request can come -after- the job is started, but the pull request has to be done during the spawn procedure or else output can be lost. 2014-10-23 08:16:49 -07:00
Ralph Castain
894acb0aa8 configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros
These two macros set the MCA prefix and MCA cmd line id,
   respectively.  Specifically, MCA parameters will be named
   PREFIX<foo> in the environment, and the cmd line will use
   -ID foo bar.

   These macros must be called during configure.ac and a value
   supplied. In the case of Open MPI, the values given are
   PREFIX=OMPI_MCA_ and ID=mca.

   Other projects (such as ORCM) will call these macros with
   their own unique values.  For example, ORCM uses PREFIX=ORCM_MCA_
   and ID=omca

   This scheme is necessary to allow running Open MPI applications under
   systems that use their own versions of ORTE and OPAL.  For example,
   when running OMPI applications under ORCM, we need the MCA params passed
   to the ORCM daemons to be separated from those recognized by the OMPI application.
2014-10-22 18:57:40 -07:00
Jeff Squyres
c22e1ae33b configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros
These two macros set the prefix for the OPAL and ORTE libraries,
respectively.  Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.

These macros must be called, even if the prefix argument is empty.

The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix.  For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.

This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL.  For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
2014-10-22 10:32:19 -07:00
Jeff Squyres
01fd96bfa5 Revert "Provide a mechanism by which an upstream project can rename
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."

This reverts commit 63f619f871.
2014-10-22 10:32:11 -07:00
Jeff Squyres
206eade32c mpirun.1in: whitespace cleanup
Whitespace cleanup only; no content changes.
2014-10-20 05:18:25 -07:00
Jeff Squyres
9529289319 mpirun.1in: more updates about binding/etc.
Follow on to 91e9686 and f9d620e.
2014-10-20 05:17:49 -07:00
Ralph Castain
91e96861dd Cleanup the orterun man page per review by Gus Correa 2014-10-19 10:21:50 -07:00
Ralph Castain
f9d620e3a7 Update the orterun man page 2014-10-16 21:05:04 -07:00
Ralph Castain
ecbae03009 Fix typo 2014-10-16 13:30:06 -07:00
Ralph Castain
b6aa691e0a Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
Gilles Gouaillardet
b5aea782ce Revert "Fix heterogeneous support"
Per the discussion at http://www.open-mpi.org/community/lists/devel/2014/10/16050.php

This reverts commit c9c5d4011b.
2014-10-16 12:24:38 +09:00
Gilles Gouaillardet
c9c5d4011b Fix heterogeneous support
* redefine orte_process_name_t so it can be converted
  between host and network format as an opal_identifier_t
  aka uint64_t by the OPAL layer.
* correctly send OPAL_DSTORE_ARCH key
2014-10-15 17:19:13 +09:00
Ralph Castain
1ae34da5e5 Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment. 2014-10-10 12:13:25 -07:00
Ralph Castain
63f619f871 Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build. 2014-10-10 11:39:08 -07:00
Ralph Castain
1be1654e5f Correctly identify the synonym for orte_direct_modex_cutoff as ompi_hostname_cutoff 2014-10-10 06:05:06 -07:00
Ralph Castain
4fc4a8346b Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL. 2014-10-09 20:58:46 -07:00
Elena
b937b31693 fix for multiple spawn test 2014-10-09 06:18:16 +02:00
Elena
3d65799236 pmix: fixed ugly bug which caused many strange hangs 2014-10-09 06:17:03 +02:00
Elena
c905fe9b78 pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff 2014-10-09 06:15:31 +02:00
Elena
e319c95267 fixes for grpcomm rcd/brucks algorithms 2014-10-09 06:12:26 +02:00
Ralph Castain
fd6a044b7f Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
2014-10-03 16:02:57 -06:00
Howard Pritchard
d2bb8d8829 remove alps ess component
The alps ess component is obsolete.  It relies on header
files only present in very old CLE (Cray Linux)  3.X for
the Cray XT series.  As support for these systems is being
dropped starting with release 1.9, this code is being removed.
2014-10-03 13:17:33 -06:00
Jeff Squyres
413e775dbf version configury: make dist now works
Update the VERSION file scheme:

* Remove "want_repo_rev".
* Add "tarball_version".

All values are now always included (major, minor, release, greek,
repo_rev).  However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
2014-10-02 11:32:54 -07:00
hpp
8ded59ce0f fix alps plm to allow explicit host placement
It turns out that the alps plm code was developed only
    on cray systems that were running batch schedulers.

    However, for bring up and development systems, its not
    at all uncommon for there to be no batch scheduler, and
    thus to orte it appears that orte_num_allocated_nodes
    is always zero.  This forces a user using mpirun on such
    a system to always specify a host list:

    mpirun -n 4 -N 1 -host 32,45,68 ....

    just to get the job to run, but then since the -L argument for aprun
    is never built, the app always runs on the first batch of nodes that
    aprun finds available.
2014-10-02 10:42:01 -06:00
Jeff Squyres
72704441a2 URLs: update URLs for GitHub 2014-10-01 14:44:09 -07:00
Ralph Castain
84810b80fd Cover the remaining code paths for Java apps to define class path
Refs trac:4926

This commit was SVN r32823.

The following Trac tickets were found above:
  Ticket 4926 --> https://svn.open-mpi.org/trac/ompi/ticket/4926
2014-09-30 22:27:03 +00:00
Ralph Castain
040a69c38b Correct the classpath to correctly include the local directory so Java programs find the application class
cmr=v1.8.4:reviewer=jsquyres

This commit was SVN r32817.
2014-09-30 16:35:12 +00:00
Ralph Castain
4320457394 Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings
This commit was SVN r32811.
2014-09-29 17:10:38 +00:00
Howard Pritchard
f8ac8bb6b0 remove improper use of hwloc_bitmap_free
When using the native aprun launcher, it was observed that
there were frequent memory corruption errors occuring either
during a PMI kvs-fence operation, or at mpi termation during
opal cleanup of allocated objects.  This was especially bad
when using

aprun --c none

In some cases, the application would even just hang in finalize
if using ptmalloc, owing to some kind of infinite loop in
cleanup of small blocks, etc.

It turns out that the proble was in orte_ess_base_proc_binding's
improper use of opal_hwloc_base_get_available_cpus.  The cpuset
(bitmap) returned from that function is not meant to be freed
by the caller.

This problem is likely never observed when using the mpirun launcher
as there's an early exit if the OMPI_MCA_orte_bound_at_launch
environment variable is set.

This commit was SVN r32809.
2014-09-29 16:10:37 +00:00
Gilles Gouaillardet
9661e4537f oob/tcp: fix a race condition
Mimick the btl/tcp protocol to solve the race condition that happens
when two peers try to connect to each other at the same time

cmr=v1.8.4:reviewer=rhc

This commit was SVN r32799.
2014-09-26 06:54:30 +00:00
Ralph Castain
17846411c3 Now that we have an ORTE thread running in apps, we can't just call "exit"
during RTE abort as that is happening in a thread, and (at least in some
environments) doesn't result in the main thread being immediately
terminated. Instead, we wind up going thru orte_finalize in the main
thread, which isn't what we want.

So replace the call to "exit" with the "quick exit" variant "_exit", which
causes the entire process to exit immediately.

(custom patch has been posted for 1.8.3)

This commit was SVN r32780.
2014-09-23 22:51:10 +00:00
Howard Pritchard
1508a01325 Fixes to enable mpirun to work again on Cray
The ess pmi module was not handling aprun launched
daemons.  All daemons were thinking they were vpid 1.

Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.

Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.

Have the ESS pmi component open the pmix framework and
select a pmix component.

This commit was SVN r32773.
2014-09-23 15:37:26 +00:00
Gilles Gouaillardet
5fa2b6c59c oob/tcp: fix a race condition
Refs trac:4909

This commit was SVN r32754.

The following Trac tickets were found above:
  Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909
2014-09-18 08:17:25 +00:00
Ralph Castain
3a437cbdb3 Silence set-but-not-used warning when timing isn't enabled
This commit was SVN r32749.
2014-09-17 00:40:10 +00:00
Ralph Castain
414f4e9783 Try to provide a real hostname for the remote host to aid in debugging
Refs trac:4908

This commit was SVN r32748.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-17 00:39:49 +00:00
Jeff Squyres
9dc49c5f92 oob_tcp_connection: print "<unknown>" instead of "NULL"
"NULL" doesn't meany anything to the user, and is somewhat confusing
to see in an error message.  "<unknown>" at least indicates that
there's an error, and we know who the peer is.

This commit was SVN r32747.
2014-09-16 22:47:57 +00:00
Ralph Castain
09aecea55a Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message.
Refs trac:4908

This commit was SVN r32746.

The following Trac tickets were found above:
  Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-16 22:21:02 +00:00
Ralph Castain
4bbc9a28d6 Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it.
cmr=v1.8.3:reviewer=rhc

This commit was SVN r32744.
2014-09-16 18:59:36 +00:00
Ralph Castain
a74428513d Provide a better help message when we are unable to complete a connection due to a firewall.
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32743.
2014-09-16 16:28:29 +00:00
Ralph Castain
dfb952fa78 [Contribution from Artem - moved it to svn from git for him]
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.

This commit was SVN r32738.
2014-09-15 18:00:46 +00:00
Jeff Squyres
e95ed94a94 plm_rsh_module.c: output to the framework output
Trivial fix from r32686: don't output to stream 0, but rather to
orte_plm_base_framework.framework_output (this is the way it was
before r32686).  In reality, this is going to end up being stream 0,
anyway, but we might as well be pedantically correct...

Refs trac:4897.

This commit was SVN r32726.

The following SVN revision numbers were found above:
  r32686 --> open-mpi/ompi@4df1aa63f7

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-13 00:46:35 +00:00
Ralph Castain
0445052a1c Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value.
Refs trac:4897

This commit was SVN r32719.

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-12 22:21:30 +00:00
Ralph Castain
9e7e90265f Temporarily make the direct grpcomm component the default until we can debug the other modules
This commit was SVN r32707.
2014-09-11 14:47:54 +00:00
Ralph Castain
4eb6291334 Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives.
This commit was SVN r32703.
2014-09-10 17:36:16 +00:00
Ralph Castain
ea11e63f59 Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node.
Reviewed by rhc, so RM-approved to go across to v1.8.3

cmr=v1.8.3:reviewer=ompi-gk1.8

This commit was SVN r32701.
2014-09-10 15:52:18 +00:00
Ralph Castain
e671620ac7 Per request from Jeff: tune up the help messages for binding options
Refs trac:4898

This commit was SVN r32691.

The following Trac tickets were found above:
  Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898
2014-09-09 22:39:22 +00:00
Gilles Gouaillardet
63209eac5b orte/util: use ORTE_JOB_FAMILY and ORTE_LOCAL_JOBID macros
This commit was SVN r32688.
2014-09-09 05:13:00 +00:00
Ralph Castain
4207b4c4ad Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32687.
2014-09-08 21:03:51 +00:00
Ralph Castain
4df1aa63f7 Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused.
So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes.

cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32686.
2014-09-08 20:38:46 +00:00
Ralph Castain
6323b226c7 Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation).
This commit was SVN r32673.
2014-09-06 19:19:44 +00:00
Ralph Castain
94ffca4901 Correct the cutoff point for full modex operation as it is based on the number of nodes in the system, not the number of procs in the signature.
This commit was SVN r32666.
2014-09-03 17:28:12 +00:00
Ralph Castain
2bfb18e004 Resolve some race conditions when async pmix modex modes are invoked. Since calls to "get" data can come both locally and remotely before data for a given proc has actually been received, we have to track all requests that cannot be immediately fulfilled and provide the data once it has been received.
This commit was SVN r32664.
2014-09-02 20:04:17 +00:00
Ralph Castain
4d186e6402 Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32662.
2014-09-02 14:53:00 +00:00
Ralph Castain
f2b26bde4c Resolve a race condition that could cause us to hang during abnormal terminations due to multi-counting num_terminated
This commit was SVN r32660.
2014-09-02 00:32:52 +00:00
Ralph Castain
e49ca05f11 Remove unused variable
This commit was SVN r32651.
2014-08-31 03:11:50 +00:00
Ralph Castain
5cdbc00136 Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others.
This commit was SVN r32650.
2014-08-30 19:33:46 +00:00
Ralph Castain
a2085a5916 Fix the PSM transport key generator to match prior releases
This commit was SVN r32649.
2014-08-30 00:48:25 +00:00
Ralph Castain
cb0739dfd4 Update the regex to resolve a bug
This commit was SVN r32647.
2014-08-29 22:24:20 +00:00
Ralph Castain
8faabed2cd Add some further initialization and protection for zero-byte messages
This commit was SVN r32644.
2014-08-29 17:24:55 +00:00
Ralph Castain
2b225e3776 Cleanup a race condition regarding marking that waitpid_fired. We should always mark it as fired when we enter the wait_local_proc routine, and also mark it as no longer alive if iof_complete has also been found. If other places in the code also update those flags, there is no harm done.
This commit was SVN r32643.
2014-08-29 17:03:31 +00:00
Ralph Castain
730e28349e Some minor uninitialized variable cleanups
This commit was SVN r32629.
2014-08-29 02:21:13 +00:00
Ralph Castain
fafdbeec0c Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them.
This commit was SVN r32624.
2014-08-28 20:35:35 +00:00
Ralph Castain
731a878ff3 Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe
This commit was SVN r32620.
2014-08-27 19:52:20 +00:00
Ralph Castain
5fb7c7d23b Don't explicitly add the hostname to the data fetch when we already cached a remote blob
This commit was SVN r32619.
2014-08-27 16:18:05 +00:00
Ralph Castain
3c24770bce Protect debug printing on backend nodes
This commit was SVN r32618.
2014-08-27 16:17:28 +00:00
Ralph Castain
b87b69e977 Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction
This commit was SVN r32617.
2014-08-27 16:16:46 +00:00
Ralph Castain
842aaf6167 Correctly end mapping oversubscribed nodes round-robin byslot
cmr=v1.8.3:reviewer=rhc

This commit was SVN r32616.
2014-08-27 16:15:18 +00:00
Gilles Gouaillardet
2679629a12 pmix: fix compilation when configured with --without-hwloc
This commit was SVN r32604.
2014-08-26 08:31:05 +00:00
Ralph Castain
1221e8a96f Compare the full signature - thanks to Gilles for identifying the problem
This commit was SVN r32595.
2014-08-25 14:52:06 +00:00
Ralph Castain
5a13cdb739 Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check
This commit was SVN r32587.
2014-08-22 22:48:16 +00:00
Ralph Castain
039b7acfb5 Fix the quoting algorithm so only rsh command lines get quoted values
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32586.
2014-08-22 22:47:38 +00:00
Ralph Castain
f00af81c1d Little more cleanup under the abort cases cited by Gilles. All seem to be working now
This commit was SVN r32585.
2014-08-22 19:57:57 +00:00
Ralph Castain
b1a7375192 Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors
This commit was SVN r32584.
2014-08-22 19:20:45 +00:00
Ralph Castain
6ff2a60829 Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario
This commit was SVN r32578.
2014-08-22 14:26:24 +00:00
Ralph Castain
8f1b9b463e Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound.
This commit was SVN r32577.
2014-08-22 05:17:51 +00:00
Ralph Castain
c6f78d6e54 The PMI ess component now gets used for more than direct launch, so only set standalone_operation flag if no daemon uri is available so we aggregate show_help messages
This commit was SVN r32574.
2014-08-22 03:00:56 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
b4511913f6 Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us.
Refs trac:4870

This commit was SVN r32558.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 20:58:07 +00:00
Ralph Castain
fa28710d53 Track down the last piece of the connection problem. It appears that
providing a netmask of 0 to opal_net_samenetwork results in everything
looking like it is on the same network. Hence, we were not retaining any
of the alternative addresses, so we had no other way to check them.

Refs trac:4870

This commit was SVN r32556.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 16:55:36 +00:00
Ralph Castain
343038af7b Frazzle-frump! Missed that we reset the peer state just before the new check.
Refs trac:4870

This commit was SVN r32554.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-19 22:34:49 +00:00
Ralph Castain
0a91fdf85f If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure.
cmr=v1.8.2:reviewer=edgar

This commit was SVN r32553.
2014-08-19 19:48:24 +00:00
Ralph Castain
024572cb6c Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it.
cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket

This commit was SVN r32552.
2014-08-19 19:40:20 +00:00
Jeff Squyres
1551339eba rsh: revert part of r32517: keep the quoting
As part of reviewing CMR #4860, I talked through r32517 with Ralph.

In attempt to fix various rsh quoting problems, r32517 removed all the
quoting from the main code path and then only added it back in at the
end in some cases.

This commit puts back the quoting parts that were removed in r32517
(r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca
<foo_equivalent> 1" so that de-duplication works, and b) change a !=
to ==).

refs trac:4860

This commit was SVN r32524.

The following SVN revision numbers were found above:
  r32517 --> open-mpi/ompi@7342bce58f

The following Trac tickets were found above:
  Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860
2014-08-13 19:27:10 +00:00
Gilles Gouaillardet
f96d382d1d Fix typo.
Thanks to Christopher Samuel for reporting it

This commit was SVN r32520.
2014-08-13 05:54:59 +00:00
Ralph Castain
7342bce58f Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing.
Fixes trac:4857

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32517.

The following Trac tickets were found above:
  Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857
2014-08-13 03:51:04 +00:00
George Bosilca
de7191132d Remove few warnings.
This commit was SVN r32506.
2014-08-11 13:34:44 +00:00
Gilles Gouaillardet
a873f45a90 Fix r32460 race condition resolution when procs call MPI_Abort.
do not invoke orte_session_dir_finalize(...) so
orte_ess_base_app_abort(...) can successfully createi
<orte_process_info.proc_session_dir>/aborted

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32498.

The following SVN revision numbers were found above:
  r32460 --> open-mpi/ompi@abedb97be4
2014-08-11 05:50:32 +00:00
Gilles Gouaillardet
e184733ef6 check-help-strings cleanup
This commit was SVN r32496.
2014-08-11 03:26:21 +00:00
Gilles Gouaillardet
f24699623f check-help-strings cleanup
This commit was SVN r32495.
2014-08-11 03:25:22 +00:00
Gilles Gouaillardet
c3c364a262 check-help-strings cleanup
This commit was SVN r32494.
2014-08-11 03:22:05 +00:00
Gilles Gouaillardet
d9e0212e0e check-help-strings cleanup
This commit was SVN r32493.
2014-08-11 03:21:08 +00:00
Gilles Gouaillardet
d139f75db4 check-help-strings cleanup
This commit was SVN r32492.
2014-08-11 03:20:37 +00:00
Ralph Castain
abedb97be4 Resolve race condition when procs call MPI_Abort. Since we go thru the errmgr instead of the normal proc termination routines, we need to ensure we mark that the proc has fired its waitpid and is no longer alive. Otherwise, the local daemon won't terminate because it thinks there is still a local proc alive and we hang.
Thanks to Gilles for tracking it down.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32460.
2014-08-08 15:58:49 +00:00
Ralph Castain
8ea576c870 I have no idea how they did it, but someone managed to write a test that circled around and around and eventually reached this point with a NULL pointer. So protect against that possibility.
This commit was SVN r32434.
2014-08-05 16:20:46 +00:00
Ralph Castain
42c5073aa3 Safely cleanup the opal_proc_t structure for non-MPI procs.
This commit was SVN r32402.
2014-08-01 16:38:49 +00:00
Ralph Castain
7758528d72 Apparently, someone else is destructing the opal_proc_t, so don't destruct it ourselves
This commit was SVN r32400.
2014-08-01 14:54:22 +00:00
Ralph Castain
d29a5ab69d Okay, now handle the non-MPI apps
This commit was SVN r32399.
2014-08-01 14:49:25 +00:00
Ralph Castain
daeb9b6c4f Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain.
Thanks to Gilles for pointing out some of the discrepancies.

This commit was SVN r32398.
2014-08-01 14:44:11 +00:00
Ralph Castain
8cfadd1842 Per Paul Hargrove, add missing include to fix build under FreeBSD
RM-approved

cmr=v1.8.2:reviewer=ompi-gk1.8

This commit was SVN r32397.
2014-08-01 13:37:41 +00:00
George Bosilca
daa076995a orte_rmaps_numa_node_t -> opal_rmaps_numa_node_t
This commit was SVN r32380.
2014-07-31 19:58:47 +00:00
Ralph Castain
5db717f090 Some small leak cleanups
cmr=v1.8.3:reviewer=artpol

This commit was SVN r32358.
2014-07-30 15:46:02 +00:00
Ralph Castain
98b5a86a58 Now that we are using the radix routed module, teach it how to behave nicely with singletons
This commit was SVN r32312.
2014-07-24 22:46:17 +00:00
Ralph Castain
0cad281a92 Single-word cmd line values for orted are dealt with in orte_plm_base_orted_append_basic_args, so protect against special characters there. Have the rsh module only deal with multi-word arguments as those were skipped by orte_plm_base_orted_append_basic_args.
Refs trac:4802

This commit was SVN r32293.

The following Trac tickets were found above:
  Ticket 4802 --> https://svn.open-mpi.org/trac/ompi/ticket/4802
2014-07-23 17:06:51 +00:00
Jeff Squyres
4da3c85b54 fortran: revert Absoft-based fixes
Rever r32246, r32254, and 32255 -- they were fixing side-effects of
the real bug.  Real fix coming after this one.

This commit was SVN r32286.

The following SVN revision numbers were found above:
  r32246 --> open-mpi/ompi@08d2a1a48d
  r32254 --> open-mpi/ompi@232d4dbb7b
2014-07-22 21:49:22 +00:00
Ralph Castain
a94a97bd50 Cleanup the passing of MCA params on the orted cmd line in ssh by ensuring that we quote all values since they could be multi-word and/or contain special characters. Thanks to Dirk Schubert for pointing it out.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32280.
2014-07-22 18:22:06 +00:00
Ralph Castain
2f579806ae Make the radix routed component the default pending repair/completion of debruijn option
This commit was SVN r32276.
2014-07-22 16:48:51 +00:00
Ralph Castain
6c5e592785 Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment.
Update the orte/test/mpi/coll_test.c test to show the revised example.

This commit was SVN r32234.

The following SVN revision numbers were found above:
  r32203 --> open-mpi/ompi@a523dba41d
  r32210 --> open-mpi/ompi@2ce11ed5c4
  r32222 --> open-mpi/ompi@d55f16db50
2014-07-15 03:48:00 +00:00
Ralph Castain
1feaffbb15 Get the blasted singleton comm_spawn working again. There remain problems with the Slurm interaction in this use-case as the PMI components (if configured to build) try to run even when a Slurm allocation hasn't been made, but I leave that to someone else to resolve. I did, however, tell the Slurm ess to quit interfering with applications launched in this use-case by ORTE daemons, so things do work when inside a Slurm allocation.
Also discovered that the rsh launcher is not picking up --enable-orterun-prefix-by-default when invoked during singleton comm_spawn, but I was unable to see why that was happening and ran out of time.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32229.
2014-07-13 14:47:22 +00:00
Ralph Castain
d55f16db50 Fix a hang in daemon collectives when run on multinode systems
This commit was SVN r32222.
2014-07-12 00:43:12 +00:00
Ralph Castain
2ce11ed5c4 Fix one spot missed by recent commit
This commit was SVN r32210.
2014-07-10 20:08:57 +00:00
Ralph Castain
a523dba41d NOTE: this modifies the MPI-RTE interface
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.

This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:

 * In order to allow scalable
 * generation of collective id's, they are formed as:
 *
 * top 32-bits are the jobid of the procs involved in
 * the collective. For collectives across multiple jobs
 * (e.g., in a connect_accept), the daemon jobid will
 * be used as the id will be issued by mpirun. This
 * won't cause problems because daemons don't use the
 * collective_id
 *
 * bottom 32-bits are a rolling counter that recycles
 * when the max is hit. The daemon will cleanup each
 * collective upon completion, so this means a job can
 * never have more than 2**32 collectives going on at
 * a time. If someone needs more than that - they've got
 * a problem.
 *
 * Note that this means (for now) that RTE-level collectives
 * cannot be done by individual threads - they must be
 * done at the overall process level. This is required as
 * there is no guaranteed ordering for the collective id's,
 * and all the participants must agree on the id of the
 * collective they are executing. So if thread A on one
 * process asks for a collective id before thread B does,
 * but B asks before A on another process, the collectives will
 * be mixed and not result in the expected behavior. We may
 * find a way to relax this requirement in the future by
 * adding a thread context id to the jobid field (maybe taking the
 * lower 16-bits of that field).

This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.

This commit was SVN r32203.
2014-07-10 18:53:12 +00:00
Ralph Castain
8c85ca350e Remove debug
This commit was SVN r32200.
2014-07-10 18:28:24 +00:00
Jeff Squyres
6cc538ae16 help-orterun.txt: wrap long messages, clarify new messages
Clarify the new -x/mca_base_env_list help messages.

This commit was SVN r32199.
2014-07-10 17:24:52 +00:00
Ralph Castain
796f57f709 Protect against problems if someone passes us thru a pipe and then abnormally terminates the pipe early
This commit was SVN r32189.
2014-07-09 22:41:53 +00:00
Ralph Castain
7beb8f6799 Silence used-before-set var warning
This commit was SVN r32188.
2014-07-09 22:37:47 +00:00
Joshua Ladd
801e2cb544 Fix error and warning messages after reverting
the mca_base_env_list to being semicolon delimited.

This commit was SVN r32179.
2014-07-09 14:46:19 +00:00
Joshua Ladd
30da6d3a17 Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option.
mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3  should now be expressed as

mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. 

The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. 

This feature was developed by Elena Shipunova and was reviewed by Josh Ladd.

This commit was SVN r32163.
2014-07-09 00:38:25 +00:00
Ralph Castain
5bb5b22573 When a user asks for cpus/rank > 1 and only has one slot, we need to ensure we always map at least one process when they don't tell us -np
cmr=v1.8.2:reviewer=rhc:subject=correct num_procs in corner case

This commit was SVN r32142.
2014-07-04 17:00:35 +00:00
Ralph Castain
9f97c74ba3 Silence warning
This commit was SVN r32136.
2014-07-03 17:29:04 +00:00
Ralph Castain
356e7ea904 Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now.
This commit was SVN r32133.
2014-07-03 15:58:13 +00:00
Ralph Castain
0a4639308e Remove a potential race condition - we'll cleanup the local children when we are all done
This commit was SVN r32132.
2014-07-03 14:13:43 +00:00
George Bosilca
2883adcdf3 Remove useless variables.
This commit was SVN r32123.
2014-07-03 00:30:54 +00:00
Ralph Castain
149810f02c Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro
cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message

This commit was SVN r32122.
2014-07-02 14:46:00 +00:00
Ralph Castain
e9d69ca370 Remove stale test
This commit was SVN r32104.
2014-06-29 16:37:19 +00:00
Adrian Reber
47b118c0ae fix FT compilation
This commit was SVN r32094.
2014-06-26 03:40:07 +00:00
Adrian Reber
cabf1d4e68 use the orte attributes in the FT code to fix compile errors
This commit was SVN r32093.
2014-06-26 03:19:17 +00:00
Adrian Reber
10c1a50705 "handle" removal of opal_db.remove() in the FT code
This commit was SVN r32092.
2014-06-26 03:11:37 +00:00
Ralph Castain
f3cb124e50 Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
This commit was SVN r32089.

The following SVN revision numbers were found above:
  r32070 --> open-mpi/ompi@12d92d0c22
  r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Adrian Reber
9f73e79d91 also change the callback function prototype (to get the FT code to compile again)
This commit was SVN r32088.
2014-06-25 20:37:02 +00:00
Adrian Reber
4aca7095dc fix a syntax error in the FT code
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Adrian Reber
4b25e92194 get the FT code to compile again by adding/removing #includes
This commit was SVN r32086.
2014-06-25 18:42:17 +00:00
Ralph Castain
8fca77c3d3 Protect the binding policy setting so it builds when --without-hwloc
Refs trac:4742

This commit was SVN r32085.

The following Trac tickets were found above:
  Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742
2014-06-25 18:13:54 +00:00
Adrian Reber
72f1c7941f use a consistent naming scheme for the SNAPSHOT attributes
This commit was SVN r32083.
2014-06-25 15:26:24 +00:00
Nathan Hjelm
563eaf0726 Fix support for Cray alps
The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.

Changes:

 - Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
   PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
   had previously just used PMI_SUCCESS.

 - Add missing definition and a typo in pml_alps_module.

 - launch_id is no longer available in the orte_node_t structure. Use the
   attribute lookup to get the value.

 - Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
   opal_list_sort instead (O(nlogn)).

This commit was SVN r32076.
2014-06-24 21:29:04 +00:00
Ralph Castain
5f6be06b54 Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load

This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Ralph Castain
12d92d0c22 Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Ralph Castain
34e5573988 Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination.
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.

Refs trac:4717

This commit was SVN r32063.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-21 17:09:02 +00:00
Ralph Castain
9cfc408fd4 Little more debug - getting close to figuring this one out
Refs trac:4717

This commit was SVN r32060.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 16:24:06 +00:00
Ralph Castain
f9da295682 Add some additional debug
Refs trac:4717

This commit was SVN r32059.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 14:14:36 +00:00
Ralph Castain
645df5e823 Don't release the node_name field as it gets used in the slots parsing - will be released at newline detection
This commit was SVN r32058.
2014-06-20 13:18:46 +00:00
Ralph Castain
9a47e45a09 <laugh> ensure we really compare the things we want to compare
This commit was SVN r32055.
2014-06-19 20:54:25 +00:00
Ralph Castain
e65538e91b Add some defensive programming, fix a typo
This commit was SVN r32054.
2014-06-19 20:52:13 +00:00
Ralph Castain
b43f760f93 If you don't specify all the rank-file mapping for all procs, then you'll segfault - which is probably a bad idea. I can't see an easy workaround, so just error out for now and let's see if anyone really cares.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32053.
2014-06-19 20:30:06 +00:00
Ralph Castain
b618b36a2f Fix potential issue if opal_hwloc_topology is NULL
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32050.
2014-06-19 18:52:41 +00:00
Ralph Castain
61fe4daa33 Add some further debug
Refs trac:4717

This commit was SVN r32047.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-19 15:59:51 +00:00
Ralph Castain
65275d6326 Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel
cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message

This commit was SVN r32030.
2014-06-18 19:20:28 +00:00
Ralph Castain
3f032d39e8 Mark the proc as alive so waitpid callback system doesn't immediately activate the callback
Refs trac:4717

This commit was SVN r32026.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-18 14:04:55 +00:00
Ralph Castain
8e7c0257f0 Cleanup some missed updates to orte_wait_cb as params have changed
Refs trac:4717

This commit was SVN r32025.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 23:40:31 +00:00
Ralph Castain
5dbf4a62c4 Cleanup: we were accidentally killing ourselves (bad idea)
Refs trac:4717

This commit was SVN r32022.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 20:38:42 +00:00
Ralph Castain
5216bd5558 Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb
Refs trac:4717

This commit was SVN r32020.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 18:46:40 +00:00
Ralph Castain
42bf7466fc This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
Refs trac:4717

This commit was SVN r32019.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
ab52f16100 Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process.
cmr=v1.8.2:reviewer=rhc:subject=cache termination cleanups

This commit was SVN r32008.
2014-06-16 20:46:35 +00:00
Ralph Castain
561983ae52 Fix static builds by renaming conflicting type
This commit was SVN r32006.
2014-06-14 17:39:28 +00:00
Ralph Castain
3f04d50cb0 Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions:
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.

(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.

Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".

Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.

Closes trac:4345

cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions

This commit was SVN r32005.

The following Trac tickets were found above:
  Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
2014-06-14 15:38:32 +00:00
Ralph Castain
56c3575c0e Can't emit an error for an unrecognized mapping policy modifier as the ppr policy relies on not doing so.
This commit was SVN r31998.
2014-06-13 20:10:09 +00:00
Ralph Castain
3ed282bf44 Per patch from Tetsuya, correct the cpus-per-proc logic so we correctly detect when the user is attempting to bind too low for that option
Refs trac:4702

This commit was SVN r31988.

The following Trac tickets were found above:
  Ticket 4702 --> https://svn.open-mpi.org/trac/ompi/ticket/4702
2014-06-13 16:32:52 +00:00
Ralph Castain
ba926d8635 The TCP component will have set the hash table entry to NULL, but that doesn't remove the key. So the hash_table retrieval function will return success, but with a NULL pointer - protect against that scenario
Patch provided by Gilles - reviewed by rhc.

RM-approved

cmr=v1.8.2:reviewer=ompi-gk1.8

This commit was SVN r31971.
2014-06-09 17:46:22 +00:00
Ralph Castain
5d5ae41ea5 Cleanup a memory leak in the daemons - thanks to Artem for spotting it
This commit was SVN r31970.
2014-06-09 17:14:02 +00:00
Ralph Castain
06dbfa3098 Make the cpus-per-proc equivalent a little more intuitive:
* allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc.

* if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior.

Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31967.
2014-06-08 20:26:59 +00:00
Ralph Castain
8db76e9c6f Ensure that we change to the session dir if we preload binaries so we'll use the loaded one
Special patch created for v1.8 and CMR filed

This commit was SVN r31963.
2014-06-06 21:43:23 +00:00
Ralph Castain
b7c08582ba Add new tag to avoid conflicts
This commit was SVN r31960.
2014-06-06 17:23:35 +00:00
Ralph Castain
638c24f655 Correct the bind-in-place algorithm to better handle comm_spawn. If the location identified by the mapper is already occupied by procs from another job, then we need to shift either right or left until we find an unoccupied location where we can be bound. If nothing is available, then check for the overload flag (and bind us in the original location if provided), or see if this was the default binding policy instead of one specified by the user - if so, then just don't bind this process.
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31959.
2014-06-06 12:36:14 +00:00
Gilles Gouaillardet
d26ac02b4a #if OPAL_HAVE_HWLOC protect access to orte_proc_info_t.cpuset
Fix a bug when trunk is configured with --without-hwloc

v1.8 is safe so no cmr

This commit was SVN r31957.
2014-06-06 07:25:39 +00:00
Ralph Castain
e21bfeadcd Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed.
This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier.

cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component

This commit was SVN r31956.
2014-06-06 02:24:17 +00:00
Ralph Castain
34cb137314 Add another attribute to the orte_proc_t area
This commit was SVN r31953.
2014-06-05 14:48:19 +00:00
Ralph Castain
b2413a6b88 Cannot update the proc state prior to activating the state machine as some callback functions need to compare the prior proc state against the new one.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31949.
2014-06-04 03:40:08 +00:00
Ralph Castain
b771388fa7 We really need to send *all* the daemon info whenever the daemon job has changed as new daemons need a full nidmap
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31948.
2014-06-04 03:38:54 +00:00
Ralph Castain
c5384d44d7 Protect against NULL result in get_attr
This commit was SVN r31947.
2014-06-04 03:09:37 +00:00
Ralph Castain
8e768de317 Really should check our own node for oversubscription, not the HNP
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31946.
2014-06-04 03:09:02 +00:00
Ralph Castain
5398158d5a Move fclose inside bracket to protect against NULL fp
This commit was SVN r31945.
2014-06-04 03:08:18 +00:00
Ralph Castain
f1978fba7c Cleanup a set of typos on the orte_get_attribute call
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
5668f085a3 Silence some useless warnings, and fix a missed updated in the tm plm
This commit was SVN r31930.
2014-06-02 17:57:56 +00:00
Ralph Castain
742c0d2284 Fix typo that would cause a segfault if orte_startup_timeout was set
This commit was SVN r31929.
2014-06-02 15:59:18 +00:00
Ralph Castain
7df500ecf5 Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up.
This commit was SVN r31928.
2014-06-02 15:00:33 +00:00
Ralph Castain
03234f2a33 Revert r31926 and replace it with a more complete checking of availability and accessibility of the required freq control paths.
This commit was SVN r31927.

The following SVN revision numbers were found above:
  r31926 --> open-mpi/ompi@9779084352
2014-06-02 14:34:00 +00:00
Gilles Gouaillardet
9779084352 orte_rtc_base_select: skip a RTC module if it has a zero priority
This commit was SVN r31926.
2014-06-02 07:16:41 +00:00
Ralph Castain
65a35d92ef Cleanup compile issues - missing updates to some plm components and the slurm ras component
This commit was SVN r31921.
2014-06-01 17:59:06 +00:00
Ralph Castain
2c3d07db24 Cleanup the test so it is MPI correct
This commit was SVN r31919.
2014-06-01 17:57:36 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
cf2c7381d0 Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs.
Refs trac:4643

This commit was SVN r31915.

The following Trac tickets were found above:
  Ticket 4643 --> https://svn.open-mpi.org/trac/ompi/ticket/4643
2014-06-01 16:08:56 +00:00
Ralph Castain
1107f9099e Per the RFC issued here:
http://www.open-mpi.org/community/lists/devel/2014/05/14827.php

Refactor PMI support

This commit was SVN r31907.
2014-06-01 04:28:17 +00:00
Nathan Hjelm
041b72b0cc plm/alps: better workaround for the noisy cray pmi implementation
This commit is a slightly better workaround to prevent mesages of
the form:
[unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
[unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor

It works by completely disabling PMI in the application process when using
mpirun. This should not be an issue for any apps.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31882.
2014-05-22 16:04:36 +00:00
Oscar Vega-Gisbert
83bdebbf81 Java bindings for OSHMEM.
This commit was SVN r31810.
2014-05-18 21:48:09 +00:00
Nathan Hjelm
73bfecd650 More leak fixes.
Two leaks are fixed in this commit:

 - Do not leak btl component list items.

 - Do not leak the nodename when decoding the pidmap.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31779.
2014-05-15 16:38:13 +00:00
Nathan Hjelm
59d09ad9de orte: fix several small memory leaks
grpcomm: fix memory leaks

We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.

oob,rml: fix memory leaks

This commit fixes several leaks:

 - Both the oob/base and oob/tcp were leaking objects on their peer
   hash tables. Iterate on the hash tables and free any objects.

 - Leaked sent messages because of missing OBJ_RELEASE. I placed the
   release in ORTE_RML_SEND_COMPLETE to catch all the possible
   paths.

ess/base: close the state framework

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Gilles Gouaillardet
5b9364fc12 Fix a memory leak in orte_register_params()
mca_base_var_register (..., MCA_BASE_VAR_TYPE_STRING, ...)
will dup() the orte_set_slots string, so there is no need
to do this in the first place.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31773.
2014-05-15 10:31:19 +00:00
Gilles Gouaillardet
5f82c391a6 Fix memory leaks in orte/util/nidmap.c
This patch fixes four memory leaks in orte/util/nidmap.c :
 - hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed
 - even if bo->bytes is freed in the decode, bo was not freed
 - a job list is populated but never used nor freed

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31770.
2014-05-15 08:28:53 +00:00
Ralph Castain
ad0e8f841d Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection.
Refs trac:4627

This commit was SVN r31763.

The following Trac tickets were found above:
  Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627
2014-05-14 19:23:02 +00:00
Ralph Castain
e605e73379 Close the incoming socket if we aren't going to accept it
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31759.
2014-05-14 16:51:59 +00:00
Ralph Castain
3a1c2fff3e Correct a misplaced bracket - daemons shouldn't be doing app-related operations
This may need a patch for 1.8.2, but we can try to directly apply it

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r31754.
2014-05-14 15:23:30 +00:00
Nathan Hjelm
2a57e71a47 plm/alps: fix typo introduced in r31589
This commit was SVN r31747.

The following SVN revision numbers were found above:
  r31589 --> open-mpi/ompi@445b552d3a
2014-05-13 22:36:54 +00:00
Ralph Castain
f55c587a74 Per patch from Tetsuya Mishima, ensure the rank_file mapper accurately tracks number of nodes in the map
Refs trac:4594

This commit was SVN r31725.

The following Trac tickets were found above:
  Ticket 4594 --> https://svn.open-mpi.org/trac/ompi/ticket/4594
2014-05-13 14:36:25 +00:00
Ralph Castain
5388347511 Per Jeff's suggestion, remove function that has duplicate functionality and just use one to check if session_dir directory should be removed.
Refs trac:4584

This commit was SVN r31691.

The following Trac tickets were found above:
  Ticket 4584 --> https://svn.open-mpi.org/trac/ompi/ticket/4584
2014-05-08 17:22:43 +00:00
Ralph Castain
aaae4841e9 Flush the show_help system on our way out - this also restores the opal_show_help function pointer to the OPAL layer for any subsequent processing.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31685.
2014-05-08 14:37:47 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6 The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
top_ompi_srcdir  ->  OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR

We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.

Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.

This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
05590b6a8c Correct the datastore containing the coprocessor info
This commit was SVN r31677.
2014-05-07 19:29:12 +00:00
Ralph Castain
4def94900a Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Ralph Castain
87d809eefe Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow
This commit was SVN r31633.
2014-05-05 19:22:06 +00:00
Ralph Castain
fae39a658d Add third flag for open when using O_CREAT. Thanks to "robi" for reporting it and providing a patch.
Fixes trac:4596

Reviewed by rhc, RM-approved

cmr=v1.8.2:reviewer=ompi-gk1.8

This commit was SVN r31626.

The following Trac tickets were found above:
  Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
2014-05-02 21:58:38 +00:00
Ralph Castain
60c554e097 Ugh - protect that --display-devel print with some NULL checks
This commit was SVN r31604.
2014-05-02 14:28:45 +00:00
Ralph Castain
c7f55be387 Per a user request, add binding info to the simple --diplay-map option
This commit was SVN r31603.
2014-05-02 14:25:59 +00:00
Ralph Castain
ccd33a17b8 Since we cannot block when calling abort, and we want to ensure any "show_help" message at least has a chance to get out before we exit, introduce a slight delay into the abort procedure.
Refs trac:4576

This commit was SVN r31601.

The following Trac tickets were found above:
  Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:46:25 +00:00
Ralph Castain
c1383ca1f3 Protect against NULL cpuset when not bound
This commit was SVN r31600.
2014-05-02 10:45:11 +00:00
Ralph Castain
0209cddb5b Revert r31596 and r31595 as they recreate the "abort" problem - all they did was move the blocking send to another point in the code. An alternative solution to the "show_help and abort" problem. will come in another commit
Refs trac:4576

This commit was SVN r31599.

The following SVN revision numbers were found above:
  r31595 --> open-mpi/ompi@2b61f22973
  r31596 --> open-mpi/ompi@712634efd3

The following Trac tickets were found above:
  Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:38:30 +00:00
Ralph Castain
6545e6e9a8 Add one more check for failed mapping that rarely occurs, but results in a hang when it does
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31598.
2014-05-02 10:35:14 +00:00
Ralph Castain
712634efd3 Silence warning
Refs trac:4576

This commit was SVN r31596.

The following Trac tickets were found above:
  Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-01 23:58:03 +00:00
Ralph Castain
2b61f22973 Now that the abort code no longer involves a blocking rml send section, apps that call show_help followed by abort are not printing their error message. So block them in show_help until that message gets out.
This commit was SVN r31595.
2014-05-01 22:57:17 +00:00
Ralph Castain
445b552d3a Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.

Refs trac:4571

This commit was SVN r31589.

The following Trac tickets were found above:
  Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
567ed25938 As per the earlier RFC, move the DB framework to orcm, thus removing it from the OMPI code repo
This commit was SVN r31586.
2014-05-01 15:43:32 +00:00
Ralph Castain
3b64c603b4 First stage of RFC to rename OMPI_foo build system support: change OMPI_CHECK_PACKAGE -> OPAL_CHECK_PACKAGE
This commit was SVN r31582.
2014-05-01 14:24:56 +00:00
Ralph Castain
238ecea311 When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn

This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Ralph Castain
d04a102ab8 Silence warnings
This commit was SVN r31573.
2014-04-30 20:55:46 +00:00
Ralph Castain
087b84b0ef Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers.
This commit was SVN r31572.
2014-04-30 19:29:00 +00:00
Ralph Castain
8cda1b3dc6 Don't store cpu_bitmap unless it is non-NULL
This commit was SVN r31570.
2014-04-30 18:12:48 +00:00
Ralph Castain
7a79b25577 Ensure we cleanup some files so session dirs can be rolled up
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
34988ba2a2 Cleanup the MPI_Abort detection
Refs trac:4576

This commit was SVN r31561.

The following Trac tickets were found above:
  Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-30 00:51:59 +00:00
Ralph Castain
3c9d877c1b Remove debug
This commit was SVN r31560.
2014-04-30 00:08:43 +00:00
Ralph Castain
9402380e1f Fix some errors in transition
This commit was SVN r31559.
2014-04-30 00:07:53 +00:00
Ralph Castain
c4c9bc1573 As per the RFC:
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php

Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM

This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Ralph Castain
1f0efe62a4 Minor cleanup - remove unused RML tag
Refs trac:4576

This commit was SVN r31545.

The following Trac tickets were found above:
  Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-29 17:34:17 +00:00
Ralph Castain
e05b88fd18 Take another stab at resolving the "called-abort" requirement without getting stuck. Return to "drop a turd" mode, perhaps with a little more intelligence behind it. Don't worry about catching it if session dirs weren't created
cmr=v1.8.2:reviewer=jsquyres:subject=cleanup MPI_Abort hangs

This commit was SVN r31543.
2014-04-29 17:29:46 +00:00
Ralph Castain
2c6234698e Fix the tarball build - need to include the orte_config.h header
This commit was SVN r31540.
2014-04-29 00:05:19 +00:00
Ralph Castain
3723b39f30 Ensure we don't silently fail when unable to make a connection - bark pleasantly first.
Refs trac:4571

This commit was SVN r31537.

The following Trac tickets were found above:
  Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-04-28 19:16:32 +00:00
Ralph Castain
d642babff6 Derived from patch provided by Artem, cleanup the "abnormal" code path for selecting TCP OOB modules to connect to a remote process. If we can't find a direct interface-to-address match, then assign all the provided addresses to the first available TCP module and let the normal failure process determine if the remote proc is truly reachable.
cmr=v1.8.2:reviewer=artpol:subject=fix abnormal code connection path in tcp oob

This commit was SVN r31536.
2014-04-28 19:05:14 +00:00
Ralph Castain
fb61a94804 Follow the lead set by Jeff: no need to run AC_CONFIG_HEADERS on orte_config.h. However, unlike the MPI layer, we don't run that macro on another file in orte/include, so ensure we add that -I path back!
This commit was SVN r31534.
2014-04-28 17:12:15 +00:00
Jeff Squyres
d8715f1e3a Close 3 more fd's that were leaking into child processes.
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.

Refs trac:4550

This commit was SVN r31515.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 15:36:24 +00:00
Jeff Squyres
e1655ae68d opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL.  And so we shall.

Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.

Refs trac:4550

This commit was SVN r31513.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
410f5bfb91 oob_tcp_listener.c: set both ends of this thread to be close-on-exec
This pipe is used to communicate between threads in this process.
Mark both fd as close-on-exec so that children don't inherit this
pipe.

Refs trac:4550

This commit was SVN r31512.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-23 21:46:41 +00:00
Jeff Squyres
87e6232e67 orterun.c: set an fd to be close-on-exec
Make sure the debugger attach fifo is marked as close-on-exec so that
children procs don't inherit it.  For example, if you salloc a SLURM
allocation and run "mpirun ..." in there (i.e., mpirun is running on
the head node, and launching on to back-end nodes), the forked srun's
will inherit this fd if it is still open.

Refs trac:4550

This commit was SVN r31499.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-22 21:55:09 +00:00
Jeff Squyres
63b7ef4103 orterun.1in: Document --allow-run-as-root option
Add some verbiage about how mpirun now defaults to disallowing running
as root, but you can use the --allow-run-as-root option to override
this default behavior.

Refs trac:4536

This commit was SVN r31477.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-22 14:34:32 +00:00
Jeff Squyres
ea4c916096 plm_slurm_module.c: don't leave the extra fd to /dev/null open
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed).  It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.

This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31469.

The following SVN revision numbers were found above:
  r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Jeff Squyres
38a27b858d Protect for the CLEANUP case where tmp hasn't been set yet
Refs trac:4536

This commit was SVN r31438.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:34:53 +00:00
Jeff Squyres
482b465c05 Trivial format change: use the same length of lines and \n offsets as
opal_show_help().

Refs trac:4536

This commit was SVN r31437.

The following Trac tickets were found above:
  Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
2014-04-18 23:14:45 +00:00
Jeff Squyres
530f22c403 proc_info.c: uncomment C99 struct member initialization usage
The C99 usage to initialize via struct member names was already there,
but commented out.  This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.

This commit was SVN r31425.
2014-04-18 17:26:07 +00:00
Ralph Castain
8594f5d738 Correctly set a non-zero exit status when mpirun is terminated by signal
Fixes trac:4537

cmr=v1.8.1:reviewer=jsquyres

This commit was SVN r31423.

The following Trac tickets were found above:
  Ticket 4537 --> https://svn.open-mpi.org/trac/ompi/ticket/4537
2014-04-18 16:39:08 +00:00