1
1

2589 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
391074cde6 Add a tag
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Samuel Gutierrez
81f38b258a commit of new shared memory backing facility framework (shmem) and its components.
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c Remove two stale modules
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5 Ensure we use the pmi grpcomm when using pmi
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
Currently ompi_ignored, and unignored only for me (others to soon follow).

This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Ralph Castain
042ee3ec48 Support the option of outputting error_log messages with something other than the process name
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
61dd7f4588 Need to pass identification of input/output channels
This commit was SVN r24783.
2011-06-17 14:48:59 +00:00
Ralph Castain
93dcfc15d0 Let the upper layer set the channels to be opened
This commit was SVN r24779.
2011-06-16 20:31:52 +00:00
Ralph Castain
7f2d2e3de7 Track the app_context rank - will equal overall rank for single app_context jobs
This commit was SVN r24778.
2011-06-16 20:31:30 +00:00
Josh Hursey
0eb3b3b7b0 Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
Per RFC:
  http://www.open-mpi.org/community/lists/devel/2011/06/9335.php

This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
033cbbed31 Don't automatically assign group channels if not given - let the layer above figure it out.
This commit was SVN r24771.
2011-06-10 16:28:18 +00:00
Ralph Castain
e039c7b7ea Avoid crashing when debugging rmaps and a non-string resource constraint is given
This commit was SVN r24770.
2011-06-10 16:27:30 +00:00
Josh Hursey
9080eaedf3 Fully initialize the orte_errmgr_base_component_t structure for the app.
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Josh Hursey
20339a7900 Minor coding style and intentation fixes.
This commit was SVN r24764.
2011-06-09 14:16:06 +00:00
Ralph Castain
fc8d920c56 Dont monitor resource usage unless requested, even if sensors are enabled during configure
This commit was SVN r24760.
2011-06-08 10:47:25 +00:00
Ralph Castain
906eb925f1 Update the resource usage sensor to initiate support for time analysis of measurements
This commit was SVN r24759.
2011-06-07 23:22:51 +00:00
Ralph Castain
f3cae3d6f3 Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
Fix a bug in the new code that prevented the system from correctly matching addresses.

Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.

Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.

Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.

NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.

This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
6b52d8f519 The paffinity is apparently needed.
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a Add diagnostics
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7 Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e Report bindings if requested.
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87 No need for the paffinity here.
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563 Enable the ability to constrain applications to hosts on the basis of resources.
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2 * Add support for using modex to exchange NID/PID pairs when using Portals4.
Rather than try to support a bunch of lightweight environments like I did
  with the Portals3 code, always use the "modex" and hack the grpcomm for
  the SHMEM implementation to return the right nid/pid for a remote
  process by "magic".

This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa Correct stale typo
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62 Fix typo
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5 Add some debug
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
1b5ca323c6 Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca Dont sample stats for pid=0 children
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3 Plug a couple of additional places in case orte_iof is not opened
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b Record the number of heartbeats recvd each period for diag purposes
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00
Ralph Castain
69dce0ec10 Minor heartbeat cleanups
This commit was SVN r24713.
2011-05-19 21:27:44 +00:00
Ralph Castain
c3df95dd13 Prevent failure due to race condition during abnormal term
This commit was SVN r24712.
2011-05-19 21:27:05 +00:00
Ralph Castain
b0f47e6f59 Allow orte_iof to not be opened
This commit was SVN r24711.
2011-05-19 21:26:30 +00:00
Ralph Castain
1f3911cc8b Add a new proc state
This commit was SVN r24710.
2011-05-19 21:25:58 +00:00
Ralph Castain
b47ec2ee87 Remove lingering references to opal_profile option
This commit was SVN r24709.
2011-05-18 18:27:29 +00:00
Ralph Castain
d34bab541d Remove the ompi-profiler tool and its attendant ompi-probe program. Also remove the grpcomm basic component since its only function was to support profiled clusters, which nobody was doing. :-(
This commit was SVN r24704.
2011-05-17 03:30:25 +00:00
Ralph Castain
a3e43594a4 Extend node stats to include additional memory info. Change "darwin" pstat module to "test" as we don't really know how to get all the stat info for darwin.
Add a new OPAL_ERROR_LOG macro similar to the ORTE_ERROR_LOG one.

This commit was SVN r24692.
2011-05-08 14:45:16 +00:00
Ralph Castain
c160f5d5a2 Add ability to specify mcast interfaces by name
This commit was SVN r24691.
2011-05-08 14:42:48 +00:00
Thomas Herault
fb3fd8fd0e items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist.
They shouldn't be released, but returned to the freelist.

This commit was SVN r24679.
2011-05-03 21:03:09 +00:00
Ralph Castain
138928fcf4 Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces.
This commit was SVN r24666.
2011-04-29 18:46:40 +00:00
Shiqing Fan
4490fdbd34 Add the initial support for MinGW and MSYS.
Correctly check the dependencies of MSYS env.
Set up configure include and lib path for building the package.
update a few more CMake scripts.

This commit was SVN r24663.
2011-04-29 14:42:07 +00:00
Ralph Castain
c78531ce8a Don't free the envar that gets putenv'd as that messes up the environ
This commit was SVN r24660.
2011-04-29 08:50:29 +00:00
Ralph Castain
b586f2952e Arggg...revert r24645. I knew those fields were there for a reason...sigh.
This commit was SVN r24647.

The following SVN revision numbers were found above:
  r24645 --> open-mpi/ompi@e4732110da
2011-04-28 15:07:00 +00:00
Ralph Castain
859aaab93d In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here.
So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result.

This commit was SVN r24646.
2011-04-28 13:54:33 +00:00
Ralph Castain
e4732110da Remove a couple more stale fields
This commit was SVN r24645.
2011-04-28 00:26:38 +00:00
Ralph Castain
9988b97b97 Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays.
Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring.

Extend the heartbeat sensor to report node and process stats in the heartbeat.

Store the process and node stats in their respective orte_xxx_t object.

This commit was SVN r24629.
2011-04-21 22:55:45 +00:00
Ralph Castain
5f64b830f9 Ensure we only kill threads once
This commit was SVN r24620.
2011-04-18 14:47:09 +00:00
Ralph Castain
89501e6e24 Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback.
This commit was SVN r24618.
2011-04-18 12:21:47 +00:00
Ralph Castain
3a28556472 Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question.
Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI.

This commit was SVN r24614.
2011-04-14 15:04:21 +00:00