Ralph Castain
0a4639308e
Remove a potential race condition - we'll cleanup the local children when we are all done
...
This commit was SVN r32132.
2014-07-03 14:13:43 +00:00
George Bosilca
2883adcdf3
Remove useless variables.
...
This commit was SVN r32123.
2014-07-03 00:30:54 +00:00
Ralph Castain
149810f02c
Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro
...
cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message
This commit was SVN r32122.
2014-07-02 14:46:00 +00:00
Ralph Castain
e9d69ca370
Remove stale test
...
This commit was SVN r32104.
2014-06-29 16:37:19 +00:00
Adrian Reber
47b118c0ae
fix FT compilation
...
This commit was SVN r32094.
2014-06-26 03:40:07 +00:00
Adrian Reber
cabf1d4e68
use the orte attributes in the FT code to fix compile errors
...
This commit was SVN r32093.
2014-06-26 03:19:17 +00:00
Adrian Reber
10c1a50705
"handle" removal of opal_db.remove() in the FT code
...
This commit was SVN r32092.
2014-06-26 03:11:37 +00:00
Ralph Castain
f3cb124e50
Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
...
This commit was SVN r32089.
The following SVN revision numbers were found above:
r32070 --> open-mpi/ompi@12d92d0c22
r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Adrian Reber
9f73e79d91
also change the callback function prototype (to get the FT code to compile again)
...
This commit was SVN r32088.
2014-06-25 20:37:02 +00:00
Adrian Reber
4aca7095dc
fix a syntax error in the FT code
...
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Adrian Reber
4b25e92194
get the FT code to compile again by adding/removing #includes
...
This commit was SVN r32086.
2014-06-25 18:42:17 +00:00
Ralph Castain
8fca77c3d3
Protect the binding policy setting so it builds when --without-hwloc
...
Refs trac:4742
This commit was SVN r32085.
The following Trac tickets were found above:
Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742
2014-06-25 18:13:54 +00:00
Adrian Reber
72f1c7941f
use a consistent naming scheme for the SNAPSHOT attributes
...
This commit was SVN r32083.
2014-06-25 15:26:24 +00:00
Nathan Hjelm
563eaf0726
Fix support for Cray alps
...
The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.
Changes:
- Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
had previously just used PMI_SUCCESS.
- Add missing definition and a typo in pml_alps_module.
- launch_id is no longer available in the orte_node_t structure. Use the
attribute lookup to get the value.
- Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
opal_list_sort instead (O(nlogn)).
This commit was SVN r32076.
2014-06-24 21:29:04 +00:00
Ralph Castain
5f6be06b54
Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
...
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load
This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Ralph Castain
12d92d0c22
Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
...
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Ralph Castain
34e5573988
Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination.
...
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.
Refs trac:4717
This commit was SVN r32063.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-21 17:09:02 +00:00
Ralph Castain
9cfc408fd4
Little more debug - getting close to figuring this one out
...
Refs trac:4717
This commit was SVN r32060.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 16:24:06 +00:00
Ralph Castain
f9da295682
Add some additional debug
...
Refs trac:4717
This commit was SVN r32059.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 14:14:36 +00:00
Ralph Castain
645df5e823
Don't release the node_name field as it gets used in the slots parsing - will be released at newline detection
...
This commit was SVN r32058.
2014-06-20 13:18:46 +00:00
Ralph Castain
9a47e45a09
<laugh> ensure we really compare the things we want to compare
...
This commit was SVN r32055.
2014-06-19 20:54:25 +00:00
Ralph Castain
e65538e91b
Add some defensive programming, fix a typo
...
This commit was SVN r32054.
2014-06-19 20:52:13 +00:00
Ralph Castain
b43f760f93
If you don't specify all the rank-file mapping for all procs, then you'll segfault - which is probably a bad idea. I can't see an easy workaround, so just error out for now and let's see if anyone really cares.
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32053.
2014-06-19 20:30:06 +00:00
Ralph Castain
b618b36a2f
Fix potential issue if opal_hwloc_topology is NULL
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32050.
2014-06-19 18:52:41 +00:00
Ralph Castain
61fe4daa33
Add some further debug
...
Refs trac:4717
This commit was SVN r32047.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-19 15:59:51 +00:00
Ralph Castain
65275d6326
Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel
...
cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message
This commit was SVN r32030.
2014-06-18 19:20:28 +00:00
Ralph Castain
3f032d39e8
Mark the proc as alive so waitpid callback system doesn't immediately activate the callback
...
Refs trac:4717
This commit was SVN r32026.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-18 14:04:55 +00:00
Ralph Castain
8e7c0257f0
Cleanup some missed updates to orte_wait_cb as params have changed
...
Refs trac:4717
This commit was SVN r32025.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 23:40:31 +00:00
Ralph Castain
5dbf4a62c4
Cleanup: we were accidentally killing ourselves (bad idea)
...
Refs trac:4717
This commit was SVN r32022.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 20:38:42 +00:00
Ralph Castain
5216bd5558
Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb
...
Refs trac:4717
This commit was SVN r32020.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 18:46:40 +00:00
Ralph Castain
42bf7466fc
This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
...
Refs trac:4717
This commit was SVN r32019.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
ab52f16100
Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process.
...
cmr=v1.8.2:reviewer=rhc:subject=cache termination cleanups
This commit was SVN r32008.
2014-06-16 20:46:35 +00:00
Ralph Castain
561983ae52
Fix static builds by renaming conflicting type
...
This commit was SVN r32006.
2014-06-14 17:39:28 +00:00
Ralph Castain
3f04d50cb0
Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions:
...
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.
(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.
Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".
Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.
Closes trac:4345
cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions
This commit was SVN r32005.
The following Trac tickets were found above:
Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
2014-06-14 15:38:32 +00:00
Ralph Castain
56c3575c0e
Can't emit an error for an unrecognized mapping policy modifier as the ppr policy relies on not doing so.
...
This commit was SVN r31998.
2014-06-13 20:10:09 +00:00
Ralph Castain
3ed282bf44
Per patch from Tetsuya, correct the cpus-per-proc logic so we correctly detect when the user is attempting to bind too low for that option
...
Refs trac:4702
This commit was SVN r31988.
The following Trac tickets were found above:
Ticket 4702 --> https://svn.open-mpi.org/trac/ompi/ticket/4702
2014-06-13 16:32:52 +00:00
Ralph Castain
ba926d8635
The TCP component will have set the hash table entry to NULL, but that doesn't remove the key. So the hash_table retrieval function will return success, but with a NULL pointer - protect against that scenario
...
Patch provided by Gilles - reviewed by rhc.
RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31971.
2014-06-09 17:46:22 +00:00
Ralph Castain
5d5ae41ea5
Cleanup a memory leak in the daemons - thanks to Artem for spotting it
...
This commit was SVN r31970.
2014-06-09 17:14:02 +00:00
Ralph Castain
06dbfa3098
Make the cpus-per-proc equivalent a little more intuitive:
...
* allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc.
* if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior.
Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31967.
2014-06-08 20:26:59 +00:00
Ralph Castain
8db76e9c6f
Ensure that we change to the session dir if we preload binaries so we'll use the loaded one
...
Special patch created for v1.8 and CMR filed
This commit was SVN r31963.
2014-06-06 21:43:23 +00:00
Ralph Castain
b7c08582ba
Add new tag to avoid conflicts
...
This commit was SVN r31960.
2014-06-06 17:23:35 +00:00
Ralph Castain
638c24f655
Correct the bind-in-place algorithm to better handle comm_spawn. If the location identified by the mapper is already occupied by procs from another job, then we need to shift either right or left until we find an unoccupied location where we can be bound. If nothing is available, then check for the overload flag (and bind us in the original location if provided), or see if this was the default binding policy instead of one specified by the user - if so, then just don't bind this process.
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31959.
2014-06-06 12:36:14 +00:00
Gilles Gouaillardet
d26ac02b4a
#if OPAL_HAVE_HWLOC protect access to orte_proc_info_t.cpuset
...
Fix a bug when trunk is configured with --without-hwloc
v1.8 is safe so no cmr
This commit was SVN r31957.
2014-06-06 07:25:39 +00:00
Ralph Castain
e21bfeadcd
Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed.
...
This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier.
cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component
This commit was SVN r31956.
2014-06-06 02:24:17 +00:00
Ralph Castain
34cb137314
Add another attribute to the orte_proc_t area
...
This commit was SVN r31953.
2014-06-05 14:48:19 +00:00
Ralph Castain
b2413a6b88
Cannot update the proc state prior to activating the state machine as some callback functions need to compare the prior proc state against the new one.
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31949.
2014-06-04 03:40:08 +00:00
Ralph Castain
b771388fa7
We really need to send *all* the daemon info whenever the daemon job has changed as new daemons need a full nidmap
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31948.
2014-06-04 03:38:54 +00:00
Ralph Castain
c5384d44d7
Protect against NULL result in get_attr
...
This commit was SVN r31947.
2014-06-04 03:09:37 +00:00
Ralph Castain
8e768de317
Really should check our own node for oversubscription, not the HNP
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31946.
2014-06-04 03:09:02 +00:00
Ralph Castain
5398158d5a
Move fclose inside bracket to protect against NULL fp
...
This commit was SVN r31945.
2014-06-04 03:08:18 +00:00
Ralph Castain
f1978fba7c
Cleanup a set of typos on the orte_get_attribute call
...
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
5668f085a3
Silence some useless warnings, and fix a missed updated in the tm plm
...
This commit was SVN r31930.
2014-06-02 17:57:56 +00:00
Ralph Castain
742c0d2284
Fix typo that would cause a segfault if orte_startup_timeout was set
...
This commit was SVN r31929.
2014-06-02 15:59:18 +00:00
Ralph Castain
7df500ecf5
Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up.
...
This commit was SVN r31928.
2014-06-02 15:00:33 +00:00
Ralph Castain
03234f2a33
Revert r31926 and replace it with a more complete checking of availability and accessibility of the required freq control paths.
...
This commit was SVN r31927.
The following SVN revision numbers were found above:
r31926 --> open-mpi/ompi@9779084352
2014-06-02 14:34:00 +00:00
Gilles Gouaillardet
9779084352
orte_rtc_base_select: skip a RTC module if it has a zero priority
...
This commit was SVN r31926.
2014-06-02 07:16:41 +00:00
Ralph Castain
65a35d92ef
Cleanup compile issues - missing updates to some plm components and the slurm ras component
...
This commit was SVN r31921.
2014-06-01 17:59:06 +00:00
Ralph Castain
2c3d07db24
Cleanup the test so it is MPI correct
...
This commit was SVN r31919.
2014-06-01 17:57:36 +00:00
Ralph Castain
8736a1c138
Per RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php
Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).
This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
cf2c7381d0
Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs.
...
Refs trac:4643
This commit was SVN r31915.
The following Trac tickets were found above:
Ticket 4643 --> https://svn.open-mpi.org/trac/ompi/ticket/4643
2014-06-01 16:08:56 +00:00
Ralph Castain
1107f9099e
Per the RFC issued here:
...
http://www.open-mpi.org/community/lists/devel/2014/05/14827.php
Refactor PMI support
This commit was SVN r31907.
2014-06-01 04:28:17 +00:00
Nathan Hjelm
041b72b0cc
plm/alps: better workaround for the noisy cray pmi implementation
...
This commit is a slightly better workaround to prevent mesages of
the form:
[unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
[unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor
It works by completely disabling PMI in the application process when using
mpirun. This should not be an issue for any apps.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31882.
2014-05-22 16:04:36 +00:00
Oscar Vega-Gisbert
83bdebbf81
Java bindings for OSHMEM.
...
This commit was SVN r31810.
2014-05-18 21:48:09 +00:00
Nathan Hjelm
73bfecd650
More leak fixes.
...
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
2014-05-15 16:38:13 +00:00
Nathan Hjelm
59d09ad9de
orte: fix several small memory leaks
...
grpcomm: fix memory leaks
We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.
oob,rml: fix memory leaks
This commit fixes several leaks:
- Both the oob/base and oob/tcp were leaking objects on their peer
hash tables. Iterate on the hash tables and free any objects.
- Leaked sent messages because of missing OBJ_RELEASE. I placed the
release in ORTE_RML_SEND_COMPLETE to catch all the possible
paths.
ess/base: close the state framework
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Gilles Gouaillardet
5b9364fc12
Fix a memory leak in orte_register_params()
...
mca_base_var_register (..., MCA_BASE_VAR_TYPE_STRING, ...)
will dup() the orte_set_slots string, so there is no need
to do this in the first place.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31773.
2014-05-15 10:31:19 +00:00
Gilles Gouaillardet
5f82c391a6
Fix memory leaks in orte/util/nidmap.c
...
This patch fixes four memory leaks in orte/util/nidmap.c :
- hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed
- even if bo->bytes is freed in the decode, bo was not freed
- a job list is populated but never used nor freed
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31770.
2014-05-15 08:28:53 +00:00
Ralph Castain
ad0e8f841d
Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection.
...
Refs trac:4627
This commit was SVN r31763.
The following Trac tickets were found above:
Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627
2014-05-14 19:23:02 +00:00
Ralph Castain
e605e73379
Close the incoming socket if we aren't going to accept it
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31759.
2014-05-14 16:51:59 +00:00
Ralph Castain
3a1c2fff3e
Correct a misplaced bracket - daemons shouldn't be doing app-related operations
...
This may need a patch for 1.8.2, but we can try to directly apply it
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r31754.
2014-05-14 15:23:30 +00:00
Nathan Hjelm
2a57e71a47
plm/alps: fix typo introduced in r31589
...
This commit was SVN r31747.
The following SVN revision numbers were found above:
r31589 --> open-mpi/ompi@445b552d3a
2014-05-13 22:36:54 +00:00
Ralph Castain
f55c587a74
Per patch from Tetsuya Mishima, ensure the rank_file mapper accurately tracks number of nodes in the map
...
Refs trac:4594
This commit was SVN r31725.
The following Trac tickets were found above:
Ticket 4594 --> https://svn.open-mpi.org/trac/ompi/ticket/4594
2014-05-13 14:36:25 +00:00
Ralph Castain
5388347511
Per Jeff's suggestion, remove function that has duplicate functionality and just use one to check if session_dir directory should be removed.
...
Refs trac:4584
This commit was SVN r31691.
The following Trac tickets were found above:
Ticket 4584 --> https://svn.open-mpi.org/trac/ompi/ticket/4584
2014-05-08 17:22:43 +00:00
Ralph Castain
aaae4841e9
Flush the show_help system on our way out - this also restores the opal_show_help function pointer to the OPAL layer for any subsequent processing.
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31685.
2014-05-08 14:37:47 +00:00
Ralph Castain
5602156a1c
Use the correct abstraction layer name for the data dirs
...
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091
The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
...
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
a8e2d6c3a6
The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature:
...
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
2014-05-07 21:48:53 +00:00
Ralph Castain
05590b6a8c
Correct the datastore containing the coprocessor info
...
This commit was SVN r31677.
2014-05-07 19:29:12 +00:00
Ralph Castain
4def94900a
Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES
...
This commit was SVN r31634.
2014-05-05 21:43:05 +00:00
Ralph Castain
87d809eefe
Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow
...
This commit was SVN r31633.
2014-05-05 19:22:06 +00:00
Ralph Castain
fae39a658d
Add third flag for open when using O_CREAT. Thanks to "robi" for reporting it and providing a patch.
...
Fixes trac:4596
Reviewed by rhc, RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31626.
The following Trac tickets were found above:
Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
2014-05-02 21:58:38 +00:00
Ralph Castain
60c554e097
Ugh - protect that --display-devel print with some NULL checks
...
This commit was SVN r31604.
2014-05-02 14:28:45 +00:00
Ralph Castain
c7f55be387
Per a user request, add binding info to the simple --diplay-map option
...
This commit was SVN r31603.
2014-05-02 14:25:59 +00:00
Ralph Castain
ccd33a17b8
Since we cannot block when calling abort, and we want to ensure any "show_help" message at least has a chance to get out before we exit, introduce a slight delay into the abort procedure.
...
Refs trac:4576
This commit was SVN r31601.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:46:25 +00:00
Ralph Castain
c1383ca1f3
Protect against NULL cpuset when not bound
...
This commit was SVN r31600.
2014-05-02 10:45:11 +00:00
Ralph Castain
0209cddb5b
Revert r31596 and r31595 as they recreate the "abort" problem - all they did was move the blocking send to another point in the code. An alternative solution to the "show_help and abort" problem. will come in another commit
...
Refs trac:4576
This commit was SVN r31599.
The following SVN revision numbers were found above:
r31595 --> open-mpi/ompi@2b61f22973
r31596 --> open-mpi/ompi@712634efd3
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-02 10:38:30 +00:00
Ralph Castain
6545e6e9a8
Add one more check for failed mapping that rarely occurs, but results in a hang when it does
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31598.
2014-05-02 10:35:14 +00:00
Ralph Castain
712634efd3
Silence warning
...
Refs trac:4576
This commit was SVN r31596.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-05-01 23:58:03 +00:00
Ralph Castain
2b61f22973
Now that the abort code no longer involves a blocking rml send section, apps that call show_help followed by abort are not printing their error message. So block them in show_help until that message gets out.
...
This commit was SVN r31595.
2014-05-01 22:57:17 +00:00
Ralph Castain
445b552d3a
Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
...
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.
Refs trac:4571
This commit was SVN r31589.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
567ed25938
As per the earlier RFC, move the DB framework to orcm, thus removing it from the OMPI code repo
...
This commit was SVN r31586.
2014-05-01 15:43:32 +00:00
Ralph Castain
3b64c603b4
First stage of RFC to rename OMPI_foo build system support: change OMPI_CHECK_PACKAGE -> OPAL_CHECK_PACKAGE
...
This commit was SVN r31582.
2014-05-01 14:24:56 +00:00
Ralph Castain
238ecea311
When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
...
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn
This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Ralph Castain
d04a102ab8
Silence warnings
...
This commit was SVN r31573.
2014-04-30 20:55:46 +00:00
Ralph Castain
087b84b0ef
Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers.
...
This commit was SVN r31572.
2014-04-30 19:29:00 +00:00
Ralph Castain
8cda1b3dc6
Don't store cpu_bitmap unless it is non-NULL
...
This commit was SVN r31570.
2014-04-30 18:12:48 +00:00
Ralph Castain
7a79b25577
Ensure we cleanup some files so session dirs can be rolled up
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31569.
2014-04-30 17:52:10 +00:00
Ralph Castain
34988ba2a2
Cleanup the MPI_Abort detection
...
Refs trac:4576
This commit was SVN r31561.
The following Trac tickets were found above:
Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576
2014-04-30 00:51:59 +00:00
Ralph Castain
3c9d877c1b
Remove debug
...
This commit was SVN r31560.
2014-04-30 00:08:43 +00:00
Ralph Castain
9402380e1f
Fix some errors in transition
...
This commit was SVN r31559.
2014-04-30 00:07:53 +00:00