1
1
Граф коммитов

5556 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
38636f4f0a Ensure we properly cleanup on termination, including when terminating due to ctrl-c
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-21 06:33:37 -07:00
Ralph Castain
2aa286c9d0 Update orte-clean so it cleans legacy session directories as well as pmix artifacts
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 17:46:39 -07:00
Ralph Castain
501ba8faad Merge pull request #3704 from rhc54/topic/signal
Control distribution of signals to children vs grandchildren
2017-06-20 11:11:43 -07:00
Ralph Castain
952726c121 Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures.
This now passes the loop test, and so we believe it resolves the random hangs in finalize.

Changes in PMIx master that are included here:

* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable

Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 09:02:15 -07:00
Ralph Castain
206aec6083 By default, apply signals to all direct children _and_ any children they might have spawned (so long as they remain in the same process group). Provide an MCA param (odls_base_signal_direct_children_only) to indicate that the signal is to go _only_ to our direct children, and not be delivered to any children spawned by those procs.
Refs https://www.mail-archive.com/users@lists.open-mpi.org/msg31221.html

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-15 12:26:11 -07:00
Ralph Castain
8f09929469 Fix rank-file mapper launch by correctly setting up the remote map from the provided data
Put a simple protection for the case where procs fail while we are trying to deregister handlers

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-15 08:33:29 -07:00
Ralph Castain
8afa1433b8 Only set the "bound" flag if we wre actually bound
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-14 13:22:01 -07:00
Gilles Gouaillardet
72c7329462 configury: use 'uname -n' when 'hostname' is not available
the 'hostname' command might not be available on some platforms
such as Fedora Core 26, so mimick config/libtool.m4 and fallback
to 'uname -n' if needed

Refs. #3680

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-06-12 15:04:32 +09:00
Ralph Castain
1f0f03b45b Print a better error message when srun isn't found in the path. Ensure we don't segfault if -host specifies a node not included in the allocation
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-09 07:46:47 -07:00
Ralph Castain
00ba6a1be6 Protect against NULL topology
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-08 20:56:44 -07:00
Ralph Castain
7b39f19f60 Fix the backend mapper algorithm for comm_spawn. The front and back ends need to get the nodes into the job map in the same order so that the ranking algorithms will reach the same results
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-08 08:00:52 -07:00
Ralph Castain
81ab79f311 Ensure the orted doesn't go into an infinite loop during force-terminate
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-07 21:44:49 -07:00
Ralph Castain
7002535059 Merge pull request #3671 from rhc54/topic/ofi
We cannot use OFI to determine when daemons can finalize as we don't …
2017-06-07 15:08:56 -07:00
George Bosilca
484004b03d
simple_spawn should be independent of ORTE. 2017-06-07 17:51:46 -04:00
Ralph Castain
919d7fcf49 We cannot use OFI to determine when daemons can finalize as we don't see the "sockets" go away. So always use the OOB for the mgmt conduit - this provides the necessary termination signal AND ensures that IOF and other mgmt messages go solely across TCP.
Cleanup the way we look for matching OFI addresses by using the opal_net_samenetwork helper function. This now works for multi-network environments, but only using the socket provider

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-07 13:51:30 -07:00
Ralph Castain
bd1793ad17 Get the pmix/ext2x component to work. Fix a minor problem in the libevent external component.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 20:06:28 -07:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
a28eaf914a Silence warnings when terminating
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-05 13:53:07 -07:00
Ralph Castain
8f526968c2 Do not hang if we cannot relay messages. Eliminate extra error log message
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-05 06:35:19 -07:00
Ralph Castain
51b4078b70 Merge pull request #3648 from rhc54/topic/ofi
Clean up the conduit open code so we return detectable errors when co…
2017-06-02 18:08:55 -07:00
Ralph Castain
e884cbf5f5 Even though the ofi component doesn't do any routing itself, the rest of the code base (e.g., grpcomm) needs to know what routing module this component is using. So set it to the "direct" module, and don't allow ofi to be used if that module isn't available.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 15:47:25 -07:00
Ralph Castain
ba9a6078c2 Add ability to select transport, and only compare the first one in the conduit list for a match. This lets you select which conduit to use for OFI - if you set "-mca rml_ofi_transports ethernet" you'll pickup the mgmt conduit. If you set "-mca rml_ofi_transports fabric", you'll get the coll conduit
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 14:31:23 -07:00
Jeff Squyres
af9565ec25 ess: add missing <signal.h> header
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-06-02 14:11:40 -07:00
Ralph Castain
066d5eedce Shift the signal forwarding code to ess/base so it can be available to more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients
Check for NULL

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 10:59:14 -07:00
Ralph Castain
6b3bbd30c5 Clean up the conduit open code so we return detectable errors when conduit not opened.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 10:40:51 -07:00
Ralph Castain
2ab4f93f6a Instead of "forced_terminate" just quietly causing the daemon to disappear, let's at least attempt to let the user know where the problem occurred.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 08:28:16 -07:00
anandhi
6ddb487744 Cleaned up the send_msg(), moved checking for send to self into the send_nb()
and send_buffer_nb()
	modified:   orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi Jayakumar <anandhi.s.jayakumar@intel.com>
2017-06-01 17:50:54 -07:00
Ralph Castain
9d6b929894 Fix uninitialized variable. Set exit codes for failed launch so we get pretty error messages
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-31 07:38:37 -07:00
Ralph Castain
26e7515a5e Don't sweat the "sync" settings on file descriptors as those flags aren't apparently fully portable
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 20:37:26 -07:00
Ralph Castain
5d990b557c Reorg ordering so that bare executable names also are found
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 15:58:55 -07:00
Ralph Castain
321abfc8c6 Fix cwd and preload-binary options
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 14:07:22 -07:00
Ralph Castain
ad108ba44d Fix the DVM
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 11:42:42 -07:00
Ralph Castain
9a8811a246 Ensure that data from a job that was stored in ompi-server is purged once that job completes. Cleanup a few typos. Silence a Coverity warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 09:43:01 -07:00
Ralph Castain
e8759ca66b Add minor test to ORTE test suite
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-29 15:43:52 -07:00
Ralph Castain
f3ab326b4a Add some debug code for detecting leaking file descriptors. At the end of each job (and if MCA param is set), have each daemon compute the number of open fds and their characteristics and print a summary
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-29 11:25:20 -07:00
Ralph Castain
87201a80ff Silence coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-27 11:45:53 -07:00
Ralph Castain
9f60cd0fe7 Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-27 10:47:08 -07:00
Ralph Castain
8c2a06477c Fix ompi-server operations
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-26 08:57:55 -07:00
Ralph Castain
657e701c65 Add debug verbosity to the orte data server and pmix pub/lookup functions
Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).

Remove unneeded test

Fix memory corruption by re-initializing variable to NULL in loop

Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.

Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.

Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.

Have the mindist module add procs to the job's proc array as it is a fully described module

Protect the hnp-not-in-allocation case

Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-25 18:41:27 -07:00
Gilles Gouaillardet
22ab73cb1a Merge pull request #3471 from ggouaillardet/topic/execve_cmd
odls: fix handling of the orte fork agent
2017-05-15 15:07:39 +09:00
Ralph Castain
b527c40dae Remove debug
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-12 12:41:36 -07:00
Ralph Castain
23af6c9d02 Merge pull request #3519 from rhc54/topic/nolocal
Fix --nolocal
2017-05-12 09:57:52 -07:00
Ralph Castain
45bbd598c1 Fix --nolocal
Fix the --nolocal option by ensuring we always check/remove the HNP from the list of available nodes if the flag is set
Ensure that the HNP node is included as available when nothing else is given

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-12 09:03:26 -07:00
Ralph Castain
29e083bffd Fix total_slots_allocated computation
On unmanaged allocations, we need to update the total_slots_allocated once the daemons have been launched and "discovered" their topology

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-12 08:21:52 -07:00
Ralph Castain
9164afbb08 When a daemon force-terminates, we don't get the show_help message it was trying to send because the message is at a lower priority than the termination event. Resolve this by putting the oob in its own progress thread. Also, use only that one thread by default - if someone needs more progress threads in the OOB, they can use the MCA param to get them.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-11 06:52:55 -07:00
Ralph Castain
f47124e4d3 Finally fix the problem - the key was knowing there were more than 2 topologies involved, and that the HNP is not allocated. Give up on being cute and just search the darned list of topologies - there won't be that many, and if there are (so the scan takes awhile), then too bad.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 16:44:19 -07:00
Ralph Castain
55f4b825af Add verbose output to nidmap code for debugging as this is a new, and sometimes fragile, feature
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 12:40:02 -07:00
Ralph Castain
911961ee21 Sigh - remove debug
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 11:26:42 -07:00
Ralph Castain
2d93d15aa7 Merge pull request #3502 from rhc54/topic/cisco
Fix nidmap computation to deal with hetero nodes
2017-05-10 11:21:12 -07:00
Ralph Castain
50646b07ce Update the RML OFI by copying the updated files from @anandhis branch
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 09:17:06 -07:00