1
1
openmpi/orte/runtime
Ralph Castain ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
..
data_type_support Better support automated tests for map, rank, and bind options 2015-04-30 14:01:13 -07:00
help-orte-runtime.txt As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time: 2013-08-20 18:59:36 +00:00
Makefile.am configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros 2014-10-22 10:32:19 -07:00
orte_cr.c FT: fix compilation using --with-ft (1/5) 2015-03-11 14:23:33 +01:00
orte_cr.h FT: fix compilation using --with-ft (1/5) 2015-03-11 14:23:33 +01:00
orte_data_server.c As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: 2013-08-22 16:37:40 +00:00
orte_data_server.h Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
orte_finalize.c Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. 2015-05-29 14:37:14 -07:00
orte_globals.c initialize common symbols from orte 2015-05-08 10:11:58 +09:00
orte_globals.h Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
orte_info_support.c opal: fix multiple bugs in MCA and opal 2015-04-07 19:13:20 -06:00
orte_info_support.h Update OMPI frameworks to use the MCA framework system. 2013-03-27 21:17:31 +00:00
orte_init.c Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. 2015-05-29 14:37:14 -07:00
orte_locks.c initialize common symbols from orte 2015-05-08 10:11:58 +09:00
orte_locks.h Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. 2010-07-17 21:03:27 +00:00
orte_mca_params.c Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
orte_quit.c orte_quit: Remove logically dead code 2015-05-26 12:16:12 -06:00
orte_quit.h Sorry for mid-day commit, but I had promised on the call to do this upon my return. 2012-04-06 14:23:13 +00:00
orte_wait.c Remove useless variables. 2014-07-03 00:30:54 +00:00
orte_wait.h Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. 2014-07-15 03:48:00 +00:00
runtime_internals.h Modify the accounting system to recycle jobids. Properly recover resources from nodes and jobs upon completion. Adjustments in several places were required to deal with sparsely populated job, node, and proc arrays as a result of this change. 2009-03-03 16:39:13 +00:00
runtime.h As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: 2013-08-22 16:37:40 +00:00