ea35e47228
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point. |
||
---|---|---|
.. | ||
data_type_support | ||
help-orte-runtime.txt | ||
Makefile.am | ||
orte_cr.c | ||
orte_cr.h | ||
orte_data_server.c | ||
orte_data_server.h | ||
orte_finalize.c | ||
orte_globals.c | ||
orte_globals.h | ||
orte_info_support.c | ||
orte_info_support.h | ||
orte_init.c | ||
orte_locks.c | ||
orte_locks.h | ||
orte_mca_params.c | ||
orte_quit.c | ||
orte_quit.h | ||
orte_wait.c | ||
orte_wait.h | ||
runtime_internals.h | ||
runtime.h |