Fix slow startup issue with the MX MTL. The problem is caused by mx_connect() being a one-sided operation from the API level, but not being an interrupting call when the target is not entering the MX library. So if most of the processes exit mtl_mx_add_procs() and enter the stage gate 2 barrier, the other processes can only progress their mx_connect() calls when the targets enter the mx library. Because the event library is in EV_ONELOOP mode, this only happens once a second. The mx progress thread (hidden in the MX library) also only wakes up once a second, so mx_connect calls can take a second to complete.
The temporary solution is to switch into EV_NONBLOCK mode earlier (right after the mx_connect loop) so that there isn't a giant slowdown when processes enter the stage gate 2 barrier before other proesses. They will now not block in the event library for any period of time, which appears to have a 50% speedup when running at > 64 procs. Refs trac:645 This commit was SVN r12713. The following Trac tickets were found above: Ticket 645 --> https://svn.open-mpi.org/trac/ompi/ticket/645
Этот коммит содержится в:
родитель
3e11c76d4c
Коммит
bfbc281e93
@ -151,6 +151,16 @@ ompi_mtl_mx_add_procs(struct mca_mtl_base_module_t *mtl,
|
||||
mtl_peer_data[i] = (struct mca_mtl_base_endpoint_t*)
|
||||
mtl_mx_endpoint;
|
||||
}
|
||||
|
||||
/* because mx_connect isn't an interupting function, need to
|
||||
progress MX as often as possible during the stage gate 2. This
|
||||
would have happened after the stage gate anyway, so we're just
|
||||
speeding things up a bit. */
|
||||
#if OMPI_ENABLE_PROGRESS_THREADS == 0
|
||||
/* switch from letting us sit in the event library for a bit each
|
||||
time through opal_progress() to completely non-blocking */
|
||||
opal_progress_set_event_flag(OPAL_EVLOOP_NONBLOCK);
|
||||
#endif
|
||||
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
Загрузка…
x
Ссылка в новой задаче
Block a user