1
1

Address a race condition in libevent select.

This is not really a fix for the race condition because I could not
figure out how it happen, but it does address the problem generated by
the race. If we do not remove a bad fd from the select list we keep
getting the same error from select, and we stop doing any progress on
the communication side. Thus, we forcefully disable all bad fd as soon
as select fails, and we are back in track, progress ensure and
everything seems to work as expected (no leftover events in the event
base).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Этот коммит содержится в:
George Bosilca 2019-02-21 19:52:52 -05:00 коммит произвёл Aurelien Bouteiller
родитель f1ae036466
Коммит c39fb5758a
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 08F60797C5941DB2

Просмотреть файл

@ -42,6 +42,7 @@
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <fcntl.h>
#include "event-internal.h"
#include "evsignal-internal.h"
@ -166,12 +167,30 @@ select_dispatch(struct event_base *base, struct timeval *tv)
check_selectop(sop);
if (res == -1) {
if (errno != EINTR) {
event_warn("select");
return (-1);
if (errno == EINTR) {
return (0);
}
return (0);
/* There seems to be a very subtle race condition between the
* event_del and the select, where the fd is still active on the
* event_readset_in but no libevent structure make reference
* to it so it. Thus, any call to progress will no nothing more
* than print a warning and do nothing, leading to deadlocks.
* If we force remove the problematic fd, we get the warning only
* once, and things work as expected.
*/
event_warn("select");
for (j = 0; j < nfds; ++j) {
if (FD_ISSET(j, sop->event_readset_in) ||
FD_ISSET(j, sop->event_writeset_in)) {
res = fcntl(j, F_GETFL);
if( res == -1 ) {
event_warn("bad file descriptor %d/%d\n", j, nfds);
FD_CLR(j, sop->event_readset_in);
FD_CLR(j, sop->event_writeset_in);
}
}
}
return (-1);
}
event_debug(("%s: select reports %d", __func__, res));