1
1
openmpi/opal
Nathan Hjelm eaa98af52c opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 5c770a7bec)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:28:20 -06:00
..
class opal/free_list: fix race condition 2018-10-16 15:28:20 -06:00
datatype opal/dataype: add additional interface to retrieve more details about 2018-06-21 09:25:50 -05:00
dss Update ORTE to support PMIx v3 2018-03-02 02:00:31 -08:00
etc Correct the comment in the default MCA param template - we do not support a param called "component_path". The correct syntax is "mca_base_component_path" 2018-01-05 08:46:44 -08:00
include Complete job control integration 2018-08-20 16:08:54 -07:00
mca Merge pull request #5874 from rhc54/cmr40/config 2018-10-10 10:47:26 -05:00
memoryhooks opal: rename opal_atomic_init to opal_atomic_lock_init 2017-08-07 14:15:11 -06:00
runtime opal/progress: protect against multiple threads in event base 2018-09-21 14:40:08 -05:00
test/reachable reachable: add tests 2017-09-19 19:42:54 -07:00
threads opal/thread: Added keyword opal_thread_local for TLS. 2018-06-14 13:25:04 -07:00
tools Revert "Update to sync with OMPI master and cleanup to build" 2016-11-22 15:03:20 -08:00
util snprintf() length fix for info 2018-09-21 14:47:11 -05:00
win32 opal: standardize on max hostname length 2016-04-24 08:19:47 +02:00
common_sym_whitelist.txt opal: add code patcher framework 2016-04-13 17:16:13 -06:00
Makefile.am opal: remove generated asm code 2017-08-03 09:18:58 -06:00
win_makefile Purge whitespace from the repo 2015-06-23 20:59:57 -07:00