eaa98af52c
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:
```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
opal_free_list_item_t *item =
(opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);
if (OPAL_UNLIKELY(NULL == item)) {
opal_mutex_lock (&flist->fl_lock);
opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
opal_mutex_unlock (&flist->fl_lock);
item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
}
return item;
}
```
The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.
This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.
Fixes #2921
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit
|
||
---|---|---|
.. | ||
Makefile.am | ||
opal_bitmap.c | ||
opal_bitmap.h | ||
opal_fifo.c | ||
opal_fifo.h | ||
opal_free_list.c | ||
opal_free_list.h | ||
opal_graph.c | ||
opal_graph.h | ||
opal_hash_table.c | ||
opal_hash_table.h | ||
opal_hotel.c | ||
opal_hotel.h | ||
opal_interval_tree.c | ||
opal_interval_tree.h | ||
opal_lifo.c | ||
opal_lifo.h | ||
opal_list.c | ||
opal_list.h | ||
opal_object.c | ||
opal_object.h | ||
opal_pointer_array.c | ||
opal_pointer_array.h | ||
opal_rb_tree.c | ||
opal_rb_tree.h | ||
opal_ring_buffer.c | ||
opal_ring_buffer.h | ||
opal_tree.c | ||
opal_tree.h | ||
opal_value_array.c | ||
opal_value_array.h |