coll/ml: add support for blocking and non-blocking allreduce, reduce, and

allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 15:39:19 +00:00 · 2014-01-22 15:39:19 +00:00 · 1a021b8f2d
--- a/ompi/mca/bcol/base/bcol_base_frame.c
+++ b/ompi/mca/bcol/base/bcol_base_frame.c
@ -1,7 +1,10 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
- * Copyright (c) 2013 Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2013      Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -125,7 +128,6 @@ static int mca_bcol_base_set_components_to_use(opal_list_t *bcol_components_avai
                opal_list_t *bcol_components_in_use)
 {
    /* local variables */
-    opal_list_item_t *b_item;
    const mca_base_component_t *b_component;

    mca_base_component_list_item_t *b_cli;
@ -156,13 +158,10 @@ static int mca_bcol_base_set_components_to_use(opal_list_t *bcol_components_avai
    /* loop over list of components requested */
    for (i = 0; i < cnt; i++) {
        /* loop over discovered components */
-        for (b_item = opal_list_get_first(bcol_components_avail);
-                opal_list_get_end(bcol_components_avail) != b_item;
-                b_item = opal_list_get_next(b_item)) {
-
-            b_cli = (mca_base_component_list_item_t *) b_item;
+        OPAL_LIST_FOREACH(b_cli, bcol_components_avail, mca_base_component_list_item_t) {
            b_component = b_cli->cli_component;

+
            b_component_name = b_component->mca_component_name;
            b_str_len = strlen(b_component_name);

--- a/ompi/mca/bcol/base/bcol_base_init.c
+++ b/ompi/mca/bcol/base/bcol_base_init.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -20,13 +23,9 @@ int mca_bcol_base_init(bool enable_progress_threads, bool enable_mpi_threads)
 {
    mca_bcol_base_component_t *bcol_component;
    mca_base_component_list_item_t *cli;
-    opal_list_item_t *item;
    int ret;

-    for (item = opal_list_get_first((opal_list_t *) &mca_bcol_base_components_in_use);
-            opal_list_get_end((opal_list_t *) &mca_bcol_base_components_in_use) != item;
-            item = opal_list_get_next(item)) {
-        cli = (mca_base_component_list_item_t *) item;
+    OPAL_LIST_FOREACH(cli, &mca_bcol_base_components_in_use, mca_base_component_list_item_t) {
        bcol_component = (mca_bcol_base_component_t *) cli->cli_component;

        if (false == bcol_component->init_done) {
--- a/ompi/mca/bcol/basesmuma/Makefile.am
+++ b/ompi/mca/bcol/basesmuma/Makefile.am
@ -18,14 +18,22 @@ sources = \
        bcol_basesmuma_module.c \
        bcol_basesmuma_buf_mgmt.c \
        bcol_basesmuma_mem_mgmt.c \
-        bcol_basesmuma_fanin.c \
+	bcol_basesmuma_fanin.c \
        bcol_basesmuma_fanout.c \
        bcol_basesmuma_progress.c \
+        bcol_basesmuma_reduce.h \
+        bcol_basesmuma_reduce.c \
+        bcol_basesmuma_allreduce.c \
        bcol_basesmuma_setup.c \
-        bcol_basesmuma_rk_barrier.c \
+	bcol_basesmuma_rd_barrier.c  \
        bcol_basesmuma_rd_nb_barrier.c \
+        bcol_basesmuma_rk_barrier.c \
        bcol_basesmuma_utils.c    \
        bcol_basesmuma_bcast_prime.c \
+        bcol_basesmuma_lmsg_knomial_bcast.c \
+        bcol_basesmuma_lmsg_bcast.c \
+        bcol_basesmuma_gather.c \
+        bcol_basesmuma_allgather.c \
        bcol_basesmuma_smcm.h \
        bcol_basesmuma_smcm.c 

--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma.h
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma.h
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_allgather.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_allgather.c
@ -0,0 +1,538 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "ompi/mca/bcol/basesmuma/bcol_basesmuma.h"
+/*
+  #define IS_AGDATA_READY(peer, my_flag, my_sequence_number)\
+  (((peer)->sequence_number == (my_sequence_number) && \
+  (peer)->flags[ALLGATHER_FLAG][bcol_id] >= (my_flag) \
+  )? true : false )
+*/
+
+#define CALC_ACTIVE_REQUESTS(active_requests,peers, tree_order) \
+    do{                                                         \
+        for( j = 0; j < (tree_order - 1); j++){                 \
+            if( 0 > peers[j] ) {                                \
+                /* set the bit */                               \
+                *active_requests ^= (1<<j);                     \
+            }                                                   \
+        }                                                       \
+    }while(0)
+
+
+
+/*
+ * Recursive K-ing allgather
+ */
+
+/*
+ *
+ * Recurssive k-ing algorithm
+ * Example k=3 n=9
+ *
+ *
+ * Number of Exchange steps = log (basek) n
+ * Number of steps in exchange step = k (radix)
+ *
+ */
+int bcol_basesmuma_k_nomial_allgather_init(bcol_function_args_t *input_args,
+                                           struct coll_ml_function_t *const_args)
+{
+    /* local variables */
+    /* XXX -- FIXME -- This is never set */
+    int8_t  flag_offset;
+    volatile int8_t ready_flag;
+    mca_bcol_basesmuma_module_t *bcol_module = (mca_bcol_basesmuma_module_t *) const_args->bcol_module;
+    netpatterns_k_exchange_node_t *exchange_node = &bcol_module->knomial_allgather_tree;
+    int group_size = bcol_module->colls_no_user_data.size_of_group;
+    int *list_connected = bcol_module->super.list_n_connected; /* critical for hierarchical colls */
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    mca_bcol_basesmuma_component_t *cm = &mca_bcol_basesmuma_component;
+    uint32_t buffer_index = input_args->buffer_index;
+    int *active_requests =
+        &(bcol_module->ml_mem.nb_coll_desc[buffer_index].active_requests);
+
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buffer_index].iteration;
+    int *status = &bcol_module->ml_mem.nb_coll_desc[buffer_index].status;
+    int leading_dim, buff_idx, idx;
+
+    int i, j, probe;
+    int knt;
+    int src;
+    int recv_offset, recv_len;
+
+    int pow_k, tree_order;
+    int max_requests = 0; /* important to initialize this */
+
+    int matched = 0;
+    int64_t sequence_number=input_args->sequence_num;
+    int my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    int buff_offset = bcol_module->super.hier_scather_offset;
+
+
+    int pack_len = input_args->count * input_args->dtype->super.size;
+
+    void *data_addr = (void*)(
+        (unsigned char *) input_args->sbuf +
+        (size_t) input_args->sbuf_offset);
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char *peer_data_pointer;
+
+    /* control structures */
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    volatile mca_bcol_basesmuma_header_t *peer_ctl_pointer;
+
+#if 0
+    fprintf(stderr,"entering p2p allgather pack_len %d\n",pack_len);
+#endif
+    /* initialize the iteration counter */
+    buff_idx = input_args->src_desc->buffer_index;
+    leading_dim = bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+    data_buffs=(volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+
+    /* Set pointer to current proc ctrl region */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+
+    /* initialize headers and ready flag */
+    BASESMUMA_HEADER_INIT(my_ctl_pointer, ready_flag, sequence_number, bcol_id);
+
+    /* initialize these */
+    *iteration = 0;
+    *active_requests = 0;
+    *status = 0;
+
+    /* k-nomial parameters */
+    tree_order = exchange_node->tree_order;
+    pow_k = exchange_node->log_tree_order;
+
+    /* calculate the maximum number of requests
+     * at each level each rank communicates with
+     * at most (k - 1) peers
+     * so if we set k - 1 bit fields in "max_requests", then
+     * we have max_request  == 2^(k - 1) -1
+     */
+    for(i = 0; i < (tree_order - 1); i++){
+        max_requests ^=  (1<<i);
+    }
+    /* let's begin the collective, starting with extra ranks and their
+     * respective proxies
+     */
+
+    if( EXTRA_NODE == exchange_node->node_type ) {
+
+        /* then I will signal to my proxy rank*/
+        my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id] = ready_flag;
+        ready_flag = flag_offset + 1 + pow_k + 2;
+        /* now, poll for completion */
+        src = exchange_node->rank_extra_sources_array[0];
+        peer_data_pointer = data_buffs[src].payload;
+        peer_ctl_pointer = data_buffs[src].ctl_struct;
+
+        /* calculate the offset */
+        knt = 0;
+        for(i = 0; i < group_size; i++){
+            knt += list_connected[i];
+        }
+        for( i = 0; i < cm->num_to_probe && (0 == matched); i++ ) {
+            if(IS_PEER_READY(peer_ctl_pointer, ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                matched = 1;
+                /* we receive the entire message */
+                memcpy((void *)((unsigned char *) data_addr + buff_offset),
+                       (void *) ((unsigned char *) peer_data_pointer + buff_offset),
+                       knt * pack_len);
+
+                goto FINISHED;
+            }
+
+        }
+
+        /* save state and bail */
+        *iteration = -1;
+        return BCOL_FN_STARTED;
+
+    }else if ( 0 < exchange_node->n_extra_sources ) {
+
+        /* I am a proxy for someone */
+        src = exchange_node->rank_extra_sources_array[0];
+        peer_data_pointer = data_buffs[src].payload;
+        peer_ctl_pointer = data_buffs[src].ctl_struct;
+
+
+        knt = 0;
+        for(i = 0; i < src; i++){
+            knt += list_connected[i];
+        }
+
+        /* probe for extra rank's arrival */
+        for( i = 0; i < cm->num_to_probe && ( 0 == matched); i++) {
+            if(IS_PEER_READY(peer_ctl_pointer,ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                matched = 1;
+                /* copy it in */
+                memcpy((void *)((unsigned char *) data_addr + knt*pack_len),
+                       (void *) ((unsigned char *) peer_data_pointer + knt*pack_len),
+                       pack_len * list_connected[src]);
+                goto MAIN_PHASE;
+            }
+        }
+        *status = ready_flag;
+        *iteration = -1;
+        return BCOL_FN_STARTED;
+
+
+    }
+
+MAIN_PHASE:
+    /* bump the ready flag */
+    ready_flag++;
+
+
+    /* we start the recursive k - ing phase */
+    for( *iteration = 0; *iteration < pow_k; (*iteration)++) {
+        /* announce my arrival */
+        MB();
+        my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id] = ready_flag;
+        /* calculate the number of active requests */
+        CALC_ACTIVE_REQUESTS(active_requests,exchange_node->rank_exchanges[*iteration],tree_order);
+        /* Now post the recv's */
+        for( j = 0; j < (tree_order - 1); j++ ) {
+
+            /* recv phase */
+            src = exchange_node->rank_exchanges[*iteration][j];
+
+            if( src < 0 ) {
+                /* then not a valid rank, continue */
+
+                continue;
+            }
+
+            peer_data_pointer = data_buffs[src].payload;
+            peer_ctl_pointer = data_buffs[src].ctl_struct;
+            if( !(*active_requests&(1<<j))) {
+                /* then the bit hasn't been set, thus this peer
+                 * hasn't been processed at this level
+                 */
+                recv_offset = exchange_node->payload_info[*iteration][j].r_offset * pack_len;
+                recv_len = exchange_node->payload_info[*iteration][j].r_len * pack_len;
+                /* post the receive */
+                /* I am putting the probe loop as the inner most loop to achieve
+                 * better temporal locality
+                 */
+                matched = 0;
+                for( probe = 0; probe < cm->num_to_probe && (0 == matched); probe++){
+                    if(IS_PEER_READY(peer_ctl_pointer,ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                        matched = 1;
+                        /* set this request's bit */
+                        *active_requests ^= (1<<j);
+                        /* get the data */
+                        memcpy((void *)((unsigned char *) data_addr + recv_offset),
+                               (void *)((unsigned char *) peer_data_pointer + recv_offset),
+                               recv_len);
+                    }
+                }
+            }
+
+
+        }
+        if( max_requests == *active_requests ){
+            /* bump the ready flag */
+            ready_flag++;
+            /*reset the active requests */
+            *active_requests = 0;
+        } else {
+            /* save state and hop out
+             * only the iteration needs to be tracked
+             */
+            *status = my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id];
+            return BCOL_FN_STARTED;
+        }
+    }
+
+    /* bump the flag one more time for the extra rank */
+    ready_flag = flag_offset + 1 + pow_k + 2;
+
+    /* finish off the last piece, send the data back to the extra  */
+    if( 0 < exchange_node->n_extra_sources ) {
+        /* simply announce my arrival */
+        MB();
+        my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id] = ready_flag;
+
+    }
+
+FINISHED:
+    /* bump this up */
+    my_ctl_pointer->starting_flag_value[bcol_id]++;
+    return BCOL_FN_COMPLETE;
+}
+
+
+/* allgather progress function */
+
+int bcol_basesmuma_k_nomial_allgather_progress(bcol_function_args_t *input_args,
+                                               struct coll_ml_function_t *const_args)
+{
+
+
+    /* local variables */
+    int8_t flag_offset;
+    uint32_t buffer_index = input_args->buffer_index;
+    volatile int8_t ready_flag;
+    mca_bcol_basesmuma_module_t *bcol_module = (mca_bcol_basesmuma_module_t *) const_args->bcol_module;
+    netpatterns_k_exchange_node_t *exchange_node = &bcol_module->knomial_allgather_tree;
+    int group_size = bcol_module->colls_no_user_data.size_of_group;
+    int *list_connected = bcol_module->super.list_n_connected; /* critical for hierarchical colls */
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    mca_bcol_basesmuma_component_t *cm = &mca_bcol_basesmuma_component;
+    int *active_requests =
+        &(bcol_module->ml_mem.nb_coll_desc[buffer_index].active_requests);
+
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buffer_index].iteration;
+    int *iter = iteration; /* double alias */
+    int *status = &bcol_module->ml_mem.nb_coll_desc[buffer_index].status;
+    int leading_dim, idx, buff_idx;
+
+    int i, j, probe;
+    int knt;
+    int src;
+    int recv_offset, recv_len;
+    int max_requests = 0; /* critical to set this */
+    int pow_k, tree_order;
+
+    int matched = 0;
+    int64_t sequence_number=input_args->sequence_num;
+    int my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    int buff_offset = bcol_module->super.hier_scather_offset;
+
+    int pack_len = input_args->count * input_args->dtype->super.size;
+
+    void *data_addr = (void*)(
+        (unsigned char *) input_args->sbuf +
+        (size_t) input_args->sbuf_offset);
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char *peer_data_pointer;
+
+    /* control structures */
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    volatile mca_bcol_basesmuma_header_t *peer_ctl_pointer;
+
+#if 0
+    fprintf(stderr,"%d: entering sm allgather progress active requests %d iter %d ready_flag %d\n",my_rank,
+            *active_requests,*iter,*status);
+#endif
+    buff_idx = input_args->src_desc->buffer_index;
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+    data_buffs=(volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+
+    /* Set pointer to current proc ctrl region */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+
+    /* increment the starting flag by one and return */
+    /* flag offset seems unnecessary here */
+    flag_offset = my_ctl_pointer->starting_flag_value[bcol_id];
+    ready_flag = *status;
+    my_ctl_pointer->sequence_number = sequence_number;
+    /* k-nomial parameters */
+    tree_order = exchange_node->tree_order;
+    pow_k = exchange_node->log_tree_order;
+
+    /* calculate the maximum number of requests
+     * at each level each rank communicates with
+     * at most (k - 1) peers
+     * so if we set k - 1 bit fields in "max_requests", then
+     * we have max_request  == 2^(k - 1) -1
+     */
+    for(i = 0; i < (tree_order - 1); i++){
+        max_requests ^= (1<<i);
+    }
+
+    /* let's begin the collective, starting with extra ranks and their
+     * respective proxies
+     */
+
+    if( EXTRA_NODE == exchange_node->node_type ) {
+
+        /* If I'm in here, then I must be looking for data */
+        ready_flag = flag_offset + 1 + pow_k + 2;
+
+        src = exchange_node->rank_extra_sources_array[0];
+        peer_data_pointer = data_buffs[src].payload;
+        peer_ctl_pointer = data_buffs[src].ctl_struct;
+
+        /* calculate the offset */
+        knt = 0;
+        for(i = 0; i < group_size; i++){
+            knt += list_connected[i];
+        }
+        for( i = 0; i < cm->num_to_probe && (0 == matched); i++ ) {
+            if(IS_PEER_READY(peer_ctl_pointer, ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                matched = 1;
+                /* we receive the entire message */
+                memcpy((void *)((unsigned char *) data_addr + buff_offset),
+                       (void *) ((unsigned char *) peer_data_pointer + buff_offset),
+                       knt * pack_len);
+
+                goto FINISHED;
+            }
+
+        }
+
+        /* haven't found it, state is saved, bail out */
+        return BCOL_FN_STARTED;
+
+    }else if ( ( -1 == *iteration ) && (0 < exchange_node->n_extra_sources) ) {
+
+        /* I am a proxy for someone */
+        src = exchange_node->rank_extra_sources_array[0];
+        peer_data_pointer = data_buffs[src].payload;
+        peer_ctl_pointer = data_buffs[src].ctl_struct;
+
+        knt = 0;
+        for(i = 0; i < src; i++){
+            knt += list_connected[i];
+        }
+
+        /* probe for extra rank's arrival */
+        for( i = 0; i < cm->num_to_probe && ( 0 == matched); i++) {
+            if(IS_PEER_READY(peer_ctl_pointer,ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                matched = 1;
+                /* copy it in */
+                memcpy((void *)((unsigned char *) data_addr + knt*pack_len),
+                       (void *) ((unsigned char *) peer_data_pointer + knt*pack_len),
+                       pack_len * list_connected[src]);
+
+                ready_flag++;
+                *iteration = 0;
+                goto MAIN_PHASE;
+            }
+        }
+        return BCOL_FN_STARTED;
+
+    }
+
+MAIN_PHASE:
+
+    /* start the recursive k - ing phase */
+    for( *iter=*iteration; *iter < pow_k; (*iter)++) {
+        /* I am ready at this level */
+        MB();
+        my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id] = ready_flag;
+        if( 0 == *active_requests ) {
+            /* flip some bits, if we don't have active requests from a previous visit */
+            CALC_ACTIVE_REQUESTS(active_requests,exchange_node->rank_exchanges[*iter],tree_order);
+        }
+        for( j = 0; j < (tree_order - 1); j++ ) {
+
+            /* recv phase */
+            src = exchange_node->rank_exchanges[*iter][j];
+
+            if( src < 0 ) {
+                /* then not a valid rank, continue
+                 */
+                continue;
+            }
+
+            peer_data_pointer = data_buffs[src].payload;
+            peer_ctl_pointer = data_buffs[src].ctl_struct;
+            if( !(*active_requests&(1<<j))){
+
+                /* then this peer hasn't been processed at this level */
+                recv_offset = exchange_node->payload_info[*iter][j].r_offset * pack_len;
+                recv_len = exchange_node->payload_info[*iter][j].r_len * pack_len;
+                /* I am putting the probe loop as the inner most loop to achieve
+                 * better temporal locality
+                 */
+                matched = 0;
+                for( probe = 0; probe < cm->num_to_probe && (0 == matched); probe++){
+                    if(IS_PEER_READY(peer_ctl_pointer,ready_flag, sequence_number, ALLGATHER_FLAG, bcol_id)){
+                        matched = 1;
+                        /* flip the request's bit */
+                        *active_requests ^= (1<<j);
+                        /* copy the data */
+                        memcpy((void *)((unsigned char *) data_addr + recv_offset),
+                               (void *)((unsigned char *) peer_data_pointer + recv_offset),
+                               recv_len);
+                    }
+                }
+            }
+
+
+        }
+        if( max_requests == *active_requests ){
+            /* bump the ready flag */
+            ready_flag++;
+            /* reset the active requests for the next level */
+            *active_requests = 0;
+            /* calculate the number of active requests
+             * logically makes sense to do it here. We don't
+             * want to inadvertantly flip a bit to zero that we
+             * set previously
+             */
+        } else {
+            /* state is saved hop out
+             */
+            *status = my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id];
+            return BCOL_FN_STARTED;
+        }
+    }
+    /* bump the flag one more time for the extra rank */
+    ready_flag = flag_offset + 1 + pow_k + 2;
+
+    /* finish off the last piece, send the data back to the extra  */
+    if( 0 < exchange_node->n_extra_sources ) {
+        /* simply announce my arrival */
+        MB();
+        my_ctl_pointer->flags[ALLGATHER_FLAG][bcol_id] = ready_flag;
+
+    }
+
+FINISHED:
+    /* bump this up for others to see */
+    my_ctl_pointer->starting_flag_value[bcol_id]++;
+    return BCOL_FN_COMPLETE;
+}
+
+/* Register allreduce functions to the BCOL function table,
+ * so they can be selected
+ */
+int bcol_basesmuma_allgather_init(mca_bcol_base_module_t *super)
+{
+    mca_bcol_base_coll_fn_comm_attributes_t comm_attribs;
+    mca_bcol_base_coll_fn_invoke_attributes_t inv_attribs;
+
+    comm_attribs.bcoll_type = BCOL_ALLGATHER;
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 1024 * 1024;
+    comm_attribs.waiting_semantics = NON_BLOCKING;
+
+    inv_attribs.bcol_msg_min = 0;
+    inv_attribs.bcol_msg_max = 20000; /* range 1 */
+
+    inv_attribs.datatype_bitmap = 0xffffffff;
+    inv_attribs.op_types_bitmap = 0xffffffff;
+
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                                 bcol_basesmuma_k_nomial_allgather_init,
+                                 bcol_basesmuma_k_nomial_allgather_progress);
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_allreduce.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_allreduce.c
@ -0,0 +1,602 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/constants.h"
+#include "ompi/op/op.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "ompi/communicator/communicator.h"
+
+#include "opal/include/opal_stdint.h"
+
+#include "bcol_basesmuma.h"
+
+static int bcol_basesmuma_allreduce_intra_fanin_fanout_progress (bcol_function_args_t *input_args, coll_ml_function_t *c_input_args);
+
+int bcol_basesmuma_allreduce_init(mca_bcol_base_module_t *super)
+{
+    mca_bcol_base_coll_fn_comm_attributes_t comm_attribs;
+    mca_bcol_base_coll_fn_invoke_attributes_t inv_attribs;
+
+    comm_attribs.bcoll_type = BCOL_ALLREDUCE;
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 16;
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+
+    /* selection logic at the ml level specifies a
+     * request for a non-blocking algorithm
+     * however, these algorithms are blocking
+     * following what was done at the p2p level
+     * we will specify non-blocking, but beware,
+     * these algorithms are blocking and will not make use
+     * of the progress engine
+     */
+    comm_attribs.waiting_semantics = NON_BLOCKING;
+
+    inv_attribs.bcol_msg_min = 0;
+    inv_attribs.bcol_msg_max = 20000;
+    inv_attribs.datatype_bitmap = 0xffffffff;
+    inv_attribs.op_types_bitmap = 0xffffffff;
+
+    /* Set attributes for fanin fanout algorithm */
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                                 bcol_basesmuma_allreduce_intra_fanin_fanout,
+                                 bcol_basesmuma_allreduce_intra_fanin_fanout_progress);
+    /* Differs only in comm size */
+
+    comm_attribs.data_src = DATA_SRC_UNKNOWN;
+    comm_attribs.waiting_semantics = BLOCKING;
+
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 8;
+
+    /* Set attributes for recursive doubling algorithm */
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                                 bcol_basesmuma_allreduce_intra_recursive_doubling,
+                                 NULL);
+
+
+    return OMPI_SUCCESS;
+}
+
+/*
+ * Small data fanin reduce
+ * ML buffers are used for both payload and control structures
+ * This functions works with hierarchical allreduce and
+ * progress engine
+ */
+static inline int reduce_children (mca_bcol_basesmuma_module_t *bcol_module, volatile void *rbuf, netpatterns_tree_node_t *my_reduction_node,
+                                   int *iteration, volatile mca_bcol_basesmuma_header_t *my_ctl_pointer, ompi_datatype_t *dtype,
+                                   volatile mca_bcol_basesmuma_payload_t *data_buffs, int count, struct ompi_op_t *op, int process_shift)
+{
+    volatile mca_bcol_basesmuma_header_t *child_ctl_pointer;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    int64_t sequence_number = my_ctl_pointer->sequence_number;
+    int8_t ready_flag = my_ctl_pointer->ready_flag;
+    int group_size = bcol_module->colls_no_user_data.size_of_group;
+
+    if (LEAF_NODE != my_reduction_node->my_node_type) {
+        volatile char *child_data_pointer;
+        volatile void *child_rbuf;
+
+        /* for each child */
+        /* my_result_data = child_result_data (op) my_source_data */
+
+        for (int child = *iteration ; child < my_reduction_node->n_children ; ++child) {
+            int child_rank = my_reduction_node->children_ranks[child] + process_shift;
+
+            if (group_size <= child_rank){
+                child_rank -= group_size;
+            }
+
+            child_ctl_pointer = data_buffs[child_rank].ctl_struct;
+
+            if (!IS_PEER_READY(child_ctl_pointer, ready_flag, sequence_number, ALLREDUCE_FLAG, bcol_id)) {
+                *iteration = child;
+                return BCOL_FN_STARTED;
+            }
+
+            child_data_pointer = data_buffs[child_rank].payload;
+            child_rbuf = child_data_pointer + child_ctl_pointer->roffsets[bcol_id];
+
+            ompi_op_reduce(op, (void *)child_rbuf, (void *)rbuf, count, dtype);
+        } /* end child loop */
+    }
+
+    if (ROOT_NODE != my_reduction_node->my_node_type) {
+        opal_atomic_wmb ();
+        my_ctl_pointer->flags[ALLREDUCE_FLAG][bcol_id] = ready_flag;
+    }
+
+    /* done with this step. move on to fan out */
+    *iteration = -1;
+
+    return BCOL_FN_COMPLETE;
+}
+
+static int allreduce_fanout (mca_bcol_basesmuma_module_t *bcol_module, volatile mca_bcol_basesmuma_header_t *my_ctl_pointer,
+                             volatile void *my_data_pointer, int process_shift, volatile mca_bcol_basesmuma_payload_t *data_buffs,
+                             int sequence_number, int group_size, int rbuf_offset, size_t pack_len)
+{
+    volatile mca_bcol_basesmuma_header_t *parent_ctl_pointer;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    int8_t ready_flag = my_ctl_pointer->ready_flag + 1;
+    netpatterns_tree_node_t *my_fanout_read_tree;
+    volatile void *parent_data_pointer;
+    int my_fanout_parent, my_rank;
+    void *parent_rbuf, *rbuf;
+
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    my_fanout_read_tree = &(bcol_module->fanout_read_tree[my_rank]);
+
+    if (ROOT_NODE != my_fanout_read_tree->my_node_type) {
+        my_fanout_parent = my_fanout_read_tree->parent_rank + process_shift;
+        if (group_size <= my_fanout_parent) {
+            my_fanout_parent -= group_size;
+        }
+
+        rbuf = (void *)((char *) my_data_pointer + rbuf_offset);
+
+        /*
+         * Get parent payload data and control data.
+         * Get the pointer to the base address of the parent's payload buffer.
+         * Get the parent's control buffer.
+         */
+        parent_data_pointer = data_buffs[my_fanout_parent].payload;
+        parent_ctl_pointer = data_buffs[my_fanout_parent].ctl_struct;
+
+        parent_rbuf = (void *) ((char *) parent_data_pointer + rbuf_offset);
+
+        /* Wait until parent signals that data is ready */
+        /* The order of conditions checked in this loop is important, as it can
+         * result in a race condition.
+         */
+        if (!IS_PEER_READY(parent_ctl_pointer, ready_flag, sequence_number, ALLREDUCE_FLAG, bcol_id)) {
+            return BCOL_FN_STARTED;
+        }
+
+        assert (parent_ctl_pointer->flags[ALLREDUCE_FLAG][bcol_id] == ready_flag);
+
+        /* Copy the rank to a shared buffer writable by the current rank */
+        memcpy ((void *) rbuf, (const void*) parent_rbuf, pack_len);
+    }
+
+    if (LEAF_NODE != my_fanout_read_tree->my_node_type) {
+        opal_atomic_wmb ();
+
+        /* Signal to children that they may read the data from my shared buffer (bump the ready flag) */
+        my_ctl_pointer->flags[ALLREDUCE_FLAG][bcol_id] = ready_flag;
+    }
+
+    my_ctl_pointer->starting_flag_value[bcol_id] += 1;
+
+    return BCOL_FN_COMPLETE;
+
+}
+
+static int bcol_basesmuma_allreduce_intra_fanin_fanout_progress (bcol_function_args_t *input_args, coll_ml_function_t *c_input_args)
+{
+    mca_bcol_basesmuma_module_t *bcol_module = (mca_bcol_basesmuma_module_t *) c_input_args->bcol_module;
+    int buff_idx = buff_idx = input_args->src_desc->buffer_index;
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buff_idx].iteration;
+    void *data_addr = (void *) input_args->src_desc->data_addr;
+    int my_node_index, my_rank, group_size, leading_dim, idx;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    int64_t sequence_number = input_args->sequence_num;
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    struct ompi_datatype_t *dtype = input_args->dtype;
+    netpatterns_tree_node_t *my_reduction_node;
+    struct ompi_op_t *op = input_args->op;
+    volatile void *my_data_pointer;
+    int count = input_args->count;
+    int rc, process_shift;
+    ptrdiff_t lb, extent;
+    volatile void *rbuf;
+
+    /* get addressing information */
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    group_size = bcol_module->colls_no_user_data.size_of_group;
+    leading_dim = bcol_module->colls_no_user_data.size_of_group;
+    idx = SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    /* Align node index to around sbgp root */
+    process_shift = input_args->root;
+    my_node_index = my_rank - input_args->root;
+    if (0 > my_node_index ) {
+        my_node_index += group_size;
+    }
+
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *) bcol_module->colls_with_user_data.data_buffs + idx;
+    /* Get control structure and payload buffer */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+    my_data_pointer = (volatile char *) data_addr;
+
+    my_data_pointer = (volatile char *) data_addr;
+    rbuf = (volatile void *)((char *) my_data_pointer + input_args->rbuf_offset);
+
+    /***************************
+     * Fan into root phase
+     ***************************/
+
+    my_reduction_node = &(bcol_module->reduction_tree[my_node_index]);
+    if (-1 != *iteration) {
+        rc = reduce_children (bcol_module, rbuf, my_reduction_node, iteration, my_ctl_pointer,
+                              dtype, data_buffs, count, op, process_shift);
+        if (BCOL_FN_COMPLETE != rc) {
+            return rc;
+        }
+    }
+
+    /* there might be non-contig dtype - so compute the length with get_extent */
+    ompi_datatype_get_extent(dtype, &lb, &extent);
+
+    /***************************
+     * Fan out from root
+     ***************************/
+
+    /* all nodes will have the result after fanout */
+    input_args->result_in_rbuf = true;
+
+    /* Signal that you are ready for fanout phase */
+    return allreduce_fanout (bcol_module, my_ctl_pointer, my_data_pointer, process_shift, data_buffs,
+                             sequence_number, group_size, input_args->rbuf_offset, count * (size_t) extent);
+}
+
+/**
+ * Shared memory blocking allreduce.
+ */
+int bcol_basesmuma_allreduce_intra_fanin_fanout(bcol_function_args_t *input_args, coll_ml_function_t *c_input_args)
+{
+    /* local variables */
+    mca_bcol_basesmuma_module_t *bcol_module = (mca_bcol_basesmuma_module_t *) c_input_args->bcol_module;
+    int buff_idx = buff_idx = input_args->src_desc->buffer_index;
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buff_idx].iteration;
+    void *data_addr = (void *) input_args->src_desc->data_addr;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    struct ompi_datatype_t *dtype = input_args->dtype;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    int rc, my_rank, leading_dim, idx;
+    volatile void *my_data_pointer;
+    volatile void *sbuf, *rbuf;
+    int8_t ready_flag;
+
+    /* get addressing information */
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    leading_dim = bcol_module->colls_no_user_data.size_of_group;
+    idx = SM_ARRAY_INDEX(leading_dim, buff_idx, 0);
+
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *) bcol_module->colls_with_user_data.data_buffs + idx;
+    /* Get control structure */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+
+    my_data_pointer = (volatile char *) data_addr;
+    rbuf = (volatile void *)((char *) my_data_pointer + input_args->rbuf_offset);
+    sbuf = (volatile void *)((char *) my_data_pointer + input_args->sbuf_offset);
+
+    /* Setup resource recycling */
+    /* Set for multiple instances of bcols */
+    BASESMUMA_HEADER_INIT(my_ctl_pointer, ready_flag, input_args->sequence_num, bcol_id);
+
+    if (sbuf != rbuf) {
+        rc = ompi_datatype_copy_content_same_ddt (dtype, input_args->count, (char *)rbuf,
+                                                  (char *)sbuf);
+        if( 0 != rc ) {
+            return OMPI_ERROR;
+        }
+    }
+
+    *iteration = 0;
+    my_ctl_pointer->ready_flag = ready_flag;
+
+    return bcol_basesmuma_allreduce_intra_fanin_fanout_progress (input_args, c_input_args);
+}
+
+
+
+/* this thing uses the old bcol private control structures */
+int bcol_basesmuma_allreduce_intra_recursive_doubling(bcol_function_args_t *input_args,
+                                                      coll_ml_function_t *c_input_args)
+{
+
+    int my_rank,group_size,my_node_index;
+    int pair_rank, exchange, extra_rank, payload_len;
+    size_t dt_size;
+    int read_offset, write_offset;
+    volatile void *my_data_pointer;
+    volatile mca_bcol_basesmuma_ctl_struct_t *my_ctl_pointer = NULL,
+        *partner_ctl_pointer = NULL,
+        *extra_ctl_pointer = NULL;
+    volatile void *my_read_pointer, *my_write_pointer, *partner_read_pointer,
+        *extra_rank_readwrite_data_pointer,*extra_rank_read_data_pointer;
+    mca_bcol_basesmuma_module_t* bcol_module =
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+
+    int8_t ready_flag;
+    int sbuf_offset,rbuf_offset,flag_offset;
+    int root,count;
+    struct ompi_op_t *op;
+    int64_t sequence_number=input_args->sequence_num;
+    struct ompi_datatype_t *dtype;
+    int first_instance;
+    int leading_dim,idx;
+    int buff_idx;
+    mca_bcol_basesmuma_ctl_struct_t **ctl_structs;
+    /*volatile void **data_buffs;*/
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    netpatterns_pair_exchange_node_t *my_exchange_node;
+
+
+    /*
+     * Get addressing information
+     */
+    buff_idx = input_args->src_desc->buffer_index;
+
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    group_size = bcol_module->colls_no_user_data.size_of_group;
+    leading_dim = bcol_module->colls_no_user_data.size_of_group;
+    idx = SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    /*
+     * Get SM control structures and payload buffers
+     */
+    ctl_structs = (mca_bcol_basesmuma_ctl_struct_t **)
+        bcol_module->colls_with_user_data.ctl_buffs+idx;
+    /*data_buffs = (volatile void **)
+      bcol_module->colls_with_user_data.data_buffs+idx;*/
+
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs + idx;
+
+
+    /*
+     * Get control structure and payload buffer
+     */
+    my_ctl_pointer = ctl_structs[my_rank];
+    if (my_ctl_pointer->sequence_number < sequence_number) {
+        first_instance=1;
+    }
+    my_data_pointer = data_buffs[my_rank].payload;
+
+    /*
+     * Align node index to around sbgp root
+     */
+    root = input_args->root;
+    my_node_index = my_rank - root;
+    if (0 > my_node_index) {
+        my_node_index += group_size;
+    }
+
+    /*
+     * Get data from arguments
+     */
+    sbuf_offset = input_args->sbuf_offset;
+    rbuf_offset = input_args->rbuf_offset;
+    op   = input_args->op;
+    count = input_args->count;
+    dtype = input_args->dtype;
+
+    /*
+     * Get my node for the reduction tree
+     */
+    my_exchange_node = &(bcol_module->recursive_doubling_tree);
+
+
+    if (first_instance) {
+        my_ctl_pointer->index = 1;
+        my_ctl_pointer->starting_flag_value = 0;
+        flag_offset = 0;
+        my_ctl_pointer->flag = -1;
+        /*
+          for( i = 0; i < NUM_SIGNAL_FLAGS; i++){
+          my_ctl_pointer->flags[ALLREDUCE_FLAG] = -1;
+          }
+        */
+    } else {
+        my_ctl_pointer->index++;
+        flag_offset = my_ctl_pointer->starting_flag_value;
+    }
+
+    /* signal that I have arrived */
+    /* MB(); */
+    my_ctl_pointer->sequence_number = sequence_number;
+
+    /* If we use this buffer more than once by an sm module in
+     * a given collective, will need to distinguish between instances, so
+     * we pick up the right data.
+     */
+    ready_flag = flag_offset + sequence_number + 1;
+
+    /*
+     * Set up pointers for using during recursive doubling phase
+     */
+    read_offset = sbuf_offset;
+    write_offset = rbuf_offset;
+    fprintf(stderr,"read offset %d write offset %d\n",read_offset,write_offset);
+    my_read_pointer =  (volatile void *)((char *) my_data_pointer + read_offset);
+    my_write_pointer = (volatile void *)((char *) my_data_pointer + write_offset);
+
+    /*
+     * When there are non-power 2 nodes, the extra nodes' data is copied and
+     * reduced by partner exchange nodes.
+     * Extra nodes: Nodes with rank greater nearest power of 2
+     * Exchange nodes: Nodes with rank lesser than nearest power of 2 that
+     * partner with extras nodes during reduction
+     */
+
+    if (0 < my_exchange_node->n_extra_sources) {
+        /*
+         * Signal extra node that data is ready
+         */
+        opal_atomic_wmb ();
+
+        my_ctl_pointer->flag = ready_flag;
+
+        if (EXCHANGE_NODE == my_exchange_node->node_type) {
+            extra_rank = my_exchange_node->rank_extra_source;
+            extra_ctl_pointer = ctl_structs[extra_rank];
+            extra_rank_readwrite_data_pointer = (void *) ((char *) data_buffs[extra_rank].payload +
+                                                          read_offset);
+
+            /*
+             * Wait for data to get ready
+             */
+            while (!((sequence_number == extra_ctl_pointer->sequence_number) &&
+                     (extra_ctl_pointer->flag >= ready_flag))){
+            }
+
+            ompi_op_reduce(op,(void *)extra_rank_readwrite_data_pointer,
+                           (void *)my_read_pointer, count, dtype);
+        }
+    }
+
+
+    /* --Exchange node that reduces with extra node --: Signal to extra node that data is read
+     * --Exchange node that doesn't reduce data with extra node --: This assignment
+     * is used so it can sync with other nodes during exchange phase
+     * --Extra node--: It can pass to next phase
+     */
+    ready_flag++;
+    /*my_ctl_pointer->flags[ALLREDUCE_FLAG] = ready_flag;*/
+    my_ctl_pointer->flag = ready_flag;
+
+
+    /*
+     * Exchange data with all the nodes that are less than max_power_2
+     */
+    for (exchange=0 ; exchange < my_exchange_node->n_exchanges ; exchange++) {
+        int tmp=0;
+
+        /*my_ctl_pointer->flags[ALLREDUCE_FLAG] = ready_flag;*/
+        my_ctl_pointer->flag = ready_flag;
+        pair_rank=my_exchange_node->rank_exchanges[exchange];
+        partner_ctl_pointer = ctl_structs[pair_rank];
+        partner_read_pointer = (volatile void *) ((char *)data_buffs[pair_rank].payload + read_offset);
+
+        my_read_pointer =  (volatile void *)((char *) my_data_pointer + read_offset);
+        my_write_pointer = (volatile void *)((char *) my_data_pointer + write_offset);
+
+        /*
+         * Wait for partner to be ready, so we can read
+         */
+        /*
+          JSL ----  FIX ME  !!!!! MAKE ME COMPLIANT WITH NEW BUFFERS
+          while (!IS_ALLREDUCE_PEER_READY(partner_ctl_pointer,
+          ready_flag, sequence_number)) {
+          }
+        */
+
+        /*
+         * Perform reduction operation
+         */
+        ompi_3buff_op_reduce(op,(void *)my_read_pointer, (void *)partner_read_pointer,
+                             (void *)my_write_pointer, count, dtype);
+
+
+        /*
+         * Signal that I am done reading my partner's data
+         */
+        ready_flag++;
+        /*my_ctl_pointer->flags[ALLREDUCE_FLAG] = ready_flag;*/
+        my_ctl_pointer->flag = ready_flag;
+
+        while (ready_flag > partner_ctl_pointer->flag){
+            opal_progress();
+        }
+
+        /*
+         * Swap read and write offsets
+         */
+        tmp = read_offset;
+        read_offset = write_offset;
+        write_offset = tmp;
+
+    }
+
+
+    /*
+     * Copy data in from the "extra" source, if need be
+     */
+
+    if (0 < my_exchange_node->n_extra_sources) {
+
+        if (EXTRA_NODE == my_exchange_node->node_type) {
+
+            int extra_rank_read_offset=-1,my_write_offset=-1;
+
+            /* Offset the ready flag to sync with
+             * exchange node which might going through exchange phases
+             * unlike the extra node
+             */
+            ready_flag = ready_flag + my_exchange_node->log_2;
+
+            if (my_exchange_node->log_2%2) {
+                extra_rank_read_offset = rbuf_offset;
+                my_write_offset = rbuf_offset;
+
+            } else {
+                extra_rank_read_offset = sbuf_offset;
+                my_write_offset = sbuf_offset;
+
+            }
+
+            my_write_pointer = (volatile void*)((char *)my_data_pointer + my_write_offset);
+            extra_rank = my_exchange_node->rank_extra_source;
+            extra_ctl_pointer = ctl_structs[extra_rank];
+
+            extra_rank_read_data_pointer = (volatile void *) ((char *)data_buffs[extra_rank].payload +
+                                                              extra_rank_read_offset);
+
+            /*
+             * Wait for the exchange node to be ready
+             */
+            ompi_datatype_type_size(dtype, &dt_size);
+            payload_len = count*dt_size;
+#if 0
+            fix me JSL !!!!!
+                while (!IS_DATA_READY(extra_ctl_pointer, ready_flag, sequence_number)){
+                }
+#endif
+            memcpy((void *)my_write_pointer,(const void *)
+                   extra_rank_read_data_pointer, payload_len);
+
+            ready_flag++;
+            /*my_ctl_pointer->flags[ALLREDUCE_FLAG] = ready_flag;*/
+            my_ctl_pointer->flag = ready_flag;
+
+
+        } else {
+
+            /*
+             * Signal parent that data is ready
+             */
+            MB();
+            /*my_ctl_pointer->flags[ALLREDUCE_FLAG] = ready_flag;*/
+            my_ctl_pointer->flag = ready_flag;
+
+            /* wait until child is done to move on - this buffer will
+             *   be reused for the next stripe, so don't want to move
+             *   on too quick.
+             */
+            extra_rank = my_exchange_node->rank_extra_source;
+            extra_ctl_pointer = ctl_structs[extra_rank];
+        }
+    }
+
+    input_args->result_in_rbuf = my_exchange_node->log_2 & 1;
+
+    my_ctl_pointer->starting_flag_value += 1;
+
+    return BCOL_FN_COMPLETE;
+}
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_bcast_prime.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_bcast_prime.c
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c
@ -1,7 +1,8 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
- * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
- * Copyright (c) 2012      Los Alamos National Security, LLC.
+ * Copyright (c) 2013-2014 Los Alamos National Security, LLC.
 *                         All rights reserved.
 * Copyright (c) 2014 Cisco Systems, Inc.  All rights reserved.
 * $COPYRIGHT$
@ -40,7 +41,7 @@
 * correct generation of the bank is ready for use.
 */
 int bcol_basesmuma_get_buff_index( sm_buffer_mgmt *buff_block,
-    uint64_t buff_id )
+                                   uint64_t buff_id )
 {
    /* local variables */
    int memory_bank;
@ -74,11 +75,11 @@ int bcol_basesmuma_get_buff_index( sm_buffer_mgmt *buff_block,
    return index;
 }

-/* release the shared memory buffers 
+/* release the shared memory buffers
 *  buf_id is the unique ID assigned to the particular buffer
 */
 int bcol_basesmuma_free_buff( sm_buffer_mgmt * buff_block,
-    uint64_t buff_id )
+                              uint64_t buff_id )
 {
    /* local variables */
    int ret=OMPI_SUCCESS;
@ -99,98 +100,98 @@ int bcol_basesmuma_free_buff( sm_buffer_mgmt * buff_block,
    assert(generation == buff_block->ctl_buffs_mgmt[memory_bank].bank_gen_counter);

    /*
-     * increment counter of completed buffers 
+     * increment counter of completed buffers
     */
    OPAL_THREAD_ADD32(&(buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed),
-      1);
+                      1);

    /*
     * If I am the last to checkin - initiate resource recycling
     */
-     if( buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed ==
-         buff_block->ctl_buffs_mgmt[memory_bank].number_of_buffers ) {
+    if( buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed ==
+        buff_block->ctl_buffs_mgmt[memory_bank].number_of_buffers ) {

-         /* Lock to ensure atomic recycling of resources */
-         OPAL_THREAD_LOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));
-         
-         /* make sure someone else did not already get to this */
-         if( buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed !=
-         buff_block->ctl_buffs_mgmt[memory_bank].number_of_buffers ) {
-             /* release lock and exit */
-             OPAL_THREAD_UNLOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));
-         } else {
-			  sm_nbbar_desc_t *p_sm_nb_desc = NULL;
-             /* initiate the freeing of resources.  Need to make sure the other
-              * ranks in the group are also done with their resources before this
-              * block is made available for use again.
-              * No one else will try to allocate from this block or free back to
-              * this block until the next genration counter has been incremented,
-              * so will just reset the number of freed buffers to 0, so no one else
-              * will try to also initialize the recycling of these resrouces
-              */
-              buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed=0;
+        /* Lock to ensure atomic recycling of resources */
+        OPAL_THREAD_LOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));

-             /* Start the nonblocking barrier */
-			 p_sm_nb_desc = &(buff_block->ctl_buffs_mgmt[memory_bank].nb_barrier_desc);
-			 p_sm_nb_desc->coll_buff = buff_block;
-             bcol_basesmuma_rd_nb_barrier_init_admin(p_sm_nb_desc);
-             
-			 if( NB_BARRIER_DONE !=
-                  buff_block->ctl_buffs_mgmt[memory_bank].
-                      nb_barrier_desc.collective_phase) {
+        /* make sure someone else did not already get to this */
+        if( buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed !=
+            buff_block->ctl_buffs_mgmt[memory_bank].number_of_buffers ) {
+            /* release lock and exit */
+            OPAL_THREAD_UNLOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));
+        } else {
+            sm_nbbar_desc_t *p_sm_nb_desc = NULL;
+            /* initiate the freeing of resources.  Need to make sure the other
+             * ranks in the group are also done with their resources before this
+             * block is made available for use again.
+             * No one else will try to allocate from this block or free back to
+             * this block until the next genration counter has been incremented,
+             * so will just reset the number of freed buffers to 0, so no one else
+             * will try to also initialize the recycling of these resrouces
+             */
+            buff_block->ctl_buffs_mgmt[memory_bank].n_buffs_freed=0;

-                  opal_list_t *list=&(cs->nb_admin_barriers);
-                  opal_list_item_t *append_item;
+            /* Start the nonblocking barrier */
+            p_sm_nb_desc = &(buff_block->ctl_buffs_mgmt[memory_bank].nb_barrier_desc);
+            p_sm_nb_desc->coll_buff = buff_block;
+            bcol_basesmuma_rd_nb_barrier_init_admin(p_sm_nb_desc);

-                  /* put this onto the progression list */
-                  OPAL_THREAD_LOCK(&(cs->nb_admin_barriers_mutex));
-                  append_item=(opal_list_item_t *)
-                      &(buff_block->ctl_buffs_mgmt[memory_bank].nb_barrier_desc);
-                  opal_list_append(list,append_item);
-                  OPAL_THREAD_UNLOCK(&(cs->nb_admin_barriers_mutex));
-                  /* progress communications so that resources can be freed up */
-                  opal_progress();
-              } else {
-                  /* mark the block as available */
-                  (buff_block->ctl_buffs_mgmt[memory_bank].bank_gen_counter)++;
-              }
-                 
-             /* get out of here */
-             OPAL_THREAD_UNLOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));
-         }
+            if( NB_BARRIER_DONE !=
+                buff_block->ctl_buffs_mgmt[memory_bank].
+                nb_barrier_desc.collective_phase) {

-     }
+                opal_list_t *list=&(cs->nb_admin_barriers);
+                opal_list_item_t *append_item;
+
+                /* put this onto the progression list */
+                OPAL_THREAD_LOCK(&(cs->nb_admin_barriers_mutex));
+                append_item=(opal_list_item_t *)
+                    &(buff_block->ctl_buffs_mgmt[memory_bank].nb_barrier_desc);
+                opal_list_append(list,append_item);
+                OPAL_THREAD_UNLOCK(&(cs->nb_admin_barriers_mutex));
+                /* progress communications so that resources can be freed up */
+                opal_progress();
+            } else {
+                /* mark the block as available */
+                (buff_block->ctl_buffs_mgmt[memory_bank].bank_gen_counter)++;
+            }
+
+            /* get out of here */
+            OPAL_THREAD_UNLOCK(&(buff_block->ctl_buffs_mgmt[memory_bank].mutex));
+        }
+
+    }

    /* return */
    return ret;
 }

-#if 0 
-/* Basesmuma interface function used for buffer bank resource recycling and 
+#if 0
+/* Basesmuma interface function used for buffer bank resource recycling and
   bcol specific registration information
- */
+*/
 int bcol_basesmuma_bank_init(struct mca_coll_ml_module_t *ml_module,
-		mca_bcol_base_module_t *bcol_module,
-		void *reg_data)
+                             mca_bcol_base_module_t *bcol_module,
+                             void *reg_data)
 {
-	/* assumption here is that the block has been registered with 
-	 * sm bcol hence has been mapped by each process, need to be
-	 * sure that memory is mapped amongst sm peers 
-	 */
-	
-	/* local variables */
-	int ret = OMPI_SUCCESS, i;
+    /* assumption here is that the block has been registered with
+     * sm bcol hence has been mapped by each process, need to be
+     * sure that memory is mapped amongst sm peers
+     */
+
+    /* local variables */
+    int ret = OMPI_SUCCESS, i;
    uint32_t j;
-	sm_buffer_mgmt *pload_mgmt;
-	mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
-	bcol_basesmuma_registration_data_t *sm_reg_data =
-		(bcol_basesmuma_registration_data_t *) reg_data;
-	mca_bcol_basesmuma_module_t *sm_bcol =
-		(mca_bcol_basesmuma_module_t *) bcol_module;
-	ml_memory_block_desc_t *ml_block = 
-		ml_module->payload_block;
-	size_t malloc_size;
-	ompi_common_sm_file_t input_file;
+    sm_buffer_mgmt *pload_mgmt;
+    mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
+    bcol_basesmuma_registration_data_t *sm_reg_data =
+        (bcol_basesmuma_registration_data_t *) reg_data;
+    mca_bcol_basesmuma_module_t *sm_bcol =
+        (mca_bcol_basesmuma_module_t *) bcol_module;
+    ml_memory_block_desc_t *ml_block =
+        ml_module->payload_block;
+    size_t malloc_size;
+    ompi_common_sm_file_t input_file;
    uint64_t mem_offset;
    int leading_dim,loop_limit,buf_id;
    unsigned char *base_ptr;
@ -199,56 +200,56 @@ int bcol_basesmuma_bank_init(struct mca_coll_ml_module_t *ml_module,

    fprintf(stderr,"test opti test\n");

-	/* first, we get a pointer to the payload buffer management struct */
-	pload_mgmt = &(sm_bcol->colls_with_user_data);
+    /* first, we get a pointer to the payload buffer management struct */
+    pload_mgmt = &(sm_bcol->colls_with_user_data);

-	/* allocate memory for pointers to mine and my peers' payload buffers 
-	 */
-	malloc_size = ml_block->num_banks*ml_block->num_buffers_per_bank*
-		pload_mgmt->size_of_group *sizeof(void *);
-	pload_mgmt->data_buffs = malloc(malloc_size);
-	if( !pload_mgmt->data_buffs) {
-		ret = OMPI_ERR_OUT_OF_RESOURCE;
-		goto exit_ERROR;
-	}
+    /* allocate memory for pointers to mine and my peers' payload buffers
+     */
+    malloc_size = ml_block->num_banks*ml_block->num_buffers_per_bank*
+        pload_mgmt->size_of_group *sizeof(void *);
+    pload_mgmt->data_buffs = malloc(malloc_size);
+    if( !pload_mgmt->data_buffs) {
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto exit_ERROR;
+    }

-	/* setup the input file for the shared memory connection manager */
-	input_file.file_name = sm_reg_data->file_name;
-	input_file.size = sm_reg_data->size;
-	input_file.size_ctl_structure = 0;
-	input_file.data_seg_alignment = CACHE_LINE_SIZE;
-	input_file.mpool_size = sm_reg_data->size;
+    /* setup the input file for the shared memory connection manager */
+    input_file.file_name = sm_reg_data->file_name;
+    input_file.size = sm_reg_data->size;
+    input_file.size_ctl_structure = 0;
+    input_file.data_seg_alignment = BASESMUMA_CACHE_LINE_SIZE;
+    input_file.mpool_size = sm_reg_data->size;

-	/* call the connection manager and map my shared memory peers' file
-	 */
-	ret = ompi_common_smcm_allgather_connection(
-			sm_bcol,
-			sm_bcol->super.sbgp_partner_module,
-			&(cs->sm_connections_list),
-			&(sm_bcol->payload_backing_files_info),
-			sm_bcol->super.sbgp_partner_module->group_comm,
-			input_file,
-			false);
-	if( OMPI_SUCCESS != ret ) {
-		goto exit_ERROR;
-	}
+    /* call the connection manager and map my shared memory peers' file
+     */
+    ret = ompi_common_smcm_allgather_connection(
+        sm_bcol,
+        sm_bcol->super.sbgp_partner_module,
+        &(cs->sm_connections_list),
+        &(sm_bcol->payload_backing_files_info),
+        sm_bcol->super.sbgp_partner_module->group_comm,
+        input_file,
+        false);
+    if( OMPI_SUCCESS != ret ) {
+        goto exit_ERROR;
+    }

-	/* now we exchange offset info - don't assume symmetric virtual memory
-	 */
-       mem_offset = (uint64_t)(ml_block->block->base_addr) -
-			(uint64_t)(cs->sm_payload_structs->data_addr);
+    /* now we exchange offset info - don't assume symmetric virtual memory
+     */
+    mem_offset = (uint64_t)(ml_block->block->base_addr) -
+        (uint64_t)(cs->sm_payload_structs->data_addr);

-       /* call into the exchange offsets function */
-       ret = base_bcol_basesmuma_exchange_offsets(sm_bcol_module, 
-		       (void **)pload_mgmt->data_buffs, mem_offset, 0,
-		       pload_mgmt->size_of_group);
-    	if( OMPI_SUCCESS != ret ) {
-	    	goto exit_ERROR;
-       	}
-	
-	/* convert memory offset to virtual address in current rank */
-	leading_dim = pload_mgmt->size_of_group;
-	loop_limit =  ml_block->num_banks*ml_block->num_buffers_per_bank;
+    /* call into the exchange offsets function */
+    ret = base_bcol_basesmuma_exchange_offsets(sm_bcol_module,
+                                               (void **)pload_mgmt->data_buffs, mem_offset, 0,
+                                               pload_mgmt->size_of_group);
+    if( OMPI_SUCCESS != ret ) {
+        goto exit_ERROR;
+    }
+
+    /* convert memory offset to virtual address in current rank */
+    leading_dim = pload_mgmt->size_of_group;
+    loop_limit =  ml_block->num_banks*ml_block->num_buffers_per_bank;
    for (i=0;i< sm_bcol_module->super.sbgp_partner_module->group_size;i++) {

        /* get the base pointer */
@ -266,7 +267,7 @@ int bcol_basesmuma_bank_init(struct mca_coll_ml_module_t *ml_module,
            int array_id_m1=SM_ARRAY_INDEX(leading_dim,(buf_id-1),i);
            array_id=SM_ARRAY_INDEX(leading_dim,buf_id,i);
            pload_mgmt->data_buffs[array_id]=(void *) ((uint64_t)(pload_mgmt->data_buffs[array_id_m1])+
-                    (uint64_t)ml_block->size_buffer);
+                                                       (uint64_t)ml_block->size_buffer);
        }
    }

@ -287,17 +288,17 @@ exit_ERROR:
 #endif

 /*
- * Allocate buffers for storing non-blocking collective descriptions, required 
+ * Allocate buffers for storing non-blocking collective descriptions, required
 * for making code re-entrant
 *
 */
-static int init_nb_coll_buff_desc(mca_bcol_basesmuma_nb_coll_buff_desc_t **desc, 
-				void *base_addr, uint32_t num_banks, 
-				uint32_t num_buffers_per_bank, 
-				uint32_t size_buffer, 
-				uint32_t header_size, 
-				int group_size, 
-				int pow_k)
+static int init_nb_coll_buff_desc(mca_bcol_basesmuma_nb_coll_buff_desc_t **desc,
+                                  void *base_addr, uint32_t num_banks,
+                                  uint32_t num_buffers_per_bank,
+                                  uint32_t size_buffer,
+                                  uint32_t header_size,
+                                  int group_size,
+                                  int pow_k)
 {
    uint32_t i, j, ci;
    mca_bcol_basesmuma_nb_coll_buff_desc_t *tmp_desc = NULL;
@ -333,104 +334,103 @@ static int init_nb_coll_buff_desc(mca_bcol_basesmuma_nb_coll_buff_desc_t **desc,


 #if 1
-/* New init function used for new control scheme where we put the control 
- * struct at the top of the payload buffer 
+/* New init function used for new control scheme where we put the control
+ * struct at the top of the payload buffer
 */
 int bcol_basesmuma_bank_init_opti(struct mca_coll_ml_module_t *ml_module,
-		mca_bcol_base_module_t *bcol_module,
-		void *reg_data)
+                                  mca_bcol_base_module_t *bcol_module,
+                                  void *reg_data)
 {
-	/* assumption here is that the block has been registered with 
-	 * sm bcol hence has been mapped by each process, need to be
-	 * sure that memory is mapped amongst sm peers 
-	 */
-	
-	/* local variables */
-	int ret = OMPI_SUCCESS, i, j;
-	sm_buffer_mgmt *pload_mgmt;
-	mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
-	bcol_basesmuma_registration_data_t *sm_reg_data =
-		(bcol_basesmuma_registration_data_t *) reg_data;
-	mca_bcol_basesmuma_module_t *sm_bcol =
-		(mca_bcol_basesmuma_module_t *) bcol_module;
-	ml_memory_block_desc_t *ml_block = 
-		ml_module->payload_block;
-	size_t malloc_size;
-	bcol_basesmuma_smcm_file_t input_file;
+    /* assumption here is that the block has been registered with
+     * sm bcol hence has been mapped by each process, need to be
+     * sure that memory is mapped amongst sm peers
+     */
+
+    /* local variables */
+    int ret = OMPI_SUCCESS, i, j, k, l;
+    sm_buffer_mgmt *pload_mgmt;
+    mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
+    bcol_basesmuma_registration_data_t *sm_reg_data =
+        (bcol_basesmuma_registration_data_t *) reg_data;
+    mca_bcol_basesmuma_module_t *sm_bcol =
+        (mca_bcol_basesmuma_module_t *) bcol_module;
+    ml_memory_block_desc_t *ml_block =
+        ml_module->payload_block;
+    size_t malloc_size;
+    bcol_basesmuma_smcm_file_t input_file;
    uint64_t mem_offset;
    int leading_dim,loop_limit,buf_id;
    unsigned char *base_ptr;
    mca_bcol_basesmuma_module_t *sm_bcol_module=
-            (mca_bcol_basesmuma_module_t *)bcol_module;
+        (mca_bcol_basesmuma_module_t *)bcol_module;
    int my_idx, array_id;
    mca_bcol_basesmuma_header_t *ctl_ptr;
    void **results_array;

-	mca_bcol_basesmuma_local_mlmem_desc_t *ml_mem = &sm_bcol_module->ml_mem;
+    mca_bcol_basesmuma_local_mlmem_desc_t *ml_mem = &sm_bcol_module->ml_mem;

-	/* first, we get a pointer to the payload buffer management struct */
-	pload_mgmt = &(sm_bcol->colls_with_user_data);
+    /* first, we get a pointer to the payload buffer management struct */
+    pload_mgmt = &(sm_bcol->colls_with_user_data);

    /* go ahead and get the header size that is cached on the payload block
     */
    sm_bcol->total_header_size = ml_module->data_offset;

-	/* allocate memory for pointers to mine and my peers' payload buffers 
-     * difference here is that now we use our new data struct 
-	 */
-	malloc_size = ml_block->num_banks*ml_block->num_buffers_per_bank*
-		pload_mgmt->size_of_group *sizeof(mca_bcol_basesmuma_payload_t);
-	pload_mgmt->data_buffs = (mca_bcol_basesmuma_payload_t *) malloc(malloc_size);
-	if( !pload_mgmt->data_buffs) {
-		ret = OMPI_ERR_OUT_OF_RESOURCE;
-		goto exit_ERROR;
-	}
+    /* allocate memory for pointers to mine and my peers' payload buffers
+     * difference here is that now we use our new data struct
+     */
+    malloc_size = ml_block->num_banks*ml_block->num_buffers_per_bank*
+        pload_mgmt->size_of_group *sizeof(mca_bcol_basesmuma_payload_t);
+    pload_mgmt->data_buffs = (mca_bcol_basesmuma_payload_t *) malloc(malloc_size);
+    if( !pload_mgmt->data_buffs) {
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto exit_ERROR;
+    }

    /* allocate some memory to hold the offsets */
    results_array = (void **) malloc(pload_mgmt->size_of_group*sizeof(void *));

-	/* setup the input file for the shared memory connection manager */
-	input_file.file_name = sm_reg_data->file_name;
-	input_file.size = sm_reg_data->size;
-	input_file.size_ctl_structure = 0;
-	input_file.data_seg_alignment = BASESMUMA_CACHE_LINE_SIZE;
-	input_file.mpool_size = sm_reg_data->size;
+    /* setup the input file for the shared memory connection manager */
+    input_file.file_name = sm_reg_data->file_name;
+    input_file.size = sm_reg_data->size;
+    input_file.size_ctl_structure = 0;
+    input_file.data_seg_alignment = BASESMUMA_CACHE_LINE_SIZE;
+    input_file.mpool_size = sm_reg_data->size;

-	/* call the connection manager and map my shared memory peers' file
-	 */
-	ret = bcol_basesmuma_smcm_allgather_connection(
-			sm_bcol,
-			sm_bcol->super.sbgp_partner_module,
-			&(cs->sm_connections_list),
-			&(sm_bcol->payload_backing_files_info),
-			sm_bcol->super.sbgp_partner_module->group_comm,
-			input_file,cs->payload_base_fname,
-			false);
-	if( OMPI_SUCCESS != ret ) {
-		goto exit_ERROR;
-	}
+    /* call the connection manager and map my shared memory peers' file
+     */
+    ret = bcol_basesmuma_smcm_allgather_connection(
+        sm_bcol,
+        sm_bcol->super.sbgp_partner_module,
+        &(cs->sm_connections_list),
+        &(sm_bcol->payload_backing_files_info),
+        sm_bcol->super.sbgp_partner_module->group_comm,
+        input_file,cs->payload_base_fname,
+        false);
+    if( OMPI_SUCCESS != ret ) {
+        goto exit_ERROR;
+    }


-	/* now we exchange offset info - don't assume symmetric virtual memory
-	 */
+    /* now we exchange offset info - don't assume symmetric virtual memory
+     */

-        mem_offset = (uint64_t)(uintptr_t)(ml_block->block->base_addr) -
-            (uint64_t)(uintptr_t)(cs->sm_payload_structs->data_addr);
+    mem_offset = (uint64_t)(uintptr_t)(ml_block->block->base_addr) -
+        (uint64_t)(uintptr_t)(cs->sm_payload_structs->data_addr);

-        /* call into the exchange offsets function */
-        ret=comm_allgather_pml(&mem_offset,results_array,1,
-                MPI_LONG_LONG_INT,
-                sm_bcol_module->super.sbgp_partner_module->my_index,
-                sm_bcol_module->super.sbgp_partner_module->group_size,
-                sm_bcol_module->super.sbgp_partner_module->group_list,
-                sm_bcol_module->super.sbgp_partner_module->group_comm);
-        if( OMPI_SUCCESS != ret ) {
-            goto exit_ERROR;
-        }
+    /* call into the exchange offsets function */
+    ret=comm_allgather_pml(&mem_offset,results_array,1,MPI_LONG_LONG_INT,
+                           sm_bcol_module->super.sbgp_partner_module->my_index,
+                           sm_bcol_module->super.sbgp_partner_module->group_size,
+                           sm_bcol_module->super.sbgp_partner_module->group_list,
+                           sm_bcol_module->super.sbgp_partner_module->group_comm);
+    if( OMPI_SUCCESS != ret ) {
+        goto exit_ERROR;
+    }

-	/* convert memory offset to virtual address in current rank */
-	leading_dim = pload_mgmt->size_of_group;
-	loop_limit =  ml_block->num_banks*ml_block->num_buffers_per_bank;
+    /* convert memory offset to virtual address in current rank */
+    leading_dim = pload_mgmt->size_of_group;
+    loop_limit =  ml_block->num_banks*ml_block->num_buffers_per_bank;
    for (i=0;i< sm_bcol_module->super.sbgp_partner_module->group_size;i++) {

        /* get the base pointer */
@ -447,29 +447,31 @@ int bcol_basesmuma_bank_init_opti(struct mca_coll_ml_module_t *ml_module,
        pload_mgmt->data_buffs[array_id].ctl_struct=(mca_bcol_basesmuma_header_t *)
            (uintptr_t)(((uint64_t)(uintptr_t)results_array[array_id])+(uint64_t)(uintptr_t)base_ptr);
        /* second, calculate where to set the data pointer */
-        pload_mgmt->data_buffs[array_id].payload=(void *) 
-            (uintptr_t)((uint64_t)(uintptr_t) pload_mgmt->data_buffs[array_id].ctl_struct + 
-                       (uint64_t)(uintptr_t) ml_module->data_offset);
-
+        pload_mgmt->data_buffs[array_id].payload=(void *)
+            (uintptr_t)((uint64_t)(uintptr_t) pload_mgmt->data_buffs[array_id].ctl_struct +
+                        (uint64_t)(uintptr_t) ml_module->data_offset);

        for( buf_id = 1 ; buf_id < loop_limit ; buf_id++ ) {
            int array_id_m1=SM_ARRAY_INDEX(leading_dim,(buf_id-1),i);
            array_id=SM_ARRAY_INDEX(leading_dim,buf_id,i);
-            /* now, play the same game as above 
+            /* now, play the same game as above
             *
             * first, set the control struct's position */
-            pload_mgmt->data_buffs[array_id].ctl_struct=(mca_bcol_basesmuma_header_t *) 
+            pload_mgmt->data_buffs[array_id].ctl_struct=(mca_bcol_basesmuma_header_t *)
                (uintptr_t)(((uint64_t)(uintptr_t)(pload_mgmt->data_buffs[array_id_m1].ctl_struct) +
-                           (uint64_t)(uintptr_t)ml_block->size_buffer));
+                             (uint64_t)(uintptr_t)ml_block->size_buffer));

            /* second, set the payload pointer */
            pload_mgmt->data_buffs[array_id].payload =(void *)
                (uintptr_t)((uint64_t)(uintptr_t) pload_mgmt->data_buffs[array_id].ctl_struct +
-                           (uint64_t)(uintptr_t) ml_module->data_offset); 
+                            (uint64_t)(uintptr_t) ml_module->data_offset);
        }

    }

+    /* done with the index array */
+    free (results_array);
+
    /* initialize my control structures!! */
    my_idx = sm_bcol_module->super.sbgp_partner_module->my_index;
    leading_dim = sm_bcol_module->super.sbgp_partner_module->group_size;
@ -499,30 +501,30 @@ int bcol_basesmuma_bank_init_opti(struct mca_coll_ml_module_t *ml_module,
            ml_block;
    }

-	ml_mem->num_banks = ml_block->num_banks;
+    ml_mem->num_banks = ml_block->num_banks;
    ml_mem->bank_release_counter = calloc(ml_block->num_banks, sizeof(uint32_t));
-	ml_mem->num_buffers_per_bank = ml_block->num_buffers_per_bank;
-	ml_mem->size_buffer = ml_block->size_buffer;
+    ml_mem->num_buffers_per_bank = ml_block->num_buffers_per_bank;
+    ml_mem->size_buffer = ml_block->size_buffer;
    /* pointer to ml level descriptor */
    ml_mem->ml_mem_desc = ml_block;

-	if (OMPI_SUCCESS != init_nb_coll_buff_desc(&ml_mem->nb_coll_desc,
-						 ml_block->block->base_addr,
-						 ml_mem->num_banks,
-						 ml_mem->num_buffers_per_bank,
-						 ml_mem->size_buffer,
-						 ml_module->data_offset,
-						 sm_bcol_module->super.sbgp_partner_module->group_size,
-						 sm_bcol_module->pow_k)) {
+    if (OMPI_SUCCESS != init_nb_coll_buff_desc(&ml_mem->nb_coll_desc,
+                                               ml_block->block->base_addr,
+                                               ml_mem->num_banks,
+                                               ml_mem->num_buffers_per_bank,
+                                               ml_mem->size_buffer,
+                                               ml_module->data_offset,
+                                               sm_bcol_module->super.sbgp_partner_module->group_size,
+                                               sm_bcol_module->pow_k)) {

-	   BASESMUMA_VERBOSE(10, ("Failed to allocate memory descriptors for storing state of non-blocking collectives\n"));
-	   return OMPI_ERROR;
-	}
+        BASESMUMA_VERBOSE(10, ("Failed to allocate memory descriptors for storing state of non-blocking collectives\n"));
+        return OMPI_ERROR;
+    }

-	return OMPI_SUCCESS;
+    return OMPI_SUCCESS;

 exit_ERROR:
-	return ret;
+    return ret;
 }

 #endif
@ -531,48 +533,45 @@ exit_ERROR:

 /* Basesmuma interface function used for buffer release */
 #if 0
-/* gvm 
- * A collective operation calls this routine to release the payload buffer. 
+/* gvm
+ * A collective operation calls this routine to release the payload buffer.
 * All processes in the shared memory sub-group of a bcol should call the non-blocking
- * barrier on the last payload buffer of a memory bank. On the completion 
- * of the non-blocking barrier, the ML callback is called which is responsible 
- * for recycling the memory bank. 
+ * barrier on the last payload buffer of a memory bank. On the completion
+ * of the non-blocking barrier, the ML callback is called which is responsible
+ * for recycling the memory bank.
 */
 mca_bcol_basesmuma_module_t *sm_bcol_module
 int bcol_basesmuma_free_payload_buff(
-		struct ml_memory_block_desc_t *block,
-		sm_buffer_mgmt *ctl_mgmt,
-		uint64_t buff_id)
+    struct ml_memory_block_desc_t *block,
+    sm_buffer_mgmt *ctl_mgmt,
+    uint64_t buff_id)
 {
-	/* local variables */
-	int ret = OMPI_SUCCESS;
+    /* local variables */
+    int ret = OMPI_SUCCESS;

-	 memory_bank = BANK_FROM_BUFFER_IDX(buff_id);
-	 ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed++;
+    memory_bank = BANK_FROM_BUFFER_IDX(buff_id);
+    ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed++;

- 	 OPAL_THREAD_ADD32(&(ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed),1);
+    OPAL_THREAD_ADD32(&(ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed),1);

-	 if (ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed == block->size_buffers_bank){
-		
-		/* start non-blocking barrier */
+    if (ctl_mgmt->ctl_buffs_mgmt[memory_bank].n_buffs_freed == block->size_buffers_bank){
+
+        /* start non-blocking barrier */
        bcol_basesmuma_rd_nb_barrier_init_admin(
-                 &(ctl_mgmt->ctl_buffs_mgmt[memory_bank].nb_barrier_desc));
- 	
-		if (NB_BARRIER_DONE !=
-                  ctl_mgmt->ctl_buffs_mgmt[memory_bank].
-                      nb_barrier_desc.collective_phase){
+            &(ctl_mgmt->ctl_buffs_mgmt[memory_bank].nb_barrier_desc));

-			/* progress the barrier */
-              opal_progress();
-		}
-		else{
-			/* free the buffer - i.e. initiate callback to ml level */
-			block->ml_release_cb(block,memory_bank);
-		}
-	 }
-	return ret;
+        if (NB_BARRIER_DONE !=
+            ctl_mgmt->ctl_buffs_mgmt[memory_bank].
+            nb_barrier_desc.collective_phase){
+
+            /* progress the barrier */
+            opal_progress();
+        }
+        else{
+            /* free the buffer - i.e. initiate callback to ml level */
+            block->ml_release_cb(block,memory_bank);
+        }
+    }
+    return ret;
 }
 #endif
-
-
-
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_gather.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_gather.c
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_bcast.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_bcast.c
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_bcast.h
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_bcast.h
@ -0,0 +1,626 @@
+#ifdef __PORTALS_AVAIL__
+#define __PORTALS_ENABLE__
+
+#include <unistd.h>
+
+#include "ompi_config.h"
+#include "ompi/constants.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "ompi/communicator/communicator.h"
+
+#include "bcol_basesmuma_utils.h"
+#include "bcol_basesmuma_portals.h"
+#include "bcol_basesmuma.h"
+
+#if 0
+struct scatter_allgather_nb_bcast_state_t
+{
+    /* local variables */
+    uint64_t length;
+    int my_rank, src, matched;
+    int *src_list;
+    int group_size;
+	int64_t ready_flag;
+    int pow_2, pow_2_levels;
+    int src_list_index;
+    uint64_t fragment_size;  /* user buffer size */
+
+	/* Input argument variables */
+	void *my_userbuf;
+	int64_t sequence_number;
+
+	/* Extra source variables */
+	bool secondary_root;
+	int partner , extra_partner;
+
+	/* Scatter Allgather offsets */
+	uint64_t local_sg_offset , global_sg_offset , partner_offset ;
+
+	/* Portals messaging relevant variables */
+	ptl_handle_eq_t allgather_eq_h;
+	ptl_handle_eq_t read_eq;
+	ptl_event_t  allgather_event;
+	bool msg_posted;
+
+	/* OMPI module and component variables */
+    mca_bcol_basesmuma_component_t *cs;
+    mca_bcol_basesmuma_module_t *bcol_module;
+
+	/* Control structure and payload variables */
+	volatile mca_bcol_basesmuma_ctl_struct_t **ctl_structs;
+    volatile mca_bcol_basesmuma_ctl_struct_t  *my_ctl_pointer;
+	volatile mca_bcol_basesmuma_ctl_struct_t  *parent_ctl_pointer; /* scatter source */
+	volatile mca_bcol_basesmuma_ctl_struct_t  *extra_partner_ctl_pointer; /* scatter source */
+
+	int phase;
+};
+
+typedef struct scatter_allgather_nb_bcast_state_t sg_state_t;
+#endif
+
+bool blocked_post = false;
+
+#define IS_SG_DATA_READY(peer, my_flag, my_sequence_number) 								\
+    (((peer)->sequence_number == (my_sequence_number) && 									\
+      (peer)->flags[BCAST_FLAG] >= (my_flag) 															\
+     )? true : false )
+
+
+
+#define  SG_LARGE_MSG_PROBE(src_list, n_src, src_list_index, matched,								\
+						    src, data_buffs, data_src_ctl_pointer,							\
+							data_src_lmsg_ctl_pointer, ready_flag,							\
+							sequence_number)  					  							\
+do {                                                                              			\
+    int j;                                                                        			\
+    for( j = 0; j < n_src; j++) {                                                 			\
+        if(src_list[j] != -1) {                                                   			\
+            data_src_ctl_pointer = data_buffs[src_list[j]].ctl_struct;                      \
+            data_src_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t*)				\
+										data_buffs[src_list[j]].payload;                    \
+            if( IS_SG_DATA_READY(data_src_ctl_pointer,ready_flag,sequence_number)) {   	    \
+                src = src_list[j];                                                			\
+                matched = 1;                                                      			\
+                src_list_index = j;   																\
+                break;                                                        				\
+            }                                                                     			\
+        }                                                                         			\
+    }                                                                             			\
+} while(0)
+
+#define  SG_LARGE_MSG_NB_PROBE(src_list, n_src, src_list_index, matched,					\
+						    src, ctl_structs, data_src_ctl_pointer,							\
+							ready_flag, sequence_number)  									\
+do {                                                                              			\
+    int j;                                                                        			\
+    for( j = 0; j < n_src; j++) {                                                 			\
+        if(src_list[j] != -1) {                                                   			\
+            data_src_ctl_pointer = ctl_structs[src_list[j]];		                        \
+            if( IS_SG_DATA_READY(data_src_ctl_pointer,ready_flag,sequence_number)) {   	    \
+                src = src_list[j];                                                			\
+                matched = 1;                                                      			\
+                src_list_index = j;   														\
+                break;                                                        				\
+            }                                                                     			\
+        }                                                                         			\
+    }                                                                             			\
+} while(0)
+
+
+
+
+
+static inline  __opal_attribute_always_inline__
+int wait_for_peers(int my_rank, int npeers, volatile mca_bcol_basesmuma_payload_t *data_buffs,
+				int flag_value, int sn)
+{
+	int *peers_list = NULL;
+	int counter = 0, diter = 0;
+	volatile mca_bcol_basesmuma_header_t *peer_ctl_pointer = NULL;
+
+	peers_list = (int *)malloc(sizeof(int) * npeers);
+
+	for (diter = 0; diter < npeers; diter++ ){
+		peers_list[diter] = my_rank ^ (1<<diter);
+		assert(peers_list[diter] != -1);
+	}
+
+	counter = 0;
+	while (counter < npeers) {
+		for (diter = 0; diter < npeers; diter++){
+			if (-1 != peers_list[diter]) {
+				peer_ctl_pointer = data_buffs[peers_list[diter]].ctl_struct;
+
+				if (IS_SG_DATA_READY(peer_ctl_pointer, flag_value, sn)) {
+					counter++;
+					peers_list[diter] = -1;
+				}
+			}
+		}
+		opal_progress();
+	}
+
+	return 0;
+}
+
+static inline  __opal_attribute_always_inline__
+int wait_for_peers_nb(int my_rank, int npeers,
+				volatile mca_bcol_basesmuma_ctl_struct_t **ctl_structs,
+				volatile int flag_value, int sn)
+{
+	int *peers_list = NULL;
+	int counter = 0, diter = 0;
+	volatile mca_bcol_basesmuma_ctl_struct_t *peer_ctl_pointer = NULL;
+
+	peers_list = (int *)malloc(sizeof(int) * npeers);
+
+	for (diter = 0; diter < npeers; diter++ ){
+		peers_list[diter] = my_rank ^ (1<<diter);
+		assert(peers_list[diter] != -1);
+	}
+
+	counter = 0;
+	while (counter < npeers) {
+		for (diter = 0; diter < npeers; diter++){
+			if (-1 != peers_list[diter]) {
+				peer_ctl_pointer = ctl_structs[peers_list[diter]];
+
+				if (IS_SG_DATA_READY(peer_ctl_pointer, flag_value, sn)) {
+					counter++;
+					peers_list[diter] = -1;
+				}
+			}
+		}
+		opal_progress();
+	}
+
+	return 0;
+}
+
+static inline  __opal_attribute_always_inline__
+int wait_for_post_complete_nb(int my_rank, int npeers,
+				volatile mca_bcol_basesmuma_ctl_struct_t **ctl_structs,
+				int flag_value, int sn)
+{
+	/* int *peers_list = NULL; */
+	int peers_list[MAX_SM_GROUP_SIZE];
+	int counter = 0, diter = 0;
+	volatile mca_bcol_basesmuma_ctl_struct_t *peer_ctl_pointer = NULL;
+
+/*	peers_list = (int *)malloc(sizeof(int) * npeers); */
+
+	assert(npeers < MAX_SM_GROUP_SIZE);
+
+	for (diter = 0; diter < npeers; diter++ ){
+		peers_list[diter] = my_rank ^ (1<<diter);
+		assert(peers_list[diter] != -1);
+	}
+
+	counter = 0;
+	for (diter = 0; diter < npeers; diter++){
+		peer_ctl_pointer = ctl_structs[peers_list[diter]];
+
+		if (IS_SG_DATA_READY(peer_ctl_pointer, flag_value, sn)) {
+					counter++;
+		}
+	}
+
+/*	free(peers_list); */
+	return counter;
+}
+
+static inline  __opal_attribute_always_inline__
+int  sg_large_msg_probe(sg_state_t *sg_state)
+{
+	int j,n_src = sg_state->pow_2_levels+1;
+
+
+	for( j = 0; j < n_src; j++) {
+        if(sg_state->src_list[j] != -1) {
+			sg_state->parent_ctl_pointer = sg_state->ctl_structs[sg_state->src_list[j]];
+
+			BASESMUMA_VERBOSE(5,("Parent %d ctl pointer (parent=%x, my ctl=%x) flag %d",
+								sg_state->src_list[j],sg_state->parent_ctl_pointer,
+								sg_state->my_ctl_pointer,
+								sg_state->parent_ctl_pointer->flag));
+
+			if (IS_SG_DATA_READY(sg_state->parent_ctl_pointer,
+						sg_state->ready_flag, sg_state->sequence_number)) {
+                sg_state->src = sg_state->src_list[j];
+                sg_state->matched = 1;
+                sg_state->src_list_index = j;
+				break;
+            }
+        }
+    }
+
+	return 0;
+}
+/*
+ * I will post message for all the my children
+ */
+static inline  __opal_attribute_always_inline__
+int sm_portals_root_scatter(sg_state_t *sg_state)
+{
+	int extra_src_posts = -1, scatter_posts = -1, allgather_posts = -1,
+						total_msg_posts = -1;
+
+	BASESMUMA_VERBOSE(10,("I am the root of the data"));
+    sg_state->my_ctl_pointer->offset = 0;
+    sg_state->my_ctl_pointer->n_sends = sg_state->pow_2_levels;
+    sg_state->my_ctl_pointer->length = sg_state->fragment_size;
+
+
+
+	extra_src_posts = (sg_state->my_rank + sg_state->pow_2 < sg_state->group_size ) ? 1: 0;
+	scatter_posts = sg_state->my_ctl_pointer->n_sends;
+	allgather_posts = sg_state->pow_2_levels - 1;
+
+	total_msg_posts = scatter_posts + allgather_posts + extra_src_posts ;
+
+	if ( total_msg_posts <= 0) {
+		BASESMUMA_VERBOSE(10,("No need to post the data "));
+		return OMPI_SUCCESS;
+	}
+
+	mca_bcol_basesmuma_portals_post_msg(sg_state->cs,
+						 &sg_state->my_ctl_pointer->portals_buf_addr,
+						   sg_state->my_userbuf, sg_state->fragment_size,
+						   PTL_EQ_NONE,
+						   total_msg_posts,
+						   blocked_post,
+						  PTL_MD_EVENT_START_DISABLE| PTL_MD_EVENT_END_DISABLE |
+						  PTL_MD_OP_GET | PTL_MD_MANAGE_REMOTE | PTL_MD_TRUNCATE | PTL_MD_EVENT_AUTO_UNLINK_ENABLE);
+
+	/*
+	 mca_bcol_basesmuma_portals_post_msg(sg_state->cs,
+						 &sg_state->my_ctl_pointer->portals_buf_addr,
+						   sg_state->my_userbuf, sg_state->fragment_size,
+						   sg_state->allgather_eq_h,
+						   total_msg_posts,
+						   blocked_post,
+						  PTL_MD_EVENT_START_DISABLE| PTL_MD_EVENT_END_DISABLE |
+						  PTL_MD_OP_GET | PTL_MD_MANAGE_REMOTE | PTL_MD_TRUNCATE | PTL_MD_EVENT_AUTO_UNLINK_ENABLE);
+	 */
+
+	 sg_state->msg_posted = true ;
+
+	/*
+	MB();
+	*/
+	sg_state->my_ctl_pointer->flag = sg_state->ready_flag;
+
+	return OMPI_SUCCESS;
+}
+
+/*
+ * Im root but my rank > pow2_groupsize, so will copy to partner who
+ * will act as root (secondary)
+ */
+static inline  __opal_attribute_always_inline__
+int sm_portals_extra_root_scatter(sg_state_t *sg_state)
+{
+	int scatter_partner = -1;
+	volatile mca_bcol_basesmuma_ctl_struct_t *scatter_partner_ctl_pointer = NULL;
+
+	int	total_msg_posts  = 1;
+
+	if ( total_msg_posts <= 0) {
+		BASESMUMA_VERBOSE(10,("No need to post the data "));
+	}
+	else {
+		mca_bcol_basesmuma_portals_post_msg(sg_state->cs,
+						 &sg_state->my_ctl_pointer->portals_buf_addr,
+						   sg_state->my_userbuf, sg_state->fragment_size,
+						   PTL_EQ_NONE,
+						   total_msg_posts,
+						   blocked_post,
+						  PTL_MD_EVENT_START_DISABLE| PTL_MD_EVENT_END_DISABLE | PTL_MD_OP_GET
+						  | PTL_MD_MANAGE_REMOTE | PTL_MD_TRUNCATE | PTL_MD_EVENT_AUTO_UNLINK_ENABLE);
+	sg_state->msg_posted = true ;
+
+	}
+
+	MB();
+	sg_state->my_ctl_pointer->flag = sg_state->ready_flag;
+
+
+
+	scatter_partner = sg_state->my_rank - sg_state->pow_2;
+	scatter_partner_ctl_pointer =
+					sg_state->ctl_structs[scatter_partner];
+
+	while(!IS_SG_DATA_READY(scatter_partner_ctl_pointer, sg_state->ready_flag,
+									sg_state->sequence_number)){
+					opal_progress();
+	}
+
+	return OMPI_SUCCESS;
+}
+
+/*
+ * Gets msg from the partner (> pow2_groupsize) and posts the
+ * message acting as root
+ */
+static inline  __opal_attribute_always_inline__
+int sm_portals_secondary_root_scatter(sg_state_t *sg_state)
+{
+
+	volatile mca_bcol_basesmuma_ctl_struct_t *extra_src_ctl_pointer = NULL;
+	int scatter_posts, allgather_posts, extra_src_posts, total_msg_posts;
+
+	sg_state->secondary_root = true;
+    BASESMUMA_VERBOSE(10,("I am the secondary root for the data"));
+    sg_state->my_ctl_pointer->offset = 0;
+    sg_state->my_ctl_pointer->n_sends = sg_state->pow_2_levels;
+    sg_state->my_ctl_pointer->length = sg_state->fragment_size;
+
+	extra_src_ctl_pointer = sg_state->ctl_structs[sg_state->src];
+
+	mca_bcol_basesmuma_portals_get_msg_fragment(sg_state->cs,
+						sg_state->read_eq,
+						&sg_state->my_ctl_pointer->portals_buf_addr,
+						&extra_src_ctl_pointer->portals_buf_addr, 0,
+						0, sg_state->fragment_size);
+
+
+	extra_src_posts = 0;
+	scatter_posts = sg_state->my_ctl_pointer->n_sends;
+	allgather_posts = sg_state->pow_2_levels - 1;
+
+	total_msg_posts = scatter_posts + allgather_posts + extra_src_posts ;
+
+	if (total_msg_posts > 0) {
+		mca_bcol_basesmuma_portals_post_msg(sg_state->cs,
+						  &sg_state->my_ctl_pointer->portals_buf_addr,
+						   sg_state->my_userbuf, sg_state->fragment_size,
+						   PTL_EQ_NONE,
+						   total_msg_posts,
+						   blocked_post,
+						   PTL_MD_EVENT_START_DISABLE| PTL_MD_EVENT_END_DISABLE | PTL_MD_OP_GET
+						   | PTL_MD_MANAGE_REMOTE | PTL_MD_TRUNCATE | PTL_MD_EVENT_AUTO_UNLINK_ENABLE);
+		sg_state->msg_posted = true ;
+	}
+    MB();
+    sg_state->my_ctl_pointer->flag = sg_state->ready_flag;
+
+	return OMPI_SUCCESS;
+}
+
+/*
+ * Internode Scatter: Get data from my parent and post for my children
+ */
+
+static inline  __opal_attribute_always_inline__
+int sm_portals_internode_scatter(sg_state_t *sg_state)
+{
+
+	int scatter_posts, allgather_posts, extra_src_posts,
+					total_msg_posts;
+	uint64_t local_offset, remote_offset;
+
+	/* compute the size of the chunk to copy */
+	sg_state->length = (sg_state->parent_ctl_pointer->length)/
+       (1<<(sg_state->parent_ctl_pointer->n_sends - sg_state->my_ctl_pointer->n_sends));
+	sg_state->my_ctl_pointer->length = sg_state->length;
+	sg_state->my_ctl_pointer->offset =
+				sg_state->parent_ctl_pointer->offset + sg_state->length;
+
+
+	local_offset = sg_state->my_ctl_pointer->offset;
+	remote_offset = sg_state->parent_ctl_pointer->offset +
+						sg_state->length;
+
+	mca_bcol_basesmuma_portals_get_msg_fragment(sg_state->cs,
+								sg_state->read_eq,
+								&sg_state->my_ctl_pointer->portals_buf_addr,
+								&sg_state->parent_ctl_pointer->portals_buf_addr,local_offset,
+								remote_offset,sg_state->length);
+
+	/* Now post the message for other children to read */
+	extra_src_posts = (sg_state->my_rank + sg_state->pow_2 <
+								sg_state->group_size ) ? 1: 0;
+	scatter_posts = sg_state->my_ctl_pointer->n_sends;
+	allgather_posts = sg_state->pow_2_levels - 1;
+
+	total_msg_posts = scatter_posts + allgather_posts + extra_src_posts ;
+
+	if (total_msg_posts > 0) {
+		mca_bcol_basesmuma_portals_post_msg(sg_state->cs, &sg_state->my_ctl_pointer->portals_buf_addr,
+						   sg_state->my_userbuf, sg_state->my_ctl_pointer->portals_buf_addr.userbuf_length,
+						   PTL_EQ_NONE,
+						   total_msg_posts,
+						   blocked_post,
+						   PTL_MD_EVENT_START_DISABLE| PTL_MD_EVENT_END_DISABLE
+						   | PTL_MD_OP_GET | PTL_MD_MANAGE_REMOTE | PTL_MD_TRUNCATE | PTL_MD_EVENT_AUTO_UNLINK_ENABLE);
+
+		sg_state->msg_posted = true;
+	}
+	/*
+    MB();
+	 */
+    sg_state->my_ctl_pointer->flag = sg_state->ready_flag;
+
+	return OMPI_SUCCESS;
+}
+
+/*
+ * Bcast's Allgather Phase:
+ * Combines data from all processes using recursive doubling algorithm
+ */
+static inline  __opal_attribute_always_inline__
+int sm_portals_bcasts_allgather_phase(sg_state_t *sg_state)
+{
+	int ag_loop,  partner;
+	volatile mca_bcol_basesmuma_ctl_struct_t  *partner_ctl_pointer = NULL; /* recursive double */
+
+
+	for( ag_loop = 1; ag_loop < sg_state->pow_2_levels; ag_loop++) {
+	        /* get my partner for this level */
+        partner = sg_state->my_rank^(1<<ag_loop);
+        partner_ctl_pointer = sg_state->ctl_structs[partner];
+
+
+		/* Block until partner is at this level of recursive-doubling stage */
+        while(!IS_SG_DATA_READY(partner_ctl_pointer, sg_state->ready_flag,
+								sg_state->sequence_number)) {
+            opal_progress();
+        }
+        assert(partner_ctl_pointer->flag >= sg_state->ready_flag);
+
+		if (partner_ctl_pointer->offset < sg_state->my_ctl_pointer->offset) {
+			sg_state->global_sg_offset -= sg_state->length;
+			sg_state->local_sg_offset = sg_state->global_sg_offset;
+		} else {
+			sg_state->local_sg_offset = sg_state->global_sg_offset + sg_state->length;
+		}
+
+
+		BASESMUMA_VERBOSE(10,("Allgather Phase: Get message from process %d, length %d",
+								partner, sg_state->length));
+		mca_bcol_basesmuma_portals_get_msg_fragment(sg_state->cs,
+								sg_state->read_eq,
+								&sg_state->my_ctl_pointer->portals_buf_addr,
+								&partner_ctl_pointer->portals_buf_addr,sg_state->local_sg_offset,
+								sg_state->local_sg_offset, sg_state->length);
+
+		sg_state->ready_flag++;
+		MB();
+	sg_state->my_ctl_pointer->flag = sg_state->ready_flag;
+
+		/* Block until partner is at this level of recursive-doubling stage */
+	while(!IS_SG_DATA_READY(partner_ctl_pointer, sg_state->ready_flag,
+								sg_state->sequence_number)) {
+		    opal_progress();
+        }
+
+        /* double the length */
+        sg_state->length *= 2;
+    }
+
+	return OMPI_SUCCESS;
+
+}
+
+
+static inline  __opal_attribute_always_inline__
+int init_sm_group_info(sg_state_t *sg_state, int buff_idx)
+{
+	int idx, leading_dim;
+	int first_instance=0;
+    int flag_offset;
+
+	/* Get addresing information */
+    sg_state->group_size = sg_state->bcol_module->colls_no_user_data.size_of_group;
+    leading_dim = sg_state->bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+	BASESMUMA_VERBOSE(1,("My buffer idx %d group size %d, leading dim %d, idx %d",
+							buff_idx,sg_state->group_size,leading_dim,idx));
+    /* grab the ctl buffs */
+    sg_state->ctl_structs = (volatile mca_bcol_basesmuma_ctl_struct_t **)
+        sg_state->bcol_module->colls_with_user_data.ctl_buffs+idx;
+
+	sg_state->my_rank = sg_state->bcol_module->super.sbgp_partner_module->my_index;
+    sg_state->my_ctl_pointer = sg_state->ctl_structs[sg_state->my_rank];
+
+	if (sg_state->my_ctl_pointer->sequence_number < sg_state->sequence_number) {
+        first_instance = 1;
+    }
+
+    if(first_instance) {
+        sg_state->my_ctl_pointer->flag = -1;
+        sg_state->my_ctl_pointer->index = 1;
+
+        sg_state->my_ctl_pointer->starting_flag_value = 0;
+        flag_offset = 0;
+
+    } else {
+        sg_state->my_ctl_pointer->index++;
+    }
+
+	/* For bcast we shud have only entry to this bcol
+	assert(sg_state->my_ctl_pointer->flag == -1);
+	*/
+
+	/* increment the starting flag by one and return */
+    flag_offset = sg_state->my_ctl_pointer->starting_flag_value;
+    sg_state->ready_flag = flag_offset + sg_state->sequence_number + 1;
+
+    sg_state->my_ctl_pointer->sequence_number = sg_state->sequence_number;
+
+	return OMPI_SUCCESS;
+
+}
+
+static inline  __opal_attribute_always_inline__
+int init_sm_portals_sg_info(sg_state_t *sg_state)
+{
+/* Get portals info*/
+	mca_bcol_basesmuma_portal_proc_info_t *portals_info;
+	int rc = OMPI_SUCCESS;
+	int sg_matchbits;
+
+	portals_info = (mca_bcol_basesmuma_portal_proc_info_t*)sg_state->cs->portals_info;
+
+	sg_matchbits = sg_state->sequence_number ;
+
+	/* Construct my portal buffer address and copy to payload buffer */
+	mca_bcol_basesmuma_construct_portal_address(&sg_state->my_ctl_pointer->portals_buf_addr,
+						portals_info->portal_id.nid,
+						portals_info->portal_id.pid,
+						sg_matchbits,
+						sg_state->bcol_module->super.sbgp_partner_module->group_comm->c_contextid);
+
+	sg_state->my_ctl_pointer->portals_buf_addr.userbuf = sg_state->my_userbuf;
+	sg_state->my_ctl_pointer->portals_buf_addr.userbuf_length = sg_state->fragment_size;
+
+	return OMPI_SUCCESS;
+}
+
+static inline  __opal_attribute_always_inline__
+int compute_src_from_root(int group_root, int my_group_rank, int pow2, int
+				group_size)
+{
+
+	int root, relative_rank, src, i;
+
+	if (group_root < pow2) {
+        root = group_root;
+    } else {
+        /* the source of the data is extra node,
+           the real root it represented by some rank from
+           pow2 group */
+        root = group_root - pow2;
+        /* shortcut for the case when my rank is root for the group */
+        if (my_group_rank == root) {
+            return group_root;
+        }
+    }
+
+    relative_rank = (my_group_rank - root) < 0 ? my_group_rank - root + pow2 :
+                                           my_group_rank - root;
+
+    for (i = 1; i < pow2; i<<=1) {
+        if (relative_rank & i) {
+            src = my_group_rank ^ i;
+            if (src >= pow2)
+                src -= pow2;
+
+            return src;
+        }
+    }
+
+	return -1;
+}
+
+int bcol_basesmuma_lmsg_scatter_allgather_portals_bcast(bcol_function_args_t *input_args,
+    coll_ml_function_t *c_input_args);
+
+int bcol_basesmuma_lmsg_scatter_allgather_portals_nb_bcast(bcol_function_args_t *input_args,
+    coll_ml_function_t *c_input_args);
+
+int bcol_basesmuma_lmsg_scatter_allgather_portals_nb_knownroot_bcast(bcol_function_args_t *input_args,
+    coll_ml_function_t *c_input_args);
+
+#endif
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_knomial_bcast.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_lmsg_knomial_bcast.c
@ -0,0 +1,450 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+/* #define __PORTALS_AVAIL__ */
+#ifdef __PORTALS_AVAIL__
+
+#define __PORTALS_ENABLE__
+#include "ompi/mca/bcol/basesmuma/bcol_basesmuma.h"
+#include "ompi/constants.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "ompi/communicator/communicator.h"
+#include "bcol_basesmuma_utils.h"
+
+#include "bcol_basesmuma_portals.h"
+
+/* debug */
+#include <unistd.h>
+/* end debug */
+
+
+/**
+ * Shared memory non-blocking Broadcast - K-nomial fan-out for small data buffers.
+ * This routine assumes that buf (the input buffer) is a single writer
+ * multi reader (SWMR) shared memory buffer owned by the calling rank
+ * which is the only rank that can write to this buffers.
+ * It is also assumed that the buffers are registered and fragmented
+ * at the ML level and that buf is sufficiently large to hold the data.
+ *
+ *
+ * @param buf - SWMR shared buffer within a sbgp that the
+ * executing rank can write to.
+ * @param count - the number of elements in the shared buffer.
+ * @param dtype - the datatype of a shared buffer element.
+ * @param root - the index within the sbgp of the root.
+ * @param module - basesmuma module.
+ */
+int bcol_basesmuma_lmsg_bcast_k_nomial_anyroot(bcol_function_args_t *input_args,
+    coll_ml_function_t *c_input_args)
+{
+#if 0
+		/* local variables */
+    mca_bcol_basesmuma_module_t* bcol_module=
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+    mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
+    int i, matched = 0;
+    int src=-1;
+    int group_size;
+    int my_rank, first_instance=0, flag_offset;
+    int rc = OMPI_SUCCESS;
+    int leading_dim, buff_idx, idx;
+    int count=input_args->count;
+    struct ompi_datatype_t* dtype=input_args->dtype;
+    int64_t sequence_number=input_args->sequence_num;
+
+	volatile int64_t ready_flag;
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char* parent_data_pointer;
+    volatile mca_bcol_basesmuma_header_t *parent_ctl_pointer;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    void *userbuf = (void *)((unsigned char *)input_args->userbuf);
+
+    size_t pack_len = 0, dt_size;
+
+    struct mca_bcol_basesmuma_portal_buf_addr_t *my_lmsg_ctl_pointer = NULL;
+    struct mca_bcol_basesmuma_portal_buf_addr_t *parent_lmsg_ctl_pointer = NULL;
+	mca_bcol_basesmuma_portal_proc_info_t *portals_info;
+	portals_info = (mca_bcol_basesmuma_portal_proc_info_t*)cs->portals_info;
+
+    /* we will work only on packed data - so compute the length*/
+    ompi_datatype_type_size(dtype, &dt_size);
+    pack_len=count*dt_size;
+    buff_idx = input_args->src_desc->buffer_index;
+
+    /* Get addressing information */
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    group_size = bcol_module->colls_no_user_data.size_of_group;
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    data_buffs=(volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+
+    /* Set pointer to current proc ctrl region */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+	my_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t*) data_buffs[my_rank].payload;
+
+    /* setup resource recycling */
+    if( my_ctl_pointer->sequence_number < sequence_number ) {
+        first_instance=1;
+    }
+
+	if( first_instance ) {
+        /* Signal arrival */
+        my_ctl_pointer->flag = -1;
+        my_ctl_pointer->index=1;
+        /* this does not need to use any flag values , so only need to
+         * set the value for subsequent values that may need this */
+        my_ctl_pointer->starting_flag_value=0;
+        flag_offset=0;
+
+    } else {
+        /* only one thread at a time will be making progress on this
+         *   collective, so no need to make this atomic */
+        my_ctl_pointer->index++;
+    }
+
+
+    /* increment the starting flag by one and return */
+    flag_offset = my_ctl_pointer->starting_flag_value;
+    ready_flag = flag_offset + sequence_number + 1;
+    my_ctl_pointer->sequence_number = sequence_number;
+
+
+	/* Construct my portal buffer address and copy to payload buffer */
+	mca_bcol_basesmuma_construct_portal_address(my_lmsg_ctl_pointer,
+						portals_info->portal_id.nid,
+						portals_info->portal_id.pid,
+						sequence_number,
+						bcol_module->super.sbgp_partner_module->group_comm->c_contextid);
+
+    /* non-blocking broadcast algorithm */
+
+    /* If I am the root, then signal ready flag */
+    if(input_args->root_flag) {
+		ptl_handle_eq_t eq_h;
+		ptl_event_t event;
+		int ret;
+
+        BASESMUMA_VERBOSE(10,("I am the root of the data"));
+
+		/* create an event queue for the incoming buffer */
+	ret = PtlEQAlloc(((mca_bcol_basesmuma_portal_proc_info_t*)
+				cs->portals_info)->ni_h, MAX_PORTAL_EVENTS_IN_Q, PTL_EQ_HANDLER_NONE, &eq_h);
+
+		if (ret != PTL_OK) {
+	  fprintf(stderr, "PtlEQAlloc() failed: %d \n",ret);
+			return OMPI_ERR_OUT_OF_RESOURCE;
+	}
+
+		/* Post the message using portal copy */
+
+		 mca_bcol_basesmuma_portals_post_msg_nb_nopers(cs, my_lmsg_ctl_pointer, userbuf,
+							pack_len, eq_h, my_lmsg_ctl_pointer->nsends);
+
+		/*
+         * signal ready flag
+         */
+        my_ctl_pointer->flag = ready_flag;
+
+		/* wait for a response from the client */
+	mca_bcol_basesmuma_portals_wait_event_nopers(eq_h, POST_MSG_EVENT,
+					&event, my_lmsg_ctl_pointer->nsends);
+
+		/* free the event queue */
+		ret = PtlEQFree(eq_h);
+		if (ret != PTL_OK) {
+		    fprintf(stderr, "PtlEQFree() failed: %d )\n",ret);
+		}
+
+        /* root is finished */
+        goto Release;
+    }
+
+    /* If I am not the root, then poll on possible "senders'" control structs */
+    for( i = 0; i < cs->num_to_probe && 0 == matched; i++) {
+
+        /* Shared memory iprobe */
+		/*
+		BCOL_BASESMUMA_SM_PROBE(bcol_module->src, bcol_module->src_size,
+                my_rank, matched, src);
+		*/
+		do {
+			int j, n_src, my_index;
+			n_src = bcol_module->src_size;
+
+			for( j = 0; j < n_src; j++) {
+			parent_ctl_pointer = data_buffs[bcol_module->src[j]].ctl_struct;
+			parent_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t *)
+											data_buffs[bcol_module->src[j]].payload;
+			if (IS_DATA_READY(parent_ctl_pointer,ready_flag,sequence_number)) {
+
+					src = bcol_module->src[j];
+			matched = 1;
+			break;
+			}
+		}
+		} while(0);
+
+    }
+
+    /* If not matched, then hop out and put me on progress list */
+    if(0 == matched ) {
+        BASESMUMA_VERBOSE(10,("Shared memory probe didn't find a match"));
+        return BCOL_FN_NOT_STARTED;
+    }
+
+    /* else, we found our root within the group ... */
+    BASESMUMA_VERBOSE(10,("Shared memory probe was matched, the root is %d", src));
+
+    /* receive the data from sender */
+    /* get the data buff */
+    /* taken care of in the macro */
+    /*parent_data_pointer = data_buffs[src].payload;*/
+    /* copy the data */
+	mca_bcol_basesmuma_portals_get_msg(cs, parent_lmsg_ctl_pointer, userbuf, pack_len);
+
+    /* set the memory barrier to ensure completion */
+    MB();
+    /* signal that I am done */
+    my_ctl_pointer->flag = ready_flag;
+
+    /* am I the last one? If so, release buffer */
+
+Release:
+    my_ctl_pointer->starting_flag_value++;
+
+    return BCOL_FN_COMPLETE;
+#endif
+}
+
+#if 0
+
+#define BASESMUMA_K_NOMIAL_SEND_SIGNAL(radix_mask, radix, my_relative_index,		\
+		my_group_index, group_size,sm_data_buffs,sender_ready_flag,			\
+				num_pending_sends) 													\
+{																					\
+    int k, rc;																  	    \
+    int dst; 			                                               			    \
+	int comm_dst;																	\
+    volatile mca_bcol_basesmuma_header_t *recv_ctl_pointer = NULL;					\
+	volatile mca_bcol_basesmuma_portal_buf_addr_t  *recv_lmsg_ctl_pointer = NULL;   \
+                                                                                    \
+    num_pending_sends = 0;													        \
+    while(radix_mask > 0) {															\
+        /* For each level of tree, do sends */										\
+        for (k = 1;																	\
+			k < radix && my_relative_index + radix_mask * k < group_size;  	  		\
+			++k) {   	                                              	   			\
+                                                                                    \
+            dst = my_group_index + radix_mask * k;                        		    \
+            if (dst >= group_size) {												\
+                dst -= group_size;													\
+            }                                                                	    \
+			/* Signal the children to get data */									\
+			recv_ctl_pointer	  = data_buffs[dst].ctl;							\
+			recv_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t *)		\
+											data_buffs[dst].payload;				\
+			recv_lmsg_ctl_pointer->src_index = my_group_index;						\
+			recv_lmsg_ctl_pointer->flag = sender_ready_flag;						\
+            ++num_pending_sends;													\
+        }																			\
+        radix_mask /= radix;														\
+    }                                                                      			\
+																		    \
+}
+
+
+
+int bcol_basesmuma_lmsg_bcast_k_nomial_anyroot(bcol_function_args_t *input_args,
+    coll_ml_function_t *c_input_args)
+{
+    /* local variables */
+    mca_bcol_basesmuma_module_t* bcol_module=
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+    mca_bcol_basesmuma_component_t *cs = &mca_bcol_basesmuma_component;
+    int i, matched = 0;
+    int src=-1;
+    int group_size;
+    int my_rank, first_instance=0, flag_offset;
+    int rc = OMPI_SUCCESS;
+    int leading_dim, buff_idx, idx;
+    int count=input_args->count;
+    struct ompi_datatype_t* dtype=input_args->dtype;
+    int64_t sequence_number=input_args->sequence_num;
+
+	volatile int64_t ready_flag;
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char* parent_data_pointer;
+    volatile mca_bcol_basesmuma_header_t *parent_ctl_pointer;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    void *userbuf = (void *)((unsigned char *)input_args->userbuf);
+
+    size_t pack_len = 0, dt_size;
+
+    struct mca_bcol_basesmuma_portal_buf_addr_t *my_lmsg_ctl_pointer = NULL;
+    struct mca_bcol_basesmuma_portal_buf_addr_t *parent_lmsg_ctl_pointer = NULL;
+	mca_bcol_basesmuma_portal_proc_info_t *portals_info;
+	portals_info = (mca_bcol_basesmuma_portal_proc_info_t*)cs->portals_info;
+
+    /* we will work only on packed data - so compute the length*/
+    ompi_datatype_type_size(dtype, &dt_size);
+    pack_len=count*dt_size;
+    buff_idx = input_args->src_desc->buffer_index;
+
+    /* Get addressing information */
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    group_size = bcol_module->colls_no_user_data.size_of_group;
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    data_buffs=(volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+
+    /* Set pointer to current proc ctrl region */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+	my_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t*) data_buffs[my_rank].payload;
+
+    /* setup resource recycling */
+    if( my_ctl_pointer->sequence_number < sequence_number ) {
+        first_instance=1;
+    }
+
+	if( first_instance ) {
+        /* Signal arrival */
+        my_ctl_pointer->flag = -1;
+        my_ctl_pointer->index=1;
+        /* this does not need to use any flag values , so only need to
+         * set the value for subsequent values that may need this */
+        my_ctl_pointer->starting_flag_value=0;
+        flag_offset=0;
+
+    } else {
+        /* only one thread at a time will be making progress on this
+         *   collective, so no need to make this atomic */
+        my_ctl_pointer->index++;
+    }
+
+
+    /* increment the starting flag by one and return */
+    flag_offset = my_ctl_pointer->starting_flag_value;
+    ready_flag = flag_offset + sequence_number + 1;
+    my_ctl_pointer->sequence_number = sequence_number;
+
+
+	/* Construct my portal buffer address and copy to payload buffer */
+	mca_bcol_basesmuma_construct_portal_address(my_lmsg_ctl_pointer,
+						portals_info->portal_id.nid,
+						portals_info->portal_id.pid,
+						sequence_number,
+						bcol_module->super.sbgp_partner_module->group_comm->c_contextid);
+
+	my_lmsg_ctl_pointer->userbuf = userbuff;
+	my_lsmg_ctl_pointer->userbuf_length = fragment_length;
+	/* create an event queue  */
+	ret = PtlEQAlloc(((mca_bcol_basesmuma_portal_proc_info_t*)
+				cs->portals_info)->ni_h, MAX_PORTAL_EVENTS_IN_Q, PTL_EQ_HANDLER_NONE, &eq_h);
+
+    /* non-blocking broadcast algorithm */
+
+    /* If I am the root, then signal ready flag */
+    if(input_args->root_flag) {
+		ptl_handle_eq_t eq_h;
+		ptl_event_t event;
+		int ret;
+		int root_radix_mask = sm_module->pow_knum;
+
+        BASESMUMA_VERBOSE(10,("I am the root of the data"));
+
+
+		if (ret != PTL_OK) {
+	  fprintf(stderr, "PtlEQAlloc() failed: %d \n",ret);
+			return OMPI_ERR_OUT_OF_RESOURCE;
+	}
+
+		BASESMUMA_K_NOMIAL_SEND_SIGNAL(root_radix_mask, radix, 0,
+			my_rank, group_size, data_buffs, ready_flag, nsends) ;
+
+		mca_bcol_basesmuma_portals_post_msg_nb_nopers(cs, my_lmsg_ctl_pointer, userbuf,
+							pack_len, eq_h, nsends);
+
+		/* wait for a response from the client */
+	mca_bcol_basesmuma_portals_wait_event_nopers(eq_h, POST_MSG_EVENT,
+					&event, nsends);
+
+        /* root is finished */
+        goto Release;
+    }
+
+	/* Im not a root so wait until someone puts data and
+	 * compute where to get data from */
+
+	while (my_ctl_pointer->flag != ready_flag) ;
+
+	my_data_source_index = lmsg_ctl_pointer->src_index;
+
+	parent_lmsg_ctl_pointer = (mca_bcol_basesmuma_portal_buf_addr_t *)
+											data_buffs[my_data_source_index].payload;
+
+	mca_bcol_basesmuma_portals_get_msg(cs, parent_lmsg_ctl_pointer, userbuf, pack_len);
+
+
+
+
+	/* I am done getting data, should I send the data to someone  */
+
+	my_relative_index = (my_rank - my_data_source_index) < 0 ? my_rank -
+		my_data_source_index  + group_size : my_rank - my_data_source_index;
+
+	/*
+     * 2. Locate myself in the tree:
+     * calculate number of radix steps that we should to take
+     */
+    radix_mask = 1;
+    while (radix_mask < group_size) {
+        if (0 != my_relative_index % (radix * radix_mask)) {
+            /* I found my level in tree */
+            break;
+        }
+        radix_mask *= radix;
+    }
+
+	/* go one step back */
+    radix_mask /=radix;
+
+	BASESMUMA_K_NOMIAL_SEND_SIGNAL(radix_mask, radix, my_relative_index,
+		my_rank, group_size,data_buffs,ready_flag,nsends)
+
+	mca_bcol_basesmuma_portals_post_msg_nb_nopers(cs, my_lmsg_ctl_pointer, userbuf,
+							pack_len, eq_h, nsends);
+
+	/* wait for childrens to read */
+    mca_bcol_basesmuma_portals_wait_event_nopers(eq_h, POST_MSG_EVENT,
+					&event, nsends);
+
+
+
+Release:
+	/* free the event queue */
+	ret = PtlEQFree(eq_h);
+	if (ret != PTL_OK) {
+		    fprintf(stderr, "PtlEQFree() failed: %d )\n",ret);
+	}
+
+
+    my_ctl_pointer->starting_flag_value++;
+
+    return BCOL_FN_COMPLETE;
+}
+
+#endif
+#endif
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_module.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_module.c
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_rd_barrier.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_rd_barrier.c
@ -0,0 +1,218 @@
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+/* Recursive doubling blocking barrier */
+
+#include "ompi_config.h"
+#include "ompi/constants.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "ompi/patterns/net/netpatterns.h"
+
+#include "opal/sys/atomic.h"
+
+#include "bcol_basesmuma.h"
+
+#if 0
+int bcol_basesmuma_recursive_double_barrier(bcol_function_args_t *input_args,
+                                            coll_ml_function_t *c_input_args)
+{
+
+    /* local variables */
+    int ret=OMPI_SUCCESS, idx, leading_dim, loop_cnt, exchange, flag_to_set;
+    int pair_rank, flag_offset;
+    mca_bcol_basesmuma_ctl_struct_t **ctl_structs;
+    netpatterns_pair_exchange_node_t *my_exchange_node;
+    int extra_rank, my_rank, pow_2;
+    volatile mca_bcol_basesmuma_ctl_struct_t *partner_ctl;
+    volatile mca_bcol_basesmuma_ctl_struct_t *my_ctl;
+    int64_t sequence_number;
+    bool found;
+    int buff_index, first_instance=0;
+    mca_bcol_basesmuma_module_t* bcol_module =
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+#if 0
+    fprintf(stderr,"Entering the sm rd barrier\n");
+    fflush(stderr);
+#endif
+
+    /* get the pointer to the segment of control structures */
+    my_exchange_node=&(bcol_module->recursive_doubling_tree);
+    my_rank=bcol_module->super.sbgp_partner_module->my_index;
+    pow_2=bcol_module->super.sbgp_partner_module->pow_2;
+
+    /* figure out what instance of the basesmuma bcol I am */
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    sequence_number=input_args->sequence_num - c_input_args->bcol_module->squence_number_offset;
+
+    buff_index=sequence_number & (bcol_module->colls_no_user_data.mask);
+
+    idx=SM_ARRAY_INDEX(leading_dim,buff_index,0);
+    ctl_structs=(mca_bcol_basesmuma_ctl_struct_t **)
+        bcol_module->colls_no_user_data.ctl_buffs+idx;
+    my_ctl=ctl_structs[my_rank];
+    if( my_ctl->sequence_number < sequence_number ) {
+        first_instance=1;
+    }
+
+    /* get the pool index */
+    if( first_instance ) {
+        idx = -1;
+        while( idx == -1 ) {
+
+            idx=bcol_basesmuma_get_buff_index(
+                &(bcol_module->colls_no_user_data),sequence_number);
+        }
+        if( -1 == idx ){
+            return ORTE_ERR_TEMP_OUT_OF_RESOURCE;
+        }
+        my_ctl->index=1;
+        /* this does not need to use any flag values , so only need to
+         * set the value for subsequent values that may need this */
+        my_ctl->starting_flag_value=0;
+        flag_offset=0;
+    } else {
+        /* only one thread at a time will be making progress on this
+         *   collective, so no need to make this atomic */
+        my_ctl->index++;
+        flag_offset=my_ctl->starting_flag_value;
+    }
+
+    /* signal that I have arrived */
+    my_ctl->flag = -1;
+    /* don't need to set this flag anymore */
+    my_ctl->sequence_number = sequence_number;
+    /* MB();*/
+
+    if(0 < my_exchange_node->n_extra_sources) {
+        if (EXCHANGE_NODE == my_exchange_node->node_type) {
+            volatile int64_t *partner_sn;
+            int cnt=0;
+
+            /* I will participate in the exchange - wait for signal from extra
+            ** process */
+            extra_rank = my_exchange_node->rank_extra_source;
+            partner_ctl=(volatile mca_bcol_basesmuma_ctl_struct_t *)ctl_structs[extra_rank];
+
+            /*partner_ctl=ctl_structs[extra_rank];*/
+            partner_sn=(volatile int64_t *)&(partner_ctl->sequence_number);
+
+            /* spin n iterations until partner registers */
+            loop_cnt=0;
+            found=false;
+            while( !found )
+            {
+                if( *partner_sn >= sequence_number ) {
+                    found=true;
+                }
+                cnt++;
+                if( cnt == 1000 ) {
+                    opal_progress();
+                    cnt=0;
+                }
+            }
+
+        }  else {
+
+            /* Nothing to do, already registared that I am here */
+        }
+    }
+
+    for(exchange = 0; exchange < my_exchange_node->n_exchanges; exchange++) {
+
+        volatile int64_t *partner_sn;
+        volatile int *partner_flag;
+        int cnt=0;
+
+        /* rank of exchange partner */
+        pair_rank = my_rank ^ ( 1 SHIFT_UP exchange );
+        partner_ctl=ctl_structs[pair_rank];
+        partner_sn=(volatile int64_t *)&(partner_ctl->sequence_number);
+        partner_flag=(volatile int *)&(partner_ctl->flag);
+
+        /* signal that I am at iteration exchange of the algorithm */
+        flag_to_set=flag_offset+exchange;
+        my_ctl->flag = flag_to_set;
+
+        /* check to see if the partner has arrived */
+
+        /* spin n iterations until partner registers */
+        found=false;
+        while( !found )
+        {
+            if( (*partner_sn > sequence_number) ||
+                ( *partner_sn == sequence_number &&
+                  *partner_flag >= flag_to_set ) ) {
+                found=true;
+            }  else {
+                cnt++;
+                if( cnt == 1000 ) {
+                    opal_progress();
+                    cnt=0;
+                }
+            }
+        }
+    }
+
+    if(0 < my_exchange_node->n_extra_sources)  {
+        if ( EXTRA_NODE == my_exchange_node->node_type ) {
+            int cnt=0;
+
+            /* I will not participate in the exchange -
+             *   wait for signal from extra partner */
+            extra_rank = my_exchange_node->rank_extra_source;
+            partner_ctl=ctl_structs[extra_rank];
+            flag_to_set=flag_offset+my_exchange_node->log_2;
+
+            /* spin n iterations until partner registers */
+            found=false;
+            while( !found )
+            {
+                if (IS_PEER_READY(partner_ctl, flag_to_set, sequence_number)){
+                    found=true;
+                } else {
+                    cnt++;
+                    if( cnt == 1000 ) {
+                        opal_progress();
+                        cnt=0;
+                    }
+                }
+            }
+
+        }  else {
+
+            /* signal the extra rank that I am done with the recursive
+             * doubling phase.
+             */
+            flag_to_set=flag_offset+my_exchange_node->log_2;
+            my_ctl->flag = flag_to_set;
+
+        }
+    }
+
+    /* if I am the last instance of a basesmuma function in this collectie,
+     *   release the resrouces */
+    if (IS_LAST_BCOL_FUNC(c_input_args)){
+        idx=bcol_basesmuma_free_buff(
+            &(bcol_module->colls_no_user_data),
+            sequence_number);
+    }  else {
+        /* increment flag value - so next sm collective in the hierarchy
+         *    will not collide with the current one, as they share the
+         *    control structure */
+        my_ctl->starting_flag_value+=(my_exchange_node->log_2+1);
+    }
+
+    /* return */
+    return ret;
+}
+#endif
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_rd_nb_barrier.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_rd_nb_barrier.c
@ -1,6 +1,6 @@
 /*
- * Copyright (c) 2009-2010 UT-Battelle, LLC. All rights reserved.
- * Copyright (c) 2009-2010 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2009-2012 UT-Battelle, LLC. All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
 * Copyright (c) 2013 Cisco Systems, Inc.  All rights reserved.
 * $COPYRIGHT$
 *
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_reduce.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_reduce.c
@ -0,0 +1,387 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/constants.h"
+#include "ompi/op/op.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/bcol/base/base.h"
+#include "ompi/mca/bcol/bcol.h"
+
+#include "opal/include/opal_stdint.h"
+
+#include "bcol_basesmuma.h"
+#include "bcol_basesmuma_reduce.h"
+/**
+ * gvm - Shared memory reduce
+ */
+
+static int bcol_basesmuma_reduce_intra_fanin_progress(bcol_function_args_t *input_args,
+                                                      coll_ml_function_t *c_input_args);
+
+int bcol_basesmuma_reduce_init(mca_bcol_base_module_t *super)
+{
+    mca_bcol_base_coll_fn_comm_attributes_t comm_attribs;
+    mca_bcol_base_coll_fn_invoke_attributes_t inv_attribs;
+
+    mca_bcol_basesmuma_module_t *basesmuma_module =
+        (mca_bcol_basesmuma_module_t *) super;
+
+    int group_size = basesmuma_module->colls_no_user_data.size_of_group;
+
+    comm_attribs.bcoll_type = BCOL_REDUCE;
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 16;
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+    comm_attribs.waiting_semantics = NON_BLOCKING;
+
+    inv_attribs.bcol_msg_min = 0;
+    inv_attribs.bcol_msg_max = 20000;
+    inv_attribs.datatype_bitmap = 0x11111111;
+    inv_attribs.op_types_bitmap = 0x11111111;
+
+
+    /* Set attributes for fanin fanout algorithm */
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs, bcol_basesmuma_reduce_intra_fanin,
+                                 bcol_basesmuma_reduce_intra_fanin_progress);
+
+    inv_attribs.bcol_msg_min = 10000000;
+    inv_attribs.bcol_msg_max = 10485760; /* range 4 */
+
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs, NULL, NULL);
+
+    return OMPI_SUCCESS;
+}
+
+/*
+ * Small data fanin reduce
+ * ML buffers are used for both payload and control structures
+ * This functions works with hierarchical allreduce and
+ * progress engine
+ */
+static inline int reduce_children (mca_bcol_basesmuma_module_t *bcol_module, volatile void *rbuf, netpatterns_tree_node_t *my_reduction_node,
+                                   int *iteration, volatile mca_bcol_basesmuma_header_t *my_ctl_pointer, ompi_datatype_t *dtype,
+                                   volatile mca_bcol_basesmuma_payload_t *data_buffs, int count, struct ompi_op_t *op, int process_shift) {
+    volatile mca_bcol_basesmuma_header_t * child_ctl_pointer;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    int64_t sequence_number = my_ctl_pointer->sequence_number;
+    int8_t ready_flag = my_ctl_pointer->ready_flag;
+    int group_size = bcol_module->colls_no_user_data.size_of_group;
+
+    if (LEAF_NODE != my_reduction_node->my_node_type) {
+        volatile char *child_data_pointer;
+        volatile void *child_rbuf;
+
+        /* for each child */
+        /* my_result_data = child_result_data (op) my_source_data */
+
+        for (int child = *iteration ; child < my_reduction_node->n_children ; ++child) {
+            int child_rank = my_reduction_node->children_ranks[child] + process_shift;
+
+            if (group_size <= child_rank){
+                child_rank -= group_size;
+            }
+
+            child_ctl_pointer = data_buffs[child_rank].ctl_struct;
+            child_data_pointer = data_buffs[child_rank].payload;
+
+            if (!IS_PEER_READY(child_ctl_pointer, ready_flag, sequence_number, REDUCE_FLAG, bcol_id)) {
+                *iteration = child;
+                return BCOL_FN_STARTED;
+            }
+
+            child_rbuf = child_data_pointer + child_ctl_pointer->roffsets[bcol_id];
+
+            ompi_op_reduce(op,(void *)child_rbuf,(void *)rbuf, count, dtype);
+        } /* end child loop */
+    }
+
+    if (ROOT_NODE != my_reduction_node->my_node_type) {
+        opal_atomic_wmb ();
+        my_ctl_pointer->flags[REDUCE_FLAG][bcol_id] = ready_flag;
+    }
+
+    return BCOL_FN_COMPLETE;
+}
+
+static int bcol_basesmuma_reduce_intra_fanin_progress(bcol_function_args_t *input_args,
+                                                      coll_ml_function_t *c_input_args)
+{
+    mca_bcol_basesmuma_module_t* bcol_module =
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+
+    netpatterns_tree_node_t *my_reduction_node;
+    int my_rank, my_node_index;
+    struct ompi_datatype_t *dtype = input_args->dtype;
+    int leading_dim, idx;
+
+    /* Buffer index */
+    int buff_idx = input_args->src_desc->buffer_index;
+
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buff_idx].iteration;
+
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    void *data_addr = (void *)input_args->src_desc->data_addr;
+    volatile void *rbuf;
+
+    /* get addressing information */
+    my_rank = bcol_module->super.sbgp_partner_module->my_index;
+    leading_dim = bcol_module->colls_no_user_data.size_of_group;
+    idx = SM_ARRAY_INDEX(leading_dim, buff_idx, 0);
+
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs + idx;
+
+    /* Get control structure and payload buffer */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+
+    my_node_index = my_rank - input_args->root;
+    if (0 > my_node_index) {
+        int group_size = bcol_module->colls_no_user_data.size_of_group;
+        my_node_index += group_size;
+    }
+
+    my_reduction_node = bcol_module->reduction_tree + my_node_index;
+    rbuf = (volatile void *)((uintptr_t) data_addr + input_args->rbuf_offset);
+
+    return reduce_children (bcol_module, rbuf, my_reduction_node, iteration, my_ctl_pointer, dtype,
+                            data_buffs, input_args->count, input_args->op, input_args->root);
+}
+
+int bcol_basesmuma_reduce_intra_fanin(bcol_function_args_t *input_args,
+                                      coll_ml_function_t *c_input_args)
+{
+    /* local variables */
+    int rc=BCOL_FN_COMPLETE;
+    int my_rank,group_size,my_node_index;
+    mca_bcol_basesmuma_module_t* bcol_module =
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+
+    netpatterns_tree_node_t *my_reduction_node;
+    volatile int8_t ready_flag;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+    volatile void *sbuf,*rbuf;
+    int sbuf_offset,rbuf_offset;
+    int root,count;
+    int64_t sequence_number=input_args->sequence_num;
+    struct ompi_datatype_t *dtype;
+    int leading_dim,idx;
+
+    /* Buffer index */
+    int buff_idx = input_args->src_desc->buffer_index;
+
+    int *iteration = &bcol_module->ml_mem.nb_coll_desc[buff_idx].iteration;
+
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char * my_data_pointer;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    void *data_addr = (void *)input_args->src_desc->data_addr;
+
+#if 0
+    fprintf(stderr,"777 entering sm reduce \n");
+#endif
+
+    /* get addressing information */
+    my_rank=bcol_module->super.sbgp_partner_module->my_index;
+    group_size=bcol_module->colls_no_user_data.size_of_group;
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+    /* fprintf(stderr,"AAA the devil!!\n"); */
+    /* Get control structure and payload buffer */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+    my_data_pointer = (volatile char *)data_addr;
+
+    /* Align node index to around sbgp root */
+    root = input_args->root;
+    my_node_index = my_rank - root;
+    if (0 > my_node_index) {
+        my_node_index += group_size;
+    }
+
+    /* get arguments */
+    sbuf_offset = input_args->sbuf_offset;
+    rbuf_offset = input_args->rbuf_offset;
+    sbuf = (volatile void *)(my_data_pointer + sbuf_offset);
+    data_buffs[my_rank].payload = (void*)sbuf;
+    rbuf = (volatile void *)(my_data_pointer + rbuf_offset);
+    count = input_args->count;
+    dtype = input_args->dtype;
+
+    /* Cache my rbuf_offset */
+    my_ctl_pointer->roffsets[bcol_id] = rbuf_offset;
+
+    /* get my node for the reduction tree */
+    my_reduction_node=&(bcol_module->reduction_tree[my_node_index]);
+
+    /* init the header */
+    BASESMUMA_HEADER_INIT(my_ctl_pointer, ready_flag, sequence_number, bcol_id);
+
+    input_args->result_in_rbuf = (ROOT_NODE == my_reduction_node->my_node_type);
+
+    /* set starting point for progress loop */
+    *iteration = 0;
+    my_ctl_pointer->ready_flag = ready_flag;
+
+    if (sbuf != rbuf) {
+        rc = ompi_datatype_copy_content_same_ddt(dtype, count, (char *)rbuf,
+                                                 (char *)sbuf);
+        if( 0 != rc ) {
+            return OMPI_ERROR;
+        }
+    }
+
+    rc = reduce_children (bcol_module, rbuf, my_reduction_node, iteration, my_ctl_pointer, dtype,
+                          data_buffs, count, input_args->op, root);
+
+    /* Flag value if other bcols are called */
+    my_ctl_pointer->starting_flag_value[bcol_id]++;
+
+    /* Recycle payload buffers */
+
+    return rc;
+}
+
+/* Small data fanin reduce
+ * Uses SM buffer (backed by SM file) for both control structures and
+ * payload
+ *
+ * NTH: How does this differ from the new one? Can we replace this
+ * with a call to the new init then a call the new progress until
+ * complete?
+ */
+int bcol_basesmuma_reduce_intra_fanin_old(bcol_function_args_t *input_args,
+                                          coll_ml_function_t *c_input_args)
+{
+    /* local variables */
+    int rc=OMPI_SUCCESS;
+    int my_rank,group_size,process_shift,my_node_index;
+    int n_children,child;
+    mca_bcol_basesmuma_module_t* bcol_module =
+        (mca_bcol_basesmuma_module_t *)c_input_args->bcol_module;
+
+    netpatterns_tree_node_t *my_reduction_node;
+    volatile int8_t ready_flag;
+    volatile void *sbuf,*rbuf;
+    int sbuf_offset,rbuf_offset;
+    int root,count;
+    struct ompi_op_t *op;
+    int64_t sequence_number=input_args->sequence_num;
+    struct ompi_datatype_t *dtype;
+    int leading_dim,idx;
+    int buff_idx;
+    int child_rank;
+    int bcol_id = (int) bcol_module->super.bcol_id;
+
+    volatile mca_bcol_basesmuma_payload_t *data_buffs;
+    volatile char * my_data_pointer;
+    volatile char * child_data_pointer;
+    volatile mca_bcol_basesmuma_header_t *my_ctl_pointer;
+    volatile mca_bcol_basesmuma_header_t * child_ctl_pointer;
+
+#if 0
+    fprintf(stderr,"Entering fanin reduce \n");
+#endif
+
+    /* Buffer index */
+    buff_idx = input_args->src_desc->buffer_index;
+    /* get addressing information */
+    my_rank=bcol_module->super.sbgp_partner_module->my_index;
+    group_size=bcol_module->colls_no_user_data.size_of_group;
+    leading_dim=bcol_module->colls_no_user_data.size_of_group;
+    idx=SM_ARRAY_INDEX(leading_dim,buff_idx,0);
+
+    /*ctl_structs=(mca_bcol_basesmuma_ctl_struct_t **)
+      bcol_module->colls_with_user_data.ctl_buffs+idx;*/
+    data_buffs = (volatile mca_bcol_basesmuma_payload_t *)
+        bcol_module->colls_with_user_data.data_buffs+idx;
+
+    /* Get control structure and payload buffer */
+    my_ctl_pointer = data_buffs[my_rank].ctl_struct;
+    my_data_pointer = (volatile char *) data_buffs[my_rank].payload;
+
+    /* Align node index to around sbgp root */
+    root = input_args->root;
+    process_shift = root;
+    my_node_index = my_rank - root;
+    if (0 > my_node_index ) {
+        my_node_index += group_size;
+    }
+
+    /* get arguments */
+    sbuf_offset = input_args->sbuf_offset;
+    rbuf_offset = input_args->rbuf_offset;
+    sbuf = (volatile void *)(my_data_pointer + sbuf_offset);
+    rbuf = (volatile void *)(my_data_pointer + rbuf_offset);
+    op   = input_args->op;
+    count = input_args->count;
+    dtype = input_args->dtype;
+
+    /* get my node for the reduction tree */
+    my_reduction_node=&(bcol_module->reduction_tree[my_node_index]);
+    n_children=my_reduction_node->n_children;
+
+    /* init the header */
+    BASESMUMA_HEADER_INIT(my_ctl_pointer, ready_flag, sequence_number, bcol_id);
+
+    input_args->result_in_rbuf = (ROOT_NODE == my_reduction_node->my_node_type);
+
+    rc = ompi_datatype_copy_content_same_ddt(dtype, count, (char *)rbuf,
+                                             (char *)sbuf);
+    if (0 != rc) {
+        return OMPI_ERROR;
+    }
+
+    if (LEAF_NODE != my_reduction_node->my_node_type) {
+        volatile void *child_rbuf;
+        /* for each child */
+        /* my_result_data = child_result_data (op) my_source_data */
+
+        for (child = 0 ; child < n_children ; ++child) {
+            child_rank = my_reduction_node->children_ranks[child];
+            child_rank += process_shift;
+
+            /* wrap around */
+            if( group_size <= child_rank ){
+                child_rank-=group_size;
+            }
+
+            /*child_ctl_pointer = ctl_structs[child_rank];*/
+            child_ctl_pointer = data_buffs[child_rank].ctl_struct;
+            child_data_pointer = data_buffs[child_rank].payload;
+
+            child_rbuf = child_data_pointer + rbuf_offset;
+            /* wait until child child's data is ready for use */
+            while (!IS_PEER_READY(child_ctl_pointer, ready_flag, sequence_number, REDUCE_FLAG, bcol_id)) {
+                opal_progress();
+            }
+
+            /* apply collective operation */
+            ompi_op_reduce(op,(void *)child_rbuf,(void *)rbuf, count,dtype);
+        } /* end child loop */
+    }
+
+    if (ROOT_NODE != my_reduction_node->my_node_type) {
+        opal_atomic_wmb ();
+        my_ctl_pointer->flags[REDUCE_FLAG][bcol_id] = ready_flag;
+    }
+
+    my_ctl_pointer->starting_flag_value[bcol_id]++;
+
+    return rc;
+}
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_reduce.h
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_reduce.h
@ -0,0 +1,92 @@
+#ifndef __BASESMUMA_REDUCE_H_
+
+#define __BASESMUMA_REDUCE_H_
+
+#include "ompi_config.h"
+#include "ompi/mca/bcol/basesmuma/bcol_basesmuma.h"
+#include "ompi/constants.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "ompi/communicator/communicator.h"
+#include "bcol_basesmuma_utils.h"
+#include <unistd.h>
+
+enum {
+    BLOCK_OFFSET = 0,
+    LOCAL_REDUCE_SEG_OFFSET,
+    BLOCK_COUNT,
+    SEG_SIZE,
+	NOFFSETS
+};
+
+int compute_knomial_reduce_offsets(int group_index, int count, struct
+				ompi_datatype_t *dtype,int k_radix,int n_exchanges,
+				int **offsets);
+
+int compute_knomial_reduce_offsets_reverse(int group_index, int count, struct
+				ompi_datatype_t *dtype,int k_radix,int n_exchanges,
+				int **offsets);
+
+int bcol_basesmuma_lmsg_reduce_recursivek_scatter_reduce(mca_bcol_basesmuma_module_t *sm_module,
+						const int buffer_index, void *sbuf,
+					    void *rbuf,
+						struct ompi_op_t *op,
+						const int count, struct ompi_datatype_t *dtype,
+						const int relative_group_index,
+						const int padded_start_byte,
+					volatile int8_t ready_flag,
+						volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+int bcol_basesmuma_lmsg_reduce_knomial_gather(mca_bcol_basesmuma_module_t *basesmuma_module,
+				const int buffer_index,
+				void *sbuf,void *rbuf, int count, struct
+				ompi_datatype_t *dtype,
+				const int my_group_index,
+				const int padded_start_byte,
+				volatile int8_t rflag,
+				volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+int bcol_basesmuma_lmsg_reduce_extra_root(mca_bcol_basesmuma_module_t *sm_module,
+						const int buffer_index, void *sbuf,
+					    void *rbuf,
+						struct ompi_op_t *op,
+						const int count, struct ompi_datatype_t *dtype,
+						const int relative_group_index,
+						const int padded_start_byte,
+					volatile int8_t rflag,
+						volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+
+
+int bcol_basesmuma_lmsg_reduce_extra_non_root(mca_bcol_basesmuma_module_t *sm_module,
+						const int buffer_index, void *sbuf,
+					    void *rbuf,
+						int root,
+						struct ompi_op_t *op,
+						const int count, struct ompi_datatype_t *dtype,
+						const int relative_group_index,
+						const int group_size,
+						const int padded_start_byte,
+					volatile int8_t rflag,
+						volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+int bcol_basesmuma_lmsg_reduce(bcol_function_args_t *input_args,
+        coll_ml_function_t *c_input_args);
+
+int bcol_basesmuma_lmsg_reduce_extra(bcol_function_args_t *input_args,
+        coll_ml_function_t *c_input_args);
+
+void basesmuma_reduce_recv(int my_group_index, int peer,
+						   void *recv_buffer,
+						   int recv_size,
+					   volatile int8_t ready_flag_val,
+					   volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+void  basesmuma_reduce_send(int my_group_index,
+						   int peer,
+						   void *send_buffer,
+						   int snd_size,
+						   int send_offset,
+					   volatile int8_t ready_flag_val,
+					   volatile mca_bcol_basesmuma_payload_t *data_buffs);
+
+#endif
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c
@ -1,7 +1,8 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
- * Copyright (c) 2012      Los Alamos National Security, LLC.
+ * Copyright (c) 2013-2014 Los Alamos National Security, LLC.
 *                         All rights reserved.
 * Copyright (c) 2014 Cisco Systems, Inc.  All rights reserved.
 * $COPYRIGHT$
@ -31,14 +32,14 @@
 #include "bcol_basesmuma.h"

 int base_bcol_basesmuma_setup_ctl_struct(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
    mca_bcol_basesmuma_component_t *cs,
    sm_buffer_mgmt *ctl_mgmt);

 /* this is the new one, uses the pml allgather */
 int base_bcol_basesmuma_exchange_offsets(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
-    void **result_array, uint64_t mem_offset, int loop_limit, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
+    void **result_array, uint64_t mem_offset, int loop_limit,
    int leading_dim)
 {
    int ret=OMPI_SUCCESS,i;
@ -53,7 +54,7 @@ int base_bcol_basesmuma_exchange_offsets(
    send_buff = (char *) malloc(count);
    recv_buff = (char *) malloc(count *
                           sm_bcol_module->super.sbgp_partner_module->group_size);
-    /*  exchange the base pointer for the controls structures - gather 
+    /*  exchange the base pointer for the controls structures - gather
     *  every one else's infromation.
     */

@ -65,7 +66,7 @@ int base_bcol_basesmuma_exchange_offsets(
    /* get the offsets from all procs, so can setup the control data
     * structures.
     */
-    
+
    ret=comm_allgather_pml((void *) send_buff,(void *) recv_buff,count,
            MPI_BYTE,
            sm_bcol_module->super.sbgp_partner_module->my_index,
@ -113,8 +114,8 @@ exit_ERROR:

 #if 0
 int base_bcol_basesmuma_exchange_offsets(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
-    void **result_array, uint64_t mem_offset, int loop_limit, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
+    void **result_array, uint64_t mem_offset, int loop_limit,
    int leading_dim)
 {
    int ret=OMPI_SUCCESS,i,dummy;
@ -126,7 +127,7 @@ int base_bcol_basesmuma_exchange_offsets(
    opal_buffer_t *recv_buffer = OBJ_NEW(opal_buffer_t);
    uint64_t rem_mem_offset;

-    /*  exchange the base pointer for the controls structures - gather 
+    /*  exchange the base pointer for the controls structures - gather
     *  every one else's infromation.
     */
    /* get list of procs that will participate in the communication */
@ -157,8 +158,8 @@ int base_bcol_basesmuma_exchange_offsets(
        &(sm_bcol_module->super.sbgp_partner_module->my_index),1,OPAL_UINT32);

    if (OMPI_SUCCESS != ret) {
-	fprintf(stderr,"Error packing my_index!!\n");
-	fflush(stderr);
+        fprintf(stderr,"Error packing my_index!!\n");
+        fflush(stderr);
        goto exit_ERROR;
    }

@ -253,7 +254,7 @@ exit_ERROR:


 static int base_bcol_basesmuma_exchange_ctl_params(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
    mca_bcol_basesmuma_component_t *cs,
    sm_buffer_mgmt *ctl_mgmt, list_data_t *data_blk)
 {
@ -281,7 +282,7 @@ static int base_bcol_basesmuma_exchange_ctl_params(
    }

 #if 0
-    ret=base_bcol_basesmuma_exchange_offsets( sm_bcol_module, 
+    ret=base_bcol_basesmuma_exchange_offsets( sm_bcol_module,
            (void **)ctl_mgmt->ctl_buffs, mem_offset, loop_limit, leading_dim);
    if( OMPI_SUCCESS != ret ) {
        goto exit_ERROR;
@ -332,7 +333,7 @@ exit_ERROR:
 }

 int base_bcol_basesmuma_setup_ctl_struct(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
    mca_bcol_basesmuma_component_t *cs,
    sm_buffer_mgmt *ctl_mgmt)
 {
@ -351,8 +352,8 @@ int base_bcol_basesmuma_setup_ctl_struct(

    /* initialize the control structure management struct -
     * for collectives without user data
-     *--------------------------------------------------------------- 
-     */ 
+     *---------------------------------------------------------------
+     */

    ctl_mgmt->number_of_buffs=n_ctl_structs;
    ctl_mgmt->num_mem_banks=
@ -391,7 +392,7 @@ int base_bcol_basesmuma_setup_ctl_struct(
        sm_bcol_module,
        sm_bcol_module->super.sbgp_partner_module,
        &(cs->sm_connections_list),
-	&(sm_bcol_module->ctl_backing_files_info),
+        &(sm_bcol_module->ctl_backing_files_info),
        sm_bcol_module->super.sbgp_partner_module->group_comm,
        input_file, cs->clt_base_fname,
        false);
@ -406,23 +407,21 @@ int base_bcol_basesmuma_setup_ctl_struct(
        ret = OMPI_ERR_OUT_OF_RESOURCE;
        goto exit_ERROR;
    }
-    for(i=0 ; i < sm_bcol_module->super.sbgp_partner_module->group_size ; i++ )
-        {
-            if(i ==
-                     sm_bcol_module->super.sbgp_partner_module->my_index) {
-                /* local file data is not cached in thi slist */
-                continue;
-            }
-            sm_bcol_module->shared_memory_scratch_space[i]=(void *)(
-                (char *)(sm_bcol_module->ctl_backing_files_info[i]->sm_mmap)+
-                cs->scratch_offset_from_base_ctl_file);
-        }
+    for(i=0 ; i < sm_bcol_module->super.sbgp_partner_module->group_size ; i++ ) {
+	if(i == sm_bcol_module->super.sbgp_partner_module->my_index) {
+	    /* local file data is not cached in thi slist */
+	    continue;
+	}
+	sm_bcol_module->shared_memory_scratch_space[i]=(void *)(
+	    (char *)(sm_bcol_module->ctl_backing_files_info[i]->sm_mmap)+
+	    cs->scratch_offset_from_base_ctl_file);
+    }
    i=sm_bcol_module->super.sbgp_partner_module->my_index;
    sm_bcol_module->shared_memory_scratch_space[i]=(void *)(
        (char *)(cs->sm_ctl_structs->map_addr)+cs->scratch_offset_from_base_ctl_file);

    /*
-     * setup the no-data buffer managment data 
+     * setup the no-data buffer managment data
     */
    n_ctl=ctl_mgmt->num_mem_banks;
    ctl_mgmt->ctl_buffs_mgmt=(mem_bank_management_t *)
@ -448,13 +447,13 @@ int base_bcol_basesmuma_setup_ctl_struct(
        mutex_ptr= &(ctl_mgmt->ctl_buffs_mgmt[i].mutex);
        OBJ_CONSTRUCT(mutex_ptr, opal_mutex_t);
        ctl_mgmt->ctl_buffs_mgmt[i].index_shared_mem_ctl_structs=i;
-        
+
        item=(opal_list_item_t *)&(ctl_mgmt->ctl_buffs_mgmt[i].nb_barrier_desc);
        OBJ_CONSTRUCT(item,opal_list_item_t);
        ctl_mgmt->ctl_buffs_mgmt[i].nb_barrier_desc.sm_module=
            sm_bcol_module;
        ctl_mgmt->ctl_buffs_mgmt[i].nb_barrier_desc.pool_index= i;
-	/* get the sm_buffer_mgmt pointer for the control structures */ 
+        /* get the sm_buffer_mgmt pointer for the control structures */
        ctl_mgmt->ctl_buffs_mgmt[i].nb_barrier_desc.coll_buff=ctl_mgmt;
        ctl_mgmt->ctl_buffs_mgmt[i].nb_barrier_desc.ml_memory_block_descriptor=
            NULL;
@ -473,10 +472,10 @@ exit_ERROR:
 /*
 * this function initializes the internal scratch buffers and control
 * structures that will be used by the module. It also intitializes
- * the payload buffer management structures. 
+ * the payload buffer management structures.
 */
 int base_bcol_basesmuma_setup_library_buffers(
-    mca_bcol_basesmuma_module_t *sm_bcol_module, 
+    mca_bcol_basesmuma_module_t *sm_bcol_module,
    mca_bcol_basesmuma_component_t *cs)
 {
    int ret=OMPI_SUCCESS,i;
--- a/ompi/mca/bcol/basesmuma/bcol_basesmuma_smcm.c
+++ b/ompi/mca/bcol/basesmuma/bcol_basesmuma_smcm.c
@ -179,7 +179,7 @@ int bcol_basesmuma_smcm_allgather_connection(
 	if( !backing_files ) {
            rc=OMPI_ERR_OUT_OF_RESOURCE;
            goto Error;
-        }
+    }
        *back_files=backing_files;

        /* check to see if we have already mapped all the files, if we have
--- a/ompi/mca/bcol/bcol.h
+++ b/ompi/mca/bcol/bcol.h
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -43,6 +46,10 @@ struct mca_bcol_base_coll_fn_desc_t;
 #define MSG_RANGE_INITIAL (1024)*12
 #define MSG_RANGE_INC      10
 #define BCOL_THRESHOLD_UNLIMITED (INT_MAX)
+/* Maximum size of a bcol's header. This allows us to correctly calculate the message
+ * thresholds. If the header of any bcol exceeds this value then increase this one
+ * to match. */
+#define BCOL_HEADER_MAX 96

 #define BCOL_HEAD_ALIGN 32   /* will turn into an MCA parameter after debug */

@ -115,30 +122,6 @@ enum {
    BCOL_FN_COMPLETE    = (OMPI_ERR_MAX - 3)
 };

-/* Originally this enum was placed in ompi/op/op.h file. It should be moved back
- * when we are ready to lobby for its inclusion. Since we are releasing only the
- * bcast and barrier initially and this struct supports the allreduce, we are not 
- * going to worry about it now. Note that in the same h-file, op.h, the struct "ompi_op_t"
- * also has a field that we introduced called "enum ompi_op_type op_type" that this needs to
- * be resolved also.
- */
-enum ompi_op_type {
-    OMPI_OP_NULL,
-    OMPI_OP_MAX,
-    OMPI_OP_MIN,
-    OMPI_OP_SUM,
-    OMPI_OP_PROD,
-    OMPI_OP_LAND,
-    OMPI_OP_BAND,
-    OMPI_OP_LOR,
-    OMPI_OP_BOR,
-    OMPI_OP_LXOR,
-    OMPI_OP_BXOR,
-    OMPI_OP_MAXLOC,
-    OMPI_OP_MINLOC,
-    OMPI_OP_REPLACE,
-    OMPI_OP_NUM_OF_TYPES
-};


 /**
@ -344,6 +327,24 @@ typedef struct {
    int n_fns_need_ordering; /* The number of functions are called for bcols need ordering */
 } mca_bcol_base_order_info_t;

+/* structure that encapsultes information propagated amongst multiple
+ * fragments whereby completing the entire ensemble of fragments is
+ * necessary in order to complete the entire collective
+ */
+struct bcol_fragment_descriptor_t {
+    /* start iterator */
+    int head;
+    /* end iterator */
+    int tail;
+    /* current iteration */
+    int start_iter;
+    /* number of full iterations this frag */
+    int num_iter;
+    /* end iter */
+    int end_iter;
+};
+typedef struct bcol_fragment_descriptor_t bcol_fragment_descriptor_t;
+
 struct bcol_function_args_t {
    /* full message sequence number */
    int64_t sequence_num;
@ -373,16 +374,19 @@ struct bcol_function_args_t {
    int rbuf_offset;
    /* for bcol opaque data */
    void *bcol_opaque_data;
-    /* An output argument that will be used by BCOL funstion to tell ML that the result of the BCOL is in rbuf */
+    /* An output argument that will be used by BCOL function to tell ML that the result of the BCOL is in rbuf */
    bool result_in_rbuf;
-    bool root_flag; /* True if the rank is root of operation */
-    int status; /* Used for non-blocking collective completion */
-    uint32_t frag_size; /* fragment size for large messages */
-    int hier_factor; /* factor used when bcast is invoked as a service function back down
-                      *  the tree in allgather for example, the pacl_len is not the actual
-                      *  len of the data needing bcasting
-                       */
+    bool root_flag;      /* True if the rank is root of operation */
+    bool need_dt_support; /* will trigger alternate code path for some colls */
+    int status;          /* Used for non-blocking collective completion */
+    uint32_t frag_size;  /* fragment size for large messages */
+    int hier_factor;     /* factor used when bcast is invoked as a service function back down
+                          * the tree in allgather for example, the pacl_len is not the actual
+                          * len of the data needing bcasting
+                          */
    mca_bcol_base_order_info_t order_info;
+    bcol_fragment_descriptor_t frag_info;
+
 };

 typedef struct bcol_function_args_t bcol_function_args_t;
@ -658,6 +662,15 @@ struct mca_bcol_base_descriptor_t {
 };
 typedef struct mca_bcol_base_descriptor_t mca_bcol_base_descriptor_t;

+static inline __opal_attribute_always_inline__ size_t
+             mca_bcol_base_get_buff_length(ompi_datatype_t *dtype, int count)
+{
+    ptrdiff_t lb, extent;
+    ompi_datatype_get_extent(dtype, &lb, &extent);
+
+    return (size_t) (extent * count);
+}
+
 #define MCA_BCOL_CHECK_ORDER(module, bcol_function_args)                     \
    do {                                                                     \
        if (*((module)->next_inorder) !=                                     \
--- a/ompi/mca/bcol/iboffload/Makefile.am
+++ b/ompi/mca/bcol/iboffload/Makefile.am
@ -31,12 +31,14 @@ sources = \
        bcol_iboffload_barrier.c \
        bcol_iboffload_bcast.h \
        bcol_iboffload_bcast.c \
+        bcol_iboffload_allgather.c \
        bcol_iboffload_collreq.h \
        bcol_iboffload_collreq.c \
        bcol_iboffload_qp_info.c \
        bcol_iboffload_qp_info.h \
        bcol_iboffload_fanin.c \
-        bcol_iboffload_fanout.c 
+        bcol_iboffload_fanout.c \
+        bcol_iboffload_allreduce.c

 # Make the output library in this directory, and name it either
 # mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_allgather.c
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_allgather.c
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_allreduce.c
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_allreduce.c
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h
@ -90,6 +90,10 @@ struct mca_bcol_iboffload_collfrag_t {
     * there isn't any wait in the coll request
     */
    int32_t last_wait_num;
+    /* fragment descriptor for non contiguous data */
+    bcol_fragment_descriptor_t *bcol_frag_info;
+    /* frag-len of ml buffer */
+    int frag_len;
 };
 typedef struct mca_bcol_iboffload_collfrag_t mca_bcol_iboffload_collfrag_t;
 OBJ_CLASS_DECLARATION(mca_bcol_iboffload_collfrag_t);
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_frag.c
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_frag.c
@ -48,7 +48,7 @@ OBJ_CLASS_INSTANCE(
        frag_constructor,
        NULL);

-#if 0
+
 static mca_bcol_iboffload_frag_t*
    mca_bcol_iboffload_get_ml_frag_calc(mca_bcol_iboffload_module_t *iboffload,
                                    mca_bcol_iboffload_collreq_t *coll_request,
@ -85,8 +85,6 @@ static mca_bcol_iboffload_frag_t*

    return fragment;
 }
-#endif
-

 static mca_bcol_iboffload_frag_t *
 mca_bcol_iboffload_get_packed_frag(mca_bcol_iboffload_module_t *iboffload,
@ -130,7 +128,6 @@ mca_bcol_iboffload_get_packed_frag(mca_bcol_iboffload_module_t *iboffload,
    return frag;
 }

-#if 0
 static mca_bcol_iboffload_frag_t *
 mca_bcol_iboffload_get_calc_frag(mca_bcol_iboffload_module_t *iboffload, int qp_index,
                                 struct mca_bcol_iboffload_collreq_t *coll_request)
@ -169,7 +166,6 @@ mca_bcol_iboffload_get_calc_frag(mca_bcol_iboffload_module_t *iboffload, int qp_

    return frag;
 }
-#endif

 mca_bcol_iboffload_frag_t*
 mca_bcol_iboffload_get_send_frag(mca_bcol_iboffload_collreq_t *coll_request,
@ -219,24 +215,24 @@ mca_bcol_iboffload_get_send_frag(mca_bcol_iboffload_collreq_t *coll_request,
            IBOFFLOAD_VERBOSE(10, ("Getting MCA_BCOL_IBOFFLOAD_SEND_FRAG_CONVERT"));
            frag = mca_bcol_iboffload_get_packed_frag(iboffload, destination,
                         qp_index, len, &coll_request->send_convertor);
-#if 0
+
        break;
        case MCA_BCOL_IBOFFLOAD_SEND_FRAG_CALC:
            IBOFFLOAD_VERBOSE(10, ("Getting MCA_BCOL_IBOFFLOAD_SEND_FRAG_CALC"));
            frag = mca_bcol_iboffload_get_calc_frag(iboffload, qp_index, coll_request);
-#endif
+
        break;
        case MCA_BCOL_IBOFFLOAD_SEND_FRAG_ML:
            IBOFFLOAD_VERBOSE(10, ("Getting MCA_BCOL_IBOFFLOAD_SEND_FRAG_ML"));
            frag = mca_bcol_iboffload_get_ml_frag(
                  iboffload, qp_index, len, coll_request->buffer_info[buf_index].lkey,
                  (uint64_t)(uintptr_t) coll_request->buffer_info[buf_index].buf + src_offset);
-#if 0
+
        break;
        case MCA_BCOL_IBOFFLOAD_SEND_FRAG_ML_CALC:
            frag = mca_bcol_iboffload_get_ml_frag_calc(iboffload, coll_request, len, src_offset);
            IBOFFLOAD_VERBOSE(10, ("Getting MCA_BCOL_IBOFFLOAD_SEND_FRAG_ML_CALC"));
-#endif
+
        break;
        default:
            IBOFFLOAD_VERBOSE(10, ("Getting default"));
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_mca.c
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_mca.c
@ -267,10 +267,19 @@ int mca_bcol_iboffload_register_params(void)
                  "Increment size of free lists (must be >= 1)",
                  32, &mca_bcol_iboffload_component.free_list_inc,
                  REGINT_GE_ONE));
+    /* rdma mpool no longer exists - must use the grdma mpool component, should resolve errors in
+     * mtt testing
+     */
+    /*
    CHECK(reg_string("mpool", NULL,
                     "Name of the memory pool to be used (it is unlikely that you will ever want to change this",
                     "rdma", &mca_bcol_iboffload_component.mpool_name,
                     0));
+    */
+    CHECK(reg_string("mpool", NULL,
+                     "Name of the memory pool to be used (it is unlikely that you will ever want to change this",
+                     "grdma", &mca_bcol_iboffload_component.mpool_name,
+                     0));
    CHECK(reg_int("cq_size", "cq_size",
                  "Size of the OpenFabrics completion "
                  "queue (will automatically be set to a minimum of "
--- a/ompi/mca/bcol/iboffload/bcol_iboffload_module.c
+++ b/ompi/mca/bcol/iboffload/bcol_iboffload_module.c
@ -731,9 +731,9 @@ static void load_func(mca_bcol_base_module_t *super)
    super->bcol_function_init_table[BCOL_BARRIER] = mca_bcol_iboffload_barrier_register;
    super->bcol_function_init_table[BCOL_BCAST] = mca_bcol_iboffload_bcast_register;
    /*super->bcol_function_init_table[BCOL_ALLTOALL] = mca_bcol_iboffload_alltoall_register;*/
-    /*super->bcol_function_init_table[BCOL_ALLGATHER] = mca_bcol_iboffload_allgather_register;*/
+    super->bcol_function_init_table[BCOL_ALLGATHER] = mca_bcol_iboffload_allgather_register;
    super->bcol_function_init_table[BCOL_SYNC] = mca_bcol_iboffload_memsync_register;
-    /*super->bcol_function_init_table[BCOL_ALLREDUCE] = mca_bcol_iboffload_allreduce_register;*/
+    super->bcol_function_init_table[BCOL_ALLREDUCE] = mca_bcol_iboffload_allreduce_register;

    super->bcol_memory_init = mca_bcol_iboffload_init_buffer_memory;

@ -1523,7 +1523,7 @@ int mca_bcol_iboffload_exchange_rem_addr(mca_bcol_iboffload_endpoint_t *ep)
    coll_request->user_handle_freed = true;
    /* complete the exchange - progress releases full request descriptors */
    while (!BCOL_IS_COMPLETED(coll_request)) {
-        opal_progress();
+        mca_bcol_iboffload_component_progress();
    }

    IBOFFLOAD_VERBOSE(10, ("RDMA addr exchange with comm rank: %d was finished.\n",
--- a/ompi/mca/bcol/ptpcoll/Makefile.am
+++ b/ompi/mca/bcol/ptpcoll/Makefile.am
@ -1,6 +1,8 @@
 #
 # Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
-# Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+# Copyright (c) 2009-2013 Mellanox Technologies.  All rights reserved.
+# Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+#                         reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@ -22,9 +24,14 @@ sources = \
        bcol_ptpcoll_component.c  \
        bcol_ptpcoll_fanin.c \
        bcol_ptpcoll_fanout.c \
-        bcol_ptpcoll_module.c 
-	
-	
+        bcol_ptpcoll_module.c \
+        bcol_ptpcoll_allreduce.h \
+        bcol_ptpcoll_allreduce.c \
+        bcol_ptpcoll_reduce.h \
+        bcol_ptpcoll_reduce.c \
+        bcol_ptpcoll_allgather.c
+
+
 # Make the output library in this directory, and name it either
 # mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
 # (for static builds).
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll.h
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll.h
@ -230,6 +230,12 @@ struct mca_bcol_ptpcoll_ml_buffer_desc_t {
    int          iteration;         /* buffer iteration in knomial, binomail, etc. algorithms */
    int          tag;               /* tag number that is attached to this operation */
    int          status;       /* operation status */
+    /* Fixme: Probably we can get rid of these fields by redesigning
+     * the reduce implementation
+     */
+    int          reduction_status; /* used for reduction to cache internal
+                                      reduction status */
+    bool          reduce_init_called;
 };
 typedef struct mca_bcol_ptpcoll_ml_buffer_desc_t mca_bcol_ptpcoll_ml_buffer_desc_t;

--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allgather.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allgather.c
@ -0,0 +1,607 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "bcol_ptpcoll_allreduce.h"
+#include "ompi/mca/coll/base/coll_tags.h" /* debug */
+/*
+ * Recursive K-ing allgather
+ */
+
+/*
+ *
+ * Recurssive k-ing algorithm
+ * Example k=3 n=9
+ *
+ *
+ * Number of Exchange steps = log (basek) n
+ * Number of steps in exchange step = k (radix)
+ *
+ */
+
+int bcol_ptpcoll_k_nomial_allgather_init(bcol_function_args_t *input_args,
+                struct coll_ml_function_t *const_args)
+{
+    /* local variables */
+
+    mca_bcol_ptpcoll_module_t *ptpcoll_module = (mca_bcol_ptpcoll_module_t *) const_args->bcol_module;
+    int *group_list = ptpcoll_module->super.sbgp_partner_module->group_list;
+    netpatterns_k_exchange_node_t *exchange_node = &ptpcoll_module->knomial_allgather_tree;
+    int my_group_index = ptpcoll_module->super.sbgp_partner_module->my_index;
+    int group_size = ptpcoll_module->group_size;
+    int *list_connected = ptpcoll_module->super.list_n_connected; /* critical for hierarchical colls */
+
+    int tag;
+    int i, j;
+    int knt;
+    int comm_src, comm_dst, src, dst;
+    int recv_offset, recv_len;
+    int send_offset, send_len;
+
+    uint32_t buffer_index = input_args->buffer_index;
+    int pow_k, tree_order;
+    int rc = OMPI_SUCCESS;
+    ompi_communicator_t* comm = ptpcoll_module->super.sbgp_partner_module->group_comm;
+    ompi_request_t **requests =
+        ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests;
+    int *active_requests =
+        &(ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].active_requests);
+    int completed = 0; /* initialized */
+    void *data_buffer = (void*)(
+            (unsigned char *) input_args->sbuf +
+            (size_t) input_args->sbuf_offset);
+    int pack_len = input_args->count * input_args->dtype->super.size;
+
+#if 0
+    fprintf(stderr,"entering p2p allgather pack_len %d. exchange node: %p\n",pack_len, exchange_node);
+#endif
+    /* initialize the iteration counter */
+    int *iteration = &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].iteration;
+    *iteration = 0;
+
+    /* reset active request counter */
+    *active_requests = 0;
+
+    /* keep tag within the limit supported by the pml */
+    tag = (PTPCOLL_TAG_OFFSET + input_args->sequence_num * PTPCOLL_TAG_FACTOR) & (ptpcoll_module->tag_mask);
+    /* mark this as a collective tag, to avoid conflict with user-level flags */
+    tag = -tag;
+
+    /* k-nomial parameters */
+    tree_order = exchange_node->tree_order;
+    pow_k = exchange_node->log_tree_order;
+
+
+    /* let's begin the collective, starting with extra ranks and their
+     * respective proxies
+     */
+    if( EXTRA_NODE == exchange_node->node_type ) {
+
+        /* then I will send to my proxy rank*/
+        dst = exchange_node->rank_extra_sources_array[0];
+        /* find rank in the communicator */
+        comm_dst = group_list[dst];
+        /* now I need to calculate my own offset */
+        knt = 0;
+        for (i = 0 ; i < my_group_index; i++){
+            knt += list_connected[i];
+        }
+
+        /* send the data to my proxy */
+        rc = MCA_PML_CALL(isend((void *) ( (unsigned char *) data_buffer +
+                        knt*pack_len),
+                        pack_len * list_connected[my_group_index],
+                        MPI_BYTE,
+                        comm_dst, tag,
+                        MCA_PML_BASE_SEND_STANDARD, comm,
+                        &(requests[*active_requests])));
+
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10,("Failed to isend data"));
+            return OMPI_ERROR;
+        }
+        ++(*active_requests);
+
+        /* now I go ahead and post the receive from my proxy */
+        comm_src = comm_dst;
+        knt = 0;
+        for( i =0; i < group_size; i++){
+            knt += list_connected[i];
+        }
+        rc = MCA_PML_CALL(irecv(data_buffer,
+                    knt * pack_len,
+                    MPI_BYTE,
+                    comm_src,
+                    tag , comm, &(requests[*active_requests])));
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10, ("Failed to post ireceive "));
+            return OMPI_ERROR;
+        }
+
+        ++(*active_requests);
+        /* poll for completion */
+        /* this polls internally */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(completed){
+            /* go to buffer release */
+            goto FINISHED;
+        }else{
+            /* save state and hop out
+             * nothing to save here
+             */
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+    }else if ( 0 < exchange_node->n_extra_sources ) {
+
+        /* I am a proxy for someone */
+        src = exchange_node->rank_extra_sources_array[0];
+        /* find the rank in the communicator */
+        comm_src = group_list[src];
+        knt = 0;
+        for(i = 0; i < src; i++){
+            knt += list_connected[i];
+        }
+        /* post the receive */
+        rc = MCA_PML_CALL(irecv((void *) ( (unsigned char *) data_buffer
+                        + knt*pack_len),
+                        pack_len * list_connected[src],
+                        MPI_BYTE,
+                        comm_src,
+                        tag , comm, &(requests[*active_requests])));
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10, ("Failed to post ireceive "));
+            return OMPI_ERROR;
+        }
+
+        ++(*active_requests);
+        /* poll for completion */
+        /* this routine polls internally */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * We really do need to block here so set
+             * the iteration to -1 indicating we need to
+             *  finish this part first
+             */
+            *iteration = -1;
+            return ((OMPI_SUCCESS != rc )? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+
+    }
+
+    /* we start the recursive k - ing phase */
+    /* fprintf(stderr,"tree order %d pow_k %d \n",tree_order,pow_k);*/
+    for( i = 0; i < pow_k; i++) {
+        for(j = 0; j < (tree_order - 1); j++) {
+
+            /* send phase */
+            dst = exchange_node->rank_exchanges[i][j];
+            if( dst < 0 ){
+                continue;
+            }
+            comm_dst = group_list[dst];
+            send_offset = exchange_node->payload_info[i][j].s_offset * pack_len;
+            send_len = exchange_node->payload_info[i][j].s_len * pack_len;
+            /* debug print */
+            /* fprintf(stderr,"sending %d bytes to rank %d at offset %d\n",send_len, */
+            /*         comm_dst,send_offset); */
+            rc = MCA_PML_CALL(isend((void*)((unsigned char *) data_buffer +
+                            send_offset),
+                            send_len,
+                            MPI_BYTE,
+                            comm_dst, tag,
+                            MCA_PML_BASE_SEND_STANDARD, comm,
+                            &(requests[*active_requests])));
+
+            if( OMPI_SUCCESS != rc ) {
+                PTPCOLL_VERBOSE(10,("Failed to isend data"));
+                return OMPI_ERROR;
+            }
+            ++(*active_requests);
+
+            /* sends are posted */
+        }
+
+        /* Now post the recv's */
+        for( j = 0; j < (tree_order - 1); j++ ) {
+
+            /* recv phase */
+            src = exchange_node->rank_exchanges[i][j];
+            if( src < 0 ) {
+                continue;
+            }
+            comm_src = group_list[src];
+            recv_offset = exchange_node->payload_info[i][j].r_offset * pack_len;
+            recv_len = exchange_node->payload_info[i][j].r_len * pack_len;
+            /* debug print */
+            /* fprintf(stderr,"recving %d bytes to rank %d at offset %d\n",recv_len, */
+            /*         comm_src,recv_offset); */
+            /* post the receive */
+            rc = MCA_PML_CALL(irecv((void *) ((unsigned char *) data_buffer +
+                            recv_offset),
+                            recv_len,
+                            MPI_BYTE,
+                            comm_src,
+                            tag, comm, &(requests[*active_requests])));
+            if( OMPI_SUCCESS != rc ) {
+                PTPCOLL_VERBOSE(10, ("Failed to post ireceive "));
+                return OMPI_ERROR;
+            }
+
+            ++(*active_requests);
+        }
+        /* finished all send/recv's now poll for completion before
+         * continuing to next iteration
+         */
+        completed = 0;
+        /* polling internally on 2*(k - 1) requests */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+
+        if(!completed){
+            /* save state and hop out
+             * only the iteration needs to be tracked
+             */
+            *iteration = i; /* need to pick up here */
+
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+    }
+
+    /* finish off the last piece, send the data back to the extra  */
+    if( 0 < exchange_node->n_extra_sources ) {
+        dst = exchange_node->rank_extra_sources_array[0];
+        comm_dst = group_list[dst];
+        knt = 0;
+        for( i = 0; i < group_size; i++){
+            knt += list_connected[i];
+        }
+        /* debug print */
+        /*
+        fprintf(stderr,"sending %d bytes to extra %d \n",pack_len*knt,comm_dst);
+        */
+        rc = MCA_PML_CALL(isend(data_buffer,
+                    pack_len * knt,
+                    MPI_BYTE,
+                    comm_dst, tag,
+                    MCA_PML_BASE_SEND_STANDARD, comm,
+                    &(requests[*active_requests])));
+
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10,("Failed to isend data"));
+            return OMPI_ERROR;
+        }
+        ++(*active_requests);
+
+        /* probe for send completion */
+        completed = 0;
+        /* polling internally */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * We really do need to block here so set
+             * the iteration to pow_k +1 indicating we need to
+             *  finish progressing the last part
+             */
+            *iteration = pow_k + 1;
+
+            return (OMPI_SUCCESS != rc ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+    }
+
+FINISHED:
+    /* recycle buffer if need be */
+    return BCOL_FN_COMPLETE;
+}
+
+/* allgather progress function */
+
+int bcol_ptpcoll_k_nomial_allgather_progress(bcol_function_args_t *input_args,
+                        struct coll_ml_function_t *const_args)
+{
+
+
+    /* local variables */
+
+    mca_bcol_ptpcoll_module_t *ptpcoll_module = (mca_bcol_ptpcoll_module_t *) const_args->bcol_module;
+    int *group_list = ptpcoll_module->super.sbgp_partner_module->group_list;
+    netpatterns_k_exchange_node_t *exchange_node = &ptpcoll_module->knomial_allgather_tree;
+    int group_size = ptpcoll_module->group_size;
+    int *list_connected = ptpcoll_module->super.list_n_connected; /* critical for hierarchical colls */
+
+
+    int tag;
+    int i, j;
+    int knt;
+    int comm_src, comm_dst, src, dst;
+    int recv_offset, recv_len;
+    int send_offset, send_len;
+    uint32_t buffer_index = input_args->buffer_index;
+
+    int pow_k, tree_order;
+    int rc = OMPI_SUCCESS;
+    ompi_communicator_t* comm = ptpcoll_module->super.sbgp_partner_module->group_comm;
+    ompi_request_t **requests =
+        ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests;
+    int *active_requests =
+        &(ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].active_requests);
+    int completed = 0; /* initialized */
+    void *data_buffer = (void*)(
+            (unsigned char *) input_args->sbuf +
+            (size_t) input_args->sbuf_offset);
+    int pack_len = input_args->count * input_args->dtype->super.size;
+    /* initialize the counter */
+    int *iteration = &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].iteration;
+
+
+#if 0
+    fprintf(stderr,"%d: entering p2p allgather progress AR: %d iter: %d\n",my_group_index,*active_requests,
+            *iteration);
+#endif
+    /* keep tag within the limit supported by the pml */
+    tag = (PTPCOLL_TAG_OFFSET + input_args->sequence_num * PTPCOLL_TAG_FACTOR) & (ptpcoll_module->tag_mask);
+    /* mark this as a collective tag, to avoid conflict with user-level flags */
+    tag = -tag;
+
+    /* k-nomial tree parameters */
+    tree_order = exchange_node->tree_order;
+    pow_k = exchange_node->log_tree_order;
+
+    /* let's begin the collective, starting with extra ranks and their
+     * respective proxies
+     */
+    if( EXTRA_NODE == exchange_node->node_type ) {
+
+        /* debug print */
+        /*fprintf(stderr,"666 \n");*/
+        /* simply poll for completion */
+        completed = 0;
+        /* polling internally */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(completed){
+            /* go to buffer release */
+            goto FINISHED;
+        }else{
+            /* save state and hop out
+             * nothing to save here
+             */
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+    }else if ( 0 < exchange_node->n_extra_sources && (-1 == *iteration)) {
+
+        /* I am a proxy for someone */
+        /* Simply poll for completion */
+        completed = 0;
+        /* polling internally */
+        assert( 1 == *active_requests);
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * We really do need to block here so set
+             * the iteration to -1 indicating we need to
+             *  finish this part first
+             */
+            (*iteration) = -1;
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+        /* I may now proceed to the recursive k - ing phase */
+        *iteration = 0;
+    }
+
+
+    /* the ordering here between the extra rank and progress active requests
+     * is critical
+     */
+    /* extra rank */
+    if( (pow_k + 1) == *iteration ){
+        /* finish off the last one */
+        goto PROGRESS_EXTRA;
+    }
+
+    /* active requests must be completed before continuing on to
+     * recursive k -ing step
+     * CAREFUL HERE, IT THIS REALLY WHAT YOU WANT??
+     */
+    if( 0 < (*active_requests) ) {
+        /* then we have something to progress from last step */
+        /* debug print */
+        /*
+        fprintf(stderr,"%d: entering progress AR: %d iter: %d\n",my_group_index,*active_requests,
+            *iteration);
+        */
+        completed = 0;
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * state hasn't changed
+             */
+
+            return ((MPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+        ++(*iteration);
+    }
+
+
+
+    /* we start the recursive k - ing phase */
+    for( i = *iteration; i < pow_k; i++) {
+        /* nothing changes here */
+        for(j = 0; j < (tree_order - 1); j++) {
+
+            /* send phase */
+            dst = exchange_node->rank_exchanges[i][j];
+            if( dst < 0 ){
+                continue;
+            }
+            comm_dst = group_list[dst];
+            send_offset = exchange_node->payload_info[i][j].s_offset * pack_len;
+            send_len = exchange_node->payload_info[i][j].s_len * pack_len;
+            rc = MCA_PML_CALL(isend((void*)((unsigned char *) data_buffer +
+                            send_offset),
+                            send_len,
+                            MPI_BYTE,
+                            comm_dst, tag,
+                            MCA_PML_BASE_SEND_STANDARD, comm,
+                            &(requests[*active_requests])));
+
+            if( OMPI_SUCCESS != rc ) {
+                PTPCOLL_VERBOSE(10,("Failed to isend data"));
+                return OMPI_ERROR;
+            }
+            ++(*active_requests);
+
+            /* sends are posted */
+        }
+
+        /* Now post the recv's */
+        for( j = 0; j < (tree_order - 1); j++ ) {
+
+            /* recv phase */
+            src = exchange_node->rank_exchanges[i][j];
+            if( src < 0 ) {
+                continue;
+            }
+            comm_src = group_list[src];
+            recv_offset = exchange_node->payload_info[i][j].r_offset * pack_len;
+            recv_len = exchange_node->payload_info[i][j].r_len * pack_len;
+            /* post the receive */
+            rc = MCA_PML_CALL(irecv((void *) ((unsigned char *) data_buffer +
+                            recv_offset),
+                            recv_len,
+                            MPI_BYTE,
+                            comm_src,
+                            tag, comm, &(requests[*active_requests])));
+            if( OMPI_SUCCESS != rc ) {
+                PTPCOLL_VERBOSE(10, ("Failed to post ireceive "));
+                return OMPI_ERROR;
+            }
+
+            ++(*active_requests);
+        }
+        /* finished all send/recv's now poll for completion before
+         * continuing to next iteration
+         */
+        completed = 0;
+        /* make this non-blocking */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * We really do need to block here so set
+             * the iteration to -1 indicating we need to
+             *  finish this part first
+             */
+            *iteration = i; /* need to pick up here */
+
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+        }
+    }
+
+    /* finish off the last piece, send the data back to the extra  */
+    if( 0 < exchange_node->n_extra_sources ) {
+        dst = exchange_node->rank_extra_sources_array[0];
+        comm_dst = group_list[dst];
+        knt = 0;
+        for( i = 0; i < group_size; i++){
+            knt += list_connected[i];
+        }
+        rc = MCA_PML_CALL(isend(data_buffer,
+                    pack_len * knt,
+                    MPI_BYTE,
+                    comm_dst, tag,
+                    MCA_PML_BASE_SEND_STANDARD, comm,
+                    &(requests[*active_requests])));
+
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10,("Failed to isend data"));
+            return OMPI_ERROR;
+        }
+        ++(*active_requests);
+
+        /* probe for send completion */
+        completed = 0;
+        /* make this non-blocking */
+        completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+        if(!completed){
+            /* save state and hop out
+             * We really do need to block here so set
+             * the iteration to pow_k +1 indicating we need to
+             *  finish progressing the last part
+             */
+            *iteration = pow_k + 1;
+
+            return ((OMPI_SUCCESS != rc) ? OMPI_ERROR :  BCOL_FN_STARTED);
+        }
+    }
+    /* folks need to skip this unless they really are the proxy
+     * reentering with the intent of progressing the final send
+     */
+    goto FINISHED;
+
+PROGRESS_EXTRA:
+
+    /* probe for send completion */
+    completed = 0;
+    /* make this non-blocking */
+    completed = mca_bcol_ptpcoll_test_all_for_match(active_requests, requests, &rc);
+    if(!completed){
+        /* save state and hop out
+         * We really do need to block here so set
+         * the iteration to pow_k +1 indicating we need to
+         *  finish progressing the last part
+         */
+
+        return ((OMPI_SUCCESS != rc) ? OMPI_ERROR : BCOL_FN_STARTED);
+    }
+
+FINISHED:
+    /* recycle buffer if need be */
+    return BCOL_FN_COMPLETE;
+}
+
+/*
+ * Register allreduce functions to the BCOL function table,
+ * so they can be selected
+ */
+int bcol_ptpcoll_allgather_init(mca_bcol_base_module_t *super)
+{
+    mca_bcol_base_coll_fn_comm_attributes_t comm_attribs;
+    mca_bcol_base_coll_fn_invoke_attributes_t inv_attribs;
+
+    comm_attribs.bcoll_type = BCOL_ALLGATHER;
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 1024 * 1024;
+    comm_attribs.waiting_semantics = NON_BLOCKING;
+
+    inv_attribs.bcol_msg_min = 0;
+    inv_attribs.bcol_msg_max = 20000; /* range 1 */
+
+    inv_attribs.datatype_bitmap = 0xffffffff;
+    inv_attribs.op_types_bitmap = 0xffffffff;
+
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                bcol_ptpcoll_k_nomial_allgather_init,
+                bcol_ptpcoll_k_nomial_allgather_progress);
+
+
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+    inv_attribs.bcol_msg_min = 10000000;
+    inv_attribs.bcol_msg_max = 10485760; /* range 4 */
+
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                bcol_ptpcoll_k_nomial_allgather_init,
+                bcol_ptpcoll_k_nomial_allgather_progress);
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allreduce.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allreduce.c
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allreduce.h
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_allreduce.h
@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#ifndef MCA_BCOL_PTPCOLL_ALLREDUCE_H
+#define MCA_BCOL_PTPCOLL_ALLREDUCE_H
+
+#include "ompi_config.h"
+#include "ompi/op/op.h"
+#include "ompi/datatype/ompi_datatype.h"
+#include "bcol_ptpcoll.h"
+#include "bcol_ptpcoll_utils.h"
+
+enum {
+	BLOCK_OFFSET = 0,
+	LOCAL_REDUCE_SEG_OFFSET,
+	BLOCK_COUNT,
+	SEG_SIZE,
+	NOFFSETS
+};
+
+BEGIN_C_DECLS
+int bcol_ptpcoll_allreduce_narraying(mca_bcol_ptpcoll_module_t *ptpcoll_module,
+        const int buffer_index, void *data_buffer,
+		struct ompi_op_t *op,
+		const int count, struct ompi_datatype_t *dtype, const int
+		buffer_size, const int relative_group_index);
+
+
+int bcol_ptpcoll_allreduce_narraying_init(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args);
+
+int bcol_ptpcoll_allreduce_recursivek_scatter_reduce(mca_bcol_ptpcoll_module_t *ptpcoll_module,
+						const int buffer_index, void *sbuf,
+					    void *rbuf,
+						struct ompi_op_t *op,
+						const int count, struct ompi_datatype_t *dtype,
+						const int relative_group_index,
+						const int padded_start_byte);
+
+int bcol_ptpcoll_allreduce_knomial_allgather(mca_bcol_ptpcoll_module_t *ptpcoll_module,
+				const int buffer_index,
+				void *sbuf,void *rbuf, int count, struct
+				ompi_datatype_t *dtype,
+				const int relative_group_index,
+				const int padded_start_byte);
+
+int bcol_ptpcoll_allreduce_recursivek_scatter_reduce_allgather_init(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args);
+
+
+int compute_knomial_allgather_offsets(int group_index, int count, struct
+				ompi_datatype_t *dtype,int k_radix,int n_exchanges,
+				int **offsets);
+
+
+int bcol_ptpcoll_allreduce_recursivek_scatter_reduce_extra(mca_bcol_ptpcoll_module_t *ptpcoll_module,
+						int buffer_index,
+						void *sbuf,
+					    void *rbuf,
+						struct ompi_op_t *op,
+						const int count, struct ompi_datatype_t *dtype);
+
+int bcol_ptpcoll_allreduce_knomial_allgather_extra(mca_bcol_ptpcoll_module_t *ptpcoll_module,
+						int buffer_index,
+						void *sbuf,
+					    void *rbuf,
+						const int count, struct ompi_datatype_t *dtype);
+
+int bcol_ptpcoll_allreduce_recursivek_scatter_reduce_allgather_extra_init(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args);
+
+int bcol_ptpcoll_allreduce_init(mca_bcol_base_module_t *super);
+
+#if 0
+int knomial_reduce_scatter_offsets(int group_index,int count, struct ompi_datatype_t *dtype, int k_radix,
+				int n_exchanges, int nth_exchange, size_t *recv_offset, size_t
+				*block_offset, size_t *block_count, size_t *block_size, size_t
+				*seg_size);
+
+int allgather_offsets(int group_index,int count, struct ompi_datatype_t *dtype, int k_radix,
+				int n_exchanges, int nth_exchange, size_t *send_offset, size_t
+				*block_offset, size_t *block_count, size_t *block_size, size_t
+				*seg_size);
+#endif
+
+END_C_DECLS
+
+#endif
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_bcast.h
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_bcast.h
@ -453,7 +453,7 @@ int bcol_ptpcoll_bcast_binomial_probe_and_scatter_anyroot(mca_bcol_ptpcoll_modul
    int pow2_distance;
    int my_left_boundary_rank;
    int my_group_index = ptpcoll_module->super.sbgp_partner_module->my_index;
-    int group_root_index;
+    int group_root_index = 0;
    void *curr_data_buffer = NULL;
    int tag = 
        ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].tag;
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_component.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_component.c
@ -69,6 +69,7 @@ mca_bcol_ptpcoll_component_t mca_bcol_ptpcoll_component = {

            ptpcoll_open,
            ptpcoll_close,
+	    .mca_register_component_params = mca_bcol_ptpcoll_register_mca_params
        },

        /* Initialization / querying functions */
@ -109,14 +110,6 @@ OBJ_CLASS_INSTANCE(mca_bcol_ptpcoll_collreq_t,
 */
 static int ptpcoll_open(void)
 {
-    int rc;
-
-    rc = mca_bcol_ptpcoll_register_mca_params();
-    if (OMPI_SUCCESS != rc) {
-        PTPCOLL_VERBOSE(10, ("Failed to register parametres for the component"));
-        return OMPI_ERROR;
-    }
-
    return OMPI_SUCCESS;
 }

--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_mca.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_mca.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -141,12 +144,15 @@ int mca_bcol_ptpcoll_register_mca_params(void)
    CHECK(reg_int("priority", NULL,
                  "PTPCOLL component priority"
                  "(from 0(low) to 90 (high))", 90, &cm->super.priority, 0));
+
    CHECK(reg_int("verbose", NULL,
                  "Output some verbose PTPCOLL information "
                  "(0 = no output, nonzero = output)", 0, &cm->verbose, REGINT_GE_ZERO));
+
    CHECK(reg_int("k_nomial_radix", NULL,
                  "The radix of K-Nomial Tree "
                  "(starts from 2)", 2, &cm->k_nomial_radix, REGINT_GE_ONE));
+
    CHECK(reg_int("narray_radix", NULL,
                  "The radix of Narray Tree "
                  "(starts from 2)", 2, &cm->narray_radix, REGINT_GE_ONE));
@ -160,9 +166,9 @@ int mca_bcol_ptpcoll_register_mca_params(void)
                  "(starts from 8)", 8, &cm->num_to_probe, REGINT_GE_ONE));

    CHECK(reg_int("bcast_small_msg_known_root_alg", NULL,
-                "Algoritm selection for bcast small messages known root"
-                "(1 - K-nomial, 2 - N-array)", 1, &cm->bcast_small_messages_known_root_alg,
-                REGINT_GE_ZERO));
+                  "Algoritm selection for bcast small messages known root"
+                  "(1 - K-nomial, 2 - N-array)", 1, &cm->bcast_small_messages_known_root_alg,
+                  REGINT_GE_ZERO));

    CHECK(reg_int("bcast_large_msg_known_root_alg", NULL,
                  "Algoritm selection for bcast large messages known root"
@ -187,10 +193,5 @@ int mca_bcol_ptpcoll_register_mca_params(void)
                "User memory can be used by the collective algorithms",
                1, &cm->super.can_use_user_buffers));

-    CHECK(reg_int("use_brucks_smsg_alltoall_rdma", NULL,
-                "Use brucks algorithm for smsg alltoall and RDMA semantics 1 = No Temp buffer recycling"
-                "1 = Alg with no Temp Buffer Recycling (faster), 2 = Alg with temp Buffer Recycling (slower)",
-                0, &cm->use_brucks_smsg_alltoall_rdma, 0));
-
    return ret;
 }
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_module.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_module.c
@ -1,8 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
- * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
- * Copyright (c) 2012      Los Alamos National Security, LLC.
- *                         All rights reserved.
+ * Copyright (c) 2012-2013 Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -34,17 +35,61 @@
 #include "bcol_ptpcoll.h"
 #include "bcol_ptpcoll_utils.h"
 #include "bcol_ptpcoll_bcast.h"
+#include "bcol_ptpcoll_allreduce.h"
+#include "bcol_ptpcoll_reduce.h"

 #define BCOL_PTP_CACHE_LINE_SIZE 128

 /*
 * Local functions
 */
+static int alloc_allreduce_offsets_array(mca_bcol_ptpcoll_module_t *ptpcoll_module)
+{
+    int rc = OMPI_SUCCESS, i = 0;
+    netpatterns_k_exchange_node_t *k_node = &ptpcoll_module->knomial_exchange_tree;
+    int n_exchanges = k_node->n_exchanges;
+
+    /* Precalculate the allreduce offsets */
+    if (0 < k_node->n_exchanges) {
+        ptpcoll_module->allgather_offsets = (int **)malloc(n_exchanges * sizeof(int*));
+
+        if (!ptpcoll_module->allgather_offsets) {
+            rc = OMPI_ERROR;
+            return rc;
+        }
+
+        for (i=0; i < n_exchanges ; i++) {
+            ptpcoll_module->allgather_offsets[i] = (int *)malloc (sizeof(int) * NOFFSETS);
+
+            if (!ptpcoll_module->allgather_offsets[i]){
+                rc = OMPI_ERROR;
+                return rc;
+            }
+        }
+    }
+    return rc;
+}
+
+static int free_allreduce_offsets_array(mca_bcol_ptpcoll_module_t *ptpcoll_module)
+{
+    int rc = OMPI_SUCCESS, i = 0;
+    netpatterns_k_exchange_node_t *k_node = &ptpcoll_module->knomial_exchange_tree;
+    int n_exchanges = k_node->n_exchanges;
+
+    if (ptpcoll_module->allgather_offsets) {
+        for (i=0; i < n_exchanges; i++) {
+            free (ptpcoll_module->allgather_offsets[i]);
+        }
+    }
+
+    free(ptpcoll_module->allgather_offsets);
+    return rc;
+}

 static void
 mca_bcol_ptpcoll_module_construct(mca_bcol_ptpcoll_module_t *ptpcoll_module)
 {
-    int i;
+    uint64_t i;
    /* Pointer to component */
    ptpcoll_module->super.bcol_component = (mca_bcol_base_component_t *) &mca_bcol_ptpcoll_component;
    ptpcoll_module->super.list_n_connected = NULL;
@ -56,7 +101,7 @@ mca_bcol_ptpcoll_module_construct(mca_bcol_ptpcoll_module_t *ptpcoll_module)
    /* set the upper limit on the tag */
    i = 2;
    ptpcoll_module->tag_mask = 1;
-    while ( i <= mca_pml.pml_max_tag && i > 0) {
+    while ( i <= (uint64_t) mca_pml.pml_max_tag && i > 0) {
        i <<= 1;
    }
    ptpcoll_module->ml_mem.ml_buf_desc = NULL;
@ -84,6 +129,9 @@ mca_bcol_ptpcoll_module_destruct(mca_bcol_ptpcoll_module_t *ptpcoll_module)
        free(ml_mem->ml_buf_desc);
    }

+    if (NULL != ptpcoll_module->allgather_offsets) {
+        free_allreduce_offsets_array(ptpcoll_module);
+    }

    if (NULL != ptpcoll_module->narray_node) {
        for (i = 0; i < ptpcoll_module->group_size; i++) {
@ -111,7 +159,7 @@ OBJ_CLASS_INSTANCE(mca_bcol_ptpcoll_module_t,
                   mca_bcol_ptpcoll_module_destruct);

 static int init_ml_buf_desc(mca_bcol_ptpcoll_ml_buffer_desc_t **desc, void *base_addr, uint32_t num_banks,
-        uint32_t num_buffers_per_bank, uint32_t size_buffer, uint32_t header_size, int group_size, int pow_k)
+                            uint32_t num_buffers_per_bank, uint32_t size_buffer, uint32_t header_size, int group_size, int pow_k)
 {
    uint32_t i, j, ci;
    mca_bcol_ptpcoll_ml_buffer_desc_t *tmp_desc = NULL;
@ -124,7 +172,7 @@ static int init_ml_buf_desc(mca_bcol_ptpcoll_ml_buffer_desc_t **desc, void *base


    *desc = (mca_bcol_ptpcoll_ml_buffer_desc_t *)calloc(num_banks * num_buffers_per_bank,
-            sizeof(mca_bcol_ptpcoll_ml_buffer_desc_t));
+                                                        sizeof(mca_bcol_ptpcoll_ml_buffer_desc_t));
    if (NULL == *desc) {
        PTPCOLL_ERROR(("Failed to allocate memory"));
        return OMPI_ERROR;
@ -151,6 +199,10 @@ static int init_ml_buf_desc(mca_bcol_ptpcoll_ml_buffer_desc_t **desc, void *base
            tmp_desc[ci].data_addr = (void *)
                ((unsigned char*)base_addr + ci * size_buffer + header_size);
            PTPCOLL_VERBOSE(10, ("ml memory cache setup %d %d - %p", i, j, tmp_desc[ci].data_addr));
+
+            /* init reduce implementation flags */
+            tmp_desc[ci].reduce_init_called = false;
+            tmp_desc[ci].reduction_status = 0;
        }
    }

@ -160,24 +212,33 @@ static int init_ml_buf_desc(mca_bcol_ptpcoll_ml_buffer_desc_t **desc, void *base
 static void mca_bcol_ptpcoll_set_small_msg_thresholds(struct mca_bcol_base_module_t *super)
 {
    mca_bcol_ptpcoll_module_t *ptpcoll_module =
-                   (mca_bcol_ptpcoll_module_t *) super;
+        (mca_bcol_ptpcoll_module_t *) super;
+    mca_bcol_ptpcoll_component_t *cm = &mca_bcol_ptpcoll_component;
+
+    /* Subtract out the maximum header size when calculating the thresholds. This
+     * will account for the headers used by the basesmuma component. If we do not
+     * take these headers into account we may overrun our buffer. */

    /* Set the Allgather threshold equals to a ML buff size */
    super->small_message_thresholds[BCOL_ALLGATHER] =
-                       ptpcoll_module->ml_mem.size_buffer /
-                       ompi_comm_size(ptpcoll_module->super.sbgp_partner_module->group_comm);
+        (ptpcoll_module->ml_mem.size_buffer - BCOL_HEADER_MAX) /
+        ompi_comm_size(ptpcoll_module->super.sbgp_partner_module->group_comm);

    /* Set the Bcast threshold, all Bcast algths have the same threshold */
    super->small_message_thresholds[BCOL_BCAST] =
-                            ptpcoll_module->ml_mem.size_buffer;
+        (ptpcoll_module->ml_mem.size_buffer - BCOL_HEADER_MAX);

    /* Set the Alltoall threshold, the Ring algth sets some limitation */
    super->small_message_thresholds[BCOL_ALLTOALL] =
-                        ptpcoll_module->ml_mem.size_buffer / 2;
+        (ptpcoll_module->ml_mem.size_buffer - BCOL_HEADER_MAX) / 2;

    /* Set the Allreduce threshold, NARRAY algth sets some limitation */
    super->small_message_thresholds[BCOL_ALLREDUCE] =
-                 ptpcoll_module->ml_mem.size_buffer / ptpcoll_module->k_nomial_radix;
+        (ptpcoll_module->ml_mem.size_buffer - BCOL_HEADER_MAX) / ptpcoll_module->k_nomial_radix;
+
+    /* Set the Reduce threshold, NARRAY algth sets some limitation */
+    super->small_message_thresholds[BCOL_REDUCE] =
+        (ptpcoll_module->ml_mem.size_buffer - BCOL_HEADER_MAX) / cm->narray_radix;
 }

 /*
@ -200,7 +261,7 @@ static int mca_bcol_ptpcoll_cache_ml_memory_info(struct mca_coll_ml_module_t *ml
    ml_mem->size_buffer = desc->size_buffer;

    PTPCOLL_VERBOSE(10, ("ML buffer configuration num banks %d num_per_bank %d size %d base addr %p",
-                           desc->num_banks, desc->num_buffers_per_bank, desc->size_buffer, desc->block->base_addr));
+                         desc->num_banks, desc->num_buffers_per_bank, desc->size_buffer, desc->block->base_addr));

    /* pointer to ml level descriptor */
    ml_mem->ml_mem_desc = desc;
@ -209,19 +270,19 @@ static int mca_bcol_ptpcoll_cache_ml_memory_info(struct mca_coll_ml_module_t *ml
    ml_mem->bank_index_for_release = 0;

    if (OMPI_SUCCESS != init_ml_buf_desc(&ml_mem->ml_buf_desc,
-                desc->block->base_addr,
-                ml_mem->num_banks,
-                ml_mem->num_buffers_per_bank,
-                ml_mem->size_buffer,
-                ml_module->data_offset,
-                group_size,
-                ptpcoll_module->pow_k)) {
+                                         desc->block->base_addr,
+                                         ml_mem->num_banks,
+                                         ml_mem->num_buffers_per_bank,
+                                         ml_mem->size_buffer,
+                                         ml_module->data_offset,
+                                         group_size,
+                                         ptpcoll_module->pow_k)) {
        PTPCOLL_VERBOSE(10, ("Failed to allocate rdma memory descriptor\n"));
        return OMPI_ERROR;
    }

    PTPCOLL_VERBOSE(10, ("ml_module = %p, ptpcoll_module = %p, ml_mem_desc = %p.\n",
-                           ml_module, ptpcoll_module, ml_mem->ml_mem_desc));
+                         ml_module, ptpcoll_module, ml_mem->ml_mem_desc));

    return OMPI_SUCCESS;
 }
@ -244,11 +305,12 @@ static void load_func(mca_bcol_ptpcoll_module_t *ptpcoll_module)
    ptpcoll_module->super.bcol_function_init_table[BCOL_BARRIER] = bcol_ptpcoll_barrier_init;

    ptpcoll_module->super.bcol_function_init_table[BCOL_BCAST] = bcol_ptpcoll_bcast_init;
-    ptpcoll_module->super.bcol_function_init_table[BCOL_ALLREDUCE] = NULL;
-    ptpcoll_module->super.bcol_function_init_table[BCOL_ALLGATHER] = NULL;
+    ptpcoll_module->super.bcol_function_init_table[BCOL_ALLREDUCE] = bcol_ptpcoll_allreduce_init;
+    ptpcoll_module->super.bcol_function_init_table[BCOL_ALLGATHER] = bcol_ptpcoll_allgather_init;
    ptpcoll_module->super.bcol_function_table[BCOL_BCAST] = bcol_ptpcoll_bcast_k_nomial_anyroot;
    ptpcoll_module->super.bcol_function_init_table[BCOL_ALLTOALL] = NULL;
    ptpcoll_module->super.bcol_function_init_table[BCOL_SYNC] = mca_bcol_ptpcoll_memsync_init;
+    ptpcoll_module->super.bcol_function_init_table[BCOL_REDUCE] = bcol_ptpcoll_reduce_init;

    /* ML memory cacher */
    ptpcoll_module->super.bcol_memory_init = mca_bcol_ptpcoll_cache_ml_memory_info;
@ -262,16 +324,17 @@ static void load_func(mca_bcol_ptpcoll_module_t *ptpcoll_module)

 int mca_bcol_ptpcoll_setup_knomial_tree(mca_bcol_base_module_t *super)
 {
-   mca_bcol_ptpcoll_module_t *p2p_module = (mca_bcol_ptpcoll_module_t *) super;
-   int rc = 0; 
-   rc = netpatterns_setup_recursive_knomial_allgather_tree_node(
-                p2p_module->super.sbgp_partner_module->group_size,
-                p2p_module->super.sbgp_partner_module->my_index,
-                mca_bcol_ptpcoll_component.k_nomial_radix,
-                super->list_n_connected,
-                &p2p_module->knomial_allgather_tree);
+    mca_bcol_ptpcoll_module_t *p2p_module = (mca_bcol_ptpcoll_module_t *) super;
+    int rc = 0;

-  return rc;
+    rc = netpatterns_setup_recursive_knomial_allgather_tree_node(
+        p2p_module->super.sbgp_partner_module->group_size,
+        p2p_module->super.sbgp_partner_module->my_index,
+        mca_bcol_ptpcoll_component.k_nomial_radix,
+        super->list_n_connected,
+        &p2p_module->knomial_allgather_tree);
+
+    return rc;
 }

 /* The function used to calculate size */
@ -301,9 +364,9 @@ static int load_narray_knomial_tree (mca_bcol_ptpcoll_module_t *ptpcoll_module)
    mca_bcol_ptpcoll_component_t *cm = &mca_bcol_ptpcoll_component;

    ptpcoll_module->full_narray_tree_size = calc_full_tree_size(
-            cm->narray_knomial_radix,
-            ptpcoll_module->group_size,
-            &ptpcoll_module->full_narray_tree_num_leafs);
+        cm->narray_knomial_radix,
+        ptpcoll_module->group_size,
+        &ptpcoll_module->full_narray_tree_num_leafs);

    ptpcoll_module->narray_knomial_proxy_extra_index = (int *)
        malloc(sizeof(int) * (cm->narray_knomial_radix));
@ -313,21 +376,21 @@ static int load_narray_knomial_tree (mca_bcol_ptpcoll_module_t *ptpcoll_module)
    }

    ptpcoll_module->narray_knomial_node = calloc(
-            ptpcoll_module->full_narray_tree_size,
-            sizeof(netpatterns_narray_knomial_tree_node_t));
+        ptpcoll_module->full_narray_tree_size,
+        sizeof(netpatterns_narray_knomial_tree_node_t));
    if(NULL == ptpcoll_module->narray_knomial_node) {
        goto Error;
    }

    PTPCOLL_VERBOSE(10 ,("My type is proxy, full tree size = %d [%d]",
-                ptpcoll_module->full_narray_tree_size,
-                cm->narray_knomial_radix
-                ));
+                         ptpcoll_module->full_narray_tree_size,
+                         cm->narray_knomial_radix
+                        ));

    if (ptpcoll_module->super.sbgp_partner_module->my_index <
-            ptpcoll_module->full_narray_tree_size) {
+        ptpcoll_module->full_narray_tree_size) {
        if (ptpcoll_module->super.sbgp_partner_module->my_index <
-                ptpcoll_module->group_size - ptpcoll_module->full_narray_tree_size) {
+            ptpcoll_module->group_size - ptpcoll_module->full_narray_tree_size) {
            ptpcoll_module->narray_type = PTPCOLL_PROXY;
            for (i = 0; i < cm->narray_knomial_radix; i++) {
                peer =
@ -346,10 +409,10 @@ static int load_narray_knomial_tree (mca_bcol_ptpcoll_module_t *ptpcoll_module)
        /* Setting node info */
        for(i = 0; i < ptpcoll_module->full_narray_tree_size; i++) {
            rc = netpatterns_setup_narray_knomial_tree(
-                    cm->narray_knomial_radix,
-                    i,
-                    ptpcoll_module->full_narray_tree_size,
-                    &ptpcoll_module->narray_knomial_node[i]);
+                cm->narray_knomial_radix,
+                i,
+                ptpcoll_module->full_narray_tree_size,
+                &ptpcoll_module->narray_knomial_node[i]);
            if(OMPI_SUCCESS != rc) {
                goto Error;
            }
@ -381,17 +444,17 @@ static int load_narray_tree(mca_bcol_ptpcoll_module_t *ptpcoll_module)
    mca_bcol_ptpcoll_component_t *cm = &mca_bcol_ptpcoll_component;

    ptpcoll_module->narray_node = calloc(ptpcoll_module->group_size,
-            sizeof(netpatterns_tree_node_t));
+                                         sizeof(netpatterns_tree_node_t));
    if(NULL == ptpcoll_module->narray_node ) {
        goto Error;
    }

    for(i = 0; i < ptpcoll_module->group_size; i++) {
        rc = netpatterns_setup_narray_tree(
-                cm->narray_radix,
-                i,
-                ptpcoll_module->group_size,
-                &ptpcoll_module->narray_node[i]);
+            cm->narray_radix,
+            i,
+            ptpcoll_module->group_size,
+            &ptpcoll_module->narray_node[i]);
        if(OMPI_SUCCESS != rc) {
            goto Error;
        }
@ -417,8 +480,8 @@ static int load_knomial_info(mca_bcol_ptpcoll_module_t *ptpcoll_module)
        cm->k_nomial_radix;

    ptpcoll_module->pow_k = pow_k_calc(ptpcoll_module->k_nomial_radix,
-            ptpcoll_module->group_size,
-            &ptpcoll_module->pow_knum);
+                                       ptpcoll_module->group_size,
+                                       &ptpcoll_module->pow_knum);

    ptpcoll_module->kn_proxy_extra_index = (int *)
        malloc(sizeof(int) * (ptpcoll_module->k_nomial_radix - 1));
@ -430,37 +493,37 @@ static int load_knomial_info(mca_bcol_ptpcoll_module_t *ptpcoll_module)
    /* Setting peer type for K-nomial algorithm*/
    if (ptpcoll_module->super.sbgp_partner_module->my_index < ptpcoll_module->pow_knum ) {
        if (ptpcoll_module->super.sbgp_partner_module->my_index <
-                ptpcoll_module->group_size - ptpcoll_module->pow_knum) {
+            ptpcoll_module->group_size - ptpcoll_module->pow_knum) {
            for (i = 0;
-                    i < (ptpcoll_module->k_nomial_radix - 1) &&
-                    ptpcoll_module->super.sbgp_partner_module->my_index *
-                    (ptpcoll_module->k_nomial_radix - 1)  +
-                    i + ptpcoll_module->pow_knum < ptpcoll_module->group_size
-                    ; i++) {
+                 i < (ptpcoll_module->k_nomial_radix - 1) &&
+                     ptpcoll_module->super.sbgp_partner_module->my_index *
+                     (ptpcoll_module->k_nomial_radix - 1)  +
+                     i + ptpcoll_module->pow_knum < ptpcoll_module->group_size
+                     ; i++) {
                ptpcoll_module->pow_ktype = PTPCOLL_KN_PROXY;
                ptpcoll_module->kn_proxy_extra_index[i] =
                    ptpcoll_module->super.sbgp_partner_module->my_index *
                    (ptpcoll_module->k_nomial_radix - 1) +
                    i + ptpcoll_module->pow_knum;
                PTPCOLL_VERBOSE(10 ,("My type is proxy, pow_knum = %d [%d] my extra %d",
-                            ptpcoll_module->pow_knum,
-                            ptpcoll_module->pow_k,
-                            ptpcoll_module->kn_proxy_extra_index[i]));
+                                     ptpcoll_module->pow_knum,
+                                     ptpcoll_module->pow_k,
+                                     ptpcoll_module->kn_proxy_extra_index[i]));
            }
            ptpcoll_module->kn_proxy_extra_num = i;
        } else {
            PTPCOLL_VERBOSE(10 ,("My type is in group, pow_knum = %d [%d]", ptpcoll_module->pow_knum,
-                        ptpcoll_module->pow_k));
+                                 ptpcoll_module->pow_k));
            ptpcoll_module->pow_ktype = PTPCOLL_KN_IN_GROUP;
        }
    } else {
        ptpcoll_module->pow_ktype = PTPCOLL_KN_EXTRA;
        ptpcoll_module->kn_proxy_extra_index[0] = (ptpcoll_module->super.sbgp_partner_module->my_index -
-                ptpcoll_module->pow_knum) / (ptpcoll_module->k_nomial_radix - 1);
+                                                   ptpcoll_module->pow_knum) / (ptpcoll_module->k_nomial_radix - 1);
        PTPCOLL_VERBOSE(10 ,("My type is extra , pow_knum = %d [%d] my proxy %d",
-                    ptpcoll_module->pow_knum,
-                    ptpcoll_module->pow_k,
-                    ptpcoll_module->kn_proxy_extra_index[0]));
+                             ptpcoll_module->pow_knum,
+                             ptpcoll_module->pow_k,
+                             ptpcoll_module->kn_proxy_extra_index[0]));
    }

    return OMPI_SUCCESS;
@ -476,8 +539,8 @@ Error:
 static int load_binomial_info(mca_bcol_ptpcoll_module_t *ptpcoll_module)
 {
    ptpcoll_module->pow_2 = pow_k_calc(2,
-            ptpcoll_module->group_size,
-            &ptpcoll_module->pow_2num);
+                                       ptpcoll_module->group_size,
+                                       &ptpcoll_module->pow_2num);

    assert(ptpcoll_module->pow_2num == 1 << ptpcoll_module->pow_2);
    assert(ptpcoll_module->pow_2num  <= ptpcoll_module->group_size);
@ -485,20 +548,20 @@ static int load_binomial_info(mca_bcol_ptpcoll_module_t *ptpcoll_module)
    /* Setting peer type for binary algorithm*/
    if (ptpcoll_module->super.sbgp_partner_module->my_index < ptpcoll_module->pow_2num ) {
        if (ptpcoll_module->super.sbgp_partner_module->my_index <
-                ptpcoll_module->group_size - ptpcoll_module->pow_2num) {
+            ptpcoll_module->group_size - ptpcoll_module->pow_2num) {
            PTPCOLL_VERBOSE(10 ,("My type is proxy, pow_2num = %d [%d]", ptpcoll_module->pow_2num,
-                        ptpcoll_module->pow_2));
+                                 ptpcoll_module->pow_2));
            ptpcoll_module->pow_2type = PTPCOLL_PROXY;
            ptpcoll_module->proxy_extra_index = ptpcoll_module->super.sbgp_partner_module->my_index +
                ptpcoll_module->pow_2num;
        } else {
            PTPCOLL_VERBOSE(10 ,("My type is in group, pow_2num = %d [%d]", ptpcoll_module->pow_2num,
-                        ptpcoll_module->pow_2));
+                                 ptpcoll_module->pow_2));
            ptpcoll_module->pow_2type = PTPCOLL_IN_GROUP;
        }
    } else {
        PTPCOLL_VERBOSE(10 ,("My type is extra , pow_2num = %d [%d]", ptpcoll_module->pow_2num,
-                    ptpcoll_module->pow_2));
+                             ptpcoll_module->pow_2));
        ptpcoll_module->pow_2type = PTPCOLL_EXTRA;
        ptpcoll_module->proxy_extra_index = ptpcoll_module->super.sbgp_partner_module->my_index -
            ptpcoll_module->pow_2num;
@ -510,10 +573,10 @@ static int load_recursive_knomial_info(mca_bcol_ptpcoll_module_t *ptpcoll_module
 {
    int rc = OMPI_SUCCESS;
    rc = netpatterns_setup_recursive_knomial_tree_node(
-                    ptpcoll_module->group_size,
-                    ptpcoll_module->super.sbgp_partner_module->my_index,
-                    mca_bcol_ptpcoll_component.k_nomial_radix,
-                    &ptpcoll_module->knomial_exchange_tree);
+        ptpcoll_module->group_size,
+        ptpcoll_module->super.sbgp_partner_module->my_index,
+        mca_bcol_ptpcoll_component.k_nomial_radix,
+        &ptpcoll_module->knomial_exchange_tree);
    return rc;
 }

@ -523,14 +586,14 @@ static void bcol_ptpcoll_collreq_init(ompi_free_list_item_t *item, void* ctx)
    mca_bcol_ptpcoll_collreq_t *collreq = (mca_bcol_ptpcoll_collreq_t *) item;

    switch(mca_bcol_ptpcoll_component.barrier_alg) {
-        case 1:
-            collreq->requests = (ompi_request_t **)
-                                     calloc(2, sizeof(ompi_request_t *));
-            break;
-        case 2:
-            collreq->requests = (ompi_request_t **)
-                calloc(2 * ptpcoll_module->k_nomial_radix, sizeof(ompi_request_t *));
-            break;
+    case 1:
+        collreq->requests = (ompi_request_t **)
+            calloc(2, sizeof(ompi_request_t *));
+        break;
+    case 2:
+        collreq->requests = (ompi_request_t **)
+            calloc(2 * ptpcoll_module->k_nomial_radix, sizeof(ompi_request_t *));
+        break;
    }
 }

@ -539,7 +602,7 @@ static void bcol_ptpcoll_collreq_init(ompi_free_list_item_t *item, void* ctx)
 * the backing shared-memory file is created.
 */
 mca_bcol_base_module_t **mca_bcol_ptpcoll_comm_query(mca_sbgp_base_module_t *sbgp,
-        int *num_modules)
+                                                     int *num_modules)
 {
    int rc;
    /* local variables */
@ -612,33 +675,32 @@ mca_bcol_base_module_t **mca_bcol_ptpcoll_comm_query(mca_sbgp_base_module_t *sbg
    /* creating collfrag free list */
    OBJ_CONSTRUCT(&ptpcoll_module->collreqs_free, ompi_free_list_t);
    rc = ompi_free_list_init_ex_new(&ptpcoll_module->collreqs_free,
-                                  sizeof(mca_bcol_ptpcoll_collreq_t),
-                                  BCOL_PTP_CACHE_LINE_SIZE,
-                                  OBJ_CLASS(mca_bcol_ptpcoll_collreq_t),
-                                  0, BCOL_PTP_CACHE_LINE_SIZE,
-                                  256 /* free_list_num */,
-                                  -1  /* free_list_max, -1 = infinite */,
-                                  32  /* free_list_inc */,
-                                  NULL,
-                                  bcol_ptpcoll_collreq_init,
-                                  ptpcoll_module);
+                                    sizeof(mca_bcol_ptpcoll_collreq_t),
+                                    BCOL_PTP_CACHE_LINE_SIZE,
+                                    OBJ_CLASS(mca_bcol_ptpcoll_collreq_t),
+                                    0, BCOL_PTP_CACHE_LINE_SIZE,
+                                    256 /* free_list_num */,
+                                    -1  /* free_list_max, -1 = infinite */,
+                                    32  /* free_list_inc */,
+                                    NULL,
+                                    bcol_ptpcoll_collreq_init,
+                                    ptpcoll_module);
    if (OMPI_SUCCESS != rc) {
        goto CLEANUP;
    }

    load_func(ptpcoll_module);
-    /*
+
    rc = alloc_allreduce_offsets_array(ptpcoll_module);
    if (OMPI_SUCCESS != rc) {
        goto CLEANUP;
    }
-    */

    /* Allocating iovec for PTP alltoall */
    iovec_size = ptpcoll_module->group_size / 2 + ptpcoll_module->group_size % 2;
    ptpcoll_module->alltoall_iovec = (struct iovec *) malloc(sizeof(struct iovec)
-                                                            * iovec_size);
-    ptpcoll_module->log_group_size = lognum(ptpcoll_module->group_size); 
+                                                             * iovec_size);
+    ptpcoll_module->log_group_size = lognum(ptpcoll_module->group_size);

    rc = mca_bcol_base_bcol_fns_table_init(&(ptpcoll_module->super));
    if (OMPI_SUCCESS != rc) {
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_reduce.c
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_reduce.c
@ -0,0 +1,406 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "bcol_ptpcoll_reduce.h"
+#include "bcol_ptpcoll_utils.h"
+
+static int bcol_ptpcoll_reduce_narray_progress(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args);
+
+static int bcol_ptpcoll_reduce_narray(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args);
+
+
+#define NARRAY_RECV_NB(narray_node, process_shift, group_size,                            \
+                        recv_buffer, pack_len, tag, comm, recv_requests,                      \
+                        num_pending_recvs)                                                 \
+do {                                                                                       \
+    int n, rc = OMPI_SUCCESS;                                                              \
+    int dst;                                                                               \
+    int comm_dst;                                                                          \
+    int offset = 0 ;                                                                       \
+                                                                                           \
+    /* Recieve data from all relevant childrens  */                                        \
+    for (n = 0; n < narray_node->n_children; n++) {                                        \
+                                                                                           \
+        dst = narray_node->children_ranks[n] + process_shift;                              \
+        if (dst >= group_size) {                                                           \
+            dst -= group_size;                                                             \
+        }                                                                                  \
+        comm_dst = group_list[dst];                                                        \
+                                                                                           \
+        /* Non blocking send .... */                                                       \
+        PTPCOLL_VERBOSE(1 , ("Reduce, Irecv data to %d[%d], count %d, tag %d, addr %p",    \
+                    dst, comm_dst, pack_len, tag,                                             \
+                    data_buffer));                                                         \
+        rc = MCA_PML_CALL(irecv((void *)((unsigned char*)recv_buffer + offset), pack_len, MPI_BYTE,                     \
+                    comm_dst, tag, comm,                                      \
+                    &(recv_requests[*num_pending_recvs])));                                \
+        if( OMPI_SUCCESS != rc ) {                                                         \
+            PTPCOLL_VERBOSE(10, ("Failed to start non-blocking receive"));                 \
+            return OMPI_ERROR;                                                             \
+        }                                                                                  \
+        ++(*num_pending_recvs);                                                            \
+        offset += pack_len;                                                                \
+    }                                                                                      \
+} while(0)
+
+
+static inline int narray_reduce(void *data_buffer, void *recv_buffer,
+                                int nrecvs, int count,
+                                struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                                int *reduction_status) {
+    int pack_len = count * dtype->super.size;
+    int i = 0;
+    void *source_buffer = NULL, *result_buffer = NULL;
+
+    source_buffer = data_buffer;
+    result_buffer = recv_buffer;
+
+    for (i = 0; i < nrecvs; i++) {
+        ompi_op_reduce(op, (void*)((unsigned char*) source_buffer) ,
+                       (void*)((unsigned char*) result_buffer),
+                       count,dtype);
+
+        source_buffer = (void *)((unsigned char*)recv_buffer
+                                 + (i+1) * pack_len);
+    }
+
+    *reduction_status = 1;
+    return OMPI_SUCCESS;
+}
+static int bcol_ptpcoll_reduce_narray_progress(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args)
+{
+    mca_bcol_ptpcoll_module_t *ptpcoll_module = (mca_bcol_ptpcoll_module_t *)const_args->bcol_module;
+
+    int tag = -1;
+    int rc;
+    int group_size = ptpcoll_module->group_size;
+    int *group_list = ptpcoll_module->super.sbgp_partner_module->group_list;
+    uint32_t buffer_index = input_args->buffer_index;
+    struct ompi_op_t *op = input_args->op;
+    ompi_communicator_t* comm = ptpcoll_module->super.sbgp_partner_module->group_comm;
+    ompi_request_t **send_request =
+        &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests[0];
+    ompi_request_t **recv_requests =
+        &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests[1];
+    void *data_buffer = NULL;
+    void *src_buffer = (void *) (
+            (unsigned char *)input_args->sbuf +
+            (size_t)input_args->sbuf_offset);
+    void *recv_buffer = (void *) (
+            (unsigned char *)input_args->rbuf +
+            (size_t)input_args->rbuf_offset);
+    int count = input_args->count;
+    struct ompi_datatype_t *dtype = input_args->dtype;
+    int pack_len = input_args->count * input_args->dtype->super.size;
+    int *active_requests =
+        &(ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].active_requests);
+    int matched = false;
+    int my_group_index = ptpcoll_module->super.sbgp_partner_module->my_index;
+    int relative_group_index = 0;
+    netpatterns_tree_node_t *narray_node = NULL;
+    bool not_sent = false;
+    int parent_rank  = -1, comm_parent_rank = -1;
+    int group_root_index = input_args->root;
+
+    if (!ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].reduce_init_called) {
+        bcol_ptpcoll_reduce_narray(input_args, const_args);
+    }
+    /*
+     * By default the src buffer is the data buffer,
+     * only after reduction, the recv buffer becomes the
+     * data buffer
+     */
+    data_buffer = src_buffer;
+
+    relative_group_index = my_group_index - group_root_index;
+    if (relative_group_index < 0) {
+        relative_group_index +=group_size;
+    }
+
+    /* keep tag within the limit support by the pml */
+    tag = (PTPCOLL_TAG_OFFSET + input_args->sequence_num * PTPCOLL_TAG_FACTOR) & (ptpcoll_module->tag_mask);
+    /* mark this as a collective tag, to avoid conflict with user-level tags */
+    tag = -tag;
+
+    narray_node = &ptpcoll_module->narray_node[relative_group_index];
+
+    PTPCOLL_VERBOSE(3, ("reduce, Narray tree Progress"));
+
+    PTPCOLL_VERBOSE(8, ("bcol_ptpcoll_reduce_narray, buffer index: %d "
+                         "tag: %d "
+                         "tag_mask: %d "
+                         "sn: %d "
+                         "root: %d [%d]"
+                         "buff: %p ",
+                         buffer_index, tag,
+                         ptpcoll_module->tag_mask, input_args->sequence_num,
+                         input_args->root_flag, input_args->root_route->rank,
+                         data_buffer));
+
+    /*
+      Check if the data was received
+     */
+    if (0 != *active_requests) {
+        matched = mca_bcol_ptpcoll_test_all_for_match
+            (active_requests, recv_requests, &rc);
+        if (OMPI_SUCCESS != rc) {
+            return OMPI_ERROR;
+        }
+
+
+        /* All data was received, then do a reduction*/
+        if(matched) {
+           narray_reduce(data_buffer, recv_buffer, narray_node->n_children, count, dtype, op,
+                   &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].reduction_status);
+
+           /*
+            * The reduction result is in the recv buffer, so it is the new data
+            * buffer
+            */
+           data_buffer = recv_buffer;
+
+           /* If not reduced, means also, you might not posted a send */
+            not_sent = true;
+        } else {
+            PTPCOLL_VERBOSE(10, ("reduce root is started"));
+            return BCOL_FN_STARTED;
+        }
+    }
+
+    /* I'm root, I'm done  */
+    if (input_args->root_flag) {
+       return BCOL_FN_COMPLETE;
+    }
+
+    PTPCOLL_VERBOSE(1,("Testing Sending Match"));
+
+    /* If send was not posted */
+    /* Manju: Leaf node should never post in the progress logic */
+    if (not_sent) {
+        parent_rank =
+            ptpcoll_module->narray_node[relative_group_index].parent_rank +
+            group_root_index;
+        if (parent_rank >= group_size) {
+            parent_rank -= group_size;
+        }
+
+        comm_parent_rank = group_list[parent_rank];
+        PTPCOLL_VERBOSE(1,("Sending data to %d ",comm_parent_rank));
+
+        rc = MCA_PML_CALL(isend(data_buffer, pack_len, MPI_BYTE,
+                    comm_parent_rank,
+                    tag, MCA_PML_BASE_SEND_STANDARD, comm, send_request));
+        if( OMPI_SUCCESS != rc ) {
+            PTPCOLL_VERBOSE(10, ("Failed to send data"));
+            return OMPI_ERROR;
+        }
+    }
+
+    if (0 == mca_bcol_ptpcoll_test_for_match(send_request, &rc)) {
+        PTPCOLL_VERBOSE(10, ("Test was not matched - %d", rc));
+        /* Data has not been sent. Return that the collective has been stated
+         * because we MUST call test on this request once it is finished to
+         * ensure that it is properly freed. */
+        return (OMPI_SUCCESS != rc) ? rc : BCOL_FN_STARTED;
+    }
+
+    return BCOL_FN_COMPLETE;
+}
+
+static int bcol_ptpcoll_reduce_narray(bcol_function_args_t *input_args,
+        struct coll_ml_function_t *const_args)
+{
+    mca_bcol_ptpcoll_module_t *ptpcoll_module = (mca_bcol_ptpcoll_module_t *)const_args->bcol_module;
+
+    int tag;
+    int rc;
+    int group_size = ptpcoll_module->group_size;
+    int *group_list = ptpcoll_module->super.sbgp_partner_module->group_list;
+    uint32_t buffer_index = input_args->buffer_index;
+
+    struct ompi_op_t *op = input_args->op;
+    ompi_communicator_t* comm = ptpcoll_module->super.sbgp_partner_module->group_comm;
+    ompi_request_t **recv_requests =
+        &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests[1];
+    ompi_request_t **send_request =
+        &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].requests[0];
+
+    void *data_buffer = NULL;
+    void *src_buffer = (void *) (
+            (unsigned char *)input_args->sbuf +
+            (size_t)input_args->sbuf_offset);
+    void *recv_buffer = (void *) (
+            (unsigned char *)input_args->rbuf +
+            (size_t)input_args->rbuf_offset);
+    int count = input_args->count;
+    struct ompi_datatype_t *dtype = input_args->dtype;
+    int pack_len = input_args->count * input_args->dtype->super.size;
+    int *active_requests =
+        &(ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].active_requests);
+    int matched = true;
+    int my_group_index = ptpcoll_module->super.sbgp_partner_module->my_index;
+    int group_root_index  = -1;
+    int relative_group_index = 0;
+    netpatterns_tree_node_t *narray_node = NULL;
+    int parent_rank  = -1, comm_parent_rank = -1;
+
+
+    /* This is first function call that should be called, not progress.
+     * The fragmentation code does this, so switch from progress to here.
+     * The flag indicates whether, we have entered this code *
+     */
+    ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].reduce_init_called = true;
+
+    PTPCOLL_VERBOSE(1, ("Reduce, Narray tree"));
+    /* reset active request counter */
+    (*active_requests) = 0;
+    /* keep tag within the limit support by the pml */
+    tag = (PTPCOLL_TAG_OFFSET + input_args->sequence_num * PTPCOLL_TAG_FACTOR) & (ptpcoll_module->tag_mask);
+    /* mark this as a collective tag, to avoid conflict with user-level flags */
+    tag = -tag;
+
+    PTPCOLL_VERBOSE(1, ("bcol_ptpcoll_reduce_narray, buffer index: %d "
+                         "tag: %d "
+                         "tag_mask: %d "
+                         "sn: %d "
+                         "root: %d "
+                         "buff: %p ",
+                         buffer_index, tag,
+                         ptpcoll_module->tag_mask, input_args->sequence_num,
+                         input_args->root_flag,
+                         src_buffer));
+
+    /* Compute Root Index Shift */
+    group_root_index = input_args->root;
+    relative_group_index = my_group_index - group_root_index;
+    if (relative_group_index < 0) {
+        relative_group_index += group_size;
+    }
+
+    narray_node = &ptpcoll_module->narray_node[relative_group_index];
+
+    if (0 == narray_node->n_children) {
+        PTPCOLL_VERBOSE(10, ("I'm leaf of the data"));
+        /*
+         * I'm root of the operation
+         * send data to N childrens
+         */
+        data_buffer = src_buffer;
+        goto NARRAY_SEND_DATA;
+    }
+
+    /* Not leaf, either an internal node or root */
+    NARRAY_RECV_NB(narray_node, group_root_index, group_size,
+                    recv_buffer, pack_len, tag, comm, recv_requests,
+                    active_requests);
+
+
+    /* We have not done reduction, yet */
+    ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].reduction_status = 0;
+
+    /* We can not block. So run couple of test for data arrival */
+    matched = mca_bcol_ptpcoll_test_all_for_match
+        (active_requests, recv_requests, &rc);
+
+    /* Check if received the data */
+    if(matched) {
+
+        narray_reduce(src_buffer, recv_buffer, narray_node->n_children,
+                        count, dtype, op, &ptpcoll_module->ml_mem.ml_buf_desc[buffer_index].reduction_status);
+        PTPCOLL_VERBOSE(1, ("Reduce, received data from  all childrend "));
+        data_buffer = recv_buffer;
+
+    } else {
+
+        PTPCOLL_VERBOSE(1, ("reduce root is started"));
+        return BCOL_FN_STARTED;
+    }
+
+    /* I'm root, I'm done  */
+    if (input_args->root_flag) {
+       return BCOL_FN_COMPLETE;
+    }
+
+
+NARRAY_SEND_DATA:
+
+    /*
+     * Send the data (reduce in case of internal nodes, or just data in
+     * case of leaf nodes) to the parent
+     */
+    narray_node = &ptpcoll_module->narray_node[relative_group_index];
+
+    parent_rank =
+        ptpcoll_module->narray_node[relative_group_index].parent_rank +
+        group_root_index;
+    if (parent_rank >= group_size) {
+        parent_rank -= group_size;
+    }
+
+    comm_parent_rank = group_list[parent_rank];
+    PTPCOLL_VERBOSE(1,("Sending data to %d ",comm_parent_rank));
+
+    rc = MCA_PML_CALL(isend(data_buffer, pack_len, MPI_BYTE,
+                comm_parent_rank,
+                tag, MCA_PML_BASE_SEND_STANDARD, comm, send_request));
+    if( OMPI_SUCCESS != rc ) {
+        PTPCOLL_VERBOSE(10, ("Failed to send data"));
+        return OMPI_ERROR;
+    }
+
+    /* We can not block. So run couple of test for data arrival */
+    if (0 == mca_bcol_ptpcoll_test_for_match(send_request, &rc)) {
+        PTPCOLL_VERBOSE(10, ("Test was not matched - %d", rc));
+        /* No data was received, return no match error */
+        return (OMPI_SUCCESS != rc) ? rc : BCOL_FN_STARTED;
+    }
+
+    return BCOL_FN_COMPLETE;
+}
+
+
+int bcol_ptpcoll_reduce_init(mca_bcol_base_module_t *super)
+{
+    mca_bcol_base_coll_fn_comm_attributes_t comm_attribs;
+    mca_bcol_base_coll_fn_invoke_attributes_t inv_attribs;
+
+    PTPCOLL_VERBOSE(1,("Initialization Reduce - Narray"));
+    comm_attribs.bcoll_type = BCOL_REDUCE;
+    comm_attribs.comm_size_min = 0;
+    comm_attribs.comm_size_max = 1024 * 1024;
+    comm_attribs.waiting_semantics = NON_BLOCKING;
+
+    inv_attribs.bcol_msg_min = 0;
+    inv_attribs.bcol_msg_max = 20000; /* range 1 */
+
+    inv_attribs.datatype_bitmap = 0xffffffff;
+    inv_attribs.op_types_bitmap = 0xffffffff;
+
+
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+    mca_bcol_base_set_attributes(super, &comm_attribs, &inv_attribs,
+                bcol_ptpcoll_reduce_narray,
+                bcol_ptpcoll_reduce_narray_progress);
+
+    comm_attribs.data_src = DATA_SRC_KNOWN;
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_reduce.h
+++ b/ompi/mca/bcol/ptpcoll/bcol_ptpcoll_reduce.h
@ -0,0 +1,25 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2013 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#ifndef MCA_BCOL_PTPCOLL_REDUCE_H
+#define MCA_BCOL_PTPCOLL_REDUCE_H
+
+#include "ompi_config.h"
+#include "bcol_ptpcoll.h"
+#include "bcol_ptpcoll_utils.h"
+
+BEGIN_C_DECLS
+
+int bcol_ptpcoll_reduce_init(mca_bcol_base_module_t *super);
+
+int bcol_ptpcoll_reduce_init(mca_bcol_base_module_t *super);
+
+#endif /* MCA_BCOL_PTPCOLL_REDUCE_H */
--- a/ompi/mca/coll/ml/Makefile.am
+++ b/ompi/mca/coll/ml/Makefile.am
@ -1,6 +1,8 @@
 #
 # Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 # Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+# Copyright (c) 2013 Los Alamos National Security, LLC. All rights
+#                    reserved.
 # $COPYRIGHT$
 # 
 # Additional copyrights may follow
@ -30,12 +32,19 @@ sources = coll_ml.h \
        coll_ml_hier_algorithms.c \
        coll_ml_hier_algorithms_setup.c \
        coll_ml_hier_algorithms_bcast_setup.c \
+        coll_ml_hier_algorithms_allreduce_setup.c \
+        coll_ml_hier_algorithms_reduce_setup.c \
        coll_ml_hier_algorithms_common_setup.c \
        coll_ml_hier_algorithms_common_setup.h \
+        coll_ml_hier_algorithms_allgather_setup.c \
        coll_ml_hier_algorithm_memsync_setup.c \
        coll_ml_custom_utils.h \
        coll_ml_custom_utils.c \
+        coll_ml_hier_algorithms_ibarrier.c \
        coll_ml_progress.c \
+        coll_ml_reduce.c \
+        coll_ml_allreduce.c \
+        coll_ml_allgather.c \
        coll_ml_mca.h \
        coll_ml_mca.c \
        coll_ml_lmngr.h \
--- a/ompi/mca/coll/ml/coll_ml.h
+++ b/ompi/mca/coll/ml/coll_ml.h
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -90,6 +93,13 @@ enum {
    ML_NUM_OF_FUNCTIONS
 };

+/* ML broadcast algorithms */
+enum {
+    COLL_ML_STATIC_BCAST,
+    COLL_ML_SEQ_BCAST,
+    COLL_ML_UNKNOWN_BCAST,
+};
+
 struct mca_bcol_base_module_t;
 /* function description */
 struct coll_ml_function_t {
@ -210,8 +220,6 @@ typedef struct coll_ml_collective_description_t coll_ml_collective_description_t
 struct rank_properties_t {
    int rank;
    int leaf;
-    int n_connected_subgroups;
-    int *list_connected_subgroups;
    int num_of_ranks_represented;
 }; typedef struct rank_properties_t rank_properties_t;

@ -237,17 +245,11 @@ struct sub_group_params_t {

    /*
     * level in the hierarchy - subgroups at the same
-     * level don't
-     * overlap.
+     * level don't overlap. May not be the same as the
+     * sbgp level.
     */
    int level_in_hierarchy;

-    /*
-     * Connected nodes
-     */
-    int n_connected_nodes;
-    int *list_connected_nodes;
-
    /*
     * Information on the ranks in the subgroup.  This includes
     * the rank, and wether or not the rank is a source/sink of
@ -255,11 +257,6 @@ struct sub_group_params_t {
     */
    rank_properties_t *rank_data;

-    /*
-     * Temp list of ranks
-     */
-    int *list_ranks;
-
    /* level one index - for example,
       for( i = 0; i < level_one_index; i++) will loop
       through all level one subgroups, this is significant
@ -267,7 +264,6 @@ struct sub_group_params_t {
       i.e. all ranks appear once and only once at level one
     */
    int level_one_index;
-
 };
 typedef struct sub_group_params_t sub_group_params_t;

@ -397,6 +393,7 @@ do {                                                                \
    /* pasha - why we duplicate it ? */                             \
    (coll_op)->variable_fn_params.src_desc = desc;                  \
    (coll_op)->variable_fn_params.hier_factor = 1;                  \
+    (coll_op)->variable_fn_params.need_dt_support = false;          \
 } while (0)

 /*Full message descriptor*/
@ -486,9 +483,6 @@ struct mca_coll_ml_component_t {
    /** MCA parameter: Priority of this component */
    int ml_priority;

-    /** MCA parameter: Number of levels */
-    int ml_n_levels;
-
    /** MCA parameter: subgrouping components to use */
    char *subgroups_string;

@ -519,17 +513,14 @@ struct mca_coll_ml_component_t {

    int use_knomial_allreduce;

-    /* Use global knowledge bcast algorithm */
-    bool use_static_bcast;
-
    /* use hdl_framework */
    bool use_hdl_bcast;

    /* Enable / Disable fragmentation (0 - off, 1 - on, 2 - auto) */
    int enable_fragmentation;

-    /* Use sequential bcast algorithm */
-    bool use_sequential_bcast;
+    /* Broadcast algorithm */
+    int bcast_algorithm;

    /* frag size that is used by list memory_manager */
    size_t lmngr_block_size;
@ -584,6 +575,9 @@ struct mca_coll_ml_component_t {
    /* Temporary hack for IMB test - not all bcols have alltoall */
    bool disable_alltoall;

+    /* Disable Reduce */
+    bool disable_reduce;
+
    /* Brucks alltoall mca and other params */
    int use_brucks_smsg_alltoall;

@ -727,6 +721,11 @@ struct mca_coll_ml_module_t {
    mca_coll_ml_collective_operation_description_t *
        coll_ml_allreduce_functions[ML_NUM_ALLREDUCE_FUNCTIONS];

+    /** Reduce functions */
+    mca_coll_ml_collective_operation_description_t *
+        coll_ml_reduce_functions[ML_NUM_REDUCE_FUNCTIONS];
+
+
    /** scatter */
    mca_coll_ml_collective_operation_description_t *
        coll_ml_scatter_functions[ML_NUM_SCATTER_FUNCTIONS];
@ -764,9 +763,6 @@ struct mca_coll_ml_module_t {
    uint64_t fragment_size;
    uint32_t ml_fragment_size;

-    /* For carto graph */
-    /* opal_carto_graph_t *sm_graph; */
-    /* opal_carto_graph_t *ib_graph; */
    /* Bcast index table. Pasha: Do we need to define something more generic ?
     the table  x 2 (large/small)*/
    int bcast_fn_index_table[2];
@ -784,6 +780,9 @@ struct mca_coll_ml_module_t {
    /* On this list we keep coll_op descriptors that were not
     * be able to start, since no ml buffers were available */
    opal_list_t waiting_for_memory_list;
+
+    /* fallback collectives */
+    mca_coll_base_comm_coll_t fallback;
 };

 typedef struct mca_coll_ml_module_t mca_coll_ml_module_t;
@ -812,25 +811,46 @@ int mca_coll_ml_ibarrier_intra(struct ompi_communicator_t *comm,
                               ompi_request_t **req,
                               mca_coll_base_module_t *module);

+/* Allreduce with EXTRA TOPO using - blocking */
 int mca_coll_ml_allreduce_dispatch(void *sbuf, void *rbuf, int count,
                                struct ompi_datatype_t *dtype, struct ompi_op_t *op,
                                struct ompi_communicator_t *comm, mca_coll_base_module_t *module);

+/* Allreduce with EXTRA TOPO using - Non-blocking */
+int mca_coll_ml_allreduce_dispatch_nb(void *sbuf, void *rbuf, int count,
+                                   ompi_datatype_t *dtype, ompi_op_t *op,
+                                   ompi_communicator_t *comm,
+                                   ompi_request_t **req,
+                                   mca_coll_base_module_t *module);
+
 /* Allreduce - blocking */
-int mca_coll_ml_allreduce_intra(void *sbuf, void *rbuf, int count,
+int mca_coll_ml_allreduce(void *sbuf, void *rbuf, int count,
                                struct ompi_datatype_t *dtype, struct ompi_op_t *op,
                                struct ompi_communicator_t *comm,
                                mca_coll_base_module_t *module);

-int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *module, int bank_index);
+/* Allreduce - Non-blocking */
+int mca_coll_ml_allreduce_nb(void *sbuf, void *rbuf, int count,
+                                struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                                struct ompi_communicator_t *comm,
+                                ompi_request_t **req,
+                                mca_coll_base_module_t *module);

-/* Reduce blocking */
+/* Reduce - Blocking */
 int mca_coll_ml_reduce(void *sbuf, void *rbuf, int count,
        struct ompi_datatype_t *dtype, struct ompi_op_t *op,
-        int root,
-        struct ompi_communicator_t *comm,
+        int root, struct ompi_communicator_t *comm,
        mca_coll_base_module_t *module);

+int mca_coll_ml_reduce_nb(void *sbuf, void *rbuf, int count,
+        struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+        int root, struct ompi_communicator_t *comm,
+        ompi_request_t **req,
+        mca_coll_base_module_t *module);
+
+int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *module, int bank_index);
+
+
 int coll_ml_progress_individual_message(mca_coll_ml_fragment_t *frag_descriptor);

 /*
@ -902,6 +922,17 @@ int mca_coll_ml_fulltree_iboffload_only_hierarchy_discovery(mca_coll_ml_module_t

 void mca_coll_ml_allreduce_matrix_init(mca_coll_ml_module_t *ml_module,
                     const mca_bcol_base_component_2_0_0_t *bcol_component);
+static inline int mca_coll_ml_err(const char* fmt, ...)
+{
+    va_list list;
+    int ret;
+
+    va_start(list, fmt);
+    ret = vfprintf(stderr, fmt, list);
+    va_end(list);
+    return ret;
+}
+

 #define ML_ERROR(args)                                       \
 do {                                                     \
--- a/ompi/mca/coll/ml/coll_ml_allgather.c
+++ b/ompi/mca/coll/ml/coll_ml_allgather.c
@ -0,0 +1,631 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+/** @file */
+
+#include "ompi_config.h"
+
+#include <stdlib.h>
+
+#include "ompi/constants.h"
+#include "opal/threads/mutex.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/coll/coll.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "opal/sys/atomic.h"
+#include "coll_ml.h"
+#include "coll_ml_select.h"
+#include "coll_ml_allocation.h"
+
+static int mca_coll_ml_allgather_small_unpack_data(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    bool rcontig = coll_op->full_message.recv_data_continguous;
+    int n_ranks_in_comm = ompi_comm_size(OP_ML_MODULE(coll_op)->comm);
+
+    void *dest = (void *)((uintptr_t)coll_op->full_message.dest_user_addr +
+            (uintptr_t)coll_op->full_message.n_bytes_delivered);
+    void *src = (void *)((uintptr_t)coll_op->fragment_data.buffer_desc->data_addr +
+            (size_t)coll_op->variable_fn_params.rbuf_offset);
+
+    if (rcontig) {
+        memcpy(dest, src, n_ranks_in_comm * coll_op->full_message.n_bytes_scheduled);
+    } else {
+        mca_coll_ml_convertor_unpack(src, n_ranks_in_comm * coll_op->full_message.n_bytes_scheduled,
+                                          &coll_op->fragment_data.message_descriptor->recv_convertor);
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static inline void copy_data (mca_coll_ml_collective_operation_progress_t *coll_op, rank_properties_t *rank_props, int soffset) {
+    bool rcontig = coll_op->fragment_data.message_descriptor->recv_data_continguous;
+    size_t total_bytes = coll_op->fragment_data.message_descriptor->n_bytes_total;
+    size_t pack_len = coll_op->fragment_data.fragment_size;
+    int doffset = rank_props->rank;
+    void *dest, *src;
+
+    src = (void *) ((uintptr_t)coll_op->fragment_data.buffer_desc->data_addr +
+                    (size_t)coll_op->variable_fn_params.rbuf_offset + soffset * pack_len);
+
+    if (rcontig) {
+        dest = (void *) ((uintptr_t) coll_op->full_message.dest_user_addr +
+                         (uintptr_t) coll_op->fragment_data.offset_into_user_buffer +
+                         doffset * total_bytes);
+
+        memcpy(dest, src, pack_len);
+    } else {
+        size_t position;
+        opal_convertor_t *recv_convertor =
+            &coll_op->fragment_data.message_descriptor->recv_convertor;
+
+        position = (size_t) coll_op->fragment_data.offset_into_user_buffer +
+            doffset * total_bytes;
+
+        opal_convertor_set_position(recv_convertor, &position);
+        mca_coll_ml_convertor_unpack(src, pack_len, recv_convertor);
+    }
+}
+
+static int mca_coll_ml_allgather_noncontiguous_unpack_data(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int i, j, n_level_one_sbgps;
+    size_t soffset;
+
+    mca_coll_ml_topology_t *topo_info = coll_op->coll_schedule->topo_info;
+    sub_group_params_t *array_of_all_subgroup_ranks = topo_info->array_of_all_subgroups;
+
+    n_level_one_sbgps = array_of_all_subgroup_ranks->level_one_index;
+
+    for (i = 0 ; i < n_level_one_sbgps; i++) {
+        /* determine where in the source buffer the data can be found */
+        soffset = array_of_all_subgroup_ranks[i].index_of_first_element;
+        for (j = 0 ; j < array_of_all_subgroup_ranks[i].n_ranks; j++, ++soffset) {
+            copy_data (coll_op, array_of_all_subgroup_ranks[i].rank_data + j, soffset);
+        }
+    }
+
+    return OMPI_SUCCESS;
+}
+
+/* Allgather dependencies seem easy, everyone needs to work from the "bottom up".
+ * Following Pasha, I too will put the simplest dependencies graph and change it later
+ * when we add hierarchy. Basically, allgather has the same dependency profile as the
+ * sequential broadcast except that there is only a single ordering of tasks.
+ */
+static int mca_coll_ml_allgather_task_setup(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int fn_idx, h_level, my_index, root;
+    mca_sbgp_base_module_t *sbgp;
+    mca_coll_ml_topology_t *topo = coll_op->coll_schedule->topo_info;
+
+    fn_idx      = coll_op->sequential_routine.current_active_bcol_fn;
+    h_level     = coll_op->coll_schedule->component_functions[fn_idx].h_level;
+    sbgp        = topo->component_pairs[h_level].
+                  subgroup_module;
+    my_index    = sbgp->my_index;
+
+    /* In the case of allgather, the local leader is always the root */
+    root = 0;
+    if (my_index == root) {
+        coll_op->variable_fn_params.root_flag = true;
+        coll_op->variable_fn_params.root_route = NULL;
+    } else {
+        coll_op->variable_fn_params.root_flag = false;
+        coll_op->variable_fn_params.root_route = &topo->route_vector[root];
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static int mca_coll_ml_allgather_frag_progress(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    /* local variables */
+    int ret;
+    size_t frag_len, dt_size;
+
+    void *buf;
+    ml_payload_buffer_desc_t *src_buffer_desc;
+    mca_coll_ml_collective_operation_progress_t *new_op;
+
+    mca_coll_ml_module_t *ml_module = OP_ML_MODULE(coll_op);
+    bool scontig = coll_op->fragment_data.message_descriptor->send_data_continguous;
+
+    ompi_datatype_type_size(coll_op->variable_fn_params.dtype, &dt_size);
+    /* Keep the pipeline filled with fragments */
+    while (coll_op->fragment_data.message_descriptor->n_active <
+        coll_op->fragment_data.message_descriptor->pipeline_depth) {
+        /* If an active fragment happens to have completed the collective during
+         * a hop into the progress engine, then don't launch a new fragment,
+         * instead break and return.
+         */
+        if (coll_op->fragment_data.message_descriptor->n_bytes_scheduled
+            == coll_op->fragment_data.message_descriptor->n_bytes_total) {
+            break;
+        }
+        /* Get an ml buffer */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        if (NULL == src_buffer_desc) {
+            /* If there exist outstanding fragments, then break out
+             * and let an active fragment deal with this later,
+             * there are no buffers available.
+             */
+            if (0 < coll_op->fragment_data.message_descriptor->n_active) {
+                return OMPI_SUCCESS;
+            } else {
+                /* The fragment is already on list and
+                 * the we still have no ml resources
+                 * Return busy */
+                if (coll_op->pending & REQ_OUT_OF_MEMORY) {
+                    ML_VERBOSE(10,("Out of resources %p", coll_op));
+                    return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+                }
+
+                coll_op->pending |= REQ_OUT_OF_MEMORY;
+                opal_list_append(&((OP_ML_MODULE(coll_op))->waiting_for_memory_list),
+                        (opal_list_item_t *)coll_op);
+                ML_VERBOSE(10,("Out of resources %p adding to pending queue", coll_op));
+                return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+            }
+        }
+
+        /* Get a new collective descriptor and initialize it */
+        new_op =  mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                     ml_module->coll_ml_allgather_functions[ML_SMALL_DATA_ALLGATHER],
+                     coll_op->fragment_data.message_descriptor->src_user_addr,
+                     coll_op->fragment_data.message_descriptor->dest_user_addr,
+                     coll_op->fragment_data.message_descriptor->n_bytes_total,
+                     coll_op->fragment_data.message_descriptor->n_bytes_scheduled);
+
+        new_op->fragment_data.current_coll_op = coll_op->fragment_data.current_coll_op;
+        new_op->fragment_data.message_descriptor = coll_op->fragment_data.message_descriptor;
+
+        /* set the task setup callback  */
+        new_op->sequential_routine.seq_task_setup = mca_coll_ml_allgather_task_setup;
+
+        /*
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(new_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+        */
+
+        /* We need this address for pointer arithmetic in memcpy */
+        buf = coll_op->fragment_data.message_descriptor->src_user_addr;
+
+        if (!scontig) {
+            frag_len = ml_module->small_message_thresholds[BCOL_ALLGATHER];
+            mca_coll_ml_convertor_get_send_frag_size(
+                            ml_module, &frag_len,
+                            coll_op->fragment_data.message_descriptor);
+
+            mca_coll_ml_convertor_pack(
+                (void *) ((uintptr_t) src_buffer_desc->data_addr +
+                frag_len * coll_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                frag_len * coll_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index),
+                frag_len, &coll_op->fragment_data.message_descriptor->send_convertor);
+       } else {
+            /* calculate new frag length, there are some issues here */
+            frag_len = (coll_op->fragment_data.message_descriptor->n_bytes_total -
+                    coll_op->fragment_data.message_descriptor->n_bytes_scheduled <
+                    coll_op->fragment_data.fragment_size ?
+                    coll_op->fragment_data.message_descriptor->n_bytes_total -
+                    coll_op->fragment_data.message_descriptor->n_bytes_scheduled :
+                    coll_op->fragment_data.fragment_size);
+
+            /* everybody copies in, based on the new values */
+            memcpy((void *) ((uintptr_t)src_buffer_desc->data_addr +
+                    frag_len * new_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                    frag_len * new_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index),
+                    (void *) ((uintptr_t) buf + (uintptr_t)
+                            coll_op->fragment_data.message_descriptor->n_bytes_scheduled), frag_len);
+        }
+
+        new_op->variable_fn_params.sbuf = (void *) src_buffer_desc->data_addr;
+        new_op->variable_fn_params.rbuf = (void *) src_buffer_desc->data_addr;
+
+        /* update the number of bytes scheduled */
+        new_op->fragment_data.message_descriptor->n_bytes_scheduled += frag_len;
+        /* everyone needs an unpack function */
+        new_op->process_fn = mca_coll_ml_allgather_noncontiguous_unpack_data;
+
+        new_op->fragment_data.fragment_size = frag_len;
+        new_op->fragment_data.buffer_desc = src_buffer_desc;
+
+        /* Setup fragment specific data */
+        ++(new_op->fragment_data.message_descriptor->n_active);
+
+        ML_VERBOSE(10, ("Start more, My index %d ",
+                    new_op->fragment_data.buffer_desc->buffer_index));
+
+        /* this is a bit buggy */
+        ML_SET_VARIABLE_PARAMS_BCAST(
+                new_op,
+                OP_ML_MODULE(new_op),
+                frag_len /* yes, we have consistent units, so this makes sense */,
+                MPI_BYTE /* we fragment according to buffer size
+                          * we don't reduce the data thus we needn't
+                          * keep "whole" datatypes, we may freely
+                          * fragment without regard for multiples
+                          * of any specific datatype
+                          */,
+                src_buffer_desc,
+                0,
+                0,
+                frag_len,
+                src_buffer_desc->data_addr);
+        /* initialize first coll */
+        ret = new_op->sequential_routine.seq_task_setup(new_op);
+        if (OMPI_SUCCESS != ret) {
+            ML_VERBOSE(3, ("Fragment failed to initialize itself"));
+            return ret;
+        }
+
+        new_op->variable_fn_params.buffer_size = frag_len;
+        new_op->variable_fn_params.hier_factor = coll_op->variable_fn_params.hier_factor;
+        new_op->variable_fn_params.root = 0;
+
+        MCA_COLL_ML_SET_NEW_FRAG_ORDER_INFO(new_op);
+
+        /* append this collective !! */
+        OPAL_THREAD_LOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+        opal_list_append(&mca_coll_ml_component.sequential_collectives,
+                                    (opal_list_item_t *)new_op);
+        OPAL_THREAD_UNLOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static inline __opal_attribute_always_inline__
+int mca_coll_ml_allgather_start (void *sbuf, int scount,
+                                 struct ompi_datatype_t *sdtype,
+                                 void* rbuf, int rcount,
+                                 struct ompi_datatype_t *rdtype,
+                                 struct ompi_communicator_t *comm,
+                                 mca_coll_base_module_t *module,
+                                 ompi_request_t **req)
+{
+    size_t pack_len, sdt_size;
+    int ret, n_fragments = 1, comm_size;
+
+    mca_coll_ml_topology_t *topo_info;
+    ml_payload_buffer_desc_t *src_buffer_desc;
+
+    mca_coll_ml_component_t *cm = &mca_coll_ml_component;
+
+    mca_coll_ml_collective_operation_progress_t *coll_op;
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t *) module;
+
+    ptrdiff_t lb, extent;
+    bool scontig, rcontig, in_place = false;
+
+    /* check for in place setting */
+    if (MPI_IN_PLACE == sbuf) {
+        in_place = true;
+        sdtype = rdtype;
+        scount = rcount;
+    }
+
+    /* scontig could be != to rcontig */
+    scontig = ompi_datatype_is_contiguous_memory_layout(sdtype, scount);
+    rcontig = ompi_datatype_is_contiguous_memory_layout(rdtype, rcount);
+
+    comm_size = ompi_comm_size(comm);
+
+    ML_VERBOSE(10, ("Starting allgather"));
+
+    assert(NULL != sdtype);
+    /* Calculate size of the data,
+     * at this stage, only contiguous data is supported */
+
+    /* this is valid for allagther */
+    ompi_datatype_type_size(sdtype, &sdt_size);
+    pack_len = scount * sdt_size;
+
+    if (in_place) {
+        sbuf = rbuf + ompi_comm_rank(comm) * pack_len;
+    }
+
+    /* Allocate collective schedule and pack message */
+    /* this is the total ending message size that will need to fit in the ml-buffer */
+    if (pack_len <= (size_t) ml_module->small_message_thresholds[BCOL_ALLGATHER]) {
+        /* The len of the message can not be larger than ML buffer size */
+        ML_VERBOSE(10, ("Single frag %d %d %d", pack_len, comm_size, ml_module->payload_block->size_buffer));
+        assert(pack_len * comm_size <= ml_module->payload_block->size_buffer);
+
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        /* change 1 */
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allgather_functions[ML_SMALL_DATA_ALLGATHER],
+                sbuf, rbuf, pack_len, 0 /* offset for first pack */);
+
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+
+        coll_op->fragment_data.current_coll_op = ML_SMALL_DATA_ALLGATHER;
+        /* task setup callback function */
+        coll_op->sequential_routine.seq_task_setup = mca_coll_ml_allgather_task_setup;
+
+        /* change 2 */
+        if (!scontig) {
+            coll_op->full_message.n_bytes_scheduled =
+                mca_coll_ml_convertor_prepare(sdtype, scount, sbuf,
+                    &coll_op->full_message.send_convertor, MCA_COLL_ML_NET_STREAM_SEND);
+
+            mca_coll_ml_convertor_pack(
+                        (void *) ((uintptr_t) src_buffer_desc->data_addr + pack_len *
+                                  (coll_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                                   coll_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index)),
+                        pack_len, &coll_op->full_message.send_convertor);
+        } else {
+            /* change 3 */
+            memcpy((void *)((uintptr_t) src_buffer_desc->data_addr + pack_len *
+                            (coll_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                             coll_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index)),
+                   sbuf, pack_len);
+
+            coll_op->full_message.n_bytes_scheduled = pack_len;
+        }
+
+        if (!rcontig) {
+            mca_coll_ml_convertor_prepare(rdtype, rcount * comm_size, rbuf,
+                &coll_op->full_message.recv_convertor, MCA_COLL_ML_NET_STREAM_RECV);
+        }
+
+        if (coll_op->coll_schedule->topo_info->ranks_contiguous) {
+            coll_op->process_fn = mca_coll_ml_allgather_small_unpack_data;
+        } else {
+            coll_op->process_fn = mca_coll_ml_allgather_noncontiguous_unpack_data;
+        }
+
+        /* whole ml-buffer is used to send AND receive */
+        coll_op->variable_fn_params.sbuf = (void *) src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.rbuf = (void *) src_buffer_desc->data_addr;
+
+        /* we can set the initial offset here */
+        coll_op->variable_fn_params.sbuf_offset = 0;
+        coll_op->variable_fn_params.rbuf_offset = 0;
+
+        coll_op->variable_fn_params.count = scount;
+        coll_op->fragment_data.fragment_size =
+                             coll_op->full_message.n_bytes_scheduled;
+
+        /* For small CINCO, we may use the native datatype */
+        coll_op->variable_fn_params.dtype = sdtype;
+        coll_op->variable_fn_params.buffer_size = pack_len;
+        coll_op->variable_fn_params.root = 0;
+    } else if (cm->enable_fragmentation || pack_len * comm_size < (1 << 20)) {
+        /* calculate the number of fragments and the size of each frag */
+        size_t n_dts_per_frag, frag_len;
+        int pipeline_depth = mca_coll_ml_component.pipeline_depth;
+
+        /* Calculate the number of fragments required for this message careful watch the integer division !*/
+        frag_len = (pack_len <= (size_t) ml_module->small_message_thresholds[BCOL_ALLGATHER] ?
+                pack_len : (size_t) ml_module->small_message_thresholds[BCOL_ALLGATHER]);
+
+        n_dts_per_frag = frag_len / sdt_size;
+        n_fragments = (pack_len + sdt_size * n_dts_per_frag - 1) / (sdt_size * n_dts_per_frag);
+        pipeline_depth = (n_fragments < pipeline_depth ? n_fragments : pipeline_depth);
+
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        /* change 4 */
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allgather_functions[ML_SMALL_DATA_ALLGATHER],
+                sbuf, rbuf, pack_len,
+                0 /* offset for first pack */);
+
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+        topo_info = coll_op->coll_schedule->topo_info;
+
+        /* task setup callback function */
+        coll_op->sequential_routine.seq_task_setup = mca_coll_ml_allgather_task_setup;
+
+        if (!scontig) {
+            coll_op->full_message.send_converter_bytes_packed =
+                        mca_coll_ml_convertor_prepare(
+                                sdtype, scount, NULL,
+                                &coll_op->full_message.dummy_convertor,
+                                MCA_COLL_ML_NET_STREAM_SEND);
+
+            coll_op->full_message.dummy_conv_position = 0;
+            mca_coll_ml_convertor_get_send_frag_size(
+                                    ml_module, &frag_len,
+                                    &coll_op->full_message);
+
+            /* change 5 */
+            mca_coll_ml_convertor_prepare(sdtype, scount, sbuf,
+                    &coll_op->full_message.send_convertor, MCA_COLL_ML_NET_STREAM_SEND);
+
+            mca_coll_ml_convertor_pack(
+                    (void *) ((uintptr_t) src_buffer_desc->data_addr + frag_len *
+                              (coll_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                               coll_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index)),
+                    frag_len, &coll_op->full_message.send_convertor);
+        } else {
+            /* change 6 */
+            memcpy((void *)((uintptr_t)src_buffer_desc->data_addr + frag_len *
+                            (coll_op->coll_schedule->topo_info->hier_layout_info[0].offset +
+                             coll_op->coll_schedule->topo_info->hier_layout_info[0].level_one_index)),
+                    sbuf, frag_len);
+        }
+
+        if (!rcontig) {
+            mca_coll_ml_convertor_prepare(rdtype, rcount * comm_size, rbuf,
+                    &coll_op->full_message.recv_convertor, MCA_COLL_ML_NET_STREAM_RECV);
+        }
+
+        coll_op->process_fn = mca_coll_ml_allgather_noncontiguous_unpack_data;
+
+        /* hopefully this doesn't royaly screw things up idea behind this is the
+         * whole ml-buffer is used to send and receive
+         */
+        coll_op->variable_fn_params.sbuf = (void *) src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.rbuf = (void *) src_buffer_desc->data_addr;
+
+        /* we can set the initial offset here */
+        coll_op->variable_fn_params.sbuf_offset = 0;
+        coll_op->variable_fn_params.rbuf_offset = 0;
+
+        coll_op->fragment_data.buffer_desc = src_buffer_desc;
+
+        coll_op->fragment_data.fragment_size = frag_len;
+        coll_op->fragment_data.message_descriptor->n_active = 1;
+
+        coll_op->full_message.n_bytes_scheduled = frag_len;
+        coll_op->full_message.fragment_launcher = mca_coll_ml_allgather_frag_progress;
+
+        coll_op->full_message.pipeline_depth = pipeline_depth;
+        coll_op->fragment_data.current_coll_op = ML_SMALL_DATA_ALLGATHER;
+
+        /* remember this is different for frags !! Caused data corruption when
+         * not properly set. Need to be sure you have consistent units.
+         */
+        coll_op->variable_fn_params.count = frag_len;
+        coll_op->variable_fn_params.dtype = MPI_BYTE; /* for fragmented data, we work in
+                                                       * units of bytes. This means that
+                                                       * all of our arithmetic is done
+                                                       * in terms of bytes
+                                                       */
+
+        coll_op->variable_fn_params.root = 0;
+        coll_op->variable_fn_params.frag_size = frag_len;
+        coll_op->variable_fn_params.buffer_size = frag_len;
+    } else {
+        /* change 7 */
+        ML_VERBOSE(10, ("ML_ALLGATHER_LARGE_DATA_KNOWN case."));
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allgather_functions[ML_LARGE_DATA_ALLGATHER],
+                sbuf, rbuf, pack_len, 0 /* offset for first pack */);
+        topo_info = coll_op->coll_schedule->topo_info;
+        if (MCA_BCOL_BASE_NO_ML_BUFFER_FOR_LARGE_MSG & topo_info->all_bcols_mode) {
+            MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op, MCA_COLL_ML_NO_BUFFER, NULL);
+        } else {
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+            while (NULL == src_buffer_desc) {
+                opal_progress();
+                src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+            }
+
+            MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op, src_buffer_desc->buffer_index, src_buffer_desc);
+        }
+
+        /* not sure if I really need this here */
+        coll_op->sequential_routine.seq_task_setup = mca_coll_ml_allgather_task_setup;
+        coll_op->process_fn = NULL;
+        /* probably the most important piece */
+        coll_op->variable_fn_params.sbuf = sbuf;
+        coll_op->variable_fn_params.rbuf = rbuf;
+        coll_op->variable_fn_params.sbuf_offset = 0;
+        coll_op->variable_fn_params.rbuf_offset = 0;
+        coll_op->variable_fn_params.count = scount;
+        coll_op->variable_fn_params.dtype = sdtype;/* for zero copy, we want the
+                                                    * native datatype and actual count
+                                                    */
+        coll_op->variable_fn_params.root = 0;
+
+        /* you still need to copy in your own data into the rbuf */
+        /* don't need to do this if you have in place data */
+        if (!in_place) {
+            memcpy((char *) rbuf + ompi_comm_rank(comm) * pack_len, sbuf, pack_len);
+        }
+    }
+
+    coll_op->full_message.send_count = scount;
+    coll_op->full_message.recv_count = rcount;
+
+    coll_op->full_message.send_data_continguous = scontig;
+    coll_op->full_message.recv_data_continguous = rcontig;
+
+    ompi_datatype_get_extent(sdtype, &lb, &extent);
+    coll_op->full_message.send_extent = (size_t) extent;
+
+    ompi_datatype_get_extent(rdtype, &lb, &extent);
+    coll_op->full_message.recv_extent = (size_t) extent;
+
+
+    /* Fill in the function arguments */
+    coll_op->variable_fn_params.sequence_num =
+        OPAL_THREAD_ADD64(&(ml_module->collective_sequence_num), 1);
+    coll_op->variable_fn_params.hier_factor = comm_size;
+
+    MCA_COLL_ML_SET_ORDER_INFO(coll_op, n_fragments);
+
+
+    ret = mca_coll_ml_launch_sequential_collective (coll_op);
+    if (OMPI_SUCCESS != ret) {
+        ML_VERBOSE(10, ("Failed to launch"));
+        return ret;
+    }
+
+    *req = &coll_op->full_message.super;
+
+    return OMPI_SUCCESS;
+}
+
+int mca_coll_ml_allgather(void *sbuf, int scount,
+                          struct ompi_datatype_t *sdtype,
+                          void* rbuf, int rcount,
+                          struct ompi_datatype_t *rdtype,
+                          struct ompi_communicator_t *comm,
+                          mca_coll_base_module_t *module)
+{
+    ompi_request_t *req;
+    int ret;
+
+    ML_VERBOSE(10, ("Starting blocking allgather"));
+
+    ret = mca_coll_ml_allgather_start (sbuf, scount, sdtype,
+                                       rbuf, rcount, rdtype,
+                                       comm, module, &req);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        return ret;
+    }
+
+    ret = ompi_request_wait (&req, MPI_STATUS_IGNORE);
+
+    ML_VERBOSE(10, ("Blocking allgather is complete"));
+
+    return ret;
+}
+
+int mca_coll_ml_allgather_nb(void *sbuf, int scount,
+                             struct ompi_datatype_t *sdtype,
+                             void* rbuf, int rcount,
+                             struct ompi_datatype_t *rdtype,
+                             struct ompi_communicator_t *comm,
+                             ompi_request_t **req,
+                             mca_coll_base_module_t *module)
+{
+    int ret;
+
+    ML_VERBOSE(10, ("Starting non-blocking allgather"));
+
+    ret = mca_coll_ml_allgather_start (sbuf, scount, sdtype,
+                                       rbuf, rcount, rdtype,
+                                       comm, module, req);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        return ret;
+    }
+
+    ML_VERBOSE(10, ("Non-blocking allgather started"));
+
+    return ret;
+}
--- a/ompi/mca/coll/ml/coll_ml_allocation.c
+++ b/ompi/mca/coll/ml/coll_ml_allocation.c
@ -1,3 +1,4 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
@ -20,10 +21,8 @@

 long memory_buffer_index;

-ml_memory_block_desc_t *mca_coll_ml_allocate_block(
-                struct mca_coll_ml_component_t *ml_component,
-                ml_memory_block_desc_t *ml_memblock
-                )
+ml_memory_block_desc_t *mca_coll_ml_allocate_block(struct mca_coll_ml_component_t *ml_component,
+                                                   ml_memory_block_desc_t *ml_memblock)
 {
    ml_memory_block_desc_t *ret = NULL;
    ml_memory_block_desc_t *memory_block = NULL;
@ -33,7 +32,7 @@ ml_memory_block_desc_t *mca_coll_ml_allocate_block(
        ML_ERROR(("Memory already allocated - expecting NULL pointer"));
        return ret;
    }
-    memory_block = (ml_memory_block_desc_t*) malloc(sizeof(ml_memory_block_desc_t));
+    memory_block = (ml_memory_block_desc_t*) calloc(1, sizeof(ml_memory_block_desc_t));

    if (NULL == memory_block){
        ML_ERROR(("Couldn't allocate memory for ml_memblock"));
@ -61,38 +60,31 @@ exit_ERROR:
    return ret;
 }

-void mca_coll_ml_free_block(
-         ml_memory_block_desc_t *ml_memblock
-        )
+void mca_coll_ml_free_block (ml_memory_block_desc_t *ml_memblock)
 {
-    ml_payload_buffer_desc_t **pbuff_descs = NULL;
-
    if (!ml_memblock)
        return;

-    if(pbuff_descs){
-        free(pbuff_descs);
+    if (ml_memblock->buffer_descs){
+        free(ml_memblock->buffer_descs);
    }

    mca_coll_ml_lmngr_free(ml_memblock->block);
-    free(ml_memblock->buffer_descs);
    free(ml_memblock->bank_release_counters);
    free(ml_memblock->ready_for_memsync);
    free(ml_memblock->bank_is_busy);
    free(ml_memblock);
 }

-int mca_coll_ml_initialize_block(
-                ml_memory_block_desc_t *ml_memblock,
-                uint32_t num_buffers,
-                uint32_t num_banks,
-                uint32_t buffer_size,
-                int32_t data_offset,
-                opal_list_t *bcols_in_use
-                )
+int mca_coll_ml_initialize_block(ml_memory_block_desc_t *ml_memblock,
+                                 uint32_t num_buffers,
+                                 uint32_t num_banks,
+                                 uint32_t buffer_size,
+                                 int32_t data_offset,
+                                 opal_list_t *bcols_in_use)
 {
    int ret = OMPI_SUCCESS;
-    uint32_t loop, bank_loop, buff_loop;
+    uint32_t bank_loop, buff_loop;
    uint64_t addr_offset = 0;
    ml_payload_buffer_desc_t *pbuff_descs = NULL,*pbuff_desc = NULL;

@ -122,30 +114,31 @@ int mca_coll_ml_initialize_block(

            addr_offset+=buffer_size;
            pbuff_desc->buffer_index = BUFFER_INDEX(bank_loop,num_buffers,buff_loop);
-#if 0
-            ML_ERROR(("Setting buffer_index %lld %d %d", pbuff_desc->buffer_index,
-                        BANK_FROM_BUFFER_IDX(pbuff_desc->buffer_index),
-                        BUFFER_FROM_BUFFER_IDX(pbuff_desc->buffer_index)));
-#endif
+
            pbuff_desc->bank_index=bank_loop;
            pbuff_desc->generation_number=0;
        }

    /* Initialize ml memory block */
-    ml_memblock->bank_release_counters = (uint32_t *) malloc(sizeof(uint32_t) *
-            num_banks);
+    /* gvm FIX:This counter when zero indicates that the bank is ready for
+     * recycle. This is  initialized to number of bcol components as each bcol is responsible for
+     * releasing the buffers of a bank. This initialization will have
+     * faulty behavior, example in case of multiple interfaces,  when more than
+     * one bcol module of the component type is in use.
+     */
+    ml_memblock->bank_release_counters = (uint32_t *) calloc(num_banks, sizeof(uint32_t));
    if (NULL == ml_memblock->bank_release_counters) {
        ret = OMPI_ERR_OUT_OF_RESOURCE;
        goto exit_ERROR;
    }

-    ml_memblock->ready_for_memsync = (bool *)malloc(sizeof(bool) * num_banks);
+    ml_memblock->ready_for_memsync = (bool *) calloc(num_banks, sizeof(bool));
    if (NULL == ml_memblock->ready_for_memsync) {
        ret = OMPI_ERR_OUT_OF_RESOURCE;
        goto exit_ERROR;
    }

-    ml_memblock->bank_is_busy = (bool *)malloc(sizeof(bool) * num_banks);
+    ml_memblock->bank_is_busy = (bool *) calloc(num_banks, sizeof(bool));
    if (NULL == ml_memblock->bank_is_busy) {
        ret = OMPI_ERR_OUT_OF_RESOURCE;
        goto exit_ERROR;
@ -154,18 +147,6 @@ int mca_coll_ml_initialize_block(
    /* Set index for first bank to sync */
    ml_memblock->memsync_counter = 0;

-    /* gvm FIX:This counter when zero indicates that the bank is ready for
-     * recycle. This is  initialized to number of bcol components as each bcol is responsible for
-     * releasing the buffers of a bank. This initialization will have
-     * faulty behavior, example in case of multiple interfaces,  when more than
-     * one bcol module of the component type is in use.
-     */
-    for(loop = 0; loop < num_banks; loop++) {
-        ml_memblock->bank_release_counters[loop] = 0;
-        ml_memblock->ready_for_memsync[loop] = false;
-        ml_memblock->bank_is_busy[loop] = false;
-    }
-
    /* use first bank and first buffer */
    ml_memblock->next_free_buffer = 0;

@ -186,22 +167,20 @@ exit_ERROR:
    return ret;
 }

-ml_payload_buffer_desc_t *mca_coll_ml_alloc_buffer(
-            mca_coll_ml_module_t *module){
-
+ml_payload_buffer_desc_t *mca_coll_ml_alloc_buffer (mca_coll_ml_module_t *module)
+{
    uint64_t bindex;
    uint32_t bank, buffer, num_buffers;
    ml_memory_block_desc_t *ml_memblock = module->payload_block;
    ml_payload_buffer_desc_t *pbuff_descs = NULL,
-                             *ml_membuffer = NULL;
+        *ml_membuffer = NULL;

    /* Return a buffer */
    num_buffers = ml_memblock->num_buffers_per_bank;
    pbuff_descs = ml_memblock->buffer_descs;
    bindex = ml_memblock->next_free_buffer;
-	buffer = bindex%num_buffers;
-	bank = bindex/num_buffers;
-
+    buffer = bindex % num_buffers;
+    bank = bindex/num_buffers;

    ML_VERBOSE(10, ("ML allocator: allocating buffer index %d, bank index %d", buffer, bank));

@ -210,13 +189,12 @@ ml_payload_buffer_desc_t *mca_coll_ml_alloc_buffer(
        if(!ml_memblock->bank_is_busy[bank]) {
            /* the bank is free, mark it busy */
            ml_memblock->bank_is_busy[bank] = true;
-            ML_VERBOSE(10, ("ML allocator: reset bank %d to value %d",
-                        bank,
-                        ml_memblock->bank_release_counters[bank]));
+            ML_VERBOSE(10, ("ML allocator: reset bank %d to value %d", bank,
+                            ml_memblock->bank_release_counters[bank]));
        } else {
            /* the bank is busy, return NULL and upper layer will handle it */
-            ML_VERBOSE(10, ("No free payload buffers are available for use.\
-                        Next memory bank is still used by one of bcols \n"));
+            ML_VERBOSE(10, ("No free payload buffers are available for use."
+                            " Next memory bank is still used by one of bcols"));
            return NULL;
        }
    }
@ -227,11 +205,9 @@ ml_payload_buffer_desc_t *mca_coll_ml_alloc_buffer(
    ML_VERBOSE(10, ("ML allocator: ml buffer index %d", bindex));

    /* Compute next free buffer */
-    ++buffer;
-    buffer %= num_buffers;
+    buffer = (buffer == num_buffers - 1) ? 0 : buffer + 1;
    if (0 == buffer) {
-        ++bank;
-        bank %= ml_memblock->num_banks;
+        bank = (bank == ml_memblock->num_banks - 1) ? 0 : bank + 1;
    }

    ml_memblock->next_free_buffer = BUFFER_INDEX(bank,num_buffers,buffer);
--- a/ompi/mca/coll/ml/coll_ml_allreduce.c
+++ b/ompi/mca/coll/ml/coll_ml_allreduce.c
@ -0,0 +1,553 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+/** @file */
+
+#include "ompi_config.h"
+
+#include <stdlib.h>
+
+#include "ompi/constants.h"
+#include "opal/threads/mutex.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/coll/coll.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "opal/sys/atomic.h"
+#include "coll_ml.h"
+#include "coll_ml_select.h"
+#include "coll_ml_allocation.h"
+
+static int mca_coll_ml_allreduce_small_unpack(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int ret;
+    /* need to put in more */
+    int count = coll_op->variable_fn_params.count;
+    ompi_datatype_t *dtype = coll_op->variable_fn_params.dtype;
+
+    void *dest = (void *)((uintptr_t)coll_op->full_message.dest_user_addr +
+            (uintptr_t)coll_op->fragment_data.offset_into_user_buffer);
+    void *src = (void *)((uintptr_t)coll_op->fragment_data.buffer_desc->data_addr +
+            (size_t)coll_op->variable_fn_params.rbuf_offset);
+
+    ret = ompi_datatype_copy_content_same_ddt(dtype, (int32_t) count, (char *) dest,
+            (char *) src);
+    if (ret < 0) {
+        return OMPI_ERROR;
+    }
+
+    ML_VERBOSE(10, ("sbuf addr %p, sbuf offset %d, rbuf addr %p, rbuf offset %d.",
+                    src, coll_op->variable_fn_params.sbuf_offset, dest,
+                    coll_op->variable_fn_params.rbuf_offset));
+
+    return OMPI_SUCCESS;
+}
+
+static int mca_coll_ml_allreduce_task_setup(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int fn_idx, h_level, my_index, root;
+    mca_sbgp_base_module_t *sbgp;
+    mca_coll_ml_topology_t *topo = coll_op->coll_schedule->topo_info;
+
+    fn_idx      = coll_op->sequential_routine.current_active_bcol_fn;
+    h_level     = coll_op->coll_schedule->component_functions[fn_idx].h_level;
+    sbgp        = topo->component_pairs[h_level].subgroup_module;
+    my_index    = sbgp->my_index;
+
+    /* In the case of allreduce, the local leader is always the root */
+    root = 0;
+    if (my_index == root) {
+        coll_op->variable_fn_params.root_flag = true;
+        coll_op->variable_fn_params.root_route = NULL;
+    } else {
+        coll_op->variable_fn_params.root_flag = false;
+        coll_op->variable_fn_params.root_route = &topo->route_vector[root];
+    }
+
+    /* NTH: This was copied from the old allreduce launcher. */
+    if (0 < fn_idx) {
+        coll_op->variable_fn_params.sbuf = coll_op->variable_fn_params.rbuf;
+        coll_op->variable_fn_params.userbuf = coll_op->variable_fn_params.rbuf;
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static int mca_coll_ml_allreduce_frag_progress(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    /* local variables */
+    void *buf;
+
+    size_t dt_size;
+    int ret, frag_len, count;
+
+    ptrdiff_t lb, extent;
+
+    ml_payload_buffer_desc_t *src_buffer_desc;
+    mca_coll_ml_collective_operation_progress_t *new_op;
+
+    mca_coll_ml_module_t *ml_module = OP_ML_MODULE(coll_op);
+
+    ret = ompi_datatype_get_extent(coll_op->variable_fn_params.dtype, &lb, &extent);
+    if (ret < 0) {
+     return OMPI_ERROR;
+    }
+
+    dt_size = (size_t) extent;
+
+    /* Keep the pipeline filled with fragments */
+    while (coll_op->fragment_data.message_descriptor->n_active <
+        coll_op->fragment_data.message_descriptor->pipeline_depth) {
+        /* If an active fragment happens to have completed the collective during
+         * a hop into the progress engine, then don't launch a new fragment,
+         * instead break and return.
+         */
+        if (coll_op->fragment_data.message_descriptor->n_bytes_scheduled
+            == coll_op->fragment_data.message_descriptor->n_bytes_total) {
+            break;
+        }
+
+        /* Get an ml buffer */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(OP_ML_MODULE(coll_op));
+        if (NULL == src_buffer_desc) {
+            /* If there exist outstanding fragments, then break out
+             * and let an active fragment deal with this later,
+             * there are no buffers available.
+             */
+            if (0 < coll_op->fragment_data.message_descriptor->n_active) {
+                return OMPI_SUCCESS;
+            }
+
+            /* It is useless to call progress from here, since
+             * ml progress can't be executed as result ml memsync
+             * call will not be completed and no memory will be
+             * recycled. So we put the element on the list, and we will
+             * progress it later when memsync will recycle some memory*/
+
+            /* The fragment is already on list and
+             * the we still have no ml resources
+             * Return busy */
+            if (!(coll_op->pending & REQ_OUT_OF_MEMORY)) {
+                coll_op->pending |= REQ_OUT_OF_MEMORY;
+                opal_list_append(&((OP_ML_MODULE(coll_op))->waiting_for_memory_list),
+                                 (opal_list_item_t *)coll_op);
+                ML_VERBOSE(10,("Out of resources %p adding to pending queue", coll_op));
+            } else {
+                ML_VERBOSE(10,("Out of resources %p", coll_op));
+            }
+
+            return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+        }
+
+        /* Get a new collective descriptor and initialize it */
+        new_op =  mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allreduce_functions[coll_op->fragment_data.current_coll_op],
+                coll_op->fragment_data.message_descriptor->src_user_addr,
+                coll_op->fragment_data.message_descriptor->dest_user_addr,
+                coll_op->fragment_data.message_descriptor->n_bytes_total,
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled);
+
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(new_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+
+        new_op->fragment_data.current_coll_op = coll_op->fragment_data.current_coll_op;
+        new_op->fragment_data.message_descriptor = coll_op->fragment_data.message_descriptor;
+
+        /* set the task setup callback  */
+        new_op->sequential_routine.seq_task_setup = mca_coll_ml_allreduce_task_setup;
+        /* We need this address for pointer arithmetic in memcpy */
+        buf = coll_op->fragment_data.message_descriptor->src_user_addr;
+        /* calculate the number of data types in this packet */
+        count = (coll_op->fragment_data.message_descriptor->n_bytes_total -
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled <
+                 (size_t) OP_ML_MODULE(coll_op)->small_message_thresholds[BCOL_ALLREDUCE] ?
+                (coll_op->fragment_data.message_descriptor->n_bytes_total -
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled) / dt_size :
+                (size_t) coll_op->variable_fn_params.count);
+
+        /* calculate the fragment length */
+        frag_len = count*dt_size;
+
+        ret = ompi_datatype_copy_content_same_ddt(coll_op->variable_fn_params.dtype, count,
+                (char *) src_buffer_desc->data_addr, (char *) ((uintptr_t) buf + (uintptr_t)
+                    coll_op->fragment_data.message_descriptor->n_bytes_scheduled));
+        if (ret < 0) {
+            return OMPI_ERROR;
+        }
+
+        /* No unpack for root */
+        new_op->process_fn = mca_coll_ml_allreduce_small_unpack;
+
+        /* Setup fragment specific data */
+        new_op->fragment_data.message_descriptor->n_bytes_scheduled += frag_len;
+        new_op->fragment_data.buffer_desc = src_buffer_desc;
+        new_op->fragment_data.fragment_size = frag_len;
+        (new_op->fragment_data.message_descriptor->n_active)++;
+
+        ML_SET_VARIABLE_PARAMS_BCAST(
+                new_op,
+                OP_ML_MODULE(new_op),
+                count,
+                MPI_BYTE,
+                src_buffer_desc,
+                0,
+                0,
+                frag_len,
+                src_buffer_desc->data_addr);
+        /* Fill in bcast specific arguments */
+        /* TBD: remove buffer_size */
+        new_op->variable_fn_params.buffer_size = frag_len;
+        new_op->variable_fn_params.count = count;
+        new_op->variable_fn_params.hier_factor = coll_op->variable_fn_params.hier_factor;
+        new_op->variable_fn_params.op = coll_op->variable_fn_params.op;
+        new_op->variable_fn_params.dtype = coll_op->variable_fn_params.dtype;
+        new_op->variable_fn_params.root = 0;
+        new_op->variable_fn_params.sbuf = src_buffer_desc->data_addr;
+        new_op->variable_fn_params.rbuf = src_buffer_desc->data_addr;
+        new_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
+
+        MCA_COLL_ML_SET_NEW_FRAG_ORDER_INFO(new_op);
+
+        ML_VERBOSE(10,("FFFF Contig + fragmentation [0-sk, 1-lk, 3-su, 4-lu] %d %d %d\n",
+                    new_op->variable_fn_params.buffer_size,
+                    new_op->fragment_data.fragment_size,
+                    new_op->fragment_data.message_descriptor->n_bytes_scheduled));
+        /* initialize first coll */
+        ret = new_op->sequential_routine.seq_task_setup(new_op);
+        if (OMPI_SUCCESS != ret) {
+            ML_VERBOSE(3,("Fragment failed to initialize itself"));
+            return ret;
+        }
+
+        /* append this collective !! */
+        OPAL_THREAD_LOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+        opal_list_append(&mca_coll_ml_component.sequential_collectives,
+                (opal_list_item_t *)new_op);
+        OPAL_THREAD_UNLOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static inline __opal_attribute_always_inline__
+int parallel_allreduce_start(void *sbuf, void *rbuf, int count,
+                                struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                                struct ompi_communicator_t *comm,
+                                mca_coll_ml_module_t *ml_module,
+                                ompi_request_t **req,
+                                int small_data_allreduce,
+                                int large_data_allreduce)
+{
+    int ret, n_fragments = 1, frag_len,
+        pipeline_depth, n_dts_per_frag ;
+
+    ptrdiff_t lb, extent;
+    size_t pack_len, dt_size;
+
+    ml_payload_buffer_desc_t *src_buffer_desc;
+    mca_coll_ml_collective_operation_progress_t *coll_op;
+
+    mca_coll_ml_component_t *cm = &mca_coll_ml_component;
+
+    bool contiguous = ompi_datatype_is_contiguous_memory_layout(dtype, count);
+
+    if (MPI_IN_PLACE == sbuf) {
+        sbuf = rbuf;
+    }
+
+    ret = ompi_datatype_get_extent(dtype, &lb, &extent);
+    if (ret < 0) {
+        return OMPI_ERROR;
+    }
+
+    dt_size = (size_t) extent;
+    pack_len = count * dt_size;
+
+    ML_VERBOSE(1,("The allreduce requested %d enable fragmentation %d ",
+                    pack_len,
+                    cm->enable_fragmentation));
+    if (pack_len <= (size_t) ml_module->small_message_thresholds[BCOL_ALLREDUCE]) {
+        /* The len of the message can not be larger than ML buffer size */
+        assert(pack_len <= ml_module->payload_block->size_buffer);
+
+        ML_VERBOSE(1,("Using small data allreduce (threshold = %d)",
+                    ml_module->small_message_thresholds[BCOL_ALLREDUCE]));
+
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (OPAL_UNLIKELY(NULL == src_buffer_desc)) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allreduce_functions[small_data_allreduce],
+                sbuf, rbuf, pack_len, 0);
+
+        coll_op->variable_fn_params.rbuf = src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.sbuf = src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.count = count;
+
+        ret = ompi_datatype_copy_content_same_ddt(dtype, count,
+                (void *) (uintptr_t) src_buffer_desc->data_addr, (char *) sbuf);
+        if (ret < 0){
+            return OMPI_ERROR;
+        }
+
+        /* unpack function */
+        coll_op->process_fn = mca_coll_ml_allreduce_small_unpack;
+    } else if (cm->enable_fragmentation || !contiguous) {
+        ML_VERBOSE(1,("Using Fragmented Allreduce"));
+
+        /* fragment the data */
+        /* check for retarded application programming decisions */
+        if (dt_size > (size_t) ml_module->small_message_thresholds[BCOL_ALLREDUCE]) {
+            ML_ERROR(("Sorry, but we don't support datatypes that large"));
+            return OMPI_ERROR;
+        }
+
+        /* calculate the number of data types that can fit per ml-buffer */
+        n_dts_per_frag = ml_module->small_message_thresholds[BCOL_ALLREDUCE] / dt_size;
+
+        /* calculate the number of fragments */
+        n_fragments = (count + n_dts_per_frag - 1) / n_dts_per_frag; /* round up */
+
+        /* calculate the actual pipeline depth */
+        pipeline_depth = n_fragments < cm->pipeline_depth ? n_fragments : cm->pipeline_depth;
+
+        /* calculate the fragment size */
+        frag_len = n_dts_per_frag * dt_size;
+
+        /* allocate an ml buffer */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allreduce_functions[small_data_allreduce],
+                sbuf, rbuf, pack_len, 0 /* offset for first pack */);
+
+        /* task setup callback function */
+        coll_op->sequential_routine.seq_task_setup = mca_coll_ml_allreduce_task_setup;
+
+        coll_op->process_fn = mca_coll_ml_allreduce_small_unpack;
+
+        coll_op->variable_fn_params.sbuf = (void *) src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.rbuf = (void *) src_buffer_desc->data_addr;
+
+        coll_op->fragment_data.message_descriptor->n_active = 1;
+        coll_op->full_message.n_bytes_scheduled = frag_len;
+        coll_op->full_message.fragment_launcher = mca_coll_ml_allreduce_frag_progress;
+        coll_op->full_message.pipeline_depth = pipeline_depth;
+        coll_op->fragment_data.current_coll_op = small_data_allreduce;
+        coll_op->fragment_data.fragment_size = frag_len;
+
+        coll_op->variable_fn_params.count = n_dts_per_frag;  /* seems fishy */
+        coll_op->variable_fn_params.buffer_size = frag_len;
+
+        /* copy into the ml-buffer */
+        ret = ompi_datatype_copy_content_same_ddt(dtype, n_dts_per_frag,
+                (char *) src_buffer_desc->data_addr, (char *) sbuf);
+        if (ret < 0) {
+            return OMPI_ERROR;
+        }
+    } else {
+        ML_VERBOSE(1,("Using zero-copy ptp allreduce"));
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_allreduce_functions[large_data_allreduce],
+                sbuf, rbuf, pack_len, 0);
+
+        coll_op->variable_fn_params.userbuf =
+            coll_op->variable_fn_params.sbuf = sbuf;
+
+        coll_op->variable_fn_params.rbuf = rbuf;
+
+        /* The ML buffer is used for testing. Later, when we
+         * switch to use knem/mmap/portals this should be replaced
+         * appropriately
+         */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op->variable_fn_params.count = count;
+    }
+
+    MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op, src_buffer_desc->buffer_index,
+                                          src_buffer_desc);
+
+    /* set the offset */
+    coll_op->variable_fn_params.sbuf_offset = 0;
+    coll_op->variable_fn_params.rbuf_offset = 0;
+
+    /* Fill in the function arguments */
+    coll_op->variable_fn_params.sequence_num =
+        OPAL_THREAD_ADD64(&(ml_module->collective_sequence_num), 1);
+    coll_op->sequential_routine.current_active_bcol_fn = 0;
+    coll_op->variable_fn_params.dtype = dtype;
+    coll_op->variable_fn_params.op = op;
+    coll_op->variable_fn_params.root = 0;
+    coll_op->sequential_routine.seq_task_setup = mca_coll_ml_allreduce_task_setup; /* invoked after each level in sequential
+                                                                                    * progress call
+                                                                                    */
+    MCA_COLL_ML_SET_ORDER_INFO(coll_op, n_fragments);
+
+    ret = mca_coll_ml_launch_sequential_collective (coll_op);
+    if (ret != OMPI_SUCCESS) {
+        ML_VERBOSE(10, ("Failed to launch"));
+        return ret;
+    }
+
+    *req = &coll_op->full_message.super;
+
+    return OMPI_SUCCESS;
+}
+
+int mca_coll_ml_allreduce(void *sbuf, void *rbuf, int count,
+                           struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                           struct ompi_communicator_t *comm,
+                           mca_coll_base_module_t *module)
+{
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t*)module;
+    ompi_request_t *req;
+    int ret;
+
+    if (OPAL_UNLIKELY(!ompi_op_is_commute(op))) {
+        fprintf (stderr, "Falling back for allreduce\n");
+        /* coll/ml does not handle non-communative operations at this time. fallback
+         * on another collective module */
+        return ml_module->fallback.coll_allreduce (sbuf, rbuf, count, dtype, op, comm,
+                                                   ml_module->fallback.coll_allreduce_module);
+    }
+
+    ret = parallel_allreduce_start(sbuf, rbuf, count, dtype, op, comm,
+                                   (mca_coll_ml_module_t *) module, &req,
+                                    ML_SMALL_DATA_ALLREDUCE,
+                                    ML_LARGE_DATA_ALLREDUCE);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_ERROR(("Failed to launch"));
+        return ret;
+    }
+
+    ompi_request_wait_completion(req);
+    ompi_request_free(&req);
+
+    ML_VERBOSE(10, ("Blocking NB allreduce is done"));
+
+    return OMPI_SUCCESS;
+}
+
+int mca_coll_ml_allreduce_nb(void *sbuf, void *rbuf, int count,
+                           struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                           struct ompi_communicator_t *comm,
+                           ompi_request_t **req,
+                           mca_coll_base_module_t *module)
+{
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t*)module;
+    int ret;
+
+    if (OPAL_UNLIKELY(!ompi_op_is_commute(op))) {
+        fprintf (stderr, "Falling back for iallreduce\n");
+        /* coll/ml does not handle non-communative operations at this time. fallback
+         * on another collective module */
+        return ml_module->fallback.coll_iallreduce (sbuf, rbuf, count, dtype, op, comm, req,
+                                                    ml_module->fallback.coll_iallreduce_module);
+    }
+
+    ret = parallel_allreduce_start(sbuf, rbuf, count, dtype, op, comm,
+                                   (mca_coll_ml_module_t *) module, req,
+                                    ML_SMALL_DATA_ALLREDUCE,
+                                    ML_LARGE_DATA_ALLREDUCE);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_ERROR(("Failed to launch"));
+        return ret;
+    }
+
+    ML_VERBOSE(10, ("Blocking NB allreduce is done"));
+
+    return OMPI_SUCCESS;
+}
+
+int mca_coll_ml_allreduce_dispatch(void *sbuf, void *rbuf, int count,
+                                   struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                                   struct ompi_communicator_t *comm, mca_coll_base_module_t *module)
+{
+    int rc;
+    bool use_extra_topo;
+    ompi_request_t *req;
+
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t *) module;
+
+    use_extra_topo = (count > 1) ?
+            !ml_module->allreduce_matrix[op->op_type][dtype->id][BCOL_MULTI_ELEM_TYPE] :
+            !ml_module->allreduce_matrix[op->op_type][dtype->id][BCOL_SINGLE_ELEM_TYPE];
+
+    if (use_extra_topo) {
+        rc = parallel_allreduce_start(sbuf, rbuf, count, dtype,
+                                         op, comm, ml_module, &req,
+                                         ML_SMALL_DATA_EXTRA_TOPO_ALLREDUCE,
+                                         ML_LARGE_DATA_EXTRA_TOPO_ALLREDUCE);
+    } else {
+        rc = parallel_allreduce_start(sbuf, rbuf, count, dtype,
+                                         op, comm, ml_module, &req,
+                                         ML_SMALL_DATA_ALLREDUCE,
+                                         ML_LARGE_DATA_ALLREDUCE);
+    }
+
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != rc)) {
+        ML_ERROR(("Failed to launch"));
+        return rc;
+    }
+
+    ompi_request_wait_completion(req);
+    ompi_request_free(&req);
+
+    return OMPI_SUCCESS;
+}
+
+int mca_coll_ml_allreduce_dispatch_nb(void *sbuf, void *rbuf, int count,
+                                   ompi_datatype_t *dtype, ompi_op_t *op,
+                                   ompi_communicator_t *comm,
+                                   ompi_request_t **req,
+                                   mca_coll_base_module_t *module)
+{
+    int rc;
+    bool use_extra_topo;
+
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t *) module;
+
+    use_extra_topo = (count > 1) ?
+            !ml_module->allreduce_matrix[op->op_type][dtype->id][BCOL_MULTI_ELEM_TYPE] :
+            !ml_module->allreduce_matrix[op->op_type][dtype->id][BCOL_SINGLE_ELEM_TYPE];
+
+    if (use_extra_topo) {
+        rc = parallel_allreduce_start(sbuf, rbuf, count, dtype,
+                                         op, comm, ml_module, req,
+                                         ML_SMALL_DATA_EXTRA_TOPO_ALLREDUCE,
+                                         ML_LARGE_DATA_EXTRA_TOPO_ALLREDUCE);
+    } else {
+        rc = parallel_allreduce_start(sbuf, rbuf, count, dtype,
+                                         op, comm, ml_module, req,
+                                         ML_SMALL_DATA_ALLREDUCE,
+                                         ML_LARGE_DATA_ALLREDUCE);
+    }
+
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != rc)) {
+        ML_ERROR(("Failed to launch"));
+        return rc;
+    }
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/coll/ml/coll_ml_bcast.c
+++ b/ompi/mca/coll/ml/coll_ml_bcast.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -193,35 +196,33 @@ static int mca_coll_ml_bcast_frag_converter_progress(mca_coll_ml_collective_oper

        /* Get an ml buffer */
        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
-        if (NULL == src_buffer_desc) {
+        if (OPAL_UNLIKELY(NULL == src_buffer_desc)) {
            /* If there exist outstanding fragments, then break out
             * and let an active fragment deal with this later,
             * there are no buffers available.
             */
            if (0 < coll_op->fragment_data.message_descriptor->n_active) {
                return OMPI_SUCCESS;
-            } else {
-                /* It is useless to call progress from here, since
-                 * ml progress can't be executed as result ml memsync
-                 * call will not be completed and no memory will be
-                 * recycled. So we put the element on the list, and we will
-                 * progress it later when memsync will recycle some memory*/
-                if (NULL == src_buffer_desc) {
-                    /* The fragment is already on list and
-                     * the we still have no ml resources
-                     * Return busy */
-                    if (coll_op->pending & REQ_OUT_OF_MEMORY) {
-                        return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
-                    }
-
-                    coll_op->pending |= REQ_OUT_OF_MEMORY;
-                    opal_list_append(&ml_module->waiting_for_memory_list,
-                                    (opal_list_item_t *)coll_op);
-
-                    return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
-                }
            }
+
+            /* It is useless to call progress from here, since
+             * ml progress can't be executed as result ml memsync
+             * call will not be completed and no memory will be
+             * recycled. So we put the element on the list, and we will
+             * progress it later when memsync will recycle some memory*/
+
+            /* The fragment is already on list and
+             * the we still have no ml resources
+             * Return busy */
+            if (!(coll_op->pending & REQ_OUT_OF_MEMORY)) {
+              coll_op->pending |= REQ_OUT_OF_MEMORY;
+              opal_list_append(&ml_module->waiting_for_memory_list,
+                               (opal_list_item_t *)coll_op);
+            }
+
+            return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
        }
+
        /* Get a new collective descriptor and initialize it */
        new_op = mca_coll_ml_duplicate_op_prog_single_frag_dag
            (ml_module, coll_op);
@ -237,6 +238,7 @@ static int mca_coll_ml_bcast_frag_converter_progress(mca_coll_ml_collective_oper
            iov.iov_len  = ml_module->ml_fragment_size;
            assert(0 != iov.iov_len);

+            max_data = ml_module->small_message_thresholds[BCOL_BCAST];
            opal_convertor_pack(&new_op->fragment_data.message_descriptor->send_convertor,
                                &iov, &iov_count, &max_data);

@ -256,7 +258,8 @@ static int mca_coll_ml_bcast_frag_converter_progress(mca_coll_ml_collective_oper
                coll_ml_bcast_functions[new_op->fragment_data.current_coll_op]->
                task_setup_fn[COLL_ML_GENERAL_TASK_FN];

-             mca_coll_ml_convertor_get_send_frag_size(
+            max_data = ml_module->small_message_thresholds[BCOL_BCAST];
+            mca_coll_ml_convertor_get_send_frag_size(
                                    ml_module, &max_data,
                                    new_op->fragment_data.message_descriptor);
        }
@ -339,28 +342,27 @@ static int mca_coll_ml_bcast_frag_progress(mca_coll_ml_collective_operation_prog
             */
            if (0 < coll_op->fragment_data.message_descriptor->n_active) {
                return OMPI_SUCCESS;
-            } else {
-                /* It is useless to call progress from here, since
-                 * ml progress can't be executed as result ml memsync
-                 * call will not be completed and no memory will be
-                 * recycled. So we put the element on the list, and we will
-                 * progress it later when memsync will recycle some memory*/
+            }

-                /* The fragment is already on list and
-                 * the we still have no ml resources
-                 * Return busy */
-                if (coll_op->pending & REQ_OUT_OF_MEMORY) {
-                    ML_VERBOSE(10,("Out of resources %p", coll_op));
-                    return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
-                }
+            /* It is useless to call progress from here, since
+             * ml progress can't be executed as result ml memsync
+             * call will not be completed and no memory will be
+             * recycled. So we put the element on the list, and we will
+             * progress it later when memsync will recycle some memory*/

+            /* The fragment is already on list and
+             * the we still have no ml resources
+             * Return busy */
+            if (!(coll_op->pending & REQ_OUT_OF_MEMORY)) {
+                ML_VERBOSE(10,("Out of resources %p adding to pending queue", coll_op));
                coll_op->pending |= REQ_OUT_OF_MEMORY;
                opal_list_append(&((OP_ML_MODULE(coll_op))->waiting_for_memory_list),
                                (opal_list_item_t *) coll_op);
-
-                ML_VERBOSE(10,("Out of resources %p adding to pending queue", coll_op));
-                return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+            } else {
+                ML_VERBOSE(10,("Out of resources %p", coll_op));
            }
+
+            return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
        }

        /* Get a new collective descriptor and initialize it */
@ -448,8 +450,12 @@ static inline __opal_attribute_always_inline__
    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t *) module;
    ml_payload_buffer_desc_t *src_buffer_desc = NULL;
    mca_coll_ml_task_setup_fn_t task_setup;
+    OPAL_PTRDIFF_TYPE lb, extent;

-    ML_VERBOSE(10, ("Starting bcast, mca_coll_ml_bcast_uknown_root"));
+    /* actual starting place of the user buffer (lb added) */
+    void *actual_buf;
+
+    ML_VERBOSE(10, ("Starting bcast, mca_coll_ml_bcast_uknown_root buf: %p", buf));

    ompi_datatype_type_size(dtype, &dt_size);
    pack_len = count * dt_size;
@ -459,6 +465,10 @@ static inline __opal_attribute_always_inline__
    /* Get information about memory layout */
    contig = opal_datatype_is_contiguous_memory_layout((opal_datatype_t *)dtype, count);

+    ompi_datatype_get_extent (dtype, &lb, &extent);
+
+    actual_buf = (void *) ((uintptr_t) buf + lb);
+
    /* Allocate collective schedule and pack message */
    if (contig) {
        if (pack_len <= (size_t) ml_module->small_message_thresholds[BCOL_BCAST]) {
@ -466,9 +476,8 @@ static inline __opal_attribute_always_inline__
            bcast_index = ml_module->bcast_fn_index_table[SMALL_BCAST];

            ML_VERBOSE(10, ("Contig + small message %d [0-sk, 1-lk, 3-su, 4-lu]\n", bcast_index));
-            ALLOCATE_AND_PACK_CONTIG_BCAST_FRAG(ml_module, coll_op,
-                    bcast_index, root,
-                    pack_len, pack_len, buf, src_buffer_desc);
+            ALLOCATE_AND_PACK_CONTIG_BCAST_FRAG(ml_module, coll_op, bcast_index, root, pack_len,
+                                                pack_len, actual_buf, src_buffer_desc);

            ML_SET_VARIABLE_PARAMS_BCAST(coll_op, ml_module, count, dtype,
                    src_buffer_desc, 0, 0, ml_module->payload_block->size_buffer,
@ -490,9 +499,8 @@ static inline __opal_attribute_always_inline__
            n_fragments = (pack_len + dt_size*n_dts_per_frag - 1)/(dt_size*n_dts_per_frag);
            pipeline_depth = (n_fragments < pipeline_depth ? n_fragments : pipeline_depth);

-            ALLOCATE_AND_PACK_CONTIG_BCAST_FRAG(ml_module, coll_op,
-                    bcast_index, root,
-                    pack_len, frag_len, buf, src_buffer_desc);
+            ALLOCATE_AND_PACK_CONTIG_BCAST_FRAG(ml_module, coll_op, bcast_index, root, pack_len,
+                                                frag_len, actual_buf, src_buffer_desc);
            ML_SET_VARIABLE_PARAMS_BCAST(coll_op, ml_module, (frag_len/dt_size), dtype,
                    src_buffer_desc, 0, 0, frag_len, (src_buffer_desc->data_addr));

@ -500,7 +508,6 @@ static inline __opal_attribute_always_inline__
            coll_op->full_message.pipeline_depth = pipeline_depth;
            /* Initialize fragment specific information */
            coll_op->fragment_data.current_coll_op = bcast_index;
-            coll_op->fragment_data.buffer_desc = src_buffer_desc;
            /* coll_op->fragment_data.message_descriptor->n_bytes_scheduled += frag_len; */
            coll_op->fragment_data.fragment_size = frag_len;
            coll_op->fragment_data.message_descriptor->n_active++;
@ -515,10 +522,9 @@ static inline __opal_attribute_always_inline__
            ML_VERBOSE(10, ("Contig + zero copy %d [0-sk, 1-lk, 3-su, 4-lu]\n", bcast_index));

            coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
-                    ml_module->coll_ml_bcast_functions[bcast_index],
-                    buf, buf,
-                    pack_len,
-                    0 /* offset for first pack */);
+                                                                ml_module->coll_ml_bcast_functions[bcast_index],
+                                                                actual_buf, actual_buf, pack_len,
+                                                                0 /* offset for first pack */);
            /* For large messages (bcast) this points to userbuf */
            /* Pasha: temporary work around for basesmuma, userbuf should
               be removed  */
@ -536,10 +542,9 @@ static inline __opal_attribute_always_inline__
        ML_VERBOSE(10, ("NON Contig + fragmentation %d [0-sk, 1-lk, 3-su, 4-lu]\n", bcast_index));

        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
-                ml_module->coll_ml_bcast_functions[bcast_index],
-                buf, buf,
-                pack_len,
-                0 /* offset for first pack */);
+                                                            ml_module->coll_ml_bcast_functions[bcast_index],
+                                                            actual_buf, actual_buf, pack_len,
+                                                            0 /* offset for first pack */);
        if (OPAL_LIKELY(pack_len > 0)) {
            size_t max_data = 0;

@ -560,10 +565,9 @@ static inline __opal_attribute_always_inline__

                iov.iov_base = (IOVBASE_TYPE*) src_buffer_desc->data_addr;
                iov.iov_len  = ml_module->ml_fragment_size;
-
+                max_data = ml_module->small_message_thresholds[BCOL_BCAST];
                opal_convertor_pack(&coll_op->full_message.send_convertor,
                                    &iov, &iov_count, &max_data);
-
                coll_op->process_fn = NULL;
                coll_op->full_message.n_bytes_scheduled = max_data;

@ -571,6 +575,7 @@ static inline __opal_attribute_always_inline__
                coll_op->full_message.fragment_launcher = mca_coll_ml_bcast_frag_converter_progress;
                coll_op->full_message.pipeline_depth = mca_coll_ml_component.pipeline_depth;
                coll_op->full_message.root = true;
+
            } else {
                opal_convertor_copy_and_prepare_for_send(
                        ompi_mpi_local_convertor,
@ -597,6 +602,7 @@ static inline __opal_attribute_always_inline__
                coll_op->full_message.fragment_launcher = mca_coll_ml_bcast_frag_converter_progress;
                coll_op->full_message.pipeline_depth = mca_coll_ml_component.pipeline_depth;

+                max_data = ml_module->small_message_thresholds[BCOL_BCAST];
                coll_op->full_message.dummy_conv_position = 0;
                mca_coll_ml_convertor_get_send_frag_size(
                                             ml_module, &max_data,
@ -605,7 +611,6 @@ static inline __opal_attribute_always_inline__
                coll_op->full_message.n_bytes_scheduled = max_data;
            }
        }
-
        coll_op->fragment_data.current_coll_op = bcast_index;
        coll_op->fragment_data.message_descriptor->n_active++;
        coll_op->fragment_data.fragment_size = coll_op->full_message.n_bytes_scheduled;
@ -675,9 +680,9 @@ int mca_coll_ml_parallel_bcast(void *buf, int count, struct ompi_datatype_t *dty
 }

 int mca_coll_ml_parallel_bcast_nb(void *buf, int count, struct ompi_datatype_t *dtype,
-        int root, struct ompi_communicator_t *comm,
-        ompi_request_t **req,
-        mca_coll_base_module_t *module)
+                                  int root, struct ompi_communicator_t *comm,
+                                  ompi_request_t **req,
+                                  mca_coll_base_module_t *module)
 {
    int ret;

@ -693,8 +698,8 @@ int mca_coll_ml_parallel_bcast_nb(void *buf, int count, struct ompi_datatype_t *
 }

 int mca_coll_ml_bcast_sequential_root(void *buf, int count, struct ompi_datatype_t *dtype,
-                                         int root, struct ompi_communicator_t *comm,
-                                         mca_coll_base_module_t *module)
+                                      int root, struct ompi_communicator_t *comm,
+                                      mca_coll_base_module_t *module)
 {

    /* local variables */
@ -707,6 +712,10 @@ int mca_coll_ml_bcast_sequential_root(void *buf, int count, struct ompi_datatype
    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t *) module;
    ml_payload_buffer_desc_t *src_buffer_desc = NULL;
    mca_bcol_base_coll_fn_desc_t *func;
+    OPAL_PTRDIFF_TYPE lb, extent;
+
+    /* actual starting place of the user buffer (lb added) */
+    void *actual_buf;

    ML_VERBOSE(10, ("Starting static bcast, small messages"));

@ -716,6 +725,8 @@ int mca_coll_ml_bcast_sequential_root(void *buf, int count, struct ompi_datatype
    ompi_datatype_type_size(dtype, &dt_size);
    pack_len = count * dt_size;

+    actual_buf = (void *) ((uintptr_t) buf + lb);
+
    /* Setup data buffer */
    src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
    while (NULL == src_buffer_desc) {
@ -729,10 +740,9 @@ int mca_coll_ml_bcast_sequential_root(void *buf, int count, struct ompi_datatype
        assert(pack_len <=  ml_module->payload_block->size_buffer);

        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
-                ml_module->coll_ml_bcast_functions[ML_BCAST_SMALL_DATA_SEQUENTIAL],
-                buf, buf,
-                pack_len,
-                0 /* offset for first pack */);
+                                                            ml_module->coll_ml_bcast_functions[ML_BCAST_SMALL_DATA_SEQUENTIAL],
+                                                            actual_buf, actual_buf, pack_len,
+                                                            0 /* offset for first pack */);
        if (ompi_comm_rank(comm) == root) {
            /* single frag, pack the data */
            memcpy((void *)(uintptr_t)src_buffer_desc->data_addr,
@ -748,15 +758,14 @@ int mca_coll_ml_bcast_sequential_root(void *buf, int count, struct ompi_datatype
    } else {
        ML_VERBOSE(10, ("ML_BCAST_LARGE_DATA_KNOWN case."));
        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
-                ml_module->coll_ml_bcast_functions[ML_BCAST_LARGE_DATA_SEQUENTIAL],
-                buf, buf,
-                pack_len,
-                0 /* offset for first pack */);
+                                                            ml_module->coll_ml_bcast_functions[ML_BCAST_LARGE_DATA_SEQUENTIAL],
+                                                            actual_buf, actual_buf, pack_len,
+                                                            0 /* offset for first pack */);
        /* For large messages (bcast) this points to userbuf */
        /* Pasha: temporary work around for basesmuma, userbuf should
           be removed  */
        coll_op->variable_fn_params.userbuf =
-        coll_op->variable_fn_params.sbuf = buf;
+        coll_op->variable_fn_params.sbuf = actual_buf;

        coll_op->process_fn = NULL;
    }
--- a/ompi/mca/coll/ml/coll_ml_colls.h
+++ b/ompi/mca/coll/ml/coll_ml_colls.h
@ -288,8 +288,8 @@ struct mca_coll_ml_collective_operation_progress_t {
        int64_t send_count;
        int64_t recv_count;
        /* extent of the data types */
-        int32_t send_extent;
-        int32_t recv_extent;
+        size_t send_extent;
+        size_t recv_extent;
        /* send data type */
        struct ompi_datatype_t * send_data_type;
        /* needed for non-contigous buffers */
@ -309,9 +309,11 @@ struct mca_coll_ml_collective_operation_progress_t {
        size_t recv_converter_bytes_packed;
        /* In case if ordering is needed: order num for next frag */
        int next_frag_num;
-        /* The variable is used by non-blocking memory synchronization code 
+        /* The variable is used by non-blocking memory synchronization code
         * for caching bank index */
        int bank_index_to_recycle;
+        /* need a handle for collective progress e.g. alltoall*/
+        bcol_fragment_descriptor_t frag_info;
    } full_message;

    /* collective operation being progressed */
@ -347,6 +349,8 @@ struct mca_coll_ml_collective_operation_progress_t {

        /* ML buffer descriptor attached to this buffer */
        struct ml_payload_buffer_desc_t *buffer_desc;
+        /* handle for collective progress, e.g. alltoall */
+        bcol_fragment_descriptor_t bcol_fragment_desc;

        /* Which collective algorithm */
        int current_coll_op;
@ -359,6 +363,7 @@ struct mca_coll_ml_collective_operation_progress_t {
     * function is exited, and for nonblocking collective functions this
     * is until test or wait completes the collective.
     */
+    int global_root;
    bcol_function_args_t variable_fn_params;

    struct{
@ -407,9 +412,8 @@ do {
        /* back to the free list  (free list may release memory on distruct )*/ \
        struct ompi_communicator_t *comm = GET_COMM(op);                        \
        bool is_coll_sync = IS_COLL_SYNCMEM(op);                                \
-        assert(&(op)->full_message !=                                           \
-               (op)->fragment_data.message_descriptor);                         \
        ML_VERBOSE(10, ("Releasing %p", op));                                   \
+        OMPI_REQUEST_FINI(&(op)->full_message.super);                   \
        OMPI_FREE_LIST_RETURN_MT(&(((mca_coll_ml_module_t *)(op)->coll_module)->   \
                    coll_ml_collective_descriptors),                            \
                (ompi_free_list_item_t *)op);                                   \
@ -468,5 +472,66 @@ do {
    }                                                                           \
 } while (0)

+enum {
+    MCA_COLL_ML_NET_STREAM_SEND,
+    MCA_COLL_ML_NET_STREAM_RECV
+};
+
+static inline  __opal_attribute_always_inline__
+    int mca_coll_ml_convertor_prepare(ompi_datatype_t *dtype, int count, void *buff,
+                                            opal_convertor_t *convertor, int stream)
+{
+    size_t bytes_packed;
+
+    if (MCA_COLL_ML_NET_STREAM_SEND == stream) {
+        opal_convertor_copy_and_prepare_for_send(
+                ompi_mpi_local_convertor,
+                &dtype->super, count, buff, 0,
+                convertor);
+    } else {
+        opal_convertor_copy_and_prepare_for_recv(
+                ompi_mpi_local_convertor,
+                &dtype->super, count, buff, 0,
+                convertor);
+    }
+
+    opal_convertor_get_packed_size(convertor, &bytes_packed);
+
+    return bytes_packed;
+}
+
+static inline  __opal_attribute_always_inline__
+    int mca_coll_ml_convertor_pack(void *data_addr, size_t buff_size,
+                                            opal_convertor_t *convertor)
+{
+    struct iovec iov;
+
+    size_t max_data = 0;
+    uint32_t iov_count = 1;
+
+    iov.iov_base = (IOVBASE_TYPE*) data_addr;
+    iov.iov_len  = buff_size;
+
+    opal_convertor_pack(convertor, &iov, &iov_count, &max_data);
+
+    return max_data;
+}
+
+static inline  __opal_attribute_always_inline__
+    int mca_coll_ml_convertor_unpack(void *data_addr, size_t buff_size,
+                                            opal_convertor_t *convertor)
+{
+    struct iovec iov;
+
+    size_t max_data = 0;
+    uint32_t iov_count = 1;
+
+    iov.iov_base = (void *) (uintptr_t) data_addr;
+    iov.iov_len  = buff_size;
+
+    opal_convertor_unpack(convertor, &iov, &iov_count, &max_data);
+
+    return max_data;
+}
 #endif /* MCA_COLL_ML_COLLS_H */

--- a/ompi/mca/coll/ml/coll_ml_component.c
+++ b/ompi/mca/coll/ml/coll_ml_component.c
@ -2,6 +2,8 @@
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -60,54 +62,35 @@ mca_coll_ml_component_t mca_coll_ml_component = {

    /* First, fill in the super */

-    {
+    .super = {
        /* First, the mca_component_t struct containing meta
           information about the component itself */

-        {
+        .collm_version = {
            MCA_COLL_BASE_VERSION_2_0_0,

            /* Component name and version */

-            "ml",
-            OMPI_MAJOR_VERSION,
-            OMPI_MINOR_VERSION,
-            OMPI_RELEASE_VERSION,
+            .mca_component_name = "ml",
+            .mca_component_major_version = OMPI_MAJOR_VERSION,
+            .mca_component_minor_version = OMPI_MINOR_VERSION,
+            .mca_component_release_version = OMPI_RELEASE_VERSION,

-            /* Component open and close functions */
+            /* Component open, close, and register functions */

-            ml_open,
-            ml_close,
+            .mca_open_component = ml_open,
+            .mca_close_component = ml_close,
            .mca_register_component_params = mca_coll_ml_register_params
        },
-        {
+        .collm_data = {
            /* The component is not checkpoint ready */
            MCA_BASE_METADATA_PARAM_NONE
        },

        /* Initialization / querying functions */
-        mca_coll_ml_init_query,
-        mca_coll_ml_comm_query,
+        .collm_init_query = mca_coll_ml_init_query,
+        .collm_comm_query = mca_coll_ml_comm_query,
    },
-
-    /* ml-component specifc information */
-
-    /* (default) priority */
-    0,
-    /* Number of levels */
-    0,
-    /* subgrouping components to use */
-    NULL,
-    /* basic collectives components to use */
-    NULL,
-    /* verbose */
-    0,
-    /* max of communicators */
-    0,
-    /* min size of comm */
-    0,
-    /* base_sequence_number */
-    0,
 };

 void mca_coll_ml_abort_ml(char *message)
@ -160,124 +143,63 @@ static int coll_ml_progress()
    /* progress sequential collective operations */
    /* RLG - need to do better here for parallel progress */
    OPAL_THREAD_LOCK(&(cm->sequential_collectives_mutex));
-    for (seq_coll_op  = (mca_coll_ml_collective_operation_progress_t *)opal_list_get_first(SEQ_L);
-         seq_coll_op != (mca_coll_ml_collective_operation_progress_t *)opal_list_get_end(SEQ_L);
-         seq_coll_op  = (mca_coll_ml_collective_operation_progress_t *)opal_list_get_next((opal_list_item_t *)seq_coll_op)){
-
-        fn_idx      = seq_coll_op->sequential_routine.current_active_bcol_fn;
-        /* initialize the task */
-
-        if(SEQ_TASK_IN_PROG == seq_coll_op->sequential_routine.current_bcol_status){
-            progress_fn = seq_coll_op->coll_schedule->
-                component_functions[fn_idx].bcol_function->progress_fn;
-        }else{
-            /* PPP Pasha - apparently task setup should be called only here. see linr 190 */
-            progress_fn = seq_coll_op->coll_schedule->
-                            component_functions[fn_idx].bcol_function->coll_fn;
-            seq_coll_op->sequential_routine.current_bcol_status =
-                SEQ_TASK_IN_PROG;
-        }
-
-        const_args  = &seq_coll_op->coll_schedule->component_functions[fn_idx].constant_group_data;
-        /* RLG - note need to move to useing coll_ml_utility_data_t as
-         * collective argument, rather than  coll_ml_function_t
-         */
-        rc = progress_fn(&(seq_coll_op->variable_fn_params), (coll_ml_function_t *)const_args);
-        if (BCOL_FN_COMPLETE == rc) {
-            /* done with this routine */
-            seq_coll_op->sequential_routine.current_active_bcol_fn++;
-            /* this is totally hardwired for bcast, need a general call-back */
-            /*seq_coll_op->variable_fn_params.root_flag = true;*/
+    OPAL_LIST_FOREACH_SAFE(seq_coll_op, seq_coll_op_tmp, SEQ_L, mca_coll_ml_collective_operation_progress_t) {
+        do {
            fn_idx      = seq_coll_op->sequential_routine.current_active_bcol_fn;
-            if (seq_coll_op->sequential_routine.current_active_bcol_fn  ==
-                    seq_coll_op->coll_schedule->n_fns) {
-                /* done with this collective - recycle descriptor */
+            /* initialize the task */

-                /* remove from the progress list */
-                seq_coll_op_tmp = (mca_coll_ml_collective_operation_progress_t *)
-                    opal_list_remove_item(SEQ_L, (opal_list_item_t *)seq_coll_op);
-
-                /* handle fragment completion */
-                rc = coll_ml_fragment_completion_processing(seq_coll_op);
-
-                if (OMPI_SUCCESS != rc) {
-                    mca_coll_ml_abort_ml("Failed to run coll_ml_fragment_completion_processing");
-                }
-                /* make sure that for will pick up right one */
-                seq_coll_op = seq_coll_op_tmp;
-            }else {
-                /* task setup */
-                /* Pasha - Another call for task setup ? Why ?*/
-                rc = seq_coll_op->sequential_routine.seq_task_setup(seq_coll_op);
-                /* else, start firing bcol functions */
-                while(true) {
-
-                    fn_idx      = seq_coll_op->sequential_routine.current_active_bcol_fn;
-                    const_args  = &seq_coll_op->coll_schedule->
-                        component_functions[fn_idx].constant_group_data;
-                    coll_fn = seq_coll_op->coll_schedule->
-                        component_functions[fn_idx].bcol_function->coll_fn;
-                    rc = coll_fn(&seq_coll_op->variable_fn_params,
-                            (coll_ml_function_t *) const_args);
-
-                    if (BCOL_FN_COMPLETE == rc) {
-
-                        seq_coll_op->sequential_routine.current_active_bcol_fn++;
-                        fn_idx  = seq_coll_op->sequential_routine.current_active_bcol_fn;
-
-                        /* done with this routine,
-                         * check for collective completion */
-                        if (seq_coll_op->sequential_routine.current_active_bcol_fn  ==
-                                seq_coll_op->coll_schedule->n_fns) {
-                            /* remove from the progress list */
-                            seq_coll_op_tmp = (mca_coll_ml_collective_operation_progress_t *)
-                                opal_list_remove_item(SEQ_L, (opal_list_item_t *)seq_coll_op);
-
-                            /* handle fragment completion */
-                            rc = coll_ml_fragment_completion_processing(seq_coll_op);
-                            if (OMPI_SUCCESS != rc) {
-                                mca_coll_ml_abort_ml("Failed to run coll_ml_fragment_completion_processing");
-                            }
-
-                            /* make sure that for will pick up right one */
-                            seq_coll_op = seq_coll_op_tmp;
-
-                            /* break out of while loop */
-                            break;
-                        }else {
-                            /* setup the next task */
-                            /* sequential task setup */
-                            seq_coll_op->sequential_routine.seq_task_setup(seq_coll_op);
-                        }
-
-                    }else if (BCOL_FN_NOT_STARTED == rc) {
-
-                        seq_coll_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
-
-                        break;
-                    } else {
-                        break;
-                    }
-
-                }
+            if (SEQ_TASK_IN_PROG == seq_coll_op->sequential_routine.current_bcol_status){
+                progress_fn = seq_coll_op->coll_schedule->
+                    component_functions[fn_idx].bcol_function->progress_fn;
+            } else {
+                /* PPP Pasha - apparently task setup should be called only here. see linr 190 */
+                progress_fn = seq_coll_op->coll_schedule->
+                    component_functions[fn_idx].bcol_function->coll_fn;
            }

+            const_args  = &seq_coll_op->coll_schedule->component_functions[fn_idx].constant_group_data;
+            /* RLG - note need to move to useing coll_ml_utility_data_t as
+             * collective argument, rather than  coll_ml_function_t
+             */
+            rc = progress_fn(&(seq_coll_op->variable_fn_params), (coll_ml_function_t *)const_args);
+            if (BCOL_FN_COMPLETE == rc) {
+                /* done with this routine */
+                seq_coll_op->sequential_routine.current_active_bcol_fn++;
+                /* this is totally hardwired for bcast, need a general call-back */

-        } else if (BCOL_FN_NOT_STARTED == rc ){
-            seq_coll_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
-        }
+                fn_idx = seq_coll_op->sequential_routine.current_active_bcol_fn;
+                if (fn_idx == seq_coll_op->coll_schedule->n_fns) {
+                    /* done with this collective - recycle descriptor */

+                    /* remove from the progress list */
+                    (void) opal_list_remove_item(SEQ_L, (opal_list_item_t *)seq_coll_op);
+
+                    /* handle fragment completion */
+                    rc = coll_ml_fragment_completion_processing(seq_coll_op);
+
+                    if (OMPI_SUCCESS != rc) {
+                        mca_coll_ml_abort_ml("Failed to run coll_ml_fragment_completion_processing");
+                    }
+                } else {
+                    rc = seq_coll_op->sequential_routine.seq_task_setup(seq_coll_op);
+                    seq_coll_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
+                    continue;
+                }
+            } else if (BCOL_FN_NOT_STARTED == rc) {
+                seq_coll_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
+            } else if (BCOL_FN_STARTED == rc) {
+                seq_coll_op->sequential_routine.current_bcol_status = SEQ_TASK_IN_PROG;
+            }
+
+            break;
+        } while (true);
    }
    OPAL_THREAD_UNLOCK(&(cm->sequential_collectives_mutex));

    /* general dag's */
    /* see if active tasks can be progressed */
    OPAL_THREAD_LOCK(&(cm->active_tasks_mutex));
-    for (task_status  = (mca_coll_ml_task_status_t *)opal_list_get_first(ACTIVE_L);
-         task_status != (mca_coll_ml_task_status_t *)opal_list_get_end(ACTIVE_L);
-         task_status  = (mca_coll_ml_task_status_t *)opal_list_get_next(task_status)
-        )
-    {
+    OPAL_LIST_FOREACH(task_status, ACTIVE_L, mca_coll_ml_task_status_t) {
        /* progress task */
        progress_fn = task_status->bcol_fn->progress_fn;
        const_args = &task_status->ml_coll_operation->coll_schedule->
@ -300,10 +222,7 @@ static int coll_ml_progress()

    /* see if new tasks can be initiated */
    OPAL_THREAD_LOCK(&(cm->pending_tasks_mutex));
-    for (task_status  = (mca_coll_ml_task_status_t *) opal_list_get_first(PENDING_L);
-         task_status != (mca_coll_ml_task_status_t *) opal_list_get_end(PENDING_L);
-         task_status  = (mca_coll_ml_task_status_t *) opal_list_get_next(task_status))
-    {
+    OPAL_LIST_FOREACH_SAFE(task_status, task_status_tmp, PENDING_L, mca_coll_ml_task_status_t) {
        /* check to see if dependencies are satisfied */
        int n_dependencies = task_status->rt_num_dependencies;
        int n_dependencies_satisfied = task_status->n_dep_satisfied;
@ -323,15 +242,13 @@ static int coll_ml_progress()
                }
            } else if ( BCOL_FN_STARTED == rc ) {
                ML_VERBOSE(3, ("GOT BCOL_STARTED!"));
-                task_status_tmp = (mca_coll_ml_task_status_t *)
-                    opal_list_remove_item(PENDING_L, (opal_list_item_t *)task_status);
+                (void) opal_list_remove_item(PENDING_L, (opal_list_item_t *)task_status);
                /* RLG - is there potential for deadlock here ?  Need to
                 * look at this closely
                 */
                OPAL_THREAD_LOCK(&(cm->active_tasks_mutex));
                opal_list_append(ACTIVE_L, (opal_list_item_t *)task_status);
                OPAL_THREAD_UNLOCK(&(cm->active_tasks_mutex));
-                task_status = task_status_tmp;
            } else if( BCOL_FN_NOT_STARTED == rc ) {
                /* nothing to do */
                ML_VERBOSE(10, ("GOT BCOL_FN_NOT_STARTED!"));
@ -342,6 +259,7 @@ static int coll_ml_progress()
                 * the way the code is implemented now */
                ML_VERBOSE(3, ("GOT error !"));
                rc = OMPI_ERROR;
+                OMPI_ERRHANDLER_RETURN(rc,MPI_COMM_WORLD,rc,"Error returned from bcol function: aborting");
                break;
            }
        }
@ -357,14 +275,11 @@ static int coll_ml_progress()

 static void adjust_coll_config_by_mca_param(void)
 {
-    assert(false == mca_coll_ml_component.use_static_bcast ||
-           false == mca_coll_ml_component.use_sequential_bcast);
-
    /* setting bcast mca params */
-    if (mca_coll_ml_component.use_static_bcast) {
+    if (COLL_ML_STATIC_BCAST == mca_coll_ml_component.bcast_algorithm) {
        mca_coll_ml_component.coll_config[ML_BCAST][ML_SMALL_MSG].algorithm_id = ML_BCAST_SMALL_DATA_KNOWN;
        mca_coll_ml_component.coll_config[ML_BCAST][ML_LARGE_MSG].algorithm_id = ML_BCAST_LARGE_DATA_KNOWN;
-    } else if (mca_coll_ml_component.use_sequential_bcast){
+    } else if (COLL_ML_SEQ_BCAST == mca_coll_ml_component.bcast_algorithm) {
        mca_coll_ml_component.coll_config[ML_BCAST][ML_SMALL_MSG].algorithm_id = ML_BCAST_SMALL_DATA_SEQUENTIAL;
        mca_coll_ml_component.coll_config[ML_BCAST][ML_LARGE_MSG].algorithm_id = ML_BCAST_LARGE_DATA_SEQUENTIAL;
    } else { /* Unknown root */
@ -468,7 +383,7 @@ static int ml_close(void)

    mca_coll_ml_component_t *cs = &mca_coll_ml_component;

-    /* There is not need to release/close resource if the 
+    /* There is not need to release/close resource if the
     * priority was set to zero */
    if (cs->ml_priority <= 0) {
        return OMPI_SUCCESS;
--- a/ompi/mca/coll/ml/coll_ml_config.c
+++ b/ompi/mca/coll/ml/coll_ml_config.c
@ -73,6 +73,14 @@ static int algorithm_name_to_id(char *name)
        return ML_SMALL_DATA_ALLREDUCE;
    if (!strcasecmp(name,"ML_LARGE_DATA_ALLREDUCE"))
        return ML_LARGE_DATA_ALLREDUCE;
+    if (!strcasecmp(name,"ML_SMALL_DATA_REDUCE"))
+        return ML_SMALL_DATA_ALLREDUCE;
+    if (!strcasecmp(name,"ML_LARGE_DATA_REDUCE"))
+        return ML_LARGE_DATA_ALLREDUCE;
+    if (!strcasecmp(name,"ML_SMALL_DATA_REDUCE"))
+        return ML_SMALL_DATA_REDUCE;
+    if (!strcasecmp(name,"ML_LARGE_DATA_REDUCE"))
+        return ML_LARGE_DATA_REDUCE;
    if (!strcasecmp(name,"ML_NUM_ALLREDUCE_FUNCTIONS"))
        return ML_NUM_ALLREDUCE_FUNCTIONS;
    if (!strcasecmp(name,"ML_SMALL_DATA_ALLTOALL"))
--- a/ompi/mca/coll/ml/coll_ml_functions.h
+++ b/ompi/mca/coll/ml/coll_ml_functions.h
@ -86,6 +86,17 @@ enum {
    ML_NUM_ALLREDUCE_FUNCTIONS
 };

+/* Reduce functions */
+enum {
+    /* small data algorithm */
+    ML_SMALL_DATA_REDUCE,
+
+    /* Large data algorithm */
+    ML_LARGE_DATA_REDUCE,
+
+    /* number of functions */
+    ML_NUM_REDUCE_FUNCTIONS
+};
 /* Alltoall functions */
 enum {
    /* small data algorithm */
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms.c
@ -78,37 +78,32 @@ int ml_coll_schedule_setup(mca_coll_ml_module_t *ml_module)
    }

    /* Allreduce */
-    /* 
    if (!mca_coll_ml_component.use_knomial_allreduce) { 
 		ret = ml_coll_hier_allreduce_setup(ml_module); 
 	} else {
 		ret = ml_coll_hier_allreduce_setup_new(ml_module);
 	}
-    */

 	if( OMPI_SUCCESS != ret ) {
        return ret;
    }
-   
+

    /* Alltoall */
-    /* 
+    /*
    ret = ml_coll_hier_alltoall_setup_new(ml_module);
    
    if( OMPI_SUCCESS != ret ) {
        return ret;
    }
    */
-    
-    
+
    /* Allgather */
-    /*
    ret = ml_coll_hier_allgather_setup(ml_module);
    
    if( OMPI_SUCCESS != ret ) {
        return ret;
    }
-    */
    
    /* Gather */
    /*
@ -120,12 +115,10 @@ int ml_coll_schedule_setup(mca_coll_ml_module_t *ml_module)
    */

    /* Reduce */
-    
    ret = ml_coll_hier_reduce_setup(ml_module);
    if( OMPI_SUCCESS != ret ) {
        return ret;
    }
-    

    /* Scatter */
    /*
@ -134,6 +127,7 @@ int ml_coll_schedule_setup(mca_coll_ml_module_t *ml_module)
        return ret;
    }
    */
+
    ret = ml_coll_memsync_setup(ml_module);
    if( OMPI_SUCCESS != ret ) {
        return ret;
@ -150,7 +144,9 @@ int ml_coll_schedule_setup(mca_coll_ml_module_t *ml_module)
    /* Pasha: Do we have to keep the max_dag_size ? 
       In most generic case, it will be equal to max_fn_calls */
    ml_module->max_dag_size = ml_module->max_fn_calls;
+
    assert(ml_module->max_dag_size > 0);
+
    /* initialize the mca_coll_ml_collective_operation_progress_t free list */
    /* NOTE: as part of initialization each routine needs to make sure that
     * the module element max_dag_size is set large enough - space for
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_allgather_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_allgather_setup.c
@ -0,0 +1,193 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml_functions.h"
+#include "ompi/mca/coll/ml/coll_ml_hier_algorithms_common_setup.h"
+#include "ompi/patterns/net/netpatterns_knomial_tree.h"
+
+#define SMALL_MSG_RANGE 1
+#define LARGE_MSG_RANGE 5
+
+static int mca_coll_ml_build_allgather_schedule(mca_coll_ml_topology_t *topo_info,
+                                                mca_coll_ml_collective_operation_description_t **coll_desc, int bcol_func_index)
+{
+    int ret; /* exit code in case of error */
+    int nfn = 0;
+    int i;
+    int *scratch_indx = NULL,
+        *scratch_num = NULL;
+
+    mca_coll_ml_collective_operation_description_t  *schedule = NULL;
+    mca_coll_ml_compound_functions_t *comp_fn;
+    mca_coll_ml_schedule_hier_info_t h_info;
+
+    ML_VERBOSE(9, ("Setting hierarchy, inputs : n_levels %d, hiest %d ",
+                   topo_info->n_levels, topo_info->global_highest_hier_group_index));
+    MCA_COLL_ML_INIT_HIER_INFO(h_info, topo_info->n_levels,
+                               topo_info->global_highest_hier_group_index, topo_info);
+
+    ret = mca_coll_ml_schedule_init_scratch(topo_info, &h_info,
+                                            &scratch_indx, &scratch_num);
+    if (OMPI_SUCCESS != ret) {
+        ML_ERROR(("Can't mca_coll_ml_schedule_init_scratch.\n"));
+        goto Error;
+    }
+    assert(NULL != scratch_indx);
+    assert(NULL != scratch_num);
+
+    schedule = *coll_desc =
+        mca_coll_ml_schedule_alloc(&h_info);
+    if (NULL == schedule) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+    /* Setting topology information */
+    schedule->topo_info = topo_info;
+
+    /* Set dependencies equal to number of hierarchies */
+    for (i = 0; i < h_info.num_up_levels; i++) {
+        int query_conf[MCA_COLL_ML_QUERY_SIZE];
+        MCA_COLL_ML_SET_QUERY(query_conf, DATA_SRC_KNOWN, BLOCKING, BCOL_GATHER, bcol_func_index, 0, 0);
+        comp_fn = &schedule->component_functions[i];
+        MCA_COLL_ML_SET_COMP_FN(comp_fn, i, topo_info,
+                                i, scratch_indx, scratch_num, query_conf, "GATHER_DATA");
+    }
+
+    nfn = i;
+    if (h_info.call_for_top_function) {
+        int query_conf[MCA_COLL_ML_QUERY_SIZE];
+        MCA_COLL_ML_SET_QUERY(query_conf, DATA_SRC_KNOWN, NON_BLOCKING, BCOL_ALLGATHER, bcol_func_index, 0, 0);
+        comp_fn = &schedule->component_functions[nfn];
+        MCA_COLL_ML_SET_COMP_FN(comp_fn, nfn, topo_info,
+                                nfn, scratch_indx, scratch_num, query_conf, "ALLGATHER_DATA");
+        ++nfn;
+    }
+
+    /* coming down the hierarchy */
+    for (i = h_info.num_up_levels - 1; i >= 0; i--, nfn++) {
+        int query_conf[MCA_COLL_ML_QUERY_SIZE];
+        MCA_COLL_ML_SET_QUERY(query_conf, DATA_SRC_KNOWN, NON_BLOCKING, BCOL_BCAST, bcol_func_index, 0, 0);
+        comp_fn = &schedule->component_functions[nfn];
+        MCA_COLL_ML_SET_COMP_FN(comp_fn, i, topo_info,
+                                nfn, scratch_indx, scratch_num, query_conf, "BCAST_DATA");
+    }
+
+    /* Fill the rest of constant data */
+    mca_coll_ml_call_types(&h_info, schedule);
+
+    MCA_COLL_ML_SET_SCHEDULE_ORDER_INFO(schedule);
+
+    free(scratch_num);
+    free(scratch_indx);
+
+    return OMPI_SUCCESS;
+
+ Error:
+    if (NULL != scratch_indx) {
+        free(scratch_indx);
+    }
+    if (NULL != scratch_num) {
+        free(scratch_num);
+    }
+    if (NULL != schedule->component_functions) {
+        free(schedule->component_functions);
+    }
+    return ret;
+}
+
+int ml_coll_hier_allgather_setup(mca_coll_ml_module_t *ml_module)
+{
+    /* Hierarchy Setup */
+    int ret, topo_index, alg;
+    mca_coll_ml_topology_t *topo_info = ml_module->topo_list;
+
+    ML_VERBOSE(10,("entering allgather setup\n"));
+
+#if 0
+    /* used to validate the recursive k - ing allgather tree */
+    {
+        /* debug print */
+        int ii, jj;
+        netpatterns_k_exchange_node_t exchange_node;
+
+        ret = netpatterns_setup_recursive_knomial_allgather_tree_node(8, 3, 3, &exchange_node);
+        fprintf(stderr,"log tree order %d tree_order %d\n", exchange_node.log_tree_order,exchange_node.tree_order);
+        if( EXCHANGE_NODE == exchange_node.node_type){
+            if( exchange_node.n_extra_sources > 0){
+                fprintf(stderr,"Receiving data from extra rank %d\n",exchange_node.rank_extra_sources_array[0]);
+            }
+            for( ii = 0; ii < exchange_node.log_tree_order; ii++){
+                for( jj = 0; jj < (exchange_node.tree_order-1); jj++) {
+                    if( exchange_node.rank_exchanges[ii][jj] >= 0){
+                        fprintf(stderr,"level %d I send %d bytes to %d from offset %d \n",ii+1,
+                                exchange_node.payload_info[ii][jj].s_len,
+                                exchange_node.rank_exchanges[ii][jj],
+                                exchange_node.payload_info[ii][jj].s_offset);
+                        fprintf(stderr,"level %d I receive %d bytes from %d at offset %d\n",ii+1,
+                                exchange_node.payload_info[ii][jj].r_len,
+                                exchange_node.rank_exchanges[ii][jj],
+                                exchange_node.payload_info[ii][jj].r_offset);
+                    }
+                }
+            }
+            fprintf(stderr,"exchange_node.n_extra_sources %d\n",exchange_node.n_extra_sources);
+            fprintf(stderr,"exchange_node.myid_reindex %d\n",exchange_node.reindex_myid);
+            if( exchange_node.n_extra_sources > 0){
+                fprintf(stderr,"Sending back data to extra rank %d\n",exchange_node.rank_extra_sources_array[0]);
+            }
+        } else {
+            fprintf(stderr,"I am an extra and send to proxy %d\n",
+                    exchange_node.rank_extra_sources_array[0]);
+        }
+    }
+#endif
+
+    alg = mca_coll_ml_component.coll_config[ML_ALLGATHER][ML_SMALL_MSG].algorithm_id;
+    topo_index = ml_module->collectives_topology_map[ML_ALLGATHER][alg];
+    if (ML_UNDEFINED == alg || ML_UNDEFINED == topo_index) {
+        ML_ERROR(("No topology index or algorithm was defined"));
+        topo_info->hierarchical_algorithms[ML_ALLGATHER] = NULL;
+        return OMPI_ERROR;
+    }
+
+    ret = mca_coll_ml_build_allgather_schedule(&ml_module->topo_list[topo_index],
+                                               &ml_module->coll_ml_allgather_functions[alg],
+                                               SMALL_MSG_RANGE);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_VERBOSE(10, ("Failed to setup static alltoall"));
+        return ret;
+    }
+
+    alg = mca_coll_ml_component.coll_config[ML_ALLGATHER][ML_LARGE_MSG].algorithm_id;
+    topo_index = ml_module->collectives_topology_map[ML_ALLGATHER][alg];
+    if (ML_UNDEFINED == alg || ML_UNDEFINED == topo_index) {
+        ML_ERROR(("No topology index or algorithm was defined"));
+        topo_info->hierarchical_algorithms[ML_ALLGATHER] = NULL;
+        return OMPI_ERROR;
+    }
+
+    ret = mca_coll_ml_build_allgather_schedule(&ml_module->topo_list[topo_index],
+                                               &ml_module->coll_ml_allgather_functions[alg],
+                                               LARGE_MSG_RANGE);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_VERBOSE(10, ("Failed to setup static alltoall"));
+        return ret;
+    }
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_allreduce_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_allreduce_setup.c
@ -0,0 +1,347 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml_functions.h"
+
+#define ALLREDUCE_SMALL 1
+#define ALLREDUCE_LARGE 5
+#define SMALL_MSG_RANGE 1
+#define LARGE_MSG_RANGE 5
+
+static int mca_coll_ml_build_allreduce_schedule(
+        mca_coll_ml_topology_t *topo_info,
+        mca_coll_ml_collective_operation_description_t **coll_desc, int bcol_func_index)
+{
+
+    bool call_for_top_function, prev_is_zero;
+    int n_hiers = topo_info->n_levels;
+    int i_hier, j_hier;
+    int cnt, value_to_set = 0;
+    int ret; /* exit code in case of error */
+    int nfn=0;
+    int *scratch_indx = NULL,
+        *scratch_num = NULL;
+     int global_high_hierarchy_index =
+             topo_info->global_highest_hier_group_index;
+
+    mca_coll_ml_collective_operation_description_t  *schedule;
+    mca_coll_ml_compound_functions_t *comp_fn;
+    mca_bcol_base_module_t *prev_bcol,
+                           *bcol_module;
+    int num_up_levels,nbcol_functions,i;
+
+    if (global_high_hierarchy_index ==
+          topo_info->component_pairs[n_hiers - 1].bcol_index) {
+        /* The process that is member of highest level subgroup
+           should call for top algorithms in addition to fan-in/out steps*/
+        call_for_top_function = true;
+        /* hier level run only top algorithm, so we deduct 1 */
+        num_up_levels = n_hiers - 1;
+        /* Top algorithm is called only once, so we deduct 1 */
+        nbcol_functions = 2 * n_hiers - 1;
+    } else {
+        /* The process is not member of highest level subgroup,
+           as result it does not call for top algorithm,
+           but it calls for all fan-in/out steps */
+        call_for_top_function = false;
+        num_up_levels = n_hiers;
+        nbcol_functions = 2 * n_hiers;
+    }
+
+    *coll_desc = (mca_coll_ml_collective_operation_description_t *)
+        malloc(sizeof(mca_coll_ml_collective_operation_description_t));
+    schedule = *coll_desc;
+    if (NULL == schedule) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Allreduce_Setup_Error;
+    }
+
+    scratch_indx = (int *) malloc(sizeof(int) * (n_hiers * 2));
+    if (NULL == scratch_indx) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Allreduce_Setup_Error;
+    }
+
+    scratch_num = (int *) malloc(sizeof(int) * (n_hiers * 2));
+    if (NULL == scratch_num) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Allreduce_Setup_Error;
+    }
+
+    prev_bcol = NULL;
+
+    for (i = 0, cnt = 0; i < num_up_levels; ++i, ++cnt) {
+        if (IS_BCOL_TYPE_IDENTICAL(prev_bcol, GET_BCOL(topo_info, i))) {
+            scratch_indx[cnt] = scratch_indx[cnt - 1] + 1;
+        } else {
+            scratch_indx[cnt] = 0;
+            prev_bcol = GET_BCOL(topo_info, i);
+        }
+    }
+
+    /* top  - only if the proc arrive to highest_level_is_global_highest_level */
+    if (call_for_top_function) {
+        if (IS_BCOL_TYPE_IDENTICAL(prev_bcol, GET_BCOL(topo_info, n_hiers - 1))) {
+            scratch_indx[cnt] = scratch_indx[cnt - 1] + 1;
+        } else {
+            scratch_indx[cnt] = 0;
+            prev_bcol = GET_BCOL(topo_info, n_hiers - 1);
+        }
+
+        ++cnt;
+    }
+
+    /* going down */
+    for (i = num_up_levels - 1; i >= 0; --i, ++cnt) {
+        if (IS_BCOL_TYPE_IDENTICAL(prev_bcol, GET_BCOL(topo_info, i))) {
+            scratch_indx[cnt] = scratch_indx[cnt - 1] + 1;
+        } else {
+            scratch_indx[cnt] = 0;
+            prev_bcol = GET_BCOL(topo_info, i);
+        }
+    }
+
+    i = cnt - 1;
+    prev_is_zero = true;
+
+    do {
+        if (prev_is_zero) {
+            value_to_set = scratch_indx[i] + 1;
+            prev_is_zero = false;
+        }
+
+        if (0 == scratch_indx[i]) {
+            prev_is_zero = true;
+        }
+
+        scratch_num[i] = value_to_set;
+        --i;
+    } while(i >= 0);
+
+    /* Set dependencies equal to number of hierarchies */
+    schedule->n_fns = nbcol_functions;
+    schedule->topo_info = topo_info;
+    schedule->progress_type = 0;
+
+    /* Allocated the component function */
+    schedule->component_functions = (struct mca_coll_ml_compound_functions_t *)
+            calloc(nbcol_functions, sizeof(struct mca_coll_ml_compound_functions_t));
+
+    if (NULL == schedule->component_functions) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Allreduce_Setup_Error;
+    }
+
+
+    for (i = 0; i < num_up_levels; i++) {
+        comp_fn = &schedule->component_functions[i];
+        comp_fn->h_level = i; /* hierarchy level */
+        bcol_module = GET_BCOL(topo_info, i);
+
+        /* strcpy (comp_fn->fn_name, "ALLREDUCE_SMALL_DATA"); */
+
+        comp_fn->num_dependent_tasks     = 0;
+        comp_fn->num_dependencies        = 0;
+
+        comp_fn->bcol_function =
+            bcol_module->filtered_fns_table[DATA_SRC_KNOWN][NON_BLOCKING][BCOL_REDUCE][bcol_func_index][0][0];
+
+        comp_fn->task_comp_fn = NULL;
+
+        comp_fn->constant_group_data.bcol_module = bcol_module;
+        comp_fn->constant_group_data.index_in_consecutive_same_bcol_calls = scratch_indx[i];
+        comp_fn->constant_group_data.n_of_this_type_in_a_row = scratch_num[i];
+        comp_fn->constant_group_data.n_of_this_type_in_collective = 0;
+        comp_fn->constant_group_data.index_of_this_type_in_collective = 0;
+    }
+
+    nfn = i;
+    if (call_for_top_function) {
+        comp_fn = &schedule->component_functions[nfn];
+        comp_fn->h_level = nfn; /* hierarchy level */
+        bcol_module = GET_BCOL(topo_info, nfn);
+
+        /* strcpy (comp_fn->fn_name, "ALLREDUCE_SMALL_DATA"); */
+
+        /* The allreduce should depend on the reduce */
+        comp_fn->num_dependent_tasks     = 0;
+        comp_fn->num_dependencies        = 0;
+        comp_fn->bcol_function =
+            bcol_module->filtered_fns_table[DATA_SRC_KNOWN][NON_BLOCKING][BCOL_ALLREDUCE][bcol_func_index][0][0];
+
+        comp_fn->task_comp_fn = NULL;
+
+        comp_fn->constant_group_data.bcol_module = bcol_module;
+        comp_fn->constant_group_data.index_in_consecutive_same_bcol_calls = scratch_indx[nfn];
+        comp_fn->constant_group_data.n_of_this_type_in_a_row = scratch_num[nfn];
+        comp_fn->constant_group_data.n_of_this_type_in_collective = 0;
+        comp_fn->constant_group_data.index_of_this_type_in_collective = 0;
+
+        ++nfn;
+    }
+
+    for (i = num_up_levels - 1; i >= 0; i--) {
+        comp_fn = &schedule->component_functions[nfn];
+        comp_fn->h_level = i; /* hierarchy level */
+        bcol_module = GET_BCOL(topo_info, i);
+
+    /*    strcpy (comp_fn->fn_name, "ALLREDUCE_SMALL_DATA"); */
+
+        comp_fn->num_dependent_tasks     = 0;
+        comp_fn->num_dependencies        = 0;
+
+        comp_fn->bcol_function =
+            bcol_module->filtered_fns_table[DATA_SRC_KNOWN][NON_BLOCKING][BCOL_BCAST][bcol_func_index][0][0];
+
+        comp_fn->task_comp_fn = NULL;
+
+        comp_fn->constant_group_data.bcol_module = bcol_module;
+        comp_fn->constant_group_data.index_in_consecutive_same_bcol_calls = scratch_indx[nfn];
+        comp_fn->constant_group_data.n_of_this_type_in_a_row = scratch_num[nfn];
+        comp_fn->constant_group_data.n_of_this_type_in_collective = 0;
+        comp_fn->constant_group_data.index_of_this_type_in_collective = 0;
+
+        ++nfn;
+    }
+
+    /* Fill the rest of constant data */
+    for (i_hier = 0; i_hier < n_hiers; i_hier++) {
+        mca_bcol_base_module_t *current_bcol =
+            schedule->component_functions[i_hier].
+            constant_group_data.bcol_module;
+        cnt = 0;
+        for (j_hier = 0; j_hier < n_hiers; j_hier++) {
+            if (current_bcol ==
+                    schedule->component_functions[j_hier].
+                    constant_group_data.bcol_module) {
+                schedule->component_functions[j_hier].
+                    constant_group_data.index_of_this_type_in_collective = cnt;
+                cnt++;
+            }
+        }
+
+        schedule->component_functions[i_hier].
+            constant_group_data.n_of_this_type_in_collective = cnt;
+    }
+
+    MCA_COLL_ML_SET_SCHEDULE_ORDER_INFO(schedule);
+
+    free(scratch_num);
+    free(scratch_indx);
+
+    return OMPI_SUCCESS;
+
+Allreduce_Setup_Error:
+
+    if (NULL != scratch_indx) {
+        free(scratch_indx);
+    }
+
+    if (NULL != scratch_num) {
+        free(scratch_num);
+    }
+
+    if (NULL != schedule->component_functions) {
+        free(schedule->component_functions);
+    }
+
+    return ret;
+}
+
+int ml_coll_hier_allreduce_setup_new(mca_coll_ml_module_t *ml_module)
+{
+    /* Hierarchy Setup */
+    int ret;
+    int topo_index;
+    int alg;
+    mca_coll_ml_topology_t *topo_info = ml_module->topo_list;
+
+    alg = mca_coll_ml_component.coll_config[ML_ALLREDUCE][ML_SMALL_MSG].algorithm_id;
+    topo_index = ml_module->collectives_topology_map[ML_ALLREDUCE][alg];
+    if (ML_UNDEFINED == alg || ML_UNDEFINED == topo_index) {
+        ML_ERROR(("No topology index or algorithm was defined"));
+        topo_info->hierarchical_algorithms[ML_ALLREDUCE] = NULL;
+        return OMPI_ERROR;
+    }
+
+    ret = mca_coll_ml_build_allreduce_schedule(
+                    &ml_module->topo_list[topo_index],
+                    &ml_module->coll_ml_allreduce_functions[alg],
+                    SMALL_MSG_RANGE);
+
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_VERBOSE(10, ("Failed to setup Small Message Allreduce"));
+         return ret;
+       }
+
+    alg = mca_coll_ml_component.coll_config[ML_ALLREDUCE][ML_LARGE_MSG].algorithm_id;
+    topo_index = ml_module->collectives_topology_map[ML_ALLREDUCE][alg];
+    if (ML_UNDEFINED == alg || ML_UNDEFINED == topo_index) {
+        ML_ERROR(("No topology index or algorithm was defined"));
+        topo_info->hierarchical_algorithms[ML_ALLREDUCE] = NULL;
+        return OMPI_ERROR;
+    }
+
+    ret = mca_coll_ml_build_allreduce_schedule(
+                    &ml_module->topo_list[topo_index],
+                    &ml_module->coll_ml_allreduce_functions[alg],
+                    LARGE_MSG_RANGE);
+
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_VERBOSE(10, ("Failed to setup Large Message Allreduce"));
+         return ret;
+       }
+
+    if (true == mca_coll_ml_component.need_allreduce_support) {
+        topo_index = ml_module->collectives_topology_map[ML_ALLREDUCE][ML_SMALL_DATA_EXTRA_TOPO_ALLREDUCE];
+        if (ML_UNDEFINED == topo_index) {
+            ML_ERROR(("No topology index was defined"));
+            topo_info->hierarchical_algorithms[ML_ALLREDUCE] = NULL;
+            return OMPI_ERROR;
+        }
+
+        ret = mca_coll_ml_build_allreduce_schedule(
+                        &ml_module->topo_list[topo_index],
+                        &ml_module->coll_ml_allreduce_functions[ML_SMALL_DATA_EXTRA_TOPO_ALLREDUCE],
+                        SMALL_MSG_RANGE);
+
+        if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+            ML_VERBOSE(10, ("Failed to setup Extra Small Message Allreduce"));
+            return ret;
+        }
+
+        topo_index = ml_module->collectives_topology_map[ML_ALLREDUCE][ML_LARGE_DATA_EXTRA_TOPO_ALLREDUCE];
+        if (ML_UNDEFINED == topo_index) {
+            ML_ERROR(("No topology index was defined"));
+            topo_info->hierarchical_algorithms[ML_ALLREDUCE] = NULL;
+            return OMPI_ERROR;
+        }
+
+        ret = mca_coll_ml_build_allreduce_schedule(
+                        &ml_module->topo_list[topo_index],
+                        &ml_module->coll_ml_allreduce_functions[ML_LARGE_DATA_EXTRA_TOPO_ALLREDUCE],
+                        LARGE_MSG_RANGE);
+
+        if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+            ML_VERBOSE(10, ("Failed to setup Extra Large Message Allreduce"));
+            return ret;
+        }
+    }
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_barrier_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_barrier_setup.c
@ -17,7 +17,9 @@

 static int mca_coll_ml_build_barrier_schedule(
                                    mca_coll_ml_topology_t *topo_info,
-                                    mca_coll_ml_collective_operation_description_t **coll_desc)
+                                    mca_coll_ml_collective_operation_description_t
+                                    **coll_desc,
+                                    mca_coll_ml_module_t *ml_module)
 {
    int i_hier, rc, i_fn, n_fcns, i,
        n_hiers = topo_info->n_levels;
@ -52,6 +54,10 @@ static int mca_coll_ml_build_barrier_schedule(
        n_fcns = 2 * n_hiers;
    }

+    if( ml_module->max_fn_calls < n_fcns ) {
+        ml_module->max_fn_calls = n_fcns;
+    }
+
    /* Set dependencies equal to number of hierarchies */
    schedule->n_fns = n_fcns;
    schedule->topo_info = topo_info;
@ -170,7 +176,7 @@ int ml_coll_hier_barrier_setup(mca_coll_ml_module_t *ml_module)
           &ml_module->topo_list[ml_module->collectives_topology_map[ML_BARRIER][ML_SMALL_MSG]];

    rc = mca_coll_ml_build_barrier_schedule(topo_info,
-                            &ml_module->coll_ml_barrier_function);
+                            &ml_module->coll_ml_barrier_function, ml_module);
    if (OPAL_UNLIKELY(OMPI_SUCCESS != rc)) {
        /* Make sure to reset the barrier pointer to NULL */
        topo_info->hierarchical_algorithms[BCOL_BARRIER] = NULL;
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_bcast_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_bcast_setup.c
@ -409,12 +409,11 @@ static int mca_coll_ml_build_bcast_sequential_schedule_no_attributes(
        /*
        if(comp_fn->coll_fn_started){
            fprintf(stderr,"this statement is true\n");
-        } else{
+        } else {
            fprintf(stderr,"done setting to false \n");
        }
        */
       
-
        comp_fn->task_comp_fn = mca_coll_ml_task_comp_dynamic_root_small_message;
        /* assert(NULL != comp_fn->bcol_function); */
        /* Constants */
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_ibarrier.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_ibarrier.c
@ -0,0 +1,148 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+
+int ml_coll_hier_nonblocking_barrier_setup(mca_coll_ml_module_t *ml_module,
+        mca_coll_ml_topology_t *topo_info)
+{
+    /* local variables */
+    int ret=OMPI_SUCCESS;
+    /* int global_low_hierarchy_index=topo_info->global_lowest_hier_group_index; */
+    int global_high_hierarchy_index=topo_info->global_highest_hier_group_index;
+    int my_n_hierarchies=topo_info->n_levels;
+    int hier_loop_range,i,cnt;
+    mca_bcol_base_module_t *bcol_module;
+    /* collective algorithms */
+    coll_ml_collective_description_t **hierarchical_algorithms = topo_info->hierarchical_algorithms;
+
+    /* RLG:  one non-blocking barrier collective algorithm - this is really a hack,
+     * we need to figure out how to do this in a bit more extensible
+     * manner.
+     */
+     hierarchical_algorithms[BCOL_IBARRIER]= (coll_ml_collective_description_t *)
+         malloc(sizeof(coll_ml_collective_description_t));
+     if( NULL == hierarchical_algorithms[BCOL_IBARRIER]) {
+         ret=OMPI_ERROR;
+         goto Error;
+     }
+
+    /* am I a member of the highest level subgroup */
+    if( global_high_hierarchy_index ==
+          topo_info->component_pairs[my_n_hierarchies-1].bcol_index )
+    {
+        hier_loop_range=my_n_hierarchies-1;
+        /* allocate the function description */
+        hierarchical_algorithms[BCOL_IBARRIER][0].n_functions=2*my_n_hierarchies-1;
+        /* collective description */
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_fanin_steps=my_n_hierarchies-1;
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_fanout_steps=my_n_hierarchies-1;
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_recursive_doubling_steps=1;
+    } else {
+        hier_loop_range=my_n_hierarchies;
+        /* allocate the function description */
+        hierarchical_algorithms[BCOL_IBARRIER][0].n_functions=2*my_n_hierarchies;
+        /* collective description */
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_fanin_steps=my_n_hierarchies;
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_fanout_steps=my_n_hierarchies;
+        hierarchical_algorithms[BCOL_IBARRIER][0].alg_params.coll_fn.
+            ibarrier_recursive_doubling.n_recursive_doubling_steps=0;
+    }
+
+    /* number of temp buffers */
+    hierarchical_algorithms[BCOL_IBARRIER]->n_buffers=0;
+
+    /* allocate space for the functions */
+    hierarchical_algorithms[BCOL_IBARRIER][0].functions=(coll_ml_function_t *)
+        malloc(sizeof(coll_ml_function_t)*
+            hierarchical_algorithms[BCOL_IBARRIER][0].n_functions);
+    if( NULL == hierarchical_algorithms[BCOL_IBARRIER][0].functions) {
+        ret=OMPI_ERROR;
+        goto Error;
+    }
+
+
+    /*
+     * The algorithm used here for an N level system
+     *  - up to level N-2, inclusive : fan in
+     *  - level N-1: barrier
+     *  - level N-2, to level 0: fanout
+     */
+
+    cnt=0;
+    /* fan-in phase */
+    for(i=0 ; i < hier_loop_range ; i++ ) {
+        bcol_module=topo_info->component_pairs[i].bcol_modules[0];
+        hierarchical_algorithms[BCOL_IBARRIER][0].functions[cnt].fn_idx = BCOL_IFANIN;
+/* RLG NOTE - is this a bug ?  We do not set the bcol module pointer in the functions array ... */
+        cnt++;
+    }
+
+
+    /* barrier */
+    if( hier_loop_range != my_n_hierarchies ) {
+        bcol_module=topo_info->component_pairs[my_n_hierarchies-1].bcol_modules[0];
+        hierarchical_algorithms[BCOL_IBARRIER][0].functions[cnt].fn_idx = BCOL_IBARRIER;
+        hierarchical_algorithms[BCOL_IBARRIER][0].functions[cnt].bcol_module = bcol_module;
+        cnt++;
+    }
+
+    /* fan-out */
+    for( i=hier_loop_range-1 ; i >=0 ; i-- ) {
+        bcol_module=topo_info->component_pairs[i].bcol_modules[0];
+        hierarchical_algorithms[BCOL_IBARRIER][0].functions[cnt].fn_idx = BCOL_IFANOUT;
+        cnt++;
+    }
+
+    /* set the number of functions pointers needed by this routine - this is
+     * used to figure out how large to allocate the function argument
+     * array.
+     */
+    if( ml_module->max_fn_calls < cnt ) {
+        ml_module->max_fn_calls=cnt;
+    }
+
+    /* set algorithm parameters */
+    hierarchical_algorithms[BCOL_IBARRIER][0].
+        alg_params.coll_fn.ibarrier_recursive_doubling.n_fanin_steps=hier_loop_range;
+    hierarchical_algorithms[BCOL_IBARRIER][0].
+        alg_params.coll_fn.ibarrier_recursive_doubling.n_fanout_steps=hier_loop_range;
+    if( hier_loop_range != my_n_hierarchies ) {
+        hierarchical_algorithms[BCOL_IBARRIER][0].
+            alg_params.coll_fn.ibarrier_recursive_doubling.n_recursive_doubling_steps=1;
+    } else {
+        hierarchical_algorithms[BCOL_IBARRIER][0].
+            alg_params.coll_fn.ibarrier_recursive_doubling.n_recursive_doubling_steps=0;
+    }
+
+
+    /* done */
+    return ret;
+
+Error :
+
+    if( hierarchical_algorithms[BCOL_IBARRIER][0].functions) {
+       free(hierarchical_algorithms[BCOL_IBARRIER][0].functions);
+       hierarchical_algorithms[BCOL_IBARRIER][0].functions=NULL;
+    }
+
+    if( hierarchical_algorithms[BCOL_IBARRIER]) {
+       free(hierarchical_algorithms[BCOL_IBARRIER]);
+       hierarchical_algorithms[BCOL_IBARRIER]=NULL;
+    }
+
+    return ret;
+}
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_reduce_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_reduce_setup.c
@ -0,0 +1,308 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+
+#include "ompi_config.h"
+
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/include/ompi/constants.h"
+#include "ompi/mca/coll/ml/coll_ml_functions.h"
+static int mca_coll_ml_task_comp_static_reduce
+    (struct mca_coll_ml_task_status_t *task) {
+
+    task->ml_coll_operation->variable_fn_params.root_flag = true;
+
+    return OMPI_SUCCESS;
+}
+
+static void mca_coll_ml_static_reduce_non_root(mca_coll_ml_task_status_t *task_status, int index,
+        mca_coll_ml_compound_functions_t *func)
+{
+    /* I am not a root rank, but someone in my group is a root*/
+    if (task_status->ml_coll_operation->variable_fn_params.root_route->level == index) {
+        task_status->rt_num_dependencies = func->num_dependencies;
+        task_status->rt_num_dependent_tasks = 0;
+        task_status->rt_dependent_task_indecies = NULL;
+        task_status->ml_coll_operation->variable_fn_params.root =
+                                task_status->ml_coll_operation->variable_fn_params.root_route->rank;
+    } else {
+        task_status->rt_num_dependencies = 0;
+        task_status->rt_num_dependent_tasks = 1;
+        task_status->rt_dependent_task_indecies = task_status->ml_coll_operation->variable_fn_params.root_route->level;
+    }
+
+}
+
+static void mca_coll_ml_static_reduce_root(mca_coll_ml_task_status_t *task_status, int index,
+        mca_coll_ml_compound_functions_t *func)
+{
+        task_status->rt_num_dependencies = func->num_dependencies;
+        task_status->rt_num_dependent_tasks = 0;
+        task_status->rt_dependent_task_indecies = NULL;
+}
+
+/*
+ * Fill up the collective descriptor
+ *
+ */
+static int mca_coll_ml_build_static_reduce_schedule(
+                                    mca_coll_ml_topology_t *topo_info,
+                                    mca_coll_ml_collective_operation_description_t **coll_desc)
+{
+    int i_hier, j_hier,  n_fcns,
+        n_hiers = topo_info->n_levels;
+    int *scratch_indx = NULL,
+        *scratch_num = NULL;
+    int cnt, value_to_set = 0;
+    int ret = OMPI_SUCCESS;
+    bool prev_is_zero;
+    mca_coll_ml_compound_functions_t *comp_fns_temp;
+    mca_bcol_base_module_t *prev_bcol,
+                           *bcol_module;
+    mca_coll_ml_compound_functions_t *comp_fn;
+    mca_coll_ml_collective_operation_description_t  *schedule = NULL;
+
+    *coll_desc = (mca_coll_ml_collective_operation_description_t *)
+                  malloc(sizeof(mca_coll_ml_collective_operation_description_t));
+
+    schedule = *coll_desc;
+    if (OPAL_UNLIKELY(NULL == schedule)) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+
+    scratch_indx = (int *) malloc(sizeof(int) * (n_hiers));
+    if (NULL == scratch_indx) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+
+    scratch_num = (int *) malloc(sizeof(int) * (n_hiers));
+    if (NULL == scratch_num) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+
+    prev_bcol = NULL;
+
+    /* Calculate scratch numbers */
+    for (i_hier = 0; i_hier < n_hiers; i_hier++) {
+        if (IS_BCOL_TYPE_IDENTICAL(prev_bcol, GET_BCOL(topo_info, i_hier))) {
+            scratch_indx[i_hier] = scratch_indx[i_hier - 1] + 1;
+        } else {
+            scratch_indx[i_hier] = 0;
+            prev_bcol = GET_BCOL(topo_info, i_hier);
+        }
+    }
+
+    --i_hier;
+    prev_is_zero = true;
+
+    do {
+        if (prev_is_zero) {
+            value_to_set = scratch_indx[i_hier] + 1;
+            prev_is_zero = false;
+        }
+
+        if (0 == scratch_indx[i_hier]) {
+            prev_is_zero = true;
+        }
+
+        scratch_num[i_hier] = value_to_set;
+        --i_hier;
+    } while(i_hier >= 0);
+
+    /* All hierarchies call one function, unlike other collectives */
+    n_fcns = n_hiers;
+
+    /* Set dependencies equal to number of hierarchies */
+    schedule->n_fns = n_fcns;
+    schedule->topo_info = topo_info;
+    schedule->progress_type = 0;
+    /* Allocated the component function */
+    schedule->component_functions = (struct mca_coll_ml_compound_functions_t *)
+                                     calloc(n_fcns, sizeof(struct mca_coll_ml_compound_functions_t));
+
+    if (OPAL_UNLIKELY(NULL == schedule->component_functions)) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+
+
+    for (i_hier = 0; i_hier < n_hiers; ++i_hier) {
+        comp_fn = &schedule->component_functions[i_hier];
+
+        /* The hierarchial level */
+        comp_fn->h_level = i_hier;
+        bcol_module = GET_BCOL(topo_info, i_hier);
+
+        comp_fn->bcol_function =
+                bcol_module->filtered_fns_table[DATA_SRC_KNOWN][NON_BLOCKING][BCOL_REDUCE][1][0][0];
+
+        strcpy(comp_fn->fn_name, "REDUCE");
+        ML_VERBOSE(10, ("func indx %d set to %p", i_hier, comp_fn->bcol_function));
+
+
+        ML_VERBOSE(1,("In ML_REDUCE_SETUP  .. looks fine here"));
+        /* No need completion func for Barrier */
+        comp_fn->task_comp_fn = mca_coll_ml_task_comp_static_reduce;
+
+        /* Constants */
+        comp_fn->constant_group_data.bcol_module = bcol_module;
+        comp_fn->constant_group_data.index_in_consecutive_same_bcol_calls = scratch_indx[i_hier];
+        comp_fn->constant_group_data.n_of_this_type_in_a_row = scratch_num[i_hier];
+        comp_fn->constant_group_data.n_of_this_type_in_collective = 0;
+        comp_fn->constant_group_data.index_of_this_type_in_collective = 0;
+
+        ML_VERBOSE(10, ("Setting collective [reduce] fn_idx %d, n_of_this_type_in_a_row %d, "
+                        "index_in_consecutive_same_bcol_calls %d.",
+                         i_hier, comp_fn->constant_group_data.n_of_this_type_in_a_row,
+                         comp_fn->constant_group_data.index_in_consecutive_same_bcol_calls));
+    }
+
+
+    /* Fill the rest of constant data */
+    for (i_hier = 0; i_hier < n_hiers; i_hier++) {
+        mca_bcol_base_module_t *current_bcol =
+            schedule->component_functions[i_hier].
+            constant_group_data.bcol_module;
+        cnt = 0;
+        for (j_hier = 0; j_hier < n_hiers; j_hier++) {
+            if (current_bcol ==
+                    schedule->component_functions[j_hier].
+                    constant_group_data.bcol_module) {
+                schedule->component_functions[j_hier].
+                    constant_group_data.index_of_this_type_in_collective = cnt;
+                cnt++;
+            }
+        }
+        schedule->component_functions[i_hier].
+            constant_group_data.n_of_this_type_in_collective = cnt;
+    }
+
+    /* Manju: Reduction should always use the fixed schedule.
+     * The subgroups that this process is leader should be executed first, then
+     * it should execute the subgroups where this process is not a leader, and
+     * then execute the subgroup that includes the root.
+     */
+
+    /* Allocate the schedule list */
+    schedule->comp_fn_arr = (struct mca_coll_ml_compound_functions_t **)
+        calloc(n_hiers,sizeof(struct mca_coll_ml_compound_functions_t *));
+    if (NULL == schedule->comp_fn_arr) {
+        ML_ERROR(("Can't allocate memory.\n"));
+        ret = OMPI_ERR_OUT_OF_RESOURCE;
+        goto Error;
+    }
+
+    /* Now that the functions have been set-up properly, we can simple permute the ordering a bit */
+
+    for (i_hier = 0; i_hier < n_hiers; i_hier++) {
+        /* first one is trivial */
+        int leader_hierarchy = 0;
+        int non_leader_hierarchy = 0;
+        int func_index;
+
+        comp_fns_temp = (struct mca_coll_ml_compound_functions_t *)
+            calloc(n_hiers, sizeof(struct mca_coll_ml_compound_functions_t));
+
+        leader_hierarchy = 0;
+        non_leader_hierarchy = n_hiers - 2;
+
+        for(j_hier = 0; j_hier < n_hiers - 1 ; j_hier++) {
+
+            func_index = j_hier < i_hier ? j_hier : j_hier + 1;
+            /* I'm a leader for this group */
+            if (0 == topo_info->component_pairs->subgroup_module->my_index) {
+                comp_fns_temp[leader_hierarchy++] =
+                    schedule->component_functions[func_index];
+            }
+            else {
+                comp_fns_temp[non_leader_hierarchy--] =
+                    schedule->component_functions[func_index];
+            }
+        }
+
+        comp_fns_temp[j_hier] = schedule->component_functions[i_hier];
+        /* now let's attach this list to our array of lists */
+        schedule->comp_fn_arr[i_hier] = comp_fns_temp;
+    }
+
+    /* Manju: Do we need this ? */
+
+    /* I'm going to just loop over each schedule and
+     * set up the scratch indices, scratch numbers
+     * and other constant data
+     */
+    /*
+    for( i_hier = 1; i_hier < n_hiers; i_hier++) {
+        ret = mca_coll_ml_setup_scratch_vals(schedule->comp_fn_arr[i_hier], scratch_indx,
+                scratch_num, n_hiers);
+        if( OMPI_SUCCESS != ret ) {
+            ret = OMPI_ERROR;
+            goto Error;
+        }
+
+    }
+    */
+
+    /* Do I need this ? */
+    schedule->task_setup_fn[COLL_ML_ROOT_TASK_FN] = mca_coll_ml_static_reduce_root;
+    schedule->task_setup_fn[COLL_ML_GENERAL_TASK_FN] = mca_coll_ml_static_reduce_non_root;
+
+    MCA_COLL_ML_SET_SCHEDULE_ORDER_INFO(schedule);
+
+    free(scratch_num);
+    free(scratch_indx);
+
+    return OMPI_SUCCESS;
+
+Error:
+    if (NULL != schedule->component_functions) {
+        free(schedule->component_functions);
+        schedule->component_functions = NULL;
+    }
+
+    return ret;
+}
+
+
+int ml_coll_hier_reduce_setup(mca_coll_ml_module_t *ml_module)
+{
+    int alg, ret, topo_index=0;
+    mca_coll_ml_topology_t *topo_info =
+           &ml_module->topo_list[ml_module->collectives_topology_map[ML_REDUCE][ML_SMALL_MSG]];
+
+    if ( ml_module->max_fn_calls < topo_info->n_levels ) {
+        ml_module->max_fn_calls = topo_info->n_levels;
+    }
+
+
+    alg = mca_coll_ml_component.coll_config[ML_REDUCE][ML_SMALL_MSG].algorithm_id;
+    topo_index = ml_module->collectives_topology_map[ML_REDUCE][alg];
+    if (ML_UNDEFINED == alg || ML_UNDEFINED == topo_index) {
+        ML_ERROR(("No topology index or algorithm was defined"));
+        topo_info->hierarchical_algorithms[ML_REDUCE] = NULL;
+        return OMPI_ERROR;
+    }
+
+    ret = mca_coll_ml_build_static_reduce_schedule(&ml_module->topo_list[topo_index],
+            &ml_module->coll_ml_reduce_functions[alg]);
+    if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+        ML_VERBOSE(10, ("Failed to setup static reduce"));
+        return ret;
+    }
+
+
+    return OMPI_SUCCESS;
+}
--- a/ompi/mca/coll/ml/coll_ml_hier_algorithms_setup.c
+++ b/ompi/mca/coll/ml/coll_ml_hier_algorithms_setup.c
@ -300,7 +300,10 @@ int ml_coll_hier_allreduce_setup(mca_coll_ml_module_t *ml_module)
    return ret;
 }

-
+#if 0
+/*
+ * Manju: New setup function in coll_ml_hier_algorithms_reduce_setup.c
+ */
 /* Ishai: Reduce is not an hier algorithm (it is rooted) - it needs a different ML algorithm */
 /* Need to rewrite */
 int ml_coll_hier_reduce_setup(mca_coll_ml_module_t *ml_module)
@ -320,6 +323,7 @@ int ml_coll_hier_reduce_setup(mca_coll_ml_module_t *ml_module)
    ml_module->topo_list[topo_index].hierarchical_algorithms[BCOL_BCAST] = NULL;
    return ret;
 }
+#endif

 int ml_coll_barrier_constant_group_data_setup(
                mca_coll_ml_topology_t *topo_info,
--- a/ompi/mca/coll/ml/coll_ml_ibarrier.c
+++ b/ompi/mca/coll/ml/coll_ml_ibarrier.c
@ -0,0 +1,145 @@
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+/** @file */
+
+#include "ompi_config.h"
+
+#include "ompi/constants.h"
+#include "opal/threads/mutex.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/coll/coll.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "opal/sys/atomic.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/mca/coll/ml/coll_ml_hier_algorithms.h"
+
+    coll_ml_collective_description_t *collective_op;
+    bcol_fn_args_t fn_arguments;
+    mca_coll_ml_descriptor_t *msg_desc;
+
+    static int coll_ml_setup_ibarrier_instance_recursive_doubling(
+            mca_coll_ml_descriptor_t *msg_desc,
+            coll_ml_collective_description_t *collective_op)
+{
+    /* local variables */
+    ret=OMPI_SUCCESS;
+    int i_fn,cnt;
+
+    /* initialize function arguments */
+
+    /* mark all routines as not yet started - need this, so that
+     * when we try to progress the barrier, we know where to pickup
+     * when a function is called - MOVE this into the setup function.
+     */
+    for(i_fn=0 ; i_fn < collective_op->n_functions ; i_fn++ ) {
+        msg_desc->fragment.fn_args[i_fn].function_status=FUNCTION_NOT_STARTED;
+    }
+
+    /* setup the fanin root */
+    for(i_fn=0 ; i_fn < collective_op->
+        alg_params.coll_fn.ibarrier_recursive_doubling.n_fanin_steps ;
+        i_fn++ ) {
+        mca_bcol_base_module_t *bcol_mod=
+           msg_desc->local_comm_description->functions[i_fn].bcol_module;
+        /* the lowest rank in the group will be the root */
+        msg_desc->fragment.fn_args[i_fn].root=0;
+    }
+
+    /* setup the fanout root */
+    cnt=alg_params.coll_fn.ibarrier_recursive_doubling.n_fanin_steps+
+        alg_params.coll_fn.ibarrier_recursive_doubling.n_recursive_doubling_steps;
+    for(i_fn=cnt ; i_fn < cnt + collective_op->
+        alg_params.coll_fn.ibarrier_recursive_doubling.n_fanin_steps ;
+        i_fn++ ) {
+        mca_bcol_base_module_t *bcol_mod=
+           msg_desc->local_comm_description->functions[i_fn].bcol_module;
+        /* the lowest rank in the group will be the root */
+        msg_desc->fragment.fn_args[i_fn].root=0;
+    }
+
+    /* successful completion */
+    return ret;
+}
+
+/**
+ * Hierarchical blocking barrier
+ */
+int mca_coll_ml_nb_barrier_intra( struct ompi_communicator_t *comm,
+        ompi_request_t ** request, mca_coll_base_module_t *module)
+{
+    /* local variables */
+    int ret=OMPI_SUCCESS;
+    mca_coll_ml_module_t *ml_module;
+    uint64_t sequence_number;
+    int i_fn;
+    coll_ml_collective_description_t *collective_op;
+    bcol_fn_args_t fn_arguments;
+    mca_coll_ml_descriptor_t *msg_desc;
+
+    ml_module=(mca_coll_ml_module_t *) module;
+/* debug */
+fprintf(stderr," mca_coll_ml_nb_barrier_intra called \n");
+fflush(stderr);
+/* end debug */
+
+    /* grab full message descriptor - RLG: this is really not optimal,
+     * as we may be doing too much initialization if the collective
+     * routine completes on the first call to progress which is called
+     * within this routine.  Need to see if we can be more efficient
+     * here.  The current issue is that the only way that I can think
+     * to do this now is with two separate code paths, which I want to
+     * avoid at this stage.
+     */
+    OMPI_FREE_LIST_GET(&(ml_module->message_descriptors),
+       msg_desc,ret);
+    if( OMPI_SUCCESS != ret) {
+        goto Error;
+    }
+
+    /* get message sequence number */
+    sequence_number=OPAL_THREAD_ADD64(
+            &(ml_module->no_data_collective_sequence_num),1);
+    fn_arguments.sequence_num=sequence_number;
+
+
+    /* get pointer to schedule - only one algorithm at this stage */
+    collective_op=&(ml_module->hierarchical_algorithms[BCOL_NB_BARRIER][0]);
+
+    /* call setup function - RLG: right now this is totally extra,
+     * but if we are going to have more than one algorithm,
+     * this is a better way to do this. */
+    coll_ml_setup_ibarrier_instance_recursive_doubling(
+            msg_desc,collective_op);
+
+    /* call the progress function to actually start the barrier */
+
+    /* recycle resources - RLG: need to think about this one */
+
+#if 0
+    /* run barrier */
+    /* need to add bcol context for the call */
+    for( i_fn =0 ; i_fn < collective_op->n_functions ; i_fn++ ) {
+        mca_bcol_base_module_t *bcol_module=
+            collective_op->functions[i_fn].bcol_module;
+        /* for barrier, all we need is the group information that is
+         * captured in the bcol module
+         */
+        ret=collective_op->functions[i_fn].fn(&fn_arguments,
+                NULL,NULL,bcol_module);
+        if( OMPI_SUCCESS != ret) {
+        } goto Error;
+    }
+#endif
+
+    return OMPI_SUCCESS;
+
+Error:
+    return ret;
+}
--- a/ompi/mca/coll/ml/coll_ml_inlines.h
+++ b/ompi/mca/coll/ml/coll_ml_inlines.h
@ -1,3 +1,4 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
@ -19,17 +20,6 @@

 BEGIN_C_DECLS

-static inline int mca_coll_ml_err(const char* fmt, ...)
-{
-    va_list list;
-    int ret;
-
-    va_start(list, fmt);
-    ret = vfprintf(stderr, fmt, list);
-    va_end(list);
-    return ret;
-}
-
 static inline __opal_attribute_always_inline__ int ml_fls(int num)
 {
    int i = 1;
@ -186,15 +176,23 @@ static inline  __opal_attribute_always_inline__ int coll_ml_fragment_completion_
         * the MPI level request object, wich is the first element
         * in the fragment descriptor.
         */
+         /* I contend that this is a bug. This is not the right way to check
+             * for the first fragment as it assumes that the first fragment would always
+             * for any collective have zero as the first offset or that other subsequent
+             * fragments would not. It is not safe to assume this. The correct check is
+             * the following one
+             */
+
        ML_VERBOSE(10, ("Master ? %p %d", coll_op,  coll_op->fragment_data.offset_into_user_buffer));
-        /* offset_into_user_buffer == 0 is not a valid definition for the first frag. 
-         * if (0 != coll_op->fragment_data.offset_into_user_buffer &&
-         *       !out_of_resource) {
-         * A well posed definition of the first frag is the following
+        /* This check is in fact a bug. Not the correct definiton of first
+         * fragment. First fragment is the only fragment that satisfies the
+         * following criteria
         */
-        if((&coll_op->full_message !=
-                    coll_op->fragment_data.message_descriptor) &&
-                !out_of_resource){
+        /*if (0 != coll_op->fragment_data.offset_into_user_buffer &&
+                !out_of_resource) {
+                */
+        if (((&coll_op->full_message != coll_op->fragment_data.message_descriptor) &&
+	     !out_of_resource) || IS_COLL_SYNCMEM(coll_op)) {
            /* non-zero offset ==> this is not fragment 0 */
            CHECK_AND_RECYCLE(coll_op);
        }
@ -414,7 +412,7 @@ static inline __opal_attribute_always_inline__ int mca_coll_ml_generic_collectiv
    for (fn_index = 0; fn_index < op_desc->n_fns; fn_index++) {
        func = &op_desc->component_functions[fn_index];
        task_status = &op_prog->dag_description.status_array[fn_index];
-        /* fire the collective imidiate if it has no dependencies */
+        /* fire the collective immediately if it has no dependencies */
        if (0 == task_status->rt_num_dependencies) {
            rc = func->bcol_function->coll_fn(&op_prog->variable_fn_params,
                    /* Pasha: Need to update the prototype of the func,
@ -436,7 +434,7 @@ static inline __opal_attribute_always_inline__ int mca_coll_ml_generic_collectiv
                    OPAL_THREAD_UNLOCK(&(mca_coll_ml_component.active_tasks_mutex));
                    break;
                case BCOL_FN_COMPLETE:
-                    /* the tast is done ! lets start relevant dependencies */
+                    /* the task is done ! lets start relevant dependencies */
                    ML_VERBOSE(9, ("Call to bcol collecitive return BCOL_FN_COMPLETE"));
                    /* the task does not belong to any list, yes. So passing NULL */
                    ret = mca_coll_ml_task_completion_processing(&task_status, NULL);
@ -555,20 +553,24 @@ static inline __opal_attribute_always_inline__
                void mca_coll_ml_convertor_get_send_frag_size(mca_coll_ml_module_t *ml_module,
                                     size_t *frag_size, struct full_message_t *message_descriptor)
 {
-    size_t ml_fragment_size = ml_module->ml_fragment_size;
+    size_t fragment_size = *frag_size;
    opal_convertor_t *dummy_convertor = &message_descriptor->dummy_convertor;

    /* The last frag needs special service */
-    if (ml_fragment_size >
-          message_descriptor->send_converter_bytes_packed) {
+    if (fragment_size >
+          (size_t) message_descriptor->send_converter_bytes_packed) {
        *frag_size = message_descriptor->send_converter_bytes_packed;
        message_descriptor->send_converter_bytes_packed = 0;

        return;
    }
-
-    *frag_size = ml_fragment_size;
-    message_descriptor->dummy_conv_position += ml_fragment_size;
+    if( (message_descriptor->dummy_conv_position + fragment_size) >
+            message_descriptor->n_bytes_total ) {
+        message_descriptor->dummy_conv_position = (message_descriptor->dummy_conv_position + fragment_size)
+            - message_descriptor->n_bytes_total;
+    } else {
+        message_descriptor->dummy_conv_position += fragment_size;
+    }

    opal_convertor_generic_simple_position(dummy_convertor, &message_descriptor->dummy_conv_position);
    *frag_size -= dummy_convertor->partial_length;
@ -576,6 +578,60 @@ static inline __opal_attribute_always_inline__
    message_descriptor->send_converter_bytes_packed -= (*frag_size);
 }

+static inline __opal_attribute_always_inline__ int
+mca_coll_ml_launch_sequential_collective (mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    mca_bcol_base_coll_fn_desc_t *bcol_func;
+    int ifunc, n_fn, ih, ret;
+    mca_coll_ml_collective_operation_description_t *sched =
+        coll_op->coll_schedule;
+
+    n_fn = sched->n_fns;
+    ih = coll_op->sequential_routine.current_active_bcol_fn;
+
+    /* if collectives are already pending just add this one to the list */
+    if (opal_list_get_size (&mca_coll_ml_component.sequential_collectives)) {
+        opal_list_append(&mca_coll_ml_component.sequential_collectives, (opal_list_item_t *) coll_op);
+
+        return OMPI_SUCCESS;
+    }
+
+    for (ifunc = ih; ifunc < n_fn; ifunc++, coll_op->sequential_routine.current_active_bcol_fn++) {
+        ret = coll_op->sequential_routine.seq_task_setup(coll_op);
+        if (OMPI_SUCCESS != ret) {
+            return ret;
+        }
+
+        bcol_func = (sched->component_functions[ifunc].bcol_function);
+        ret = bcol_func->coll_fn(&coll_op->variable_fn_params,
+                    (struct coll_ml_function_t *) &sched->component_functions[ifunc].constant_group_data);
+
+        if (BCOL_FN_COMPLETE == ret) {
+            if (ifunc == n_fn - 1) {
+                ret = coll_ml_fragment_completion_processing(coll_op);
+                if (OPAL_UNLIKELY(OMPI_SUCCESS != ret)) {
+                    mca_coll_ml_abort_ml("Failed to run coll_ml_fragment_completion_processing");
+                }
+
+                return OMPI_SUCCESS;
+            }
+        } else {
+            if (BCOL_FN_STARTED == ret) {
+                coll_op->sequential_routine.current_bcol_status = SEQ_TASK_IN_PROG;
+            } else {
+                coll_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
+            }
+
+            ML_VERBOSE(10, ("Adding pending bcol to the progress list to access by ml_progress func-id %d", ifunc));
+            opal_list_append(&mca_coll_ml_component.sequential_collectives, (opal_list_item_t *) coll_op);
+
+            break;
+        }
+    }
+
+    return OMPI_SUCCESS;
+}
+
 END_C_DECLS

 #endif
--- a/ompi/mca/coll/ml/coll_ml_mca.c
+++ b/ompi/mca/coll/ml/coll_ml_mca.c
@ -54,6 +54,13 @@ mca_base_var_enum_value_t fragmentation_enable_enum[] = {
    {-1, NULL}
 };

+mca_base_var_enum_value_t bcast_algorithms[] = {
+    {COLL_ML_STATIC_BCAST, "static"},
+    {COLL_ML_SEQ_BCAST, "sequential"},
+    {COLL_ML_UNKNOWN_BCAST, "unknown-root"},
+    {-1, NULL}
+};
+
 /*
 * utility routine for string parameter registration
 */
@ -197,85 +204,76 @@ int mca_coll_ml_register_params(void)

    /* register openib component parameters */

-    CHECK(reg_int("priority", NULL,
-                  "ML component priority"
-                  "(from 0(low) to 90 (high))", 0, &mca_coll_ml_component.ml_priority, 0));
+    CHECK(reg_int("priority", NULL, "ML component priority"
+                  "(from 0(low) to 90 (high))", 27, &mca_coll_ml_component.ml_priority, 0));

-    CHECK(reg_int("verbose", NULL,
-                  "Output some verbose ML information "
+    CHECK(reg_int("verbose", NULL, "Output some verbose ML information "
                  "(0 = no output, nonzero = output)", 0, &mca_coll_ml_component.verbose, 0));

-    CHECK(reg_int("n_levels", NULL,
-                  "number of levels in the hierarchy ", 1, &mca_coll_ml_component.ml_n_levels, 0));
+    CHECK(reg_int("max_comm", NULL, "Maximum number of communicators that can use coll/ml", 24,
+                  (int *) &mca_coll_ml_component.max_comm, 0));

-    CHECK(reg_int("max_comm", NULL,
-                  "max of communicators available to run ML", 12, (int *) &mca_coll_ml_component.max_comm, 0));
+    CHECK(reg_int("min_comm_size", NULL, "Minimum size of communicator to use coll/ml", 0,
+                  &mca_coll_ml_component.min_comm_size, 0));

-    CHECK(reg_int("min_comm_size", NULL,
-                  " min size of comm to be available to run ML", 0, &mca_coll_ml_component.min_comm_size, 0));
+    CHECK(reg_int("n_payload_mem_banks", NULL, "Number of payload memory banks", 2,
+                  &mca_coll_ml_component.n_payload_mem_banks, 0));

-    CHECK(reg_int("n_payload_mem_banks", NULL,
-                "number of payload memory banks", 2, &mca_coll_ml_component.n_payload_mem_banks, 0));
-
-    CHECK(reg_int("n_payload_buffs_per_bank", NULL,
-                "number of payload buffers per bank", 16, &mca_coll_ml_component.n_payload_buffs_per_bank, 0));
+    CHECK(reg_int("n_payload_buffs_per_bank", NULL, "Number of payload buffers per bank", 16,
+                  &mca_coll_ml_component.n_payload_buffs_per_bank, 0));

    /* RLG: need to handle alignment and size */
-    CHECK(reg_ullint("payload_buffer_size", NULL,
-                  "size of payload buffer", 4*1024, &mca_coll_ml_component.payload_buffer_size, 0));
+    CHECK(reg_ullint("payload_buffer_size", NULL, "Size of payload buffers", 4*1024,
+                     &mca_coll_ml_component.payload_buffer_size, 0));

    /* get the pipeline depth, default is 2 */
-    CHECK(reg_int("pipeline_depth", NULL,
-                  "size of fragmentation pipeline", 2, &mca_coll_ml_component.pipeline_depth, 0));
+    CHECK(reg_int("pipeline_depth", NULL, "Size of fragmentation pipeline", 2,
+                  &mca_coll_ml_component.pipeline_depth, 0));

-    CHECK(reg_int("free_list_init_size", NULL,
-                  " Initial size for free lists in ML", 128, &mca_coll_ml_component.free_list_init_size, 0));
+    CHECK(reg_int("free_list_init_size", NULL, "Initial size of free lists in coll/ml", 128,
+                  &mca_coll_ml_component.free_list_init_size, 0));

-    CHECK(reg_int("free_list_grow_size", NULL,
-                  " Initial size for free lists in ML", 64, &mca_coll_ml_component.free_list_grow_size, 0));
+    CHECK(reg_int("free_list_grow_size", NULL, "Initial size of free lists in coll/ml", 64,
+                  &mca_coll_ml_component.free_list_grow_size, 0));

-    CHECK(reg_int("free_list_max_size", NULL,
-                  " Initial size for free lists in ML", -1, &mca_coll_ml_component.free_list_max_size, 0));
+    CHECK(reg_int("free_list_max_size", NULL, "Initial size of free lists in coll/ml", -1,
+                  &mca_coll_ml_component.free_list_max_size, 0));

-    CHECK(reg_int("use_knomial_allreduce", NULL,
-                "Use k-nomial Allreduce supports only p2p currently"
-                , 1, &mca_coll_ml_component.use_knomial_allreduce, 0));
+    mca_coll_ml_component.use_knomial_allreduce = 1;

-    CHECK(reg_bool("use_static_bcast", NULL,
-                "Use new bcast static algorithm", true, &mca_coll_ml_component.use_static_bcast));
+    tmp = mca_base_var_enum_create ("coll_ml_bcast_algorithm", bcast_algorithms, &new_enum);
+    if (OPAL_SUCCESS != tmp) {
+        return tmp;
+    }

-    CHECK(reg_bool("use_sequential_bcast", NULL,
-                  "Use new bcast static algorithm", false, &mca_coll_ml_component.use_sequential_bcast));
+    mca_coll_ml_component.bcast_algorithm = COLL_ML_STATIC_BCAST;
+    tmp = mca_base_component_var_register (&mca_coll_ml_component.super.collm_version, "bcast_algorithm",
+                                           "Algorithm to use for broadcast", MCA_BASE_VAR_TYPE_INT,
+                                           new_enum, 0, 0, OPAL_INFO_LVL_9, MCA_BASE_VAR_SCOPE_READONLY,
+                                           &mca_coll_ml_component.bcast_algorithm);

-    CHECK(reg_bool("disable_allgather", NULL,
-                  "Allgather disabling",
-                   false, &mca_coll_ml_component.disable_allgather));
+    CHECK(reg_bool("disable_allgather", NULL, "Disable Allgather", false,
+                   &mca_coll_ml_component.disable_allgather));

-    CHECK(reg_bool("disable_alltoall", NULL,
-                  "Alltoall disabling",
-                   false, &mca_coll_ml_component.disable_alltoall));
+    CHECK(reg_bool("disable_reduce", NULL, "Disable Reduce", false,
+                   &mca_coll_ml_component.disable_reduce));

    tmp = mca_base_var_enum_create ("coll_ml_enable_fragmentation_enum", fragmentation_enable_enum, &new_enum);
-    if (OPAL_SUCCESS != ret) {
-	return tmp;
+    if (OPAL_SUCCESS != tmp) {
+        return tmp;
    }

    /* default to auto-enable fragmentation */
    mca_coll_ml_component.enable_fragmentation = 2;
    tmp = mca_base_component_var_register (&mca_coll_ml_component.super.collm_version, "enable_fragmentation",
-					   "Disable/Enable fragmentation for large messages", MCA_BASE_VAR_TYPE_INT,
-					   new_enum, 0, 0, OPAL_INFO_LVL_9, MCA_BASE_VAR_SCOPE_READONLY,
-					   &mca_coll_ml_component.enable_fragmentation);
+                                           "Disable/Enable fragmentation for large messages", MCA_BASE_VAR_TYPE_INT,
+                                           new_enum, 0, 0, OPAL_INFO_LVL_9, MCA_BASE_VAR_SCOPE_READONLY,
+                                           &mca_coll_ml_component.enable_fragmentation);
    if (0 > tmp) {
        ret = tmp;
    }
    OBJ_RELEASE(new_enum);

-    CHECK(reg_int("use_brucks_smsg_alltoall", NULL,
-                "Use Bruck's Algo for Small Msg Alltoall"
-                "1 - Bruck's Algo with RDMA; 2 - Bruck's with Send Recv"
-                , 0, &mca_coll_ml_component.use_brucks_smsg_alltoall, 0));
-
    asprintf(&str, "%s/mca-coll-ml.config",
            opal_install_dirs.ompidatadir);
    if (NULL == str) {
--- a/ompi/mca/coll/ml/coll_ml_memsync.c
+++ b/ompi/mca/coll/ml/coll_ml_memsync.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -35,7 +38,7 @@ static int mca_coll_ml_memsync_recycle_memory(mca_coll_ml_collective_operation_p
           ML_MEMSYNC == coll_op->fragment_data.current_coll_op);

    ML_VERBOSE(10,("MEMSYNC: bank %d was recycled coll_op %p", bank, coll_op));
-    
+
    /* set the bank as free */

    ml_memblock->bank_is_busy[bank] = false;
@ -61,11 +64,11 @@ static int mca_coll_ml_memsync_recycle_memory(mca_coll_ml_collective_operation_p
                }
                break;
            case OMPI_ERR_TEMP_OUT_OF_RESOURCE: 
-                ML_VERBOSE(10, ("Already on hte list %p", pending_op));
+                ML_VERBOSE(10, ("Already on the list %p", pending_op));
                have_resources = false;
                break;
            default:
-                ML_ERROR(("Error happend %d", rc));
+                ML_ERROR(("Error happened %d", rc));
                return rc;
        }
    }
@ -133,8 +136,6 @@ int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *ml_module, int bank_index)

    ML_VERBOSE(8, ("MEMSYNC start"));

-    OBJ_RETAIN(ml_module->comm);
-
    if (OPAL_UNLIKELY(0 == opal_list_get_size(&ml_module->active_bcols_list))) {
        /* Josh's change: In the case where only p2p is active, we have no way
         * to reset the bank release counters to zero, I am doing that here since it
@ -147,12 +148,11 @@ int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *ml_module, int bank_index)
         * ptp case. 
         */
        mca_coll_ml_collective_operation_progress_t dummy_coll;
+
        dummy_coll.coll_module = (mca_coll_base_module_t *) ml_module;
        dummy_coll.fragment_data.current_coll_op = ML_MEMSYNC;
        dummy_coll.full_message.bank_index_to_recycle = bank_index;
-        dummy_coll.fragment_data.offset_into_user_buffer = 100; /* must be non-zero, 
-                                                                 * else assert fails in recycle flow 
-                                                                 */
+
        /* Handling special case when memory syncronization is not required */
        rc = mca_coll_ml_memsync_recycle_memory(&dummy_coll);
        if(OPAL_UNLIKELY(rc != OMPI_SUCCESS)){
@ -160,6 +160,9 @@ int mca_coll_ml_memsync_intra(mca_coll_ml_module_t *ml_module, int bank_index)
            return rc;
        } 
    } else {
+        /* retain the communicator until the operation is finished. the communicator
+         * will be released by CHECK_AND_RECYCLE */
+        OBJ_RETAIN(ml_module->comm);

        rc = mca_coll_ml_memsync_launch(ml_module, &req, bank_index);
        if (OPAL_UNLIKELY(rc != OMPI_SUCCESS)) {
--- a/ompi/mca/coll/ml/coll_ml_module.c
+++ b/ompi/mca/coll/ml/coll_ml_module.c
--- a/ompi/mca/coll/ml/coll_ml_reduce.c
+++ b/ompi/mca/coll/ml/coll_ml_reduce.c
@ -0,0 +1,528 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
+/*
+ * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
+ * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * $COPYRIGHT$
+ *
+ * Additional copyrights may follow
+ *
+ * $HEADER$
+ */
+/** @file */
+
+#include "ompi_config.h"
+
+#include "ompi/constants.h"
+#include "opal/threads/mutex.h"
+#include "ompi/communicator/communicator.h"
+#include "ompi/mca/coll/coll.h"
+#include "ompi/mca/bcol/bcol.h"
+#include "opal/sys/atomic.h"
+#include "ompi/mca/coll/ml/coll_ml.h"
+#include "ompi/mca/coll/ml/coll_ml_allocation.h"
+#include "ompi/mca/coll/ml/coll_ml_inlines.h"
+#define REDUCE_SMALL_MESSAGE_THRESHOLD 2048
+
+static int mca_coll_ml_reduce_unpack(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int ret;
+    /* need to put in more */
+    int count = coll_op->variable_fn_params.count;
+    ompi_datatype_t *dtype = coll_op->variable_fn_params.dtype;
+
+    void *dest = (void *)((uintptr_t)coll_op->full_message.dest_user_addr +
+            (uintptr_t)coll_op->fragment_data.offset_into_user_buffer);
+    void *src = (void *)((uintptr_t)coll_op->fragment_data.buffer_desc->data_addr +
+            (size_t)coll_op->variable_fn_params.rbuf_offset);
+
+    ret = ompi_datatype_copy_content_same_ddt(dtype, (int32_t) count, (char *) dest,
+            (char *) src);
+    if (ret < 0) {
+        return OMPI_ERROR;
+    }
+
+    if (coll_op->variable_fn_params.root_flag) {
+        ML_VERBOSE(1,("In reduce unpack %d",
+            *(int *)((unsigned char*) src)));
+    }
+
+    ML_VERBOSE(10, ("sbuf addr %p, sbuf offset %d, sbuf val %lf, rbuf addr %p, rbuf offset %d, rbuf val %lf.",
+                coll_op->variable_fn_params.sbuf, coll_op->variable_fn_params.sbuf_offset,
+                *(double *) ((unsigned char *) coll_op->variable_fn_params.sbuf +
+                    (size_t) coll_op->variable_fn_params.sbuf_offset),
+                coll_op->variable_fn_params.rbuf, coll_op->variable_fn_params.rbuf_offset,
+                *(double *) ((unsigned char *) coll_op->variable_fn_params.rbuf +
+                    (size_t) coll_op->variable_fn_params.rbuf_offset)));
+
+    return OMPI_SUCCESS;
+}
+
+
+static int
+mca_coll_ml_reduce_task_setup (mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    int fn_idx, h_level, next_h_level, my_index;
+    mca_sbgp_base_module_t *sbgp;
+    mca_coll_ml_topology_t *topo = coll_op->coll_schedule->topo_info;
+
+    fn_idx      = coll_op->sequential_routine.current_active_bcol_fn;
+    h_level     = coll_op->coll_schedule->component_functions[fn_idx].h_level;
+    next_h_level = (fn_idx < coll_op->coll_schedule->n_fns - 1) ?
+        coll_op->coll_schedule->component_functions[fn_idx+1].h_level : -1;
+    sbgp        = topo->component_pairs[h_level].subgroup_module;
+    my_index    = sbgp->my_index;
+
+    if (coll_op->variable_fn_params.root_flag) {
+        ML_VERBOSE(1,("In task completion Data in receiver buffer %d ",
+            *(int *)((unsigned char*) coll_op->variable_fn_params.rbuf +
+            coll_op->variable_fn_params.rbuf_offset)));
+    }
+
+    /* determine the root for this level of the hierarchy */
+    if (coll_op->coll_schedule->topo_info->route_vector[coll_op->global_root].level == next_h_level ||
+        coll_op->global_root == sbgp->group_list[my_index]) {
+        /* I am the global root or I will be talking to the global root in the next round. */
+        coll_op->variable_fn_params.root = my_index;
+    } else if (coll_op->coll_schedule->topo_info->route_vector[coll_op->global_root].level == h_level) {
+        /* the root is in this level of my hierarchy */
+        coll_op->variable_fn_params.root = coll_op->coll_schedule->topo_info->route_vector[coll_op->global_root].rank;
+    } else {
+        coll_op->variable_fn_params.root = 0;
+    }
+
+    /* Set the route vector for this root */
+    coll_op->variable_fn_params.root_route =
+        &coll_op->coll_schedule->topo_info->route_vector[sbgp->group_list[coll_op->variable_fn_params.root]];
+
+    /* Am I the root of this hierarchy? */
+    coll_op->variable_fn_params.root_flag = (my_index == coll_op->variable_fn_params.root);
+
+    /* For hierarchy switch btw source and destination buffer
+     * No need to make this switch for the first call ..
+     * */
+    if (0 < fn_idx) {
+        int tmp_offset = coll_op->variable_fn_params.sbuf_offset;
+        coll_op->variable_fn_params.sbuf_offset =
+            coll_op->variable_fn_params.rbuf_offset;
+        coll_op->variable_fn_params.rbuf_offset = tmp_offset;
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static int mca_coll_ml_reduce_frag_progress(mca_coll_ml_collective_operation_progress_t *coll_op)
+{
+    /* local variables */
+    void *buf;
+
+    size_t dt_size;
+    int ret, frag_len, count;
+
+    ptrdiff_t lb, extent;
+
+    ml_payload_buffer_desc_t *src_buffer_desc;
+    mca_coll_ml_collective_operation_progress_t *new_op;
+
+    mca_coll_ml_module_t *ml_module = OP_ML_MODULE(coll_op);
+
+    ret = ompi_datatype_get_extent(coll_op->variable_fn_params.dtype, &lb, &extent);
+    if (ret < 0) {
+     return OMPI_ERROR;
+    }
+
+    dt_size = (size_t) extent;
+
+    /* Keep the pipeline filled with fragments */
+    while (coll_op->fragment_data.message_descriptor->n_active <
+        coll_op->fragment_data.message_descriptor->pipeline_depth) {
+        /* If an active fragment happens to have completed the collective during
+         * a hop into the progress engine, then don't launch a new fragment,
+         * instead break and return.
+         */
+        if (coll_op->fragment_data.message_descriptor->n_bytes_scheduled
+            == coll_op->fragment_data.message_descriptor->n_bytes_total) {
+            break;
+        }
+
+        /* Get an ml buffer */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(OP_ML_MODULE(coll_op));
+        if (NULL == src_buffer_desc) {
+            /* If there exist outstanding fragments, then break out
+             * and let an active fragment deal with this later,
+             * there are no buffers available.
+             */
+            if (0 < coll_op->fragment_data.message_descriptor->n_active) {
+                return OMPI_SUCCESS;
+            } else {
+                /* It is useless to call progress from here, since
+                 * ml progress can't be executed as result ml memsync
+                 * call will not be completed and no memory will be
+                 * recycled. So we put the element on the list, and we will
+                 * progress it later when memsync will recycle some memory*/
+
+                /* The fragment is already on list and
+                 * the we still have no ml resources
+                 * Return busy */
+                if (coll_op->pending & REQ_OUT_OF_MEMORY) {
+                    ML_VERBOSE(10,("Out of resources %p", coll_op));
+                    return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+                }
+
+                coll_op->pending |= REQ_OUT_OF_MEMORY;
+                opal_list_append(&((OP_ML_MODULE(coll_op))->waiting_for_memory_list),
+                        (opal_list_item_t *)coll_op);
+                ML_VERBOSE(10,("Out of resources %p adding to pending queue", coll_op));
+                return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+            }
+        }
+
+        /* Get a new collective descriptor and initialize it */
+        new_op =  mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_reduce_functions[ML_SMALL_DATA_REDUCE],
+                coll_op->fragment_data.message_descriptor->src_user_addr,
+                coll_op->fragment_data.message_descriptor->dest_user_addr,
+                coll_op->fragment_data.message_descriptor->n_bytes_total,
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled);
+
+        ML_VERBOSE(1,(" In Reduce fragment progress %d %d ",
+                    coll_op->fragment_data.message_descriptor->n_bytes_total,
+                    coll_op->fragment_data.message_descriptor->n_bytes_scheduled));
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(new_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+
+        new_op->fragment_data.current_coll_op = coll_op->fragment_data.current_coll_op;
+        new_op->fragment_data.message_descriptor = coll_op->fragment_data.message_descriptor;
+
+        /* set the task setup callback  */
+        new_op->sequential_routine.seq_task_setup = mca_coll_ml_reduce_task_setup;
+        /* We need this address for pointer arithmetic in memcpy */
+        buf = (void*)coll_op->fragment_data.message_descriptor->src_user_addr;
+        /* calculate the number of data types in this packet */
+        count = (coll_op->fragment_data.message_descriptor->n_bytes_total -
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled <
+                ((size_t) OP_ML_MODULE(coll_op)->small_message_thresholds[BCOL_REDUCE]/4 )?
+                (coll_op->fragment_data.message_descriptor->n_bytes_total -
+                coll_op->fragment_data.message_descriptor->n_bytes_scheduled) / dt_size :
+                (size_t) coll_op->variable_fn_params.count);
+
+        /* calculate the fragment length */
+        frag_len = count * dt_size;
+
+        ret = ompi_datatype_copy_content_same_ddt(coll_op->variable_fn_params.dtype, count,
+                (char *) src_buffer_desc->data_addr, (char *) ((uintptr_t) buf + (uintptr_t)
+                    coll_op->fragment_data.message_descriptor->n_bytes_scheduled));
+        if (ret < 0) {
+            return OMPI_ERROR;
+        }
+
+        /* if root unpack the data */
+        if (ompi_comm_rank(ml_module->comm) == coll_op->global_root ) {
+            new_op->process_fn = mca_coll_ml_reduce_unpack;
+            new_op->variable_fn_params.root_flag = true;
+        } else {
+            new_op->process_fn = NULL;
+            new_op->variable_fn_params.root_flag = false;
+        }
+
+        new_op->variable_fn_params.root_route = coll_op->variable_fn_params.root_route;
+
+        /* Setup fragment specific data */
+        new_op->fragment_data.message_descriptor->n_bytes_scheduled += frag_len;
+        new_op->fragment_data.buffer_desc = src_buffer_desc;
+        new_op->fragment_data.fragment_size = frag_len;
+        (new_op->fragment_data.message_descriptor->n_active)++;
+
+        /* Set in Reduce Buffer arguments */
+        ML_SET_VARIABLE_PARAMS_BCAST(new_op, OP_ML_MODULE(new_op), count,
+                                     coll_op->variable_fn_params.dtype, src_buffer_desc,
+                                     0, (ml_module->payload_block->size_buffer -
+                                         ml_module->data_offset)/2, frag_len,
+                                     src_buffer_desc->data_addr);
+
+        new_op->variable_fn_params.buffer_size = frag_len;
+        new_op->variable_fn_params.sbuf = src_buffer_desc->data_addr;
+        new_op->variable_fn_params.rbuf = src_buffer_desc->data_addr;
+        new_op->variable_fn_params.root = coll_op->variable_fn_params.root;
+        new_op->global_root = coll_op->global_root;
+        new_op->variable_fn_params.op = coll_op->variable_fn_params.op;
+        new_op->variable_fn_params.hier_factor = coll_op->variable_fn_params.hier_factor;
+        new_op->sequential_routine.current_bcol_status = SEQ_TASK_PENDING;
+        MCA_COLL_ML_SET_NEW_FRAG_ORDER_INFO(new_op);
+
+        ML_VERBOSE(10,("FFFF Contig + fragmentation [0-sk, 1-lk, 3-su, 4-lu] %d %d %d\n",
+                    new_op->variable_fn_params.buffer_size,
+                    new_op->fragment_data.fragment_size,
+                    new_op->fragment_data.message_descriptor->n_bytes_scheduled));
+        /* initialize first coll */
+        new_op->sequential_routine.seq_task_setup(new_op);
+
+        /* append this collective !! */
+        OPAL_THREAD_LOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+        opal_list_append(&mca_coll_ml_component.sequential_collectives,
+                (opal_list_item_t *)new_op);
+        OPAL_THREAD_UNLOCK(&(mca_coll_ml_component.sequential_collectives_mutex));
+
+    }
+
+    return OMPI_SUCCESS;
+}
+
+static inline __opal_attribute_always_inline__
+int parallel_reduce_start (void *sbuf, void *rbuf, int count,
+                           struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+                           int root,
+                           struct ompi_communicator_t *comm,
+                           mca_coll_ml_module_t *ml_module,
+                           ompi_request_t **req,
+                           int small_data_reduce,
+                           int large_data_reduce) {
+    ptrdiff_t lb, extent;
+    size_t pack_len, dt_size;
+    ml_payload_buffer_desc_t *src_buffer_desc = NULL;
+    mca_coll_ml_collective_operation_progress_t * coll_op = NULL;
+    bool contiguous = ompi_datatype_is_contiguous_memory_layout(dtype, count);
+    mca_coll_ml_component_t *cm = &mca_coll_ml_component;
+    int ret, n_fragments = 1, frag_len,
+        pipeline_depth, n_dts_per_frag, rank;
+
+    if (MPI_IN_PLACE == sbuf) {
+        sbuf = rbuf;
+    }
+
+    ret = ompi_datatype_get_extent(dtype, &lb, &extent);
+    if (ret < 0) {
+        return OMPI_ERROR;
+    }
+
+    rank = ompi_comm_rank (comm);
+
+    dt_size = (size_t) extent;
+    pack_len = count * dt_size;
+
+    /* We use a separate recieve and send buffer so only half the buffer is usable. */
+    if (pack_len < (size_t) ml_module->small_message_thresholds[BCOL_REDUCE] / 4) {
+        /* The len of the message can not be larger than ML buffer size */
+        assert(pack_len <= ml_module->payload_block->size_buffer);
+
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+
+        ML_VERBOSE(10,("Using small data reduce (threshold = %d)",
+                                        REDUCE_SMALL_MESSAGE_THRESHOLD));
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_reduce_functions[small_data_reduce],
+                sbuf, rbuf, pack_len, 0);
+
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+
+        coll_op->variable_fn_params.rbuf = src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.sbuf = src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.buffer_index = src_buffer_desc->buffer_index;
+        coll_op->variable_fn_params.src_desc = src_buffer_desc;
+        coll_op->variable_fn_params.count = count;
+
+        ret = ompi_datatype_copy_content_same_ddt(dtype, count,
+                (void *) (uintptr_t) src_buffer_desc->data_addr, (char *) sbuf);
+        if (ret < 0){
+            return OMPI_ERROR;
+        }
+
+    } else if (cm->enable_fragmentation || !contiguous) {
+        ML_VERBOSE(1,("Using Fragmented Reduce "));
+
+        /* fragment the data */
+        /* check for retarded application programming decisions */
+        if (dt_size > (size_t) ml_module->small_message_thresholds[BCOL_REDUCE] / 4) {
+            ML_ERROR(("Sorry, but we don't support datatypes that large"));
+            return OMPI_ERROR;
+        }
+
+        /* calculate the number of data types that can fit per ml-buffer */
+        n_dts_per_frag = ml_module->small_message_thresholds[BCOL_REDUCE] / (4 * dt_size);
+
+        /* calculate the number of fragments */
+        n_fragments = (count + n_dts_per_frag - 1) / n_dts_per_frag; /* round up */
+
+        /* calculate the actual pipeline depth */
+        pipeline_depth = n_fragments < cm->pipeline_depth ? n_fragments : cm->pipeline_depth;
+
+        /* calculate the fragment size */
+        frag_len = n_dts_per_frag * dt_size;
+
+        /* allocate an ml buffer */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_reduce_functions[small_data_reduce],
+                sbuf,rbuf,
+                pack_len,
+                0 /* offset for first pack */);
+
+        MCA_COLL_IBOFFLOAD_SET_ML_BUFFER_INFO(coll_op,
+                src_buffer_desc->buffer_index, src_buffer_desc);
+
+
+        coll_op->variable_fn_params.sbuf = (void *) src_buffer_desc->data_addr;
+        coll_op->variable_fn_params.rbuf = (void *) src_buffer_desc->data_addr;
+
+        coll_op->fragment_data.message_descriptor->n_active = 1;
+        coll_op->full_message.n_bytes_scheduled = frag_len;
+        coll_op->full_message.fragment_launcher = mca_coll_ml_reduce_frag_progress;
+        coll_op->full_message.pipeline_depth = pipeline_depth;
+        coll_op->fragment_data.current_coll_op = small_data_reduce;
+        coll_op->fragment_data.fragment_size = frag_len;
+
+        coll_op->variable_fn_params.count = n_dts_per_frag;  /* seems fishy */
+        coll_op->variable_fn_params.buffer_size = frag_len;
+        coll_op->variable_fn_params.src_desc = src_buffer_desc;
+        /* copy into the ml-buffer */
+        ret = ompi_datatype_copy_content_same_ddt(dtype, n_dts_per_frag,
+                (char *) src_buffer_desc->data_addr, (char *) sbuf);
+        if (ret < 0) {
+            return OMPI_ERROR;
+        }
+    } else {
+        ML_VERBOSE(1,("Using zero-copy ptp reduce"));
+        coll_op = mca_coll_ml_alloc_op_prog_single_frag_dag(ml_module,
+                ml_module->coll_ml_reduce_functions[large_data_reduce],
+                sbuf, rbuf, pack_len, 0);
+
+        coll_op->variable_fn_params.userbuf =
+            coll_op->variable_fn_params.sbuf = sbuf;
+
+        coll_op->variable_fn_params.rbuf = rbuf;
+
+        /* The ML buffer is used for testing. Later, when we
+         * switch to use knem/mmap/portals this should be replaced
+         * appropriately
+         */
+        src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        while (NULL == src_buffer_desc) {
+            opal_progress();
+            src_buffer_desc = mca_coll_ml_alloc_buffer(ml_module);
+        }
+
+        coll_op->variable_fn_params.buffer_index = src_buffer_desc->buffer_index;
+        coll_op->variable_fn_params.src_desc = src_buffer_desc;
+        coll_op->variable_fn_params.count = count;
+    }
+
+    coll_op->process_fn = (rank != root) ? NULL : mca_coll_ml_reduce_unpack;
+
+    /* Set common parts */
+    coll_op->fragment_data.buffer_desc = src_buffer_desc;
+    coll_op->variable_fn_params.dtype = dtype;
+    coll_op->variable_fn_params.op = op;
+
+    /* NTH: the root, root route, and root flag are set in the task setup */
+
+    /* Fill in the function arguments */
+    coll_op->variable_fn_params.sbuf_offset = 0;
+    coll_op->variable_fn_params.rbuf_offset = (ml_module->payload_block->size_buffer -
+                               ml_module->data_offset)/2;
+
+    /* Keep track of the global root of this operation */
+    coll_op->global_root = root;
+
+    coll_op->variable_fn_params.sequence_num =
+        OPAL_THREAD_ADD64(&(ml_module->collective_sequence_num), 1);
+    coll_op->sequential_routine.current_active_bcol_fn = 0;
+    /* set the task setup callback  */
+    coll_op->sequential_routine.seq_task_setup = mca_coll_ml_reduce_task_setup;
+
+    /* Reduce requires the schedule to be fixed. If we use other (changing) schedule,
+       the operation might result in different result. */
+    coll_op->coll_schedule->component_functions = coll_op->coll_schedule->
+        comp_fn_arr[coll_op->coll_schedule->topo_info->route_vector[root].level];
+
+    /* Launch the collective */
+    ret = mca_coll_ml_launch_sequential_collective (coll_op);
+    if (OMPI_SUCCESS != ret) {
+        ML_VERBOSE(10, ("Failed to launch reduce collective"));
+        return ret;
+    }
+
+    *req = &coll_op->full_message.super;
+
+    return OMPI_SUCCESS;
+}
+
+
+int mca_coll_ml_reduce(void *sbuf, void *rbuf, int count,
+        struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+        int root, struct ompi_communicator_t *comm,
+        mca_coll_base_module_t *module) {
+
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t*)module;
+    int ret = OMPI_SUCCESS;
+    ompi_request_t *req;
+
+    if (OPAL_UNLIKELY(!ompi_op_is_commute(op))) {
+        fprintf (stderr, "Falling back for reduce\n");
+        /* coll/ml does not handle non-communative operations at this time. fallback
+         * on another collective module */
+        return ml_module->fallback.coll_reduce (sbuf, rbuf, count, dtype, op, root, comm,
+                                                ml_module->fallback.coll_reduce_module);
+    }
+
+    ML_VERBOSE(10,("Calling Ml Reduce "));
+    ret = parallel_reduce_start(sbuf, rbuf, count, dtype, op,
+                           root, comm, (mca_coll_ml_module_t *)module,
+                           &req, ML_SMALL_DATA_REDUCE,
+                           ML_LARGE_DATA_REDUCE);
+    if (OPAL_UNLIKELY(ret != OMPI_SUCCESS)) {
+        ML_VERBOSE(10, ("Failed to launch"));
+        return ret;
+    }
+
+    /* Blocking reduce */
+    ret = ompi_request_wait(&req, MPI_STATUS_IGNORE);
+
+    ML_VERBOSE(10, ("Blocking Reduce is done"));
+
+    return ret;
+}
+
+
+int mca_coll_ml_reduce_nb(void *sbuf, void *rbuf, int count,
+        struct ompi_datatype_t *dtype, struct ompi_op_t *op,
+        int root, struct ompi_communicator_t *comm,
+        ompi_request_t **req,
+        mca_coll_base_module_t *module) {
+
+    int ret = OMPI_SUCCESS;
+    mca_coll_ml_module_t *ml_module = (mca_coll_ml_module_t*)module;
+
+    if (OPAL_UNLIKELY(!ompi_op_is_commute(op))) {
+        fprintf (stderr, "Falling back for ireduce\n");
+        /* coll/ml does not handle non-communative operations at this time. fallback
+         * on another collective module */
+        return ml_module->fallback.coll_ireduce (sbuf, rbuf, count, dtype, op, root, comm, req,
+                                                 ml_module->fallback.coll_ireduce_module);
+    }
+
+    ML_VERBOSE(10,("Calling Ml Reduce "));
+    ret = parallel_reduce_start(sbuf, rbuf, count, dtype, op,
+                           root, comm, ml_module,
+                           req, ML_SMALL_DATA_REDUCE,
+                           ML_LARGE_DATA_REDUCE);
+    if (OPAL_UNLIKELY(ret != OMPI_SUCCESS)) {
+        ML_VERBOSE(10, ("Failed to launch"));
+        return ret;
+    }
+
+
+    ML_VERBOSE(10, ("Non-blocking Reduce is done"));
+
+    return OMPI_SUCCESS;
+
+}
--- a/ompi/mca/coll/ml/coll_ml_resource_affinity.c
+++ b/ompi/mca/coll/ml/coll_ml_resource_affinity.c
@ -0,0 +1,147 @@
+#include "opal/mca/carto/carto.h"
+#include "opal/mca/carto/base/base.h"
+#include "opal/util/output.h"
+#include "opal/class/opal_graph.h"
+#include "opal/mca/paffinity/base/base.h"
+#include "ompi/constants.h"
+
+#include "orte/mca/ess/ess.h"
+#include "coll_ml_resource_affinity.h"
+
+int get_dev_distance_for_all_procs(opal_carto_graph_t *graph, const char *device)
+{
+    opal_paffinity_base_cpu_set_t cpus;
+    opal_carto_base_node_t *device_node;
+    int min_distance = -1, i, num_processors;
+
+    if(opal_paffinity_base_get_processor_info(&num_processors) != OMPI_SUCCESS) {
+        num_processors = 100; /* Choose something big enough */
+    }
+
+    device_node = opal_carto_base_find_node(graph, device);
+
+    /* no topology info for device found. Assume that it is close */
+    if(NULL == device_node)
+        return 0;
+
+    OPAL_PAFFINITY_CPU_ZERO(cpus);
+    opal_paffinity_base_get(&cpus);
+
+    for (i = 0; i < num_processors; i++) {
+        opal_carto_base_node_t *slot_node;
+        int distance, socket, core;
+        char *slot;
+
+        if(!OPAL_PAFFINITY_CPU_ISSET(i, cpus))
+            continue;
+
+        opal_paffinity_base_get_map_to_socket_core(i, &socket, &core);
+        asprintf(&slot, "socket%d", socket);
+
+        slot_node = opal_carto_base_find_node(graph, slot);
+
+        free(slot);
+
+        if(NULL == slot_node)
+            return 0;
+
+        distance = opal_carto_base_spf(graph, slot_node, device_node);
+
+        if(distance < 0)
+            return 0;
+
+        if(min_distance < 0 || min_distance > distance)
+            min_distance = distance;
+    }
+
+    return min_distance;
+}
+
+int get_dev_distance_proc(opal_carto_graph_t *graph,
+				const char *device,int rank, struct ompi_proc_t *proc){
+    opal_paffinity_base_cpu_set_t cpus;
+    opal_carto_base_node_t *device_node;
+	opal_carto_base_node_t *slot_node;
+    int distance, socket, core;
+    char *slot;
+	int process_id;
+	int nrank;
+
+	nrank = orte_ess.get_node_rank(&(proc->proc_name));
+
+	opal_paffinity_base_get_physical_processor_id(nrank, &process_id);
+
+	device_node = opal_carto_base_find_node(graph, device);
+
+    /* no topology info for device found. Assume that it is close */
+    if(NULL == device_node)
+        return 0;
+
+    OPAL_PAFFINITY_CPU_ZERO(cpus);
+    opal_paffinity_base_get(&cpus);
+
+
+
+    opal_paffinity_base_get_map_to_socket_core(process_id, &socket, &core);
+    asprintf(&slot, "socket%d", socket);
+	ML_VERBOSE(10,("The socket addres is %d\n",socket));
+
+    slot_node = opal_carto_base_find_node(graph, slot);
+
+    free(slot);
+
+    if(NULL == slot_node)
+            return -1;
+
+    distance = opal_carto_base_spf(graph, slot_node, device_node);
+
+    if(distance < 0)
+	return -1;
+
+    return distance;
+
+}
+
+int coll_ml_select_leader(mca_coll_ml_module_t *ml_module,
+						  mca_sbgp_base_module_t *sbgp_module,
+							int *rank_in_comm,
+							struct ompi_proc_t ** procs,
+							int nprocs){
+
+	int rank, dist1, dist2,dist;
+	int min_dist = 10000;
+	int i,leader = 10000;
+	struct ompi_proc_t *proc = NULL;
+
+	for (i=0; i<nprocs; i++) {
+
+		/* if local process */
+		rank = rank_in_comm[sbgp_module->group_list[i]];
+		proc = procs[sbgp_module->group_list[i]];
+		dist1 = get_dev_distance_proc(ml_module->sm_graph,"mem0",rank,proc);
+		dist2 = get_dev_distance_proc(ml_module->ib_graph,"mthca0",rank,proc);
+
+		dist = dist1 + dist2;
+
+		ML_VERBOSE(10,("The distance for proc %d dist1 %d, dist2 %d \n",i,dist1,dist2));
+		if ((dist < min_dist) || ((dist == min_dist) && (i < leader))) {
+			leader = i;
+			min_dist = dist;
+		}
+	}
+
+	return leader;
+}
+
+
+int coll_ml_construct_resource_graphs(mca_coll_ml_module_t *ml_module){
+
+	opal_carto_base_get_host_graph(&ml_module->sm_graph,"Memory");
+	opal_carto_base_get_host_graph(&ml_module->ib_graph,"Infiniband");
+
+	/* debug
+	opal_graph_print(ml_module->sm_graph);
+	*/
+	return 0;
+
+}
--- a/ompi/mca/coll/ml/coll_ml_resource_affinity.h
+++ b/ompi/mca/coll/ml/coll_ml_resource_affinity.h
@ -0,0 +1,19 @@
+#include "opal/mca/carto/carto.h"
+#include "opal/mca/carto/base/base.h"
+#include "opal/util/output.h"
+#include "opal/class/opal_graph.h"
+#include "coll_ml.h"
+
+
+/* Get the host graph for SM and Infiniband */
+int discover_on_node_resources(const char device);
+int get_dev_distance_for_all_procs(opal_carto_graph_t *graph,
+							const char *device);
+int get_dev_distance_proc(opal_carto_graph_t *graph,
+				const char *device,int rank,struct ompi_proc_t *proc);
+int coll_ml_select_leader(mca_coll_ml_module_t *ml_module,
+						  mca_sbgp_base_module_t *sbgp_module,
+							int *rank_in_comm,
+							struct ompi_proc_t ** procs,
+							int nprocs);
+int coll_ml_construct_resource_graphs(mca_coll_ml_module_t *ml_module);
--- a/ompi/mca/coll/ml/coll_ml_select.c
+++ b/ompi/mca/coll/ml/coll_ml_select.c
@ -138,7 +138,7 @@ static int add_to_invoke_table(mca_bcol_base_module_t *bcol_module,
               for (i=range_min; i<=range_max; i++) {
                    bcol_module->filtered_fns_table[data_src_type][waiting_semantic][bcoll_type][i][j][k] 
                                                                    = fn_filtered; 
-                    ML_VERBOSE(11, ("Putting functions %d %d %d %d %p", bcoll_type, i, j, k, fn_filtered));
+                    ML_VERBOSE(21, ("Putting functions %d %d %d %d %p", bcoll_type, i, j, k, fn_filtered));
               }
            }
        }
@ -323,10 +323,8 @@ int mca_select_bcol_function(mca_bcol_base_module_t *bcol_module,
                            bcol_fn_arguments->dtype); 
    if ((BCOL_ALLREDUCE == bcoll_type) || (BCOL_REDUCE == bcoll_type)) {
        /* needs to be resolved, the op structure has changed, there is no field called "op_type" */
-        /*
        fn_filtered =
            bcol_module->filtered_fns_table[data_src_type][waiting_type][bcoll_type][msg_range][bcol_fn_arguments->dtype->id][bcol_fn_arguments->op->op_type];
-       */     
    }
    else {
        fn_filtered =
--- a/ompi/mca/coll/ml/mca-coll-ml.config
+++ b/ompi/mca/coll/ml/mca-coll-ml.config
@ -127,6 +127,28 @@ hierarchy = full_hr
 algorithm = ML_LARGE_DATA_ALLREDUCE
 hierarchy = full_hr

+[REDUCE]
+<small>
+# scatter supports: ML_SCATTER_SMALL_DATA_SEQUENTIAL
+algorithm = ML_SMALL_DATA_REDUCE
+hierarchy = full_hr
+<large>
+# scatter supports: ML_SCATTER_SMALL_DATA_SEQUENTIAL
+algorithm = ML_LARGE_DATA_REDUCE
+hierarchy = full_hr
+
+[IREDUCE]
+<small>
+# scatter supports: ML_SCATTER_SMALL_DATA_SEQUENTIAL
+algorithm = ML_SMALL_DATA_REDUCE
+hierarchy = full_hr
+<large>
+# scatter supports: ML_SCATTER_SMALL_DATA_SEQUENTIAL
+algorithm = ML_LARGE_DATA_REDUCE
+hierarchy = full_hr
+
+
+
 [SCATTER]
 <small>
 # scatter supports: ML_SCATTER_SMALL_DATA_SEQUENTIAL
--- a/ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c
+++ b/ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c
@ -282,6 +282,10 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmsocket_select_procs(struct ompi_pr
    module->super.group_net = OMPI_SBGP_SOCKET;

    /* test to see if process is bound */
+    /* this needs to change. Process may have been bound by an
+     * some entity other than OPAL. Just because the binding
+     * policy  isn't set doesn't mean it's not bound
+     */
    if( OPAL_BIND_TO_NONE == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) ) {

        /* pa affinity not set, so socket index will be set to -1 */
@ -331,7 +335,7 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmsocket_select_procs(struct ompi_pr
    end debug print*/

    /* if no other local procs found skip to end */
-    if( 1 >= cnt ) {
+    if( 1 > cnt ) {
      goto NoLocalPeers;
    }

@ -419,7 +423,7 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmsocket_select_procs(struct ompi_pr

 #if 0
    /*debug print*/
-    
+    /*
    {
        int ii;
        fprintf(stderr,"Ranks per socket: %d\n",cnt);
@ -429,8 +433,8 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmsocket_select_procs(struct ompi_pr
        fprintf(stderr,"\n");
        fflush(stderr);
    }
+    */
 #endif
-
   /* end debug*/


--- a/ompi/mca/sbgp/basesmuma/sbgp_basesmuma_component.c
+++ b/ompi/mca/sbgp/basesmuma/sbgp_basesmuma_component.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2013      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -146,7 +149,7 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmuma_select_procs(struct ompi_proc_
        )
 {
    /* local variables */
-    int cnt,proc,local;
+    int cnt,proc,local,last_local_proc;
    mca_sbgp_basesmuma_module_t *module;

    module=OBJ_NEW(mca_sbgp_basesmuma_module_t);
@ -157,27 +160,21 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmuma_select_procs(struct ompi_proc_
    module->super.group_comm = comm;
    module->super.group_list = NULL;
    module->super.group_net = OMPI_SBGP_MUMA;
-    cnt=0;
-    for( proc=0 ; proc < n_procs_in ; proc++) {
-        /* debug
-        if( odd ) {
-            if( !(1&proc) )
-                continue;
-        } else {
-            if( (1&proc) )
-                continue;
-        }
-         end debug */
-        local=OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags);
-        if( local ) {
+    for (proc = 0, cnt = 0, last_local_proc = 0 ; proc < n_procs_in ; ++proc) {
+        local = OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags);
+        if (local) {
+            last_local_proc = proc;
            cnt++;
        }
    }
    /* if no other local procs found skip to end */
+
    if( 2 > cnt ) {
        /* There's always at least one - namely myself */
-        assert( 1 == cnt);
-        module->super.group_size=1;
+        assert(1 == cnt);
+        module->super.group_size = 1;
+        module->super.group_list = (int *) malloc (sizeof (int));
+        module->super.group_list[0] = last_local_proc;
        /* let ml handle this case */
        goto OneLocalPeer;
    }
@ -190,21 +187,11 @@ static mca_sbgp_base_module_t *mca_sbgp_basesmuma_select_procs(struct ompi_proc_
            goto Error;
        }
    }
-    cnt=0;
-    for( proc=0 ; proc < n_procs_in ; proc++) {
-        /* debug
-        if( odd ) {
-            if( !(1&proc) )
-                continue;
-        } else {
-            if( (1&proc) )
-                continue;
-        }
-         end debug */
-        local=OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags);
+
+    for (proc = 0, cnt = 0 ; proc < n_procs_in ; ++proc) {
+        local = OPAL_PROC_ON_LOCAL_NODE(procs[proc]->proc_flags);
        if( local ) {
-            module->super.group_list[cnt]=proc;
-            cnt++;
+            module->super.group_list[cnt++] = proc;
        }
    }
 OneLocalPeer:
--- a/ompi/mca/sbgp/p2p/sbgp_p2p_component.c
+++ b/ompi/mca/sbgp/p2p/sbgp_p2p_component.c
@ -1,6 +1,9 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
 /*
 * Copyright (c) 2009-2012 Oak Ridge National Laboratory.  All rights reserved.
 * Copyright (c) 2009-2012 Mellanox Technologies.  All rights reserved.
+ * Copyright (c) 2014      Los Alamos National Security, LLC. All rights
+ *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
@ -141,103 +144,73 @@ static mca_sbgp_base_module_t * mca_sbgp_p2p_select_procs(struct ompi_proc_t **
        )
 {
    /* local variables */
-    int cnt,proc;
+    int cnt, proc, my_rank;
    mca_sbgp_p2p_module_t *module;
-    int my_rank,i_btl;
-
-    module=OBJ_NEW(mca_sbgp_p2p_module_t);
-    if (!module ) {
-        return NULL;
-    }
-    module->super.group_size=0;
-    module->super.group_comm = comm;
-    module->super.group_net = OMPI_SBGP_P2P;

    /* find my rank in the group */
-    my_rank=-1;
-    for( proc=0 ; proc < n_procs_in ; proc++) {
-        if(ompi_proc_local() == procs[proc]) {
-            my_rank=proc;
+    for (my_rank = -1, proc = 0 ; proc < n_procs_in ; ++proc) {
+        if (ompi_proc_local() == procs[proc]) {
+            my_rank = proc;
        }
    }

    /* I am not in the list - so will form no local subgroup */
-   if( 0 > my_rank ){
-       return NULL;
-   }
-
-   /* count the number of ranks in the group */
-   cnt=0;
-    for( proc=0 ; proc < n_procs_in ; proc++) {
-        mca_bml_base_endpoint_t* endpoint = 
-            (mca_bml_base_endpoint_t*) procs[proc]->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML];
-
-        if(my_rank == proc ) {
-            cnt++;
-            continue;
-        }
-        /* loop over btls */
-        for( i_btl=0 ; i_btl < (int) mca_bml_base_btl_array_get_size(&(endpoint->btl_eager)) ; i_btl++ ) {
-            if(key) {
-                /* I am checking for specific btl */
-                if( strcmp(endpoint->btl_eager.bml_btls[i_btl].btl->btl_component->btl_version.mca_component_name,key)) {
-                    cnt++;
-                    break;
-
-                }
-            } else {
-                /* I will take any btl */
-                cnt++;
-                break;
-            }
-        }
+    if (0 > my_rank) {
+        return NULL;
    }
-/* debug
-fprintf(stderr," AAA cnt %d n_procs_in %d \n",cnt,n_procs_in);
-fflush(stderr);
- end debug */
+
+    module = OBJ_NEW(mca_sbgp_p2p_module_t);
+    if (!module ) {
+        return NULL;
+    }
+
+    module->super.group_size = 0;
+    module->super.group_comm = comm;
+    module->super.group_net = OMPI_SBGP_P2P;

    /* allocate resources */
-    module->super.group_size=cnt;
-    if( cnt > 0 ) {
-        module->super.group_list=(int *)malloc(sizeof(int)*
-                module->super.group_size);
-        if( NULL == module->super.group_list ) {
-            goto Error;
-        }
+    module->super.group_list = (int *) calloc (n_procs_in, sizeof (int));
+    if (NULL == module->super.group_list) {
+        goto Error;
    }

-   cnt=0;
-    for( proc=0 ; proc < n_procs_in ; proc++) {
-        mca_bml_base_endpoint_t* endpoint = 
+    for (cnt = 0, proc = 0 ; proc < n_procs_in ; ++proc) {
+#if defined(OMPI_PROC_ENDPOINT_TAG_BML)
+        mca_bml_base_endpoint_t* endpoint =
            (mca_bml_base_endpoint_t*) procs[proc]->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML];
+#endif

-        if(my_rank == proc ) {
-            module->super.group_list[cnt]=proc;
-            cnt++;
+        if (my_rank == proc || !key) {
+            module->super.group_list[cnt++] = proc;
            continue;
        }
-        /* loop over btls */

-        for( i_btl=0 ;
-                i_btl < (int) mca_bml_base_btl_array_get_size(&(endpoint->btl_eager)) ;
-                i_btl++ ) {
-            if(key) {
+#if defined(OMPI_PROC_ENDPOINT_TAG_BML)
+        if (NULL != endpoint) {
+            int num_btls = mca_bml_base_btl_array_get_size(&(endpoint->btl_eager));
+            /* loop over btls */
+
+            for (int i_btl = 0 ; num_btls ; ++i_btl) {
                /* I am checking for specific btl */
-                if( strcmp(endpoint->btl_eager.bml_btls[i_btl].btl->
-                            btl_component->btl_version.mca_component_name,key)) {
-                    module->super.group_list[cnt]=proc;
-                    cnt++;
+                if (strcmp(endpoint->btl_eager.bml_btls[i_btl].btl->
+                           btl_component->btl_version.mca_component_name, key)) {
+                    module->super.group_list[cnt++] = proc;
                    break;
-
                }
-            } else {
-                /* I will take any btl */
-                module->super.group_list[cnt]=proc;
-                cnt++;
-                break;
            }
        }
+#endif
+    }
+
+    if (0 == cnt) {
+	goto Error;
+    }
+
+    module->super.group_size = cnt;
+    module->super.group_list = (int *) realloc (module->super.group_list, sizeof (int) * cnt);
+    if (NULL == module->super.group_list) {
+        /* Shouldn't ever happen */
+        goto Error;
    }

    /* successful return */
@ -246,9 +219,9 @@ fflush(stderr);
    /* return with error */
 Error:
    /* clean up */
-    if( NULL != module->super.group_list ) {
-        free(module->super.group_list);
-        module->super.group_list=NULL;
+    if (NULL != module->super.group_list) {
+        free (module->super.group_list);
+        module->super.group_list = NULL;
    }
    OBJ_RELEASE(module);

--- a/ompi/op/op.c
+++ b/ompi/op/op.c
@ -257,6 +257,25 @@ int ompi_op_init(void)
        add_intrinsic(&ompi_mpi_op_replace.op, OMPI_OP_BASE_FORTRAN_REPLACE,
                      FLAGS, "MPI_REPLACE")) {
        return OMPI_ERROR;
+    }else{
+/* This code is placed back here to support
+ * HCOL allreduce at the moment. It is a part of bgate repository only. This conflict with OMPI v1.7
+ * is to be resolved some other way.
+ * */
+        ompi_mpi_op_null.op.op_type = OMPI_OP_NULL;
+        ompi_mpi_op_max.op.op_type = OMPI_OP_MAX;
+        ompi_mpi_op_min.op.op_type = OMPI_OP_MIN;
+        ompi_mpi_op_sum.op.op_type = OMPI_OP_SUM;
+        ompi_mpi_op_prod.op.op_type = OMPI_OP_PROD;
+        ompi_mpi_op_land.op.op_type = OMPI_OP_LAND;
+        ompi_mpi_op_band.op.op_type = OMPI_OP_BAND;
+        ompi_mpi_op_lor.op.op_type = OMPI_OP_LOR;
+        ompi_mpi_op_bor.op.op_type = OMPI_OP_BOR;
+        ompi_mpi_op_lxor.op.op_type = OMPI_OP_LXOR;
+        ompi_mpi_op_bxor.op.op_type = OMPI_OP_BXOR;
+        ompi_mpi_op_maxloc.op.op_type = OMPI_OP_MAXLOC;
+        ompi_mpi_op_minloc.op.op_type = OMPI_OP_MINLOC;
+        ompi_mpi_op_replace.op.op_type = OMPI_OP_REPLACE;
    }

    /* All done */
--- a/ompi/op/op.h
+++ b/ompi/op/op.h
@ -109,6 +109,29 @@ typedef void (ompi_op_java_handler_fn_t)(void *, void *, int *,
 #define OMPI_OP_FLAGS_COMMUTE      0x0040


+
+
+/* This enum as well as op_type field in ompi_op_t is placed back here to support
+ * HCOL allreduce at the moment. It is a part of bgate repository only. This conflict with OMPI v1.7
+ * is to be resolved some other way.
+ * */
+enum ompi_op_type {
+    OMPI_OP_NULL,
+    OMPI_OP_MAX,
+    OMPI_OP_MIN,
+    OMPI_OP_SUM,
+    OMPI_OP_PROD,
+    OMPI_OP_LAND,
+    OMPI_OP_BAND,
+    OMPI_OP_LOR,
+    OMPI_OP_BOR,
+    OMPI_OP_LXOR,
+    OMPI_OP_BXOR,
+    OMPI_OP_MAXLOC,
+    OMPI_OP_MINLOC,
+    OMPI_OP_REPLACE,
+    OMPI_OP_NUM_OF_TYPES
+};
 /**
 * Back-end type of MPI_Op
 */
@ -119,6 +142,8 @@ struct ompi_op_t {
    /** Name, for debugging purposes */
    char o_name[MPI_MAX_OBJECT_NAME];

+    enum ompi_op_type op_type;
+
    /** Flags about the op */
    uint32_t o_flags;