2012-11-10 08:11:40 +04:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2010-02-12 02:02:14 +03:00
|
|
|
/*
|
2010-03-13 02:57:50 +03:00
|
|
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
2010-02-12 02:02:14 +03:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
* Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
|
2012-11-10 08:11:40 +04:00
|
|
|
* Copyright (c) 2010-2012 Los Alamos National Security, LLC.
|
2011-06-21 19:41:57 +04:00
|
|
|
* All rights reserved.
|
2010-02-12 02:02:14 +03:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "orte_config.h"
|
|
|
|
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <string.h>
|
|
|
|
|
|
|
|
#include "orte/runtime/runtime.h"
|
|
|
|
|
|
|
|
#include "opal/util/argv.h"
|
Refs trac:3275.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't). If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!". Kaboom.
The only problem is that it didn't give you any indication of where
this value was being set. Quite maddening, from a user perspective.
So we changed the ompi_info handles this case. If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.
At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).
We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens. Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes. Why? I think we used them in exactly one
place in the code base (mca_base_components_open.c). So we deleted
those and just used the normal OPAL_* codes instead.
While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).
This commit was SVN r27306.
The following Trac tickets were found above:
Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
2012-09-12 00:47:24 +04:00
|
|
|
#include "opal/runtime/opal_info_support.h"
|
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
|
|
|
#include "opal/mca/event/base/base.h"
|
2010-02-12 02:02:14 +03:00
|
|
|
#include "opal/util/output.h"
|
|
|
|
|
2012-11-10 08:11:40 +04:00
|
|
|
#include "orte/runtime/orte_info_support.h"
|
2010-02-12 02:02:14 +03:00
|
|
|
#include "orte/tools/orte-info/orte-info.h"
|
|
|
|
/*
|
|
|
|
* Public variables
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void component_map_construct(orte_info_component_map_t *map)
|
|
|
|
{
|
|
|
|
map->type = NULL;
|
|
|
|
}
|
|
|
|
static void component_map_destruct(orte_info_component_map_t *map)
|
|
|
|
{
|
|
|
|
if (NULL != map->type) {
|
|
|
|
free(map->type);
|
|
|
|
}
|
|
|
|
/* the type close functions will release the
|
|
|
|
* list of components
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
OBJ_CLASS_INSTANCE(orte_info_component_map_t,
|
|
|
|
opal_list_item_t,
|
|
|
|
component_map_construct,
|
|
|
|
component_map_destruct);
|
|
|
|
|
|
|
|
opal_pointer_array_t component_map;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Private variables
|
|
|
|
*/
|
|
|
|
|
|
|
|
static bool opened_components = false;
|
|
|
|
|
|
|
|
|
2012-11-10 08:11:40 +04:00
|
|
|
void orte_info_components_open(void)
|
2010-02-12 02:02:14 +03:00
|
|
|
{
|
|
|
|
if (opened_components) {
|
|
|
|
return;
|
|
|
|
}
|
2012-11-10 08:11:40 +04:00
|
|
|
|
|
|
|
opened_components = true;
|
|
|
|
|
2010-02-12 02:02:14 +03:00
|
|
|
/* init the map */
|
|
|
|
OBJ_CONSTRUCT(&component_map, opal_pointer_array_t);
|
|
|
|
opal_pointer_array_init(&component_map, 256, INT_MAX, 128);
|
2010-09-22 05:11:40 +04:00
|
|
|
|
2013-03-28 01:17:31 +04:00
|
|
|
opal_info_register_framework_params(&component_map);
|
|
|
|
orte_info_register_framework_params(&component_map);
|
2010-02-12 02:02:14 +03:00
|
|
|
}
|
|
|
|
|
2012-11-10 08:11:40 +04:00
|
|
|
/*
|
|
|
|
* Not to be confused with orte_info_close_components.
|
|
|
|
*/
|
|
|
|
void orte_info_components_close(void)
|
2010-02-12 02:02:14 +03:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
orte_info_component_map_t *map;
|
2012-04-06 18:23:13 +04:00
|
|
|
|
2012-11-10 08:11:40 +04:00
|
|
|
if (!opened_components) {
|
|
|
|
return;
|
|
|
|
}
|
Refs trac:3275.
We ran into a case where the OMPI SVN trunk grew a new acceptable MCA
parameter value, but this new value was not accepted on the v1.6
branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts
the value "silent", but on the older v1.6 branch, it doesn't). If you
set "hwloc_base_mem_bind_failure_action=silent" in the default MCA
params file and then accidentally ran with the v1.6 branch, every OMPI
executable (including ompi_info) just failed because hwloc_base_open()
would say "hey, 'silent' is not a valid value for
hwloc_base_mem_bind_failure_action!". Kaboom.
The only problem is that it didn't give you any indication of where
this value was being set. Quite maddening, from a user perspective.
So we changed the ompi_info handles this case. If any framework open
function return OMPI_ERR_BAD_PARAM (either because its base MCA params
got a bad value or because one of its component register/open
functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out
a warning that it received and error, and then dump out the parameters
that it has received so far in the framework that had a problem.
At a minimum, this will show the user the MCA param that had an error
(it's usually the last one), and ''where it was set from'' (so that
they can go fix it).
We updated ompi_info to check for O???_ERR_BAD_PARAM from each from
the framework opens. Also updated the doxygen docs in mca.h for this
O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS
and MCA_ERR_??? codes. Why? I think we used them in exactly one
place in the code base (mca_base_components_open.c). So we deleted
those and just used the normal OPAL_* codes instead.
While we were doing this, we also cleaned up a little memory
management during ompi_info/orte-info/opal-info finalization.
Valgrind still reports a truckload of memory still in use at ompi_info
termination, but they mostly look to be components not freeing
memory/resources properly (and outside the scope of this fix).
This commit was SVN r27306.
The following Trac tickets were found above:
Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275
2012-09-12 00:47:24 +04:00
|
|
|
|
2012-11-10 08:11:40 +04:00
|
|
|
orte_info_close_components ();
|
|
|
|
opal_info_close_components ();
|
|
|
|
|
|
|
|
for (i=0; i < component_map.size; i++) {
|
|
|
|
if (NULL != (map = (orte_info_component_map_t*)opal_pointer_array_get_item(&component_map, i))) {
|
|
|
|
OBJ_RELEASE(map);
|
|
|
|
}
|
2010-02-12 02:02:14 +03:00
|
|
|
}
|
2012-11-10 08:11:40 +04:00
|
|
|
|
|
|
|
OBJ_DESTRUCT(&component_map);
|
2010-02-12 02:02:14 +03:00
|
|
|
|
|
|
|
opened_components = false;
|
|
|
|
}
|