openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	3bd9c603ff	Clean up variables used in configure with OPAL_VAR_SCOPE. This is helpful in the work for #3694: ensure that many places that eventually end up in configure don't overly-pollute the global shell variable space (because debugging accidental shell variable pollution can be a real pain). Refs trac:3694 This commit was SVN r29830. The following Trac tickets were found above: Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694	2013-12-06 23:40:34 +00:00
Vasily Filipov	ae8c826527	"If" statement wrapping with #if MEMORY_LINUX_UMMUNOTIFY in order to prevent ptmalloc2 hooks disabling in case if OMPI was not configured with ummunotify support. This commit was SVN r29720.	2013-11-19 07:00:21 +00:00
Jeff Squyres	abeef55a55	Fix a few compiler warnings reported by clang: * Ensure "cnt" is always initialized * Ensure we dont' buffer overflow on strncat() -- need to ensure we account for the terminating \0 character * hwloc_get_type_depth() returns an int (not unsigned), and HWLOC_TYPE_DEPTH_UNKNOWN if it's unknown (which is probably <0, but still, might as well check what the official hwloc docs say to check for) cmr=v1.7.4:reviewer=rhc:subject=fix hwloc base compiler warnings This commit was SVN r29686.	2013-11-13 15:54:01 +00:00
Jeff Squyres	ad51705891	Fix compiler warnings about signed/unsigned comparisons Change static opal_setlimit() function to return its value in an OUT parameter and return the usual int error code indicating success or failure. The OUT param and return code need to be separated because the OUT param is an unsigned type, but opal_setlimit() was returning -1 upon failure. Hence, the caller could not know that it had failed because the return type was previously an unsigned type. cmr=v1.7.4:reviewer=rhc:subject=Fix opal sys_limits.c signed/unsigned warnings This commit was SVN r29685.	2013-11-13 15:40:34 +00:00
Jeff Squyres	f1bff698a4	Fix compiler warning: event is unsigned; it can't be negative cmr=v1.7.4:reviewer=rhc This commit was SVN r29684.	2013-11-13 15:35:37 +00:00
Jeff Squyres	0749919127	Ensure is_tar is always initialized. cmr=v1.7.4:reviewer=rhc This commit was SVN r29683.	2013-11-13 15:34:33 +00:00
Jeff Squyres	750e6f6895	Fix compiler warning. cmr=v1.7.4:reviewer=hjelmn This commit was SVN r29682.	2013-11-13 15:33:55 +00:00
Mike Dubman	840e2cb4a2	mindist: cosmetic, use fallback to byslot if unable to read NUMA info, small fix. fixed by Elena, reviewed by Ralph/Mike cmr=v1.7.4:reviewer=ompi-gk1.7 This commit was SVN r29679.	2013-11-13 09:26:40 +00:00
Jeff Squyres	71c8b471d0	Add comment: strings in values[] can be free()'d after mca_base_var_enum_create() returns This commit was SVN r29655.	2013-11-11 22:20:58 +00:00
Ralph Castain	6ef7dc1f42	We previously weren't checking all the bits in locality to ensure we had a complete match - instead, we would report "local" to the specified level if only one bit matched. Ensure that a est for locality tests local to the specified level by checking that all bits match. cmr=v1.7.4:reviewer=hjelmn:subject=Ensure locality is properly tested This commit was SVN r29643.	2013-11-08 04:21:05 +00:00
Brian Barrett	6d7a1fbb82	Move opal_portable_platform.h to opal/include/opal, which is where it really should have been all along and fix one place that uses the file Update opal_portable_platform.h with changes to mpi_portable_platform.h made in r29608. Make mpi_portable_platform.h a symlink to opal_portable_platform.h, so that they won't get out of sync. I'd like to remove mpi_portable_platform.h, but we don't automatically add -I${includedir}/openmpi/ to make that sane from a header include point of view, so that's future work. This commit was SVN r29618. The following SVN revision numbers were found above: r29608 --> open-mpi/ompi@b71bd51cdd	2013-11-06 17:12:26 +00:00
Jeff Squyres	01118fcfb9	Add an Open MPI-specific comment here so that we hopefully don't lose this change the next time we update libevent. cmr=v1.7.4:ticket=3882 This commit was SVN r29597. The following Trac tickets were found above: Ticket 3882 --> https://svn.open-mpi.org/trac/ompi/ticket/3882	2013-11-05 03:40:00 +00:00
Brian Barrett	a45d5603a3	Fix installation of libevent header files with --with-devel-headers isn't specified cmr=v1.7.4:reviewer=rhc This commit was SVN r29588.	2013-11-04 16:54:00 +00:00
Nathan Hjelm	9cd18f926c	Add missing OSX builtin define This commit was SVN r29576.	2013-10-31 02:06:39 +00:00
Nathan Hjelm	b922cd1583	Add support for OSX builtin atomics. OSX atomic support is disabled by default. Enable with --enable-osx-builtin-atomics. Fixes trac:2120 This commit was SVN r29568. The following Trac tickets were found above: Ticket 2120 --> https://svn.open-mpi.org/trac/ompi/ticket/2120	2013-10-30 17:48:15 +00:00
Ralph Castain	25385590e6	Silence warning This commit was SVN r29528.	2013-10-26 19:41:35 +00:00
Ralph Castain	75c306994e	Add some debug This commit was SVN r29523.	2013-10-26 02:26:21 +00:00
Ralph Castain	8c5c7d0db4	Correct a bug in handling of oob_tcp_if_include/exclude addresses by using the kernel index instead of the raw index of the interface. Refs trac:3696 This commit was SVN r29522. The following Trac tickets were found above: Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696	2013-10-26 00:47:14 +00:00
Nathan Hjelm	f7428fb6a9	Small fixes for the MCA variable interface. - Make a copy of enumerator data for default enumerators. This will allow the caller to free their data once the enumerator has been created. This is a change from just referencing the values array. - Make mca_base_pvar_notify check if the pvar is valid before calling the notify callback. This fixes a segmentation fault when destroying handles after MPI_Finalize(). cmr=v1.7.4:ticket=trac:3861 This commit was SVN r29512. The following Trac tickets were found above: Ticket 3861 --> https://svn.open-mpi.org/trac/ompi/ticket/3861	2013-10-24 19:27:06 +00:00
Jeff Squyres	f45144aed0	Add a little more to the docs for mca_base_var_enum_create(). This commit was SVN r29496.	2013-10-23 22:11:19 +00:00
Dave Goodell	25dd719d4d	opal: support __attribute__((__noinline__)) First cut does not attempt any "cross-check". As we discover compilers which complain about __noinline__, we will add specific cross checks to handle those cases. Reviewed-by: Jeff Squyres <jsquyres@cisco.com> This commit was SVN r29488.	2013-10-23 15:52:05 +00:00
Nathan Hjelm	d34a4300b8	Fix various bugs in mca_base_pvar. Fixes: - Segmentation fault when using watermark variables. - Segmentation fault when using a handle bound to a no longer valid performance variable. - Incorrect return codes from MPI_T_pvar_* functions. cmr=v1.7.4:reviewer=jsquyres This commit was SVN r29481.	2013-10-23 15:47:15 +00:00
Ralph Castain	772a376d73	Correct location of elog file Refs trac:3847 This commit was SVN r29438. The following Trac tickets were found above: Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847	2013-10-14 19:21:45 +00:00
Ralph Castain	24c811805f	************************************************************** This change contains a non-mandatory modification of the MPI-RTE interface. Anyone wishing to support coprocessors such as the Xeon Phi may wish to add the required definition and underlying support ************************************************************** Add locality support for coprocessors such as the Intel Xeon Phi. Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host. So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following: 1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board 2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions 3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future. 4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time. 5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored. 6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set. cmr:v1.7.4:reviewer=hjelmn This commit was SVN r29435.	2013-10-14 16:52:58 +00:00
Nathan Hjelm	50b4b92758	hostname may not NULL-terminate the string if the buffer is too small. Thanks to Kevin M. Hildebrand for catching this. cmr=v1.7.3:reviewer=jsquyres This commit was SVN r29412.	2013-10-09 15:49:18 +00:00
Ralph Castain	9902748108	*** THIS INCLUDES A SMALL CHANGE IN THE MPI-RTE INTERFACE *** Fix two problems that surfaced when using direct launch under SLURM: 1. locally store our own data because some BTLs want to retrieve it during add_procs rather than use what they have internally 2. cleanup MPI_Abort so it correctly passes the error status all the way down to the actual exit. When someone implemented the "abort_peers" API, they left out the error status. So we lost it at that point and always exited with a status of 1. This forces a change to the API to include the status. cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch This commit was SVN r29405.	2013-10-08 18:37:59 +00:00
Ralph Castain	6951976bc4	Update struct member name - this is why we put such things in the trunk before moving them to a branch, especially when coming from outside :-) Refs trac:3830 This commit was SVN r29390. The following Trac tickets were found above: Ticket 3830 --> https://svn.open-mpi.org/trac/ompi/ticket/3830	2013-10-07 15:43:43 +00:00
Ralph Castain	13cd112fb4	Avoid use of interface in struct because cygwin compilers apparently object (go figure) This commit was SVN r29388.	2013-10-06 23:55:38 +00:00
Ralph Castain	2d2307b6eb	Modify libevent to support cygwin - patch will be pushed upstream This commit was SVN r29387.	2013-10-06 23:53:31 +00:00
Ralph Castain	2121e9c01b	Fix an issue regarding use of PMI when running processes and tools that don't need or want to use it. We build PMI support based on configuration settings and library availability. However, tools such as mpirun don't need it, and definitely shouldn't be using it. Ditto for procs launched by mpirun. We used to have a way of dealing with this - we had the PMI component check to see if the process was the HNP or was launched by an HNP. Sadly, moving the OPAL db framework removed that ability as OPAL has no notion of HNPs or proc type. So add a boolean flag to the db_base_select API that allows us to restrict selection to "local" components. This gives the PMI component the ability to reject itself as required. W e then need to pass that param into the ess_base_std_app call so it can pass it all down. This commit was SVN r29341.	2013-10-02 19:03:46 +00:00
George Bosilca	43b4d76913	Fix a corner case for a non-contiguous send convertor where the convertor accepted to be set to a position in the middle of a predefined datatype. Once set there is was unable to provide the second part of the datatype. This fix force the convertor to be aligned on predefined datatypes boundaries for any non-contiguous send convertor. This commit was SVN r29285.	2013-09-28 16:46:21 +00:00
Ralph Castain	d565a76814	Do some cleanup of the way we handle modex data. Identify data that needs to be shared with peers in my job vs data that needs to be shared with non-peers - no point in sharing extra data. When we share data with some process(es) from another job, we cannot know in advance what info they have or lack, so we have to share everything just in case. This limits the optimization we can do for things like comm_spawn. Create a new required key in the OMPI layer for retrieving a "node id" from the database. ALL RTE'S MUST DEFINE THIS KEY. This allows us to compute locality in the MPI layer, which is necessary when we do things like intercomm_create. cmr:v1.7.4:reviewer=rhc:subject=Cleanup handling of modex data This commit was SVN r29274.	2013-09-27 00:37:49 +00:00
Ralph Castain	9aeba777fa	Ensure we don't enter into an infinite loop looking for the PML modex key if it isn't present. The PMI implementation will load ALL modex keys when the first key is queried, so the hash db component can safely return "not found" if a subsequent key isn't present. The PML modex_recv needs to assume everything is okay if the modex recv fails to return a value. cmr:v1.7.3:reviewer=jladd:subject=Prevent infinite loop when PML modex not found This commit was SVN r29243.	2013-09-25 16:04:00 +00:00
Ralph Castain	63da76ad5f	Silence warnings about pointer casting This commit was SVN r29226.	2013-09-22 19:21:29 +00:00
Nathan Hjelm	01839db11b	MCA/base: When encounter a duplicate file value don't free the filename. Stale code. cmr=v1.7.3:reviewer=rhc This commit was SVN r29224.	2013-09-21 18:53:36 +00:00
Nathan Hjelm	bc31773523	Fix bug in db/pmi when a stored byte object has a NULL pointer. cmr=v1.7.3:reviewer=samuel This commit was SVN r29215.	2013-09-20 15:38:36 +00:00
Ralph Castain	7bc20866fd	C standard stipulates that we have to cast the function to another of the same type to avoid unexpected behavior. We aren't using the function in this case, but Nick correctly points out that we should follow the standard regardless. Refs trac:3755 This commit was SVN r29210. The following Trac tickets were found above: Ticket 3755 --> https://svn.open-mpi.org/trac/ompi/ticket/3755	2013-09-19 18:42:21 +00:00
Ralph Castain	7de493fc02	Silence a warning about an address that can never be NULL - libevent needs to deal with the situation where the user may have compiled the code on a system where this function is present, but executes it on one where it isn't. Thus, a compile-time test isn't adequate. Pushed upstream. cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29201.	2013-09-18 02:03:01 +00:00
George Bosilca	55273f1c98	Cleanup spaces, nothing else. This commit was SVN r29197.	2013-09-18 00:07:58 +00:00
Nathan Hjelm	7929fb9dea	Cleanup complex datatypes and update datatypes and operator code to use C99. This commit changes the underlying opal complex datatypes to match the C99 types: float _Complex, double _Complex, and long double _Complex. The fortran and C++ types now are aliases to these basic types instead of structure types. The operators in ompi/mca/op/base now work on only the C99 types and the fortran types use these operators if the fortran type matches a C complex type (this should almost always be the case.) C99 is not is use in both the datatype and operator code and should make the code both cleaner and much less fragile. This commit was SVN r29193.	2013-09-17 17:49:42 +00:00
Ralph Castain	2245ac0e7e	Don't error log the return from setup_pmi as it can indicate that the process wasn't launched via srun or its equivalent. cmr:v1.7.3:reviewer=jladd This commit was SVN r29180.	2013-09-17 02:26:46 +00:00
Ralph Castain	e01953b440	Per Brice, silence warning on old Linux kernels Refs trac:3744 This commit was SVN r29179. The following Trac tickets were found above: Ticket 3744 --> https://svn.open-mpi.org/trac/ompi/ticket/3744	2013-09-16 15:43:33 +00:00
Ralph Castain	845e92bc5d	Remove the old version of hwloc. Update the new one to reflect the official release dates. Refs trac:3744 This commit was SVN r29154. The following Trac tickets were found above: Ticket 3744 --> https://svn.open-mpi.org/trac/ompi/ticket/3744	2013-09-10 16:30:13 +00:00
Joshua Ladd	b3f88c4a1d	Per the RFC schedule, this commit adds Mellanox OpenSHMEM to the trunk. It does not yet run on OSX or with CM PML for an MTL other than MXM. Mellanox is aware of these issues and is in the process of resolving them. This should be added to \ncmr=v1.7.4:subject=Move OSHMEM to 1.7.4:reviewer=rhc This commit was SVN r29153.	2013-09-10 15:34:09 +00:00
Ralph Castain	46ed907003	Correctly handle list of cores specified in the rankfile - i.e., a rankfile entry such as: rank 0=foo slot=0:0-1;1:0,1 cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29152.	2013-09-08 02:04:29 +00:00
Alex Margolin	50a3c01a0f	fixed build without thread support This commit was SVN r29145.	2013-09-06 19:03:19 +00:00
Ralph Castain	0d7fb932f1	Remove build product file Refs trac:3744 This commit was SVN r29120. The following Trac tickets were found above: Ticket 3744 --> https://svn.open-mpi.org/trac/ompi/ticket/3744	2013-09-04 16:38:22 +00:00
Ralph Castain	6011a4d29c	As per the telecon, update hwloc to v1.7.2 so we can add MIC support. Ignore hwloc1.5.2 component for now until this tests out - will remove it then. cmr:v1.7.4:reviewer=jsquyres This commit was SVN r29107.	2013-09-03 16:23:42 +00:00
Ralph Castain	7a7cfdd519	A little cleanup - the base function to sort numa lists must return something or you get a warning about non-void function returning without value, so cleanup the return values. Ensure the mindist module actually checks for a return of "error" so it won't segfault, and have it emit a polite message when that happens. cmr:v1.7.3:reviewer=jladd This commit was SVN r29089.	2013-08-29 20:01:06 +00:00
Ralph Castain	3516348aad	We don't need to report errors in pmi_setup as it is possible that PMI is available, but that we weren't launched under it (e.g., we launched via mpirun). cmr:v1.7.3:reviewer=hjelmn:subject="Silence unnecessary PMI error msgs" This commit was SVN r29086.	2013-08-29 16:35:20 +00:00
Joshua Ladd	1802aabf1a	Add support for autodetecting a MLNX HCA in the rmaps min distance feature. In this way, .ini files distributed with software stacks need not specify a particular HCA but instead may select the key word auto which will automatically select the discovered device. To use this feature, simply pass the keyword auto instead of a specific device name, --mca rmaps_base_dist_hca auto. If more than one card is installed, the mapper will inform the user of this and, at this point, the user will then need to specify which card via the normal route, e.g. --mca rmaps_base_dist_hca <dev_name>. This should be added to \ncmr=v1.7.4:reviewer=rhc:subject=Autodetect logic for min dist mapping This commit was SVN r29079.	2013-08-28 16:23:33 +00:00
Nathan Hjelm	77a41e1ca9	ompi_info: mark the variables from disabled components as disabled in the output of ompi_info. A variable is disabled if its component will never be selected due to a component selection parameter (eg. -mca btl self). The old behavior of ompi_info was to not print these parameters at all. Now we print the parameters. After some discussion with George it was decided that there needed to be some way to see what parameters will not be used. This was the comprimise. This commit also fixes a bug and a typo in the pvar sytem. The enum_count value in mca_base_pvar_dump was being used without being set. The full_name in mca_base_pvar_t was not being used. cmr=v1.7.3:ticket=trac:3734 This commit was SVN r29078. The following Trac tickets were found above: Ticket 3734 --> https://svn.open-mpi.org/trac/ompi/ticket/3734	2013-08-28 16:03:23 +00:00
Nathan Hjelm	3744c5e0be	Also check for /dev/mic/scif when deciding whether to enable the Linux memory hooks. The MIC has a /dev/scif device and the host has /dev/mic/scif. I do not know if this device exists when no MIC is connected. cmr=v1.7.4:ticket=trac:3733:reviewer=jsquyres This commit was SVN r29071. The following Trac tickets were found above: Ticket 3733 --> https://svn.open-mpi.org/trac/ompi/ticket/3733	2013-08-27 19:40:02 +00:00
Nathan Hjelm	c699ee7812	Update the ompi_info man page with information about variable levels and improve the behavior of ompi_info. This commit changes the default behavior of ompi_info --all when a level is not specified. Instead of assuming level 1 in this case we now assume level 9. This change is due to feedback from the community after the introduction of the --level option. I also added a new option: --selected-only. This option will limit the displayed variables to components that can be selected (ie. if there is a selection parameter set-- btl self,sm) cmr=v1.7.3:reviewer=jsquyres This commit was SVN r29070.	2013-08-27 19:11:37 +00:00
Nathan Hjelm	6e1656279e	Enable the use of the Linux memory hooks on Intel MIC. cmr=v1.7.3:reviewer=jsquyres This commit was SVN r29069.	2013-08-27 18:25:18 +00:00
Nathan Hjelm	2da64eb719	Fix compilation of the MPI tools information interface when profiling is enabled and fix a bug in the handling of watermark performance variables. cmr=v1.7.3:ticket=trac:3725:reviewer=jsquyres This commit was SVN r29068. The following Trac tickets were found above: Ticket 3725 --> https://svn.open-mpi.org/trac/ompi/ticket/3725	2013-08-27 18:19:18 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	611d7f9f6b	When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require. This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times. Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes: * upon first request for data, have the OPAL db pmi component fetch and decode all the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally * reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test * reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued). Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time. This commit was SVN r29040.	2013-08-17 00:49:18 +00:00
Ralph Castain	11a3743b21	Cleanup unitialized var warnings This commit was SVN r29038.	2013-08-16 21:49:17 +00:00
Ralph Castain	7947cec8fa	Cleanup warning This commit was SVN r29031.	2013-08-16 21:13:40 +00:00
Ralph Castain	8a4c5f4957	Attempt to plug a few memory leaks by ensuring we finalize all things opened during init. However, we are still leaking memory like a sieve in param registration and hwloc. This commit was SVN r29026.	2013-08-14 02:03:00 +00:00
Ralph Castain	2c286bccca	Fix typo - thanks to Michael Schlottke for pointing it out cmr:v1.7.3:reviewer=brbarret This commit was SVN r29015.	2013-08-11 18:16:21 +00:00
Nathan Hjelm	524e9b148b	MCA/base: add a function to unload a component without closing it for components that have been registered but not opened This commit was SVN r29012.	2013-08-09 20:16:08 +00:00
Nathan Hjelm	841ed962f6	fix MCA variable and component system leaks cmr=v1.7.3:reviewer=rhc This commit was SVN r29011.	2013-08-09 19:50:28 +00:00
George Bosilca	30b910b54d	More info in the debug mode. This commit was SVN r29002.	2013-08-06 09:08:43 +00:00
Nathan Hjelm	be1bd4661c	db/pmi: speed up modex by caching pmi data internally This commit was SVN r29001.	2013-08-05 22:31:50 +00:00
Nathan Hjelm	88cadc552d	Make opal/db/pmi use as few PMI keys as possible. This commit reintroduces key compression into the pmi db. This feature compresses the keys stored into the component into a small number of PMI keys by serializing the data and base64 encoding the result. This will avoid issues with Cray PMI which restricts us to ~ 3 PMI keys per rank. This commit was SVN r28993.	2013-08-03 01:06:59 +00:00
Ralph Castain	3c8aa7c296	Don't just hardcode the max length of the PMI name as it could be wrong. PMI2 installations seem to be retaining at least some of the PMI functions, so use the one to get the max name length. This commit was SVN r28962.	2013-07-30 14:13:15 +00:00
Nathan Hjelm	99adeb7f6e	Fix support for complex datatypes when fortran is not available but _Complex is This commit was SVN r28951.	2013-07-25 19:08:21 +00:00
Nathan Hjelm	ebbb32120a	MCA/base: variable system updates - Use an enumerator to handle bool values. - Fix a leak in the variable enumerator. - Fix a leak in an orte parameter. This commit was SVN r28949.	2013-07-25 15:42:01 +00:00
Ralph Castain	41f97931e9	Need to include module-level CPPFLAGS so it can build This commit was SVN r28947.	2013-07-24 23:07:43 +00:00
Nathan Hjelm	c4c69b4ddf	MPI-3: add support for large counts using derived datatypes Add support for MPI_Count type and MPI_COUNT datatype and add the required MPI-3 functions MPI_Get_elements_x, MPI_Status_set_elements_x, MPI_Type_get_extent_x, MPI_Type_get_true_extent_x, and MPI_Type_size_x. This commit adds only the C bindings. Fortran bindins will be added in another commit. For now the MPI_Count type is define to have the same size as MPI_Offset. The type is required to be at least as large as MPI_Offset and MPI_Aint. The type was initially intended to be a ssize_t (if it was the same size as a long long) but there were issues compiling romio with that definition (despite the inclusion of stddef.h). I updated the datatype engine to use size_t instead of uint32_t to support large datatypes. This will require some review to make sure that 1) the changes are beneficial, 2) nothing was broken by the change (I doubt anything was), and 3) there are no performance regressions due to this change. Increase the maximum number of predifined datatypes to support MPI_Count Put common get_elements code to ompi/datatype/ompi_datatype_get_elements.c Update MPI_Get_count to reflect changes in MPI-3 (return MPI_UNDEFINED when the count is too large for an int) This commit was SVN r28932.	2013-07-23 15:35:14 +00:00
Ralph Castain	6c1a140e99	Per request from Nathan, add a "commit" API to the opal db framework. This allows him to aggregate keys to work around the Cray's severe PMI limitations This commit was SVN r28917.	2013-07-22 22:57:16 +00:00
Jeff Squyres	49b5342130	After talking with Nathan, update some comments/documentation about the new MCA var and pvar systems. This commit was SVN r28913.	2013-07-22 20:34:42 +00:00
Nathan Hjelm	61d331d5b5	MCA/base: fix some warnings and an error in the MCA variable system This commit was SVN r28909.	2013-07-22 17:52:39 +00:00
Brian Barrett	0d8b57211a	add missing include This commit was SVN r28900.	2013-07-21 20:18:17 +00:00
Nathan Hjelm	1e8ba2b8cf	fix condition in common/pmi init that c caused pmi to fail if PMI2_Init succeeds This commit was SVN r28856.	2013-07-19 02:43:42 +00:00
Ralph Castain	4eb0dfa039	This has apparently been wrong for some time! Fix the common/pmi libraries so we build them dynamic so they can be properly linked into the components that use them. Define required library version numbers and so some other cuteness to make it all work. cmr:v1.7.3:reviewer=jsquyres This commit was SVN r28842.	2013-07-18 18:42:42 +00:00
Ralph Castain	92c6b806b9	Based on a patch submitted by Piotr Lesnicki of Bull, cleanup the PMI2 support. This has not been tested yet on multiple environments (e.g., Cray), so it needs more evaluation prior to moving to the 1.7 branch. cmr:v1.7.3:reviewer=rhc This commit was SVN r28837.	2013-07-18 14:46:07 +00:00
Nathan Hjelm	b88509af36	don't close components that failed to register. cmr:v1.7:reviewer=rhc This commit was SVN r28823.	2013-07-17 19:49:05 +00:00
Nathan Hjelm	456de007a8	ignore unavailable components when registering This commit was SVN r28802.	2013-07-16 16:02:33 +00:00
Nathan Hjelm	d446675526	MCA: Per-RFC, add support for performance variables This commit adds an API for registering and querying performance variables (mca_base_pvar) in the MCA base. The existing MCA variable system API has been updated to reflect the new API: MCA variable groups have performance variables, and new types have been added (double, unsigned long long) to reflect what is required by the MPI_T interface. Additionally, the MCA variable group code has been split into its own set of files: mca_base_var_group.[ch]. Details of the new API can be found in doxygen comments in the header: mca_base_pvar.h. Other changes to the variable system: - Use an opal_hash_table to speed up variable/group lookup. - Clean up code associated with MCA variable types. - Registered performance variables are printed by ompi_info -a. In the future an option should be added to control this behavior. Changes to OMPI: - Added full support for the MPI_T performance variable interface. This commit was SVN r28800.	2013-07-16 16:02:13 +00:00
Ralph Castain	10ca1c1b04	Turns out that there was exactly ONE place in all of the OMPI code base that still referred to OPAL_TRACE, though a few places retained the include file for no reason. So no point in letting this sit as it is clearly an unused "feature". This commit was SVN r28789.	2013-07-14 18:57:20 +00:00
Jeff Squyres	14424daf4c	Remove auto-generated file This commit was SVN r28784.	2013-07-13 20:55:09 +00:00
Nathan Hjelm	8f9b7926ec	mca/base: fix component selection negation. cmr:v1.7:reviewer=jsquyres This commit was SVN r28770.	2013-07-12 17:55:20 +00:00
Ralph Castain	b001d31c27	Per RFC, remove libevent 2.0.19 and leave 2.0.21 as the default This commit was SVN r28767.	2013-07-12 16:37:15 +00:00
Jeff Squyres	9252afdcd9	Updates and tweaks to the documentation of the new MCA parameter system (written in conjunction with Nathan). This commit was SVN r28758.	2013-07-11 20:04:51 +00:00
Jeff Squyres	bdb45a2e4f	Add an oh-so-slightly faster variant of the hotel "checkin" action (since this is used in the fast path) for when you ''know'' that there will be a room available: * Don't do the last_unoccupied_room check * Return void This commit was SVN r28757.	2013-07-11 20:00:37 +00:00
Nathan Hjelm	a694bcb6b6	Add support for the MCA variable information level to ompi_info. Add an option to ompi_info (-l, --level) that takes a number in the interval (1,9). Only MCA variables up to this level will be printed. The default level is 1. Print the level as part of both the parsable and readable output. This commit was SVN r28750.	2013-07-10 18:52:36 +00:00
Ralph Castain	028f5ee7a6	Cleanup some bitrot from moving the db framework to opal and from the new mca param system This commit was SVN r28741.	2013-07-09 14:37:08 +00:00
Ralph Castain	315da8125d	Remove stale headers cmr:v1.7.3:reviewer=jsquyres This commit was SVN r28732.	2013-07-08 18:26:58 +00:00
Ralph Castain	2ccc0438af	On some systems, pthread_kill is actually in the "signals.h" header, so include it This commit was SVN r28731.	2013-07-08 17:40:38 +00:00
Ralph Castain	eac174e624	For purposes of testing the RFC, make libevent2021 the default for now so it gets tested by MTT This commit was SVN r28730.	2013-07-05 23:14:22 +00:00
Brian Barrett	ea9cee73c1	Per RFC, remove darwin backtrace, since OS X since 10.5 has supported the execinfo() interface (which has been the default for OMPI to use on Darwin) This commit was SVN r28727.	2013-07-05 19:06:27 +00:00
Ralph Castain	21c8041a40	Update libevent 2021 component so it also only warns once when detecting reentrant behavior This commit was SVN r28721.	2013-07-04 04:41:04 +00:00
Ralph Castain	bd65937bf3	If we enable ipv6, we resolve a hosts addresses and check them all against our local interfaces to determine if the given host is us. However, if we don't enable ipv6, we only checked the first address returned. This can cause us to incorrectly identify a hostname as "not us". Make -disable-ipv6 behave the same as --enable-ipv6 by checking all the returned addresses. This commit was SVN r28716.	2013-07-03 21:41:36 +00:00
Ralph Castain	45fad1ddcc	We really should be closing the event framework when told to do so. cmr:v1.7.3,reviewer=jsquyres This commit was SVN r28714.	2013-07-03 16:57:14 +00:00
Ralph Castain	9166a8cc95	Per telecon today, add a flag so we only warn once about reentrant libevent loops - this will allow developers to better diagnose the problem as we won't swamp filesystems with warning messages. This commit was SVN r28712.	2013-07-03 04:51:36 +00:00
Jeff Squyres	ad16bcd6d1	Followup from Justin Bronder: Looks like I spoke too soon. The sandbox team has informed me that they are getting rid of SANDBOX_PID in the future and that using SANDBOX_ON would be preferred. This commit was SVN r28708.	2013-07-03 01:38:26 +00:00
Jeff Squyres	fea15ec34e	Add memory hooks override for Gentoo sandbox v2.5, too. Thanks to Justin Bronder for the patch. This commit was SVN r28702.	2013-07-02 12:34:51 +00:00
George Bosilca	a5bda43cfc	Small typo. This commit was SVN r28689.	2013-07-01 16:48:45 +00:00
Ralph Castain	446e33a5d8	There are cases where we want to use the novm state machine, but the backend node topology differs from that where mpirun is executing. In those cases, we can wind up thinking we are oversubscribed because the head node has fewer cores than the compute nodes. To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it. This commit was SVN r28682.	2013-06-27 03:04:50 +00:00
Jeff Squyres	dd25421d48	Convert strcpy() to strncpy(), and just to be extra-super paranoid, use memset(0) for extra bonus points. This commit was SVN r28668.	2013-06-22 12:21:18 +00:00
Joshua Ladd	0b5c1f2ea8	Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28666.	2013-06-21 15:28:14 +00:00
George Bosilca	f5a55ccb39	Various cleanups. This commit was SVN r28647.	2013-06-15 16:23:11 +00:00
George Bosilca	a6c3477e89	Remove useless include. This commit was SVN r28646.	2013-06-15 16:07:30 +00:00
Nathan Hjelm	8924140916	Per RFC: use a better hash algorithm for the opal_hash_table_*_ptr functions. Chose the crc32 function present in opal/util/crc.c as the hash function. The performance should be sufficient for most cases. If not we can always change the function again. This commit was SVN r28629.	2013-06-13 17:11:04 +00:00
Nathan Hjelm	518d1fe200	Fix two typos that prevented alps direct launch from working This commit was SVN r28628.	2013-06-13 17:04:08 +00:00
Joshua Ladd	46362d2761	Stomps compiler warnings in HCA min-dist calculation. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28620.	2013-06-12 16:25:25 +00:00
Tom Naughton	d86c3ce669	+ remove autogenerated 'install-sh' This commit was SVN r28602.	2013-06-07 20:40:24 +00:00
Rolf vandeVaart	62ab008017	Fix SEGV because missing CUDA initialization. This commit was SVN r28601.	2013-06-07 18:31:36 +00:00
Rolf vandeVaart	1230029aa1	The debug messages were swapped. Fixed. This commit was SVN r28600.	2013-06-07 17:23:41 +00:00
George Bosilca	72877f078f	Based on the MPI 3.0 count equal to zero has a clear meaning, no modification of the original datatype are allowed (not in type map nor extent). Make it clear in the code. Allow 0-count cases to the contiguous memory check. This commit was SVN r28568.	2013-05-29 16:02:54 +00:00
Jeff Squyres	6d173af329	This commit introduces a new "mindist" ORTE RMAPS mapper, as well as some relevant updates/new functionality in the opal/mca/hwloc and orte/mca/rmaps bases. This work was mainly developed by Mellanox, with a bunch of advice from Ralph Castain, and some minor advice from Brice Goglin and Jeff Squyres. Even though this is mainly Mellanox's work, Jeff is committing only for logistical reasons (he holds the hg+svn combo tree, and can therefore commit it directly back to SVN). ----- Implemented distance-based mapping algorithm as a new "mindist" component in the rmaps framework. It allows mapping processes by NUMA due to PCI locality information as reported by the BIOS - from the closest to device to furthest. To use this algorithm, specify: {{{mpirun --map-by dist:<device_name>}}} where <device_name> can be mlx5_0, ib0, etc. There are two modes provided: 1. bynode: load-balancing across nodes 1. byslot: go through slots sequentially (i.e., the first nodes are more loaded) These options are regulated by the optional ''span'' modifier; the command line parameter looks like: {{{mpirun --map-by dist:<device_name>,span}}} So, for example, if there are 2 nodes, each with 8 cores, and we'd like to run 10 processes, the mindist algorithm will place 8 processes to the first node and 2 to the second by default. But if you want to place 5 processes to each node, you can add a span modifier in your command line to do that. If there are two NUMA nodes on the node, each with 4 cores, and we run 6 processes, the mindist algorithm will try to find the NUMA closest to the specified device, and if successful, it will place 4 processes on that NUMA but leaving the remaining two to the next NUMA node. You can also specify the number of cpus per MPI process. This option is handled so that we map as many processes to the closest NUMA as we can (number of available processors at the NUMA divided by number of cpus per rank) and then go on with the next closest NUMA. The default binding option for this mapping is bind-to-numa. It works if you don't specify any binding policy. But if you specified binding level that was "lower" than NUMA (i.e hwthread, core, socket) it would bind to whatever level you specify. This commit was SVN r28552.	2013-05-22 13:04:40 +00:00
Jeff Squyres	55382c1bf8	Bring over upstream hwloc trunk commit https://svn.open-mpi.org/trac/hwloc/changeset/5592 to fix the merging of groups when they are I/O objects. This commit was SVN r28551.	2013-05-22 12:34:59 +00:00
Nathan Hjelm	721779d7ab	Per RFC: remove old MCA parameter system. This commit was SVN r28541.	2013-05-20 15:36:13 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	1ec13d530c	Allow simple way to request comparison to full address regardless of addr family This commit was SVN r28519.	2013-05-14 22:08:39 +00:00
Ralph Castain	eb2edb4b2b	Silence warning This commit was SVN r28516.	2013-05-14 22:00:01 +00:00
Rolf vandeVaart	8a8ea9ba1b	Fix compile error in optimize build for CUDA-aware code. This commit was SVN r28512.	2013-05-14 21:07:27 +00:00
Ralph Castain	37088f23d8	When ipv6 disabled, we still have getaddrinfo, so use it when checking common networks for resolving to kindex This commit was SVN r28496.	2013-05-14 15:54:46 +00:00
Ralph Castain	3fc1bafd82	fix typo This commit was SVN r28490.	2013-05-14 12:36:45 +00:00
Ralph Castain	f4f07bdb21	Ensure the opal_ifaddrtokindex function considers the full range of address space by using the netmask This commit was SVN r28487.	2013-05-14 03:37:44 +00:00
Ralph Castain	c33219a51b	Extend the bitmap API a bit to provide a test if all bits zero This commit was SVN r28486.	2013-05-14 03:34:57 +00:00
Jeff Squyres	4d9da92e60	Fixes trac:376: bu default the wrappr compilers will enable rpath support in generated executables on systems that support it. Use --disable-wrapper-rpath to disable this behavior. See text in README about --disable-wrapper-rpath for more details. This commit was SVN r28479. The following Trac tickets were found above: Ticket 376 --> https://svn.open-mpi.org/trac/ompi/ticket/376	2013-05-11 00:49:17 +00:00
Jeff Squyres	cad1d920b2	Check to ensure that we have struct ifreq.ifr_mtu before we try to use it, because Solaris although has SIOCFIGMTU, it curiously does not have ifreq.ifr_mtu. This commit was SVN r28460.	2013-05-07 13:51:50 +00:00
Jeff Squyres	4b9b3a81ff	Update the list of post-1.5.2 r numbers from hwloc that we have committed here. This commit was SVN r28458.	2013-05-07 01:22:06 +00:00
Jeff Squyres	ee0cdf86fd	Fix issue raised by Stefan Friedel: remove an extraneous -L that is added by hwloc's embedding so that it doesn't appear in libhwloc_embedded.la (and therefore propogate all the way up to libmpi.la). Committed upstream in hwloc SVN r5588. This commit was SVN r28457. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r5588	2013-05-07 01:21:18 +00:00
Ralph Castain	527ea1d090	Per the RFC, always enable libevent thread support. This commit was SVN r28443.	2013-05-03 15:39:05 +00:00
George Bosilca	1169ebdff8	Indentation. This commit was SVN r28426.	2013-04-30 23:26:23 +00:00
Ralph Castain	4c0dcb1aa2	Update ignores and remove build product This commit was SVN r28412.	2013-04-29 19:02:03 +00:00
Ralph Castain	5d7a93c032	Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed. Interesting how many includes had to be fixed here and there to fill in missing dependencies :-) This commit was SVN r28411.	2013-04-29 17:02:37 +00:00
Ralph Castain	3052acd968	Fix minor typo This commit was SVN r28410.	2013-04-29 17:02:11 +00:00
Ralph Castain	3818e88365	Remove and ignore build products This commit was SVN r28404.	2013-04-27 00:07:18 +00:00
Jeff Squyres	5bf9fffacd	As initially reported by Eric Chamberland in http://www.open-mpi.org/community/lists/users/2013/04/21689.php, the assert in opal_datatype_is_contiguous_memory_layout() is not always correct -- he supplied a test case where it was not valid, essentially: 1. Call MPI_Type_create_indexed_block(0, ..., &newtype) and commit newtype 1. Call MPI_Type_create_resized(newtype, 0, nonzero_value, &resized) and commit resized 1. Call MPI_File_set_view with resized This will eventually call opal_datatype_is_contiguous_memory_layout(), and the assert will fail. After some consultation with George, it was determined that the assert() is basically good, but it needs to also check for (count != 0). This commit was SVN r28398.	2013-04-25 20:54:25 +00:00
Ralph Castain	b73f25e839	Add a function to return the kernel index of the corresponding interface from an IPv4/6 string or hostname This commit was SVN r28397.	2013-04-25 19:40:34 +00:00
Ralph Castain	c081a520a3	Fix --without-hwloc This commit was SVN r28396.	2013-04-25 19:13:56 +00:00
Ralph Castain	cef639f578	Ahem....cleanup a copy/paste error in naming of these functions This commit was SVN r28395.	2013-04-25 15:21:53 +00:00
Nathan Hjelm	c50b99005d	fix typo in opal_info_show_component_version and clean up more from ompi_info This commit was SVN r28389.	2013-04-24 22:07:06 +00:00
Nathan Hjelm	4896b3bc4b	clean up some ompi_info code This commit was SVN r28388.	2013-04-24 21:37:24 +00:00
Ralph Castain	4fae24f2f1	Crud - missed this file, needs to go with prior commit, will add to cmr This commit was SVN r28382.	2013-04-24 17:47:18 +00:00
Nathan Hjelm	bccf8c657a	Per RFC add initial support for the MPI 3.0 tools interface. Current MPI_T support: - Full cvar interface. - Full categories interface. - No pvar support at this time. This commit was SVN r28376.	2013-04-24 15:59:23 +00:00
Ralph Castain	d721437c8d	Somebody (accidentally) removed the instructions for updating libevent releases in OMPI, so replace them with at least an outline on how to do it. This commit was SVN r28349.	2013-04-22 17:05:56 +00:00
Ralph Castain	1dc65b5fd7	Update libevent to 2.0.21-stable, but currently ignore it for all but those testing it This commit was SVN r28348.	2013-04-22 17:01:07 +00:00
Ralph Castain	6c6681e880	Fix an error in a test in the libevent configure.ac that we introduced - there are two brackets around the entire test code, so no need for double-brackets around the array indices within it cmr:v1.7.2 This commit was SVN r28347.	2013-04-22 15:29:44 +00:00
Rolf vandeVaart	5e1dde419c	Fix some compile errors in CUDA-aware code that has crept in. This commit was SVN r28346.	2013-04-18 15:34:16 +00:00
Jeff Squyres	c722440411	Add public functions for retrieving the MAC and MTU (paired with r28344). This commit was SVN r28345. The following SVN revision numbers were found above: r28344 --> open-mpi/ompi@e88881c25f	2013-04-17 22:32:32 +00:00
Jeff Squyres	e88881c25f	Also support getting the MAC and MTU. This commit was SVN r28344.	2013-04-17 22:17:42 +00:00
Jeff Squyres	eb012c2aad	Defensive programming: add a constructor for opal_if_t that zeros everything out before using it. This is not in response to any known bug, but rather just a pre-emptive, defensive move to help prevent bugs in code that forgets to initialize a field. This commit was SVN r28343.	2013-04-17 22:09:02 +00:00
Ralph Castain	d0e34adacb	Add debug This commit was SVN r28331.	2013-04-15 13:09:43 +00:00
Jeff Squyres	349ee654c1	Fix some --without-hwloc compile errors. Also remove one assigned-but-not-used variable assignment. This commit was SVN r28321.	2013-04-10 15:08:31 +00:00
Jeff Squyres	aef371c8f6	Fix bug introduced by r28236: make declaration and instantiation agree on "const". This commit was SVN r28320. The following SVN revision numbers were found above: r28236 --> open-mpi/ompi@cf377db823	2013-04-10 14:10:47 +00:00
George Bosilca	43e4d3654e	Fix an issue identified by Thomas Jahns and his colleague when the data representation is not correctly optimized (it is off by the extend). During the data representation process, if the opportunity to merge several items appear, we replace them with the new merged element. However, if one of the components of this merged element was comming from a "loop representation" then the new first element of this loop must have a displacement moved by the extent of the loop. This commit was SVN r28319.	2013-04-09 23:01:54 +00:00
Ralph Castain	45af6cf59e	The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem. This commit was SVN r28308.	2013-04-08 23:34:16 +00:00
George Bosilca	c5909bffe8	Make the opal_convertor_raw similar to opal_convertor_pack and _unpack, by allowing it to handle completed convertors. In this case it will return a length of zero and an iov_count set to zero. This commit was SVN r28305.	2013-04-08 13:49:14 +00:00
Ralph Castain	4dbc468c3c	Remove stale file This commit was SVN r28299.	2013-04-07 13:52:48 +00:00
Ralph Castain	c121a784ae	Remove some weird code around opal_db_close and cleanup that framework's open/close operation This commit was SVN r28298.	2013-04-07 13:52:28 +00:00
Ralph Castain	10257b8b43	Add missing include This commit was SVN r28297.	2013-04-07 01:32:08 +00:00
Ralph Castain	1067b1f5ee	Add a little debug This commit was SVN r28295.	2013-04-06 15:24:35 +00:00
Ralph Castain	3bfa53eb91	Cleanup (again) the solaris topology code in hwloc...sigh. This commit was SVN r28294.	2013-04-06 14:45:32 +00:00
Ralph Castain	ec00fa3132	Fix missing variable declaration in hwloc 1.5.2 This commit was SVN r28293.	2013-04-05 17:43:34 +00:00
Ralph Castain	1f011bef99	Cleanup the updated sys limits capability. Fix a few copy/paste bugs (my bad). Shift the limit set to the ODLS default module so that we sete the limits for all apps, even those that don't call opal_init. Leave it in opal_init as well to support direct-launch apps, but ensure we only set the limits once by removing the envar after launch by ODLS. Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation. This commit was SVN r28288.	2013-04-04 16:00:17 +00:00
Ralph Castain	d09a9e8096	Upgrade the system limit code to support a broader range of parameters. For now, we support stack size, #open files, #children, and file size we can c reate. Continue to support the old "1" or "0" options for backward compatibility. This commit was SVN r28282.	2013-04-03 18:57:53 +00:00
Ralph Castain	39a4e93e44	Correct the includes so that compiling with devel headers works This commit was SVN r28267.	2013-03-30 16:25:24 +00:00
Ralph Castain	a9dc5a31f2	Fix verbosity setting This commit was SVN r28251.	2013-03-27 22:12:01 +00:00
Ralph Castain	d12eed0703	Silence warning This commit was SVN r28249.	2013-03-27 22:07:29 +00:00
Nathan Hjelm	3b3506717e	de-deprecate mca_base_param_init mca_base_param_finalize as they will be needed until the mca_base_param shim layer goes away This commit was SVN r28248.	2013-03-27 22:07:23 +00:00
Ralph Castain	95cf39b224	Fix non-updated opal_output channel This commit was SVN r28245.	2013-03-27 21:57:24 +00:00
Nathan Hjelm	17315bf360	Now that the entire codebase has been updated to use the MCA framework system remove the last calls to the MCA parameter system. This commit was SVN r28242.	2013-03-27 21:17:53 +00:00
Nathan Hjelm	9d4a26f47d	Update OMPI frameworks to use the MCA framework system. Notes: - This commit also eliminates the need for an available components list in use in several frameworks. None of the code in question was making use of the priority field of the priority component list item so these extra lists were removed. - Cleaned up selection code in several frameworks to sort lists using opal_list_sort. - Cleans up the ompi/orte-info functions. Expose the functions that construct the list of params so they can be used elsewhere. patches for mtl/portals4 from brian missed a few output variables in openib This commit was SVN r28241.	2013-03-27 21:17:31 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	365cf48db5	Update OPAL frameworks to use the MCA framework system. This commit was SVN r28239.	2013-03-27 21:11:47 +00:00
Nathan Hjelm	c3b67d0187	Automatically generate a list of installed frameworks in project/include/project/frameworks.h This commit was SVN r28238.	2013-03-27 21:10:32 +00:00
Nathan Hjelm	020b9991a4	Introduce the MCA framework system. This formalizes the interface frameworks must provide. Other changes: - Added a flag to the MCA variable system to indicate a variable should go away when its group does. Both mca_base_framework_var_register() and mca_base_component_var_register() set this flag. Notes: - mca_base_components_open is deprecated. It will be removed in a future commit. - All frameworks should use MCA_BASE_FRAMEWORK_DECLARE to declare their framework structure. - All calls to framework open/close functions should be changed to use the mca_base_framework_* functions. - Instead of special-casing installdirs a flag was added to prevent calling into the variable system when opening a framework. - Ralph: Clarify the functional definition of the "register" function in the MCA framework object - it had the same name as another function that does a totally different thing. - As per discussion with Ralph the behavior of mca_base_framework_register() is to always call mca_base_framework_components_register() if the framework's register function was successful. This removed the need for frameworks to have to call this function directly. This commit was SVN r28237.	2013-03-27 21:10:18 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Jeff Squyres	1a048d6ee6	Remove a duplicate variable declaration. This commit was SVN r28224.	2013-03-27 01:15:27 +00:00
Ralph Castain	317915225c	Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct. This commit was SVN r28223.	2013-03-26 20:09:49 +00:00
Ralph Castain	6ee32767d4	Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument. This commit was SVN r28219.	2013-03-26 18:27:50 +00:00
Jeff Squyres	6c8d0450a3	Update the post-hwloc-1.5.2 patch list. This commit was SVN r28218.	2013-03-26 16:18:52 +00:00
Jeff Squyres	f79716dfd4	Include <hwloc.h> so that the symbols in this file are subject to the <hwloc/rename.h> renaming. This commit was SVN r28215.	2013-03-26 15:49:52 +00:00
George Bosilca	a856f926de	Remove a bunch of unused variables. This commit was SVN r28213.	2013-03-26 14:34:29 +00:00
Jeff Squyres	6695b5e17a	Re-apply r28040 from Eugene: a post-hwloc release fix for Solaris binding. This fix was included in the upstream 1.6 series, but not the upstream 1.5 series, and was therefore missed when we brought 1.5.2 to OMPI. This commit was SVN r28212. The following SVN revision numbers were found above: r28040 --> open-mpi/ompi@3d44f97572	2013-03-26 13:27:23 +00:00
Ralph Castain	8a79d37ac2	Fix a few bugs in the hwloc integration code. The "set binding policy" macro should flag that the policy was indeed set. Some systems don't report sockets, so the print functions need to check for that condition. cmr:v1.7 This commit was SVN r28209.	2013-03-25 17:51:45 +00:00
Brian Barrett	bc3ca9e009	Make the linux memory component do the failure path if it was disabled. This commit was SVN r28206.	2013-03-22 16:56:09 +00:00
Brian Barrett	6c3f986d79	* Fix issue with duplicate symbol for the initialize hook due to it existing in both libmpi and libopen-pal by removing the one for libopen-pal. This won't work if we eventually need registration caching in opal/orte, but I'm hoping that by that point, OFED will have gotten off its butt and properly integrated ummunotify into the verbs layer so that this code can go away. At the same time, fix a minor issue where the init hook was being called twice, once by the libc malloc and once by our malloc by removing the call from our malloc. This commit was SVN r28202.	2013-03-21 23:05:54 +00:00
Jeff Squyres	3938b85182	Fix CID 752007: missing break statements. This commit was SVN r28191.	2013-03-21 11:04:36 +00:00
Ralph Castain	b7f0e46319	Provide a nicer error message when someone gives a bad signal number to opal_signal cmr:v1.7.1 This commit was SVN r28188.	2013-03-20 15:30:59 +00:00
Jeff Squyres	e5838e6121	Don't mandate PCI support, because this will make builds on platforms that don't have libpciaccess fail (e.g., OS X, or any machine without libpciaccess). This commit was SVN r28181.	2013-03-19 16:20:08 +00:00
Jeff Squyres	7f34dc266b	Add missing unlocks. Fixes CID 967022 (which covers the unlock on line 627; there's probably another CID for the unlock added on line 537). This commit was SVN r28179.	2013-03-18 23:19:25 +00:00
Jeff Squyres	90802410a8	Update hwloc from 1.5.1 to 1.5.2. Re-enable hwloc PCI support by default, since it will now use libpciaccess (if available). This commit was SVN r28178.	2013-03-18 23:02:56 +00:00
Jeff Squyres	f8bbfacf65	Fix CID 967922: minor memory leak possibility. This commit was SVN r28175.	2013-03-15 17:59:00 +00:00
Brian Barrett	fc2b3b8d46	Ugh. Work around an issue with memory hooks and the change from one big library to multiple libraries that are implicitly sucked into the executable as a dependency of libmpi. The initialize hook isn't visible to libc on some linux distributions when it's in libopal and libopal isn't explicity linked into the executable. The fix is to have a duplicate initialize hook in libmpi as well as libopal. sigh. This commit was SVN r28164.	2013-03-11 19:22:24 +00:00
George Bosilca	6a933e7593	Use the libs not some weird path. This commit was SVN r28139.	2013-02-28 22:34:47 +00:00
Ralph Castain	a4b6fb241f	Remove all remaining vestiges of the Windows integration This commit was SVN r28137.	2013-02-28 17:31:47 +00:00
Ralph Castain	e71b40fdcb	If we are redirecting to files, ensure we don't create duplicate file descriptors for output streams going to the same file. If we do, then the output gets completely jumbled - best to avoid that problem. This commit was SVN r28136.	2013-02-28 17:21:53 +00:00
Ralph Castain	8d2fa3693b	First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package. This commit was SVN r28116.	2013-02-26 20:44:56 +00:00
Ralph Castain	9479635e31	Missing include here too... This commit was SVN r28115.	2013-02-26 20:21:10 +00:00
Ralph Castain	8b8333da3e	Add missing include This commit was SVN r28114.	2013-02-26 19:56:05 +00:00
Ralph Castain	e413596705	Add the loopexit API to the opal_event definitions This commit was SVN r28113.	2013-02-26 19:27:26 +00:00
Ralph Castain	bd9265c560	Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data. A few changes were required to support this move: 1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution). 2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time. 3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering. 4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options. No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework. This commit was SVN r28112.	2013-02-26 17:50:04 +00:00

... 2 3 4 5 6 ...

2494 Коммитов