openmpi

Автор	SHA1	Сообщение	Дата
Karol Mroz	e1c64e6e59	opal: standardize on max hostname length Define OPAL_MAXHOSTNAMELEN to be either: (MAXHOSTNAMELEN + 1) or (limits.h:HOST_NAME_MAX + 1) or (255 + 1) For pmix code, define above using PMIX_MAXHOSTNAMELEN. Fixup opal layer to use the new max. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-24 08:19:47 +02:00
Gilles Gouaillardet	d529951206	hwloc: correctly count cores with at least one allowed PU when SMT is enabled, a core must be counted as long as one of its hwthread is allowed Thanks Ben Menadue for the report. This fixes a regression from open-mpi/ompi@6d149554a7	2016-01-29 11:54:34 +09:00
Gilles Gouaillardet	6d149554a7	hwloc: have opal_hwloc_base_get_pu search for HWLOC_OBJ_PU when mpirun is invoked with --use-hwthread-cpus Fixes open-mpi/ompi#1247	2016-01-26 18:10:33 +09:00
Tim Mattox	958de82471	hwloc_base_util.c: Remove newly unused variable 'i'.	2016-01-14 16:35:47 -05:00
Tim Mattox	f2d4a8d266	Replace a bit counting loop with a call to an efficient population count routine	2016-01-12 10:48:56 -05:00
Gilles Gouaillardet	975b6fd51b	hwloc: do not count not allowed cores in df_search_cores	2015-09-17 13:10:34 +09:00
Nathan Hjelm	899bf548a2	opal/hwloc: fix topology detection when socket is above numa The OPAL_PROC_ON_* definitions have been changed from values to flags. This should not cause any problems as these values were already used as flags throughout the code base. Note, there will be a difference between localities produced by the new code and the old. For example, if a machine does not have a level-3 but two cores share a level-1 or level-2 cache cache the level-3 bit will not be set in the locality and OPAL_PROC_ON_LOCAL_L3CACHE will return 0. Before this change it would have returned 1. In addition the OPAL_PROC_ON_LOCAL_* macros have been simplified. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 14:17:45 -06:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	ed93154e43	Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.	2015-07-07 12:52:16 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	033418f62a	Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so.	2015-04-10 15:58:35 -07:00
Ralph Castain	ed5d10b816	Somehow slipped by - ensure we correctly count the cores	2015-03-19 17:56:18 -07:00
Ralph Castain	43a3baad5e	Ensure we use the first compute node's topology for mapping Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes. Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset. Correctly count the number of available PUs under each object when given a cpuset Fix the default binding settings, and correctly count PUs when no cpuset is given Ensure the binding policy gets set in all cases	2015-03-19 16:30:36 -07:00
Nysal Jan K.A	881a9f3d58	Fix cache line size detection on power Due to the nature of the cache architecture on power, we don't export coherency_line_size for L2 in sysfs. If we are unable to get the L2 cache line size, try L1. See open-mpi/ompi#383 for more information.	2015-02-25 17:26:28 +05:30
Howard Pritchard	c9e81b54fb	Merge pull request #412 from hppritcha/topic/owner_files add owner files to opa/ompi/orte mca directories	2015-02-23 09:48:20 -07:00
Gilles Gouaillardet	8d44d7086a	hwloc/base: fix misc memory leaks as reported by Coverity with CIDs 710636 and 1270441	2015-02-23 13:55:04 +09:00
Howard Pritchard	bf89131f9e	add owner files to opa/ompi/orte mca directories This commit adds an owner file in each of the component directories for each framework. This allows for a simple script to parse the contents of the files and generate, among other things, tables to be used on the project's wiki page. Currently there are two "fields" in the file, an owner and a status. A tool to parse the files and generate tables for the wiki page will be added in a subsequent commit.	2015-02-22 15:10:23 -07:00
Gilles Gouaillardet	55948f2a6d	hwloc: fix misc memory leak as reported by Coverity with CID 1270441 (previous commit open-mpi/ompi@c25185f3a9 did not fully fix that one)	2015-02-17 14:06:15 +09:00
Gilles Gouaillardet	c25185f3a9	opal/hwloc: fix misc memory leaks as reported by Coverity with CIDS 710631-710638, 1196705, 1196716, 1196717, 1196752, 1196753	2015-02-16 12:23:37 +09:00
Gilles Gouaillardet	8dd77c692e	opal/hwloc: fix misc bugs as reported by Coverity with CIDs 72224, 703566, 1196821, 1196842, 1196657 and 1196658	2015-02-16 11:59:48 +09:00
Ralph Castain	123fdd603f	If we are using hwthread cpus, then default to binding there, letting the user override to whatever they want	2014-12-19 08:04:28 -08:00
Ralph Castain	0630680f36	Two cleanups required for transfer to 1.8.4: * Use %d format for the topo signature as some systems apparently have problems with %u * Use correct variable in show_help message	2014-12-12 17:23:32 -08:00
Ralph Castain	18d9fdfd8d	Restore full topology comparison to support inventory monitoring	2014-12-09 01:33:06 -08:00
Ralph Castain	9b2f8cd840	Add the processor architecture to the topology signature	2014-12-09 01:17:00 -08:00
Ralph Castain	bb529ebd8e	Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings). Retain the hetero-nodes flag for those cases where the user knows that there are differences and our automated system isn't good enough to see it. Will obviously require further refinement as we find out which variances it can detect, and which it cannot.	2014-12-08 15:38:14 -08:00
Ralph Castain	cb15cc06e1	Minor changes per Jeff's request on PR for 1.8.4	2014-12-02 19:54:10 -08:00
Ralph Castain	960ef34988	Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists	2014-11-30 14:37:37 -08:00
Ralph Castain	3f9d9ae8b6	Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids. Once validated, a version of this will be backported to the v1.8.4 release.	2014-11-30 11:50:31 -08:00
Ralph Castain	d0704ef118	Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case.	2014-11-10 14:00:40 -08:00
Ralph Castain	2a90788724	Support physical processor ids in rankfile	2014-11-10 14:00:40 -08:00
Ralph Castain	e671620ac7	Per request from Jeff: tune up the help messages for binding options Refs trac:4898 This commit was SVN r32691. The following Trac tickets were found above: Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898	2014-09-09 22:39:22 +00:00
Ralph Castain	4207b4c4ad	Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32687.	2014-09-08 21:03:51 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
George Bosilca	daa076995a	orte_rmaps_numa_node_t -> opal_rmaps_numa_node_t This commit was SVN r32380.	2014-07-31 19:58:47 +00:00
Ralph Castain	3f04d50cb0	Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions: (a) default binding policy is in effect. In this case, we will emit a warning and default to not binding unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option. (b) user-specified binding policy is in effect. In this case, we will error out unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option as we cannot meet the directive. Either "bind-to" modifier (oversubscribe or overload) will be accepted for now - in 1.9, we will deprecate the "overload" term in favor of "oversubscribe". Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy. Closes trac:4345 cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions This commit was SVN r32005. The following Trac tickets were found above: Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345	2014-06-14 15:38:32 +00:00
Ralph Castain	06dbfa3098	Make the cpus-per-proc equivalent a little more intuitive: * allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc. * if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior. Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier. cmr=v1.8.2:reviewer=rhc This commit was SVN r31967.	2014-06-08 20:26:59 +00:00
Ralph Castain	5602156a1c	Use the correct abstraction layer name for the data dirs This commit was SVN r31684.	2014-05-08 14:32:24 +00:00
Jeff Squyres	82e104719a	hwloc/rmaps base: Add missing help message. Also, add missing ORTE_ERROR_LOG in the other case where this error message is used (i.e., ORTE_ERROR_LOG was used in the one place, so let's also use it in the other place). This commit was SVN r31321.	2014-04-07 15:39:54 +00:00
Jeff Squyres	37d4c22912	hwloc base: Remove unused help messages. These type of help messages are now displayed by the MCA var system itself (via enumerated values). This commit was SVN r31320.	2014-04-07 15:39:29 +00:00
Ralph Castain	9fca25a8dd	Catch one more place where we need to use the actual topology instead of opal_hwloc_topology. Thanks to Tetsuya Mishima for the patch Reviewed okay. RM-approved cmr=v1.7.5:reviewer=ompi-gk1.7 This commit was SVN r31016.	2014-03-12 00:49:54 +00:00
Ralph Castain	081669b440	When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings This commit was SVN r30968.	2014-03-10 15:53:07 +00:00
Ralph Castain	4e32a82638	If we are binding to hwthreads, then we need to treat hwthreads as cpus to get the mapping right cmr=v1.7.5:reviewer=jsquyres:subject=set hwthreads to cpus when binding to them This commit was SVN r30648.	2014-02-09 16:14:38 +00:00
Ralph Castain	193cceb483	Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps. Yippee skippee... This commit was SVN r30513.	2014-01-30 23:50:14 +00:00
George Bosilca	18ae20022a	Don't forget to release the bitmaps. This commit was SVN r30428.	2014-01-26 17:24:38 +00:00
Jeff Squyres	7768828d2d	Addendum to r30298: tweak the wording of the help messages a bit. Refs trac:4117. Please use this commit rather than the patch attached to the ticket; the patch had a few mistakes in the tweaked wording. This commit was SVN r30362. The following SVN revision numbers were found above: r30298 --> open-mpi/ompi@58479399c3 The following Trac tickets were found above: Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117	2014-01-22 12:17:14 +00:00
Ralph Castain	58479399c3	As per RFC and telecon, deprecate cmd line options and their corresponding MCA params for old-style mapping and binding directives cmr=v1.7.5:reviewer=jsquyres:subject=deprecate old-style mapping and binding directives This commit was SVN r30298.	2014-01-15 14:48:39 +00:00
Ralph Castain	fb9e427320	One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do not bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound. Refs trac:4077 This commit was SVN r30200. The following Trac tickets were found above: Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077	2014-01-09 22:39:34 +00:00
Ralph Castain	f179f2086b	Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's. cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings This commit was SVN r30179.	2014-01-09 16:16:16 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Ralph Castain	31248c0985	Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match. Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node. Refs trac:4003 This commit was SVN r30033. The following Trac tickets were found above: Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003	2013-12-20 20:42:39 +00:00

1 2 3

132 Коммитов