1
1
Граф коммитов

2769 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
13643f5b6e btl/vader: improved single-copy support
This commit makes the folowing changes:

 - Add support for the knem single-copy mechanism. Initially vader will only
   support the synchronous copy mode. Asynchronous copy support may be added
   int the future.

 - Improve Linux cross memory attach (CMA) when using restrictive ptrace
   settings. This will allow Open MPI to use CMA without modifying the system
   settings to support ptrace attach (see /etc/sysctl.d/10-ptrace.conf).

 - Allow runtime selection of the single copy mechanism. The default behavior
   is to use the best available. The priority list of single-copy mehanisms is
   as follows: xpmem, cma, and knem.

 - Allow disabling support for kernel-assisted single copy.

 - Some tuning and bug fixes.
2014-10-20 11:44:52 -06:00
Nadezhda Kogteva
2bce929330 MTL MXM cleanup: unnecessary OMPI_MTL_MXM_CONNECT_ON_FIRST_COMM variable removed 2014-10-20 10:29:47 +03:00
Aurélien Bouteiller
e3be1fb9a5 Quick pass over the sm-knem code, indent fixes 2014-10-17 10:38:35 -04:00
Jeff Squyres
43aff4d8b3 btl sm: error if knem support is requested and cannot be activated
Restore the functionality to error out (and show a helpful message) if
knem support is requested by is either not compiled in or cannot be
activated.

Thanks to Gus Correa for bringing the matter to our attention.
2014-10-16 20:01:26 -07:00
Jeff Squyres
b04a2634c6 btl sm: restore btl_sm_have_knem_support MCA param
Somehow, this MCA param was accidentally dropped after v1.6.5.  Thanks
to Gus Correa for bringing this matter to our attention.

Also moving some MCA params down from level 9 to levels 4/5.
2014-10-16 19:48:21 -07:00
Ralph Castain
b6aa691e0a Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
Gilles Gouaillardet
27dcca0bb2 pmi/s1: fix large keys
do not overwrite the PMI key when pushing a message that does
not fit within 255 bytes
2014-10-16 13:29:32 +09:00
Gilles Gouaillardet
b5aea782ce Revert "Fix heterogeneous support"
Per the discussion at http://www.open-mpi.org/community/lists/devel/2014/10/16050.php

This reverts commit c9c5d4011b.
2014-10-16 12:24:38 +09:00
George Bosilca
63ba754f3f Remove unnecessary includes from the datatype 2014-10-15 21:49:32 -04:00
George Bosilca
7541c03b4c Mark all instances where atomic operations are used but their return value is unnecessary 2014-10-15 21:47:32 -04:00
Jeff Squyres
dc66e197cc var: fix segv in deprecated file var show_help()
Ensure to include the new variable filename in the show_help() output
when we load a deprecated MCA param from a file.

Fixes #236
2014-10-15 08:07:31 -07:00
Jeff Squyres
51027a6635 usnic: fix minor typo
Change harmless-but-weird comma to semicolon.  Found during code
review.
2014-10-15 05:32:36 -07:00
Gilles Gouaillardet
c9c5d4011b Fix heterogeneous support
* redefine orte_process_name_t so it can be converted
  between host and network format as an opal_identifier_t
  aka uint64_t by the OPAL layer.
* correctly send OPAL_DSTORE_ARCH key
2014-10-15 17:19:13 +09:00
Gilles Gouaillardet
5c81658d58 pmix: fix big endian arch
use the appropriate 64 bits type otherwise data gets incorrectly
truncated on big endian arch
2014-10-15 17:17:09 +09:00
Ralph Castain
3ef94a0675 Per email thread on devel list:
Revert "OPAL: drop dead with core on bad flow. rarely happens with helloworld on large scale."

This reverts commit 86f1d5af3e.

Will be reconsidered via RFC as it represents a significant change in behavior
2014-10-12 21:13:42 -07:00
Ralph Castain
4d27eb70f2 Extend the dstore framework to include a new "update_handle" API so the attributes of an existing handle can be changed. We can't just open a new handle as the upper layers won't know where to find the info. :-( 2014-10-10 12:40:32 -07:00
Ralph Castain
1ae34da5e5 Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment. 2014-10-10 12:13:25 -07:00
Ralph Castain
63f619f871 Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build. 2014-10-10 11:39:08 -07:00
Nathan Hjelm
a31cf3b740 btl/vader: missing include 2014-10-09 13:57:21 -06:00
Nathan Hjelm
9e0c07e4ce btl/ugni: improve the handling of eager get fragments when the btl runs out
of preregistered buffers

Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
2014-10-09 13:57:21 -06:00
Howard Pritchard
ebc368d26b remove GNI_RDMAMODE_FENCE bit in GNI_PostRdma
The GNI_RDMAMODE_FENCE bit was a left over from
async progress work that is not needed at this point
in the gni BTL.  Removing the bit also allows
for the removal of the GNI_CDM_MODE_BTE_SINGLE_CHANNEL
bit from the GNI_CdmCreate call.
2014-10-09 12:41:19 -06:00
Ralph Castain
ce8e33447f Silence warning 2014-10-09 10:45:25 -07:00
Joshua Ladd
1cabd73522 Adding a new OPAL hash table routine. Please read the algorithm description in opal/class/opal_hash_table.c for more precise details on the design and implementation. This algorithm was contributed by David Linden of H.P. in partnership with Mellanox Technologies. This contribution achieves two objectives:
1. It's actually hashing now, whereas the old OPAL hash table was not. Thus, it is a bug fix for and, as such, should be included in the 1.8 series.

2. It is dynamic and can grow and shrink the number of buckets in accordance with job size, whereas the old OPAL hash table had a fixed number of buckets which resulted in poor retrieval performance at large scale.

This scheme has been deployed in the field on very large H.P./Mellanox systems and has been demonstrated to significantly decrease job start-up time (~ 20% improvement) when launching applications directly with srun in SLURM environments. However, neither SLURM nor direct launch are prerequisites to take advantage of this change as any entity that utilizes OPAL hash table objects can benefit (at least partially) from this contribution.
2014-10-09 17:24:23 +02:00
Elena
c905fe9b78 pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff 2014-10-09 06:15:31 +02:00
Howard Pritchard
9947758d98 initial thread safety for ugni btl
This commit adds initial ugni thread safety support.
With this commit, sun thread tests (excepting MPI-2 RMA)
pass with various process counts and threads/process.
Also osu_latency_mt passes.
2014-10-08 10:13:22 -06:00
Jeff Squyres
a422d893b8 memchecker: per RFC, use calloc for OBJ_NEW
With --enable-memchecker builds, use calloc(3) for OBJ_NEW instead of
malloc(3).  This cuts down on a lot of valgrind/memory checker false
positive output.

Also make a minor change in the valgrind configure.m4; have it assign
0xf to a char.  The prior assignment (of 0xff) was warning about an
overflow.  This didn't really matter, but we might as well make the
test not have a gratuitious warning in it.
2014-10-07 09:55:54 -07:00
Mike Dubman
86f1d5af3e OPAL: drop dead with core on bad flow. rarely happens with helloworld on large scale. 2014-10-07 14:07:41 +03:00
Ralph Castain
fd6a044b7f Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
2014-10-03 16:02:57 -06:00
Howard Pritchard
5428301c81 Remove catamount timer support
With the 1.9 release, support for catamount is being
dropped. Hence,  removing catamount timer support.
2014-10-03 14:53:09 -06:00
rolfv
697b18db63 Making async copy the default 2014-10-03 06:42:18 -07:00
Gilles Gouaillardet
5c5453b8b1 pmix: fix test in native_get_attr 2014-10-03 11:54:08 +09:00
Jeff Squyres
413e775dbf version configury: make dist now works
Update the VERSION file scheme:

* Remove "want_repo_rev".
* Add "tarball_version".

All values are now always included (major, minor, release, greek,
repo_rev).  However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
2014-10-02 11:32:54 -07:00
Jeff Squyres
8468424f45 distscript: remove configure.params and autogen.subdirs kruft
Remove configure.params support: configure.params hasn't been used in
years.

Also remove autogen.subdirs support; those should really be handled by
their respective Makefile.am's.
2014-10-02 11:32:54 -07:00
Jeff Squyres
54544f64b3 wrappers: update URLs for GitHub 2014-10-01 14:37:17 -07:00
Ralph Castain
9e35f80ab6 Don't multiply define WANT_PMI_SUPPORT and friends. Turns out they weren't being used anywhere anyway, so no point in defining them at all
This commit was SVN r32822.
2014-09-30 20:43:25 +00:00
Howard Pritchard
8da51fab81 cray pmi equivalent to commit 5eb65b24
This commit was SVN r32820.
2014-09-30 19:25:00 +00:00
Ralph Castain
8d0b4f222a The pmix.get functions should not be returning "success" if the requested info isn't found. Fix the macros and the component functions so they correctly return "not found" in that situation, and set the data regions and size to NULL and 0, respectively.
This commit was SVN r32818.
2014-09-30 18:03:12 +00:00
Jeff Squyres
d4e2809531 version: always use all 3 version numbers
In all previous releases, the version number would be "A.B.C" unless C
was 0, in which case it would be "A.B".  This commit changes that
scheme to always be "A.B.C", even if C==0.

Hence, v1.9.0 will be the first release where this new scheme is evident.

This commit was SVN r32816.
2014-09-30 15:54:18 +00:00
Howard Pritchard
1df933ea27 remove ompi/runtime/params.h include in ugni btl
This commit was SVN r32813.
2014-09-29 19:26:33 +00:00
Howard Pritchard
201d4ec3ad fix setting of PMIX_NODE_RANK in cray pmix comp.
Per discussions with pmix folks, it was determined that
the way the cray pmi pmix component was computing the
PMIX_NODE_RANK attribute for a process was incorrect.
This commit fixes the problem.

This commit was SVN r32810.
2014-09-29 16:55:31 +00:00
Rolf vandeVaart
399dc3db43 Code to check for managed memory. Configure support also.
This commit was SVN r32801.
2014-09-26 16:24:45 +00:00
Rolf vandeVaart
35858f837a Revert r32713. Have different code for this.
This commit was SVN r32800.

The following SVN revision numbers were found above:
  r32713 --> open-mpi/ompi@9a2bab0e27
2014-09-26 14:56:18 +00:00
Nathan Hjelm
e0eb1f2e73 btl/vader: make vader registration lookup/caching thread safe
This commit was SVN r32798.
2014-09-25 22:24:06 +00:00
George Bosilca
53e012ae97 Fix typo.
This commit was SVN r32795.
2014-09-25 17:18:27 +00:00
Nathan Hjelm
aba87f3776 btl/vader:silence warning
This commit was SVN r32788.
2014-09-24 22:10:23 +00:00
Nathan Hjelm
79881ca892 btl/vader: prevent double-destruction of endpoints and move endpoint teardown code into destructor
This commit was SVN r32779.
2014-09-23 21:51:15 +00:00
Nathan Hjelm
2d8fba0861 btl/vader: silence warning
This commit was SVN r32778.
2014-09-23 21:33:45 +00:00
Nathan Hjelm
8bd3160432 btl/vader: fix several typos in vader update
This commit was SVN r32775.
2014-09-23 20:25:36 +00:00
Nathan Hjelm
12bfd13150 btl/vader: improve performance for both single and multiple threads
This is a large update that does the following:

 - Only allocate fast boxes for a peer if a send count threshold
   has been reached (default: 16). This will greatly reduce the memory
   usage with large numbers of local peers.

 - Improve performance by limiting the number of fast boxes that can
   be allocated per peer (default: 32). This will reduce the amount
   of time spent polling for fast box messages.

 - Provide new MCA variables to configure the size, maximum count,
   and send count thresholds for fast boxes allocations.

 - Updated buffer design to increase the range of message sizes that
   can be sent with a fast box.

 - Add thread protection around fast box allocation (locks). When
   spin locks are available this should be updated to use spin locks.

 - Various fixes and cleanup.

This commit was SVN r32774.
2014-09-23 18:11:22 +00:00
Howard Pritchard
1508a01325 Fixes to enable mpirun to work again on Cray
The ess pmi module was not handling aprun launched
daemons.  All daemons were thinking they were vpid 1.

Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.

Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.

Have the ESS pmi component open the pmix framework and
select a pmix component.

This commit was SVN r32773.
2014-09-23 15:37:26 +00:00