1
1
Граф коммитов

5005 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
fc6b260146 Protect against PMIx-based requests that don't come thru the MPI comm_spawn interface 2016-01-16 13:36:06 -08:00
Ralph Castain
4dad5de8ff Silence a couple of warnings - strncpy returns a char*, not an int 2016-01-16 09:44:52 -08:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Gilles Gouaillardet
1d38430e43 opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid 2016-01-14 10:39:03 +09:00
Gilles Gouaillardet
4c43fb2a50 orte_rmaps_base_map_job: set OPAL_BIND_ALLOW_OVERLOAD when needed 2016-01-13 17:13:36 +09:00
Ralph Castain
332019b43a Silence warning 2016-01-10 09:59:36 -08:00
Nathan Hjelm
fab1eca536 grpcomm: fix bugs in grpcomm algorithms
This commit fixes multiple issues in the bruck's and recursive
doubling grpcomm algorithms. The following changes are included:

 - Use the existing bitmap implementation instead of implementing a
   new one. There were bugs in the implementation that caused an
   overrun of the bitmap array.

 - Clean up the algorithms to eliminate errors.

 - Send as little extra data as possible in the bruck's
   algorithm.

The changes were testest with various numbers of ortes varying from 1
to 4096.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-07 10:12:08 -07:00
Ralph Castain
f53d3c7a18 Silence warning 2015-12-30 10:16:58 -08:00
Ralph Castain
0a6b8d2c14 Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system 2015-12-30 07:16:43 -08:00
Ralph Castain
1cdc1c121c Revert "Standardize the handling of shutdown in the OOB TCP component"
This reverts commit open-mpi/ompi@12dccaa911.
2015-12-30 07:05:40 -08:00
Ralph Castain
12dccaa911 Standardize the handling of shutdown in the OOB TCP component 2015-12-29 07:57:22 -08:00
rhc54
5dfb7ac396 Merge pull request #1266 from ggouaillardet/topic/misc_pmix_fixes
Topic/misc pmix fixes
2015-12-29 07:02:44 -08:00
Ralph Castain
810f2446b7 Add pmix120 component, update the error handling functions in the PMIx API.
Update the configure logic for the new pmix120 component

ckpt

Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates

Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works

Cleanup the rename files to use the pretty macros
2015-12-28 23:15:44 +09:00
Gilles Gouaillardet
352b05a552 rmaps: warn if oversubscribing when manually setting the number of hosts
This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6
2015-12-28 10:38:57 +09:00
Ralph Castain
8ab28cdc82 Fix a typo that causes segfaults on multi-node executions 2015-12-24 08:43:47 -08:00
rhc54
d7199dc75b Merge pull request #1255 from annu13/fixup
Fixup
2015-12-22 20:54:48 -08:00
annu13
43f44f31c1 moved code to job setup first before enabling comm 2015-12-22 14:30:59 -08:00
Howard Pritchard
39367ca0bf plm/alps: only use srun for Native SLURM
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network.  Unlike with ALPS,
the mpirun process is on the first compute
node in the job.  With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.

This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's.  As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system.  If alps is not being used, print a helpful
message for the user and abort the job launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-22 11:03:42 -08:00
rhc54
d9cd451a16 Merge pull request #1250 from rhc54/topic/rf
Fix the default slot mapping in rank file mapper
2015-12-21 10:57:52 -08:00
Ralph Castain
7cc5879bdd Fix the default slot mapping in rank file mapper 2015-12-21 09:47:27 -08:00
Ralph Castain
94ffe10808 Do not override any external settings for PMIx component selection 2015-12-21 08:36:12 -08:00
Jeff Squyres
53ca721ff4 configury: clean up .so version numbers
Move .so version numbers to their appropriate project in the top-level
VERSION file.  Also add the project name to all .so version number
names.  Remove no-longer-used .so names.
2015-12-18 12:50:23 -05:00
Ralph Castain
64b695669a Cleanup warnings in opal and orte layers when building optimized on Mac 2015-12-17 07:51:24 -08:00
Ralph Castain
3a56f0d34b Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.

Closes #1204  (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Howard Pritchard
7a82174747 Merge pull request #1195 from hppritcha/topic/wlm_detect
support Cray nativized slurm environment
2015-12-15 07:58:53 -07:00
Jeff Squyres
3e308f41f7 rmaps base help: update binding error messages
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails.  Thanks to Dave Love
(@loveshack) for the initial suggestion.

Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
03eb1a80bf Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes

Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.

Update the PMIx code and continue attempting to debug direct modex

Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.

Update PMIx to v1.1.2
2015-12-12 18:46:38 -08:00
Ralph Castain
5e5adebf8e Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here.
Thanks to @jsharpe for the PR
2015-12-12 12:40:34 -08:00
Ralph Castain
1db3db022a Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons. 2015-12-09 19:54:44 -08:00
Jeff Squyres
00c5dc9449 rml oob: C99-ification of structure member assignment 2015-12-08 17:05:16 -08:00
Howard Pritchard
cb7c26ce96 plm/slurm: add support for cray native slurm
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 13:47:20 -06:00
Howard Pritchard
9548b8a9e8 plm/alps: add wlm detect infrastructure
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-07 07:43:20 -08:00
Ralph Castain
8823069fe9 Provide a mechanism by which a tool can request async progress thread support for ORTE 2015-12-04 08:26:57 -08:00
Mark Santcroos
3119bc14b2 Merge branch 'master' into fix/alpsinfov3 2015-11-13 08:53:06 -05:00
Ralph Castain
986a8c1d48 If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests.
In a novm operation that only has the HNP, ensure the #nodes gets set

Clean up the error reporting
2015-11-11 14:24:13 -08:00
Jeff Squyres
8bd356549a orte proc_info.h: use symbolic names
This fix was actually applied in the v2.x branch first (as commit
open-mpi/ompi-release@a9b22afc1a).
2015-11-10 13:39:21 -08:00
Mark Santcroos
299fd69c6d Merge branch 'master' into fix/alpsinfov3 2015-11-10 15:40:19 -05:00
rhc54
474a869b8d Merge pull request #1121 from dmt4/orterun-manpage-typos
change -0bind-to and -bind-to to --bind-to in the manpages
2015-11-10 11:24:08 -08:00
Dimitar Pashov
9f6e306064 change -0bind-to and -bind-to to --bind-to in the manpages 2015-11-10 17:44:53 +00:00
Ralph Castain
6a607d42a6 Prevent a segfault on tools if a connection attempt fails - tools don't open the opal/pmix framework and thus have no way of looking up a proc hostname 2015-11-10 09:11:34 -08:00
Mark Santcroos
5ec2b4d98c Fix some messages in the process. 2015-11-09 18:03:26 -05:00
Mark Santcroos
8ec89001b3 Merge branch 'master' into fix/alpsinfov3 2015-11-09 02:45:23 -05:00
Ralph Castain
9b0cdc0de2 Add support for -pernode and -npernode options to orte-submit 2015-11-08 11:34:18 -08:00
Ralph Castain
f1483eb2dc Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP. 2015-11-06 21:35:24 -08:00
Ralph Castain
5f446570d8 Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API. 2015-11-06 14:15:30 -08:00
Mark Santcroos
a40b4eb2ee Support ALPS_APPINFO_VERSION 3. 2015-11-06 09:53:41 -05:00
Ralph Castain
ec0cc4bf21 Ensure that we completely register an nspace prior to launching local procs as otherwise we may attempt to send it down before it is registered, leading to data corruption 2015-11-05 20:51:56 -08:00
Ralph Castain
68996d6858 Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with 2015-11-05 07:57:54 -08:00
Ralph Castain
169c44258d Fix missing check 2015-11-03 19:00:28 -08:00
Ralph Castain
fe0c995f6b Fix a couple of minor issues identified by Jeff 2015-11-03 17:30:51 -08:00