1
1
Граф коммитов

4988 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
39367ca0bf plm/alps: only use srun for Native SLURM
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network.  Unlike with ALPS,
the mpirun process is on the first compute
node in the job.  With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.

This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's.  As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system.  If alps is not being used, print a helpful
message for the user and abort the job launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-22 11:03:42 -08:00
rhc54
d9cd451a16 Merge pull request #1250 from rhc54/topic/rf
Fix the default slot mapping in rank file mapper
2015-12-21 10:57:52 -08:00
Ralph Castain
7cc5879bdd Fix the default slot mapping in rank file mapper 2015-12-21 09:47:27 -08:00
Ralph Castain
94ffe10808 Do not override any external settings for PMIx component selection 2015-12-21 08:36:12 -08:00
Jeff Squyres
53ca721ff4 configury: clean up .so version numbers
Move .so version numbers to their appropriate project in the top-level
VERSION file.  Also add the project name to all .so version number
names.  Remove no-longer-used .so names.
2015-12-18 12:50:23 -05:00
Ralph Castain
64b695669a Cleanup warnings in opal and orte layers when building optimized on Mac 2015-12-17 07:51:24 -08:00
Ralph Castain
3a56f0d34b Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.

Closes #1204  (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Howard Pritchard
7a82174747 Merge pull request #1195 from hppritcha/topic/wlm_detect
support Cray nativized slurm environment
2015-12-15 07:58:53 -07:00
Jeff Squyres
3e308f41f7 rmaps base help: update binding error messages
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails.  Thanks to Dave Love
(@loveshack) for the initial suggestion.

Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
03eb1a80bf Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes

Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.

Update the PMIx code and continue attempting to debug direct modex

Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.

Update PMIx to v1.1.2
2015-12-12 18:46:38 -08:00
Ralph Castain
5e5adebf8e Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here.
Thanks to @jsharpe for the PR
2015-12-12 12:40:34 -08:00
Ralph Castain
1db3db022a Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons. 2015-12-09 19:54:44 -08:00
Jeff Squyres
00c5dc9449 rml oob: C99-ification of structure member assignment 2015-12-08 17:05:16 -08:00
Howard Pritchard
cb7c26ce96 plm/slurm: add support for cray native slurm
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 13:47:20 -06:00
Howard Pritchard
9548b8a9e8 plm/alps: add wlm detect infrastructure
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-07 07:43:20 -08:00
Ralph Castain
8823069fe9 Provide a mechanism by which a tool can request async progress thread support for ORTE 2015-12-04 08:26:57 -08:00
Mark Santcroos
3119bc14b2 Merge branch 'master' into fix/alpsinfov3 2015-11-13 08:53:06 -05:00
Ralph Castain
986a8c1d48 If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests.
In a novm operation that only has the HNP, ensure the #nodes gets set

Clean up the error reporting
2015-11-11 14:24:13 -08:00
Jeff Squyres
8bd356549a orte proc_info.h: use symbolic names
This fix was actually applied in the v2.x branch first (as commit
open-mpi/ompi-release@a9b22afc1a).
2015-11-10 13:39:21 -08:00
Mark Santcroos
299fd69c6d Merge branch 'master' into fix/alpsinfov3 2015-11-10 15:40:19 -05:00
rhc54
474a869b8d Merge pull request #1121 from dmt4/orterun-manpage-typos
change -0bind-to and -bind-to to --bind-to in the manpages
2015-11-10 11:24:08 -08:00
Dimitar Pashov
9f6e306064 change -0bind-to and -bind-to to --bind-to in the manpages 2015-11-10 17:44:53 +00:00
Ralph Castain
6a607d42a6 Prevent a segfault on tools if a connection attempt fails - tools don't open the opal/pmix framework and thus have no way of looking up a proc hostname 2015-11-10 09:11:34 -08:00
Mark Santcroos
5ec2b4d98c Fix some messages in the process. 2015-11-09 18:03:26 -05:00
Mark Santcroos
8ec89001b3 Merge branch 'master' into fix/alpsinfov3 2015-11-09 02:45:23 -05:00
Ralph Castain
9b0cdc0de2 Add support for -pernode and -npernode options to orte-submit 2015-11-08 11:34:18 -08:00
Ralph Castain
f1483eb2dc Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP. 2015-11-06 21:35:24 -08:00
Ralph Castain
5f446570d8 Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API. 2015-11-06 14:15:30 -08:00
Mark Santcroos
a40b4eb2ee Support ALPS_APPINFO_VERSION 3. 2015-11-06 09:53:41 -05:00
Ralph Castain
ec0cc4bf21 Ensure that we completely register an nspace prior to launching local procs as otherwise we may attempt to send it down before it is registered, leading to data corruption 2015-11-05 20:51:56 -08:00
Ralph Castain
68996d6858 Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with 2015-11-05 07:57:54 -08:00
Ralph Castain
169c44258d Fix missing check 2015-11-03 19:00:28 -08:00
Ralph Castain
fe0c995f6b Fix a couple of minor issues identified by Jeff 2015-11-03 17:30:51 -08:00
Ralph Castain
186c18be0e Add missing cmd line options to mpirun man page, update NEWS to contain that change 2015-11-01 09:19:08 -08:00
Ralph Castain
0523f60479 Remove debug from orte-submit help output 2015-11-01 09:19:07 -08:00
rhc54
1fe27bf1dd Merge pull request #1084 from rhc54/topic/dashhost
Fix relative node syntax for dash-host option
2015-10-31 21:24:39 -07:00
Ralph Castain
8bfbe7f16c Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile 2015-10-31 19:09:54 -07:00
Ralph Castain
24419b6523 Fix relative node syntax for dash-host option 2015-10-31 19:00:46 -07:00
rhc54
b23f1f3578 Merge pull request #1080 from federeghe/bugfixes
oob_tcp: fix peer->state wrong check
2015-10-31 16:09:23 -07:00
Ralph Castain
22dc05194e Minor cleanup - explicitly NULL the last member of a function pointer module. Should default to that anyway, but this is cosmetically nicer. 2015-10-30 08:19:55 -07:00
Federico Reghenzani
6536a6a9f5 oob_tcp: fix peer->state wrong check 2015-10-29 16:43:58 +01:00
Ralph Castain
267ca8fcd3 Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now
to continue current default behavior.

Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
2015-10-27 17:31:56 -07:00
rhc54
3ffbf08283 Merge pull request #1068 from marksantcroos/master
Make odsl debug message consistent.
2015-10-24 08:11:11 -07:00
Mark Santcroos
30aab75b86 Make message consistent. 2015-10-24 13:40:03 +02:00
Ralph Castain
6506b0a5e5 Resolve a race condition that prevented the sigchild callback from being registered before short-lived apps terminated
Thanks to Mark Santcroos for the assistance in tracking it down.
2015-10-23 21:02:31 -07:00
Nathan Hjelm
9602484568 Merge pull request #1040 from hjelmn/mtl_priority
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7 mca/base: add priority output to mca_base_select
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506 Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem 2015-10-17 20:24:03 -07:00
Jeff Squyres
62351f442a help: remove stale help messages and files
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Jeff Squyres
f9e9b69d93 Merge pull request #1001 from igor-ivanov/master
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
2015-10-09 14:07:47 -04:00