1
1

3878 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
fab1eca536 grpcomm: fix bugs in grpcomm algorithms
This commit fixes multiple issues in the bruck's and recursive
doubling grpcomm algorithms. The following changes are included:

 - Use the existing bitmap implementation instead of implementing a
   new one. There were bugs in the implementation that caused an
   overrun of the bitmap array.

 - Clean up the algorithms to eliminate errors.

 - Send as little extra data as possible in the bruck's
   algorithm.

The changes were testest with various numbers of ortes varying from 1
to 4096.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-07 10:12:08 -07:00
Ralph Castain
f53d3c7a18 Silence warning 2015-12-30 10:16:58 -08:00
Ralph Castain
0a6b8d2c14 Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system 2015-12-30 07:16:43 -08:00
Ralph Castain
1cdc1c121c Revert "Standardize the handling of shutdown in the OOB TCP component"
This reverts commit open-mpi/ompi@12dccaa911.
2015-12-30 07:05:40 -08:00
Ralph Castain
12dccaa911 Standardize the handling of shutdown in the OOB TCP component 2015-12-29 07:57:22 -08:00
Gilles Gouaillardet
352b05a552 rmaps: warn if oversubscribing when manually setting the number of hosts
This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6
2015-12-28 10:38:57 +09:00
Ralph Castain
8ab28cdc82 Fix a typo that causes segfaults on multi-node executions 2015-12-24 08:43:47 -08:00
rhc54
d7199dc75b Merge pull request #1255 from annu13/fixup
Fixup
2015-12-22 20:54:48 -08:00
annu13
43f44f31c1 moved code to job setup first before enabling comm 2015-12-22 14:30:59 -08:00
Howard Pritchard
39367ca0bf plm/alps: only use srun for Native SLURM
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network.  Unlike with ALPS,
the mpirun process is on the first compute
node in the job.  With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.

This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's.  As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system.  If alps is not being used, print a helpful
message for the user and abort the job launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-22 11:03:42 -08:00
rhc54
d9cd451a16 Merge pull request #1250 from rhc54/topic/rf
Fix the default slot mapping in rank file mapper
2015-12-21 10:57:52 -08:00
Ralph Castain
7cc5879bdd Fix the default slot mapping in rank file mapper 2015-12-21 09:47:27 -08:00
Ralph Castain
94ffe10808 Do not override any external settings for PMIx component selection 2015-12-21 08:36:12 -08:00
Jeff Squyres
53ca721ff4 configury: clean up .so version numbers
Move .so version numbers to their appropriate project in the top-level
VERSION file.  Also add the project name to all .so version number
names.  Remove no-longer-used .so names.
2015-12-18 12:50:23 -05:00
Ralph Castain
64b695669a Cleanup warnings in opal and orte layers when building optimized on Mac 2015-12-17 07:51:24 -08:00
Howard Pritchard
7a82174747 Merge pull request #1195 from hppritcha/topic/wlm_detect
support Cray nativized slurm environment
2015-12-15 07:58:53 -07:00
Jeff Squyres
3e308f41f7 rmaps base help: update binding error messages
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails.  Thanks to Dave Love
(@loveshack) for the initial suggestion.

Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
1db3db022a Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons. 2015-12-09 19:54:44 -08:00
Jeff Squyres
00c5dc9449 rml oob: C99-ification of structure member assignment 2015-12-08 17:05:16 -08:00
Howard Pritchard
cb7c26ce96 plm/slurm: add support for cray native slurm
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 13:47:20 -06:00
Howard Pritchard
9548b8a9e8 plm/alps: add wlm detect infrastructure
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-07 07:43:20 -08:00
Ralph Castain
8823069fe9 Provide a mechanism by which a tool can request async progress thread support for ORTE 2015-12-04 08:26:57 -08:00
Mark Santcroos
3119bc14b2 Merge branch 'master' into fix/alpsinfov3 2015-11-13 08:53:06 -05:00
Ralph Castain
986a8c1d48 If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests.
In a novm operation that only has the HNP, ensure the #nodes gets set

Clean up the error reporting
2015-11-11 14:24:13 -08:00
Mark Santcroos
5ec2b4d98c Fix some messages in the process. 2015-11-09 18:03:26 -05:00
Mark Santcroos
8ec89001b3 Merge branch 'master' into fix/alpsinfov3 2015-11-09 02:45:23 -05:00
Ralph Castain
f1483eb2dc Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP. 2015-11-06 21:35:24 -08:00
Ralph Castain
5f446570d8 Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API. 2015-11-06 14:15:30 -08:00
Mark Santcroos
a40b4eb2ee Support ALPS_APPINFO_VERSION 3. 2015-11-06 09:53:41 -05:00
rhc54
1fe27bf1dd Merge pull request #1084 from rhc54/topic/dashhost
Fix relative node syntax for dash-host option
2015-10-31 21:24:39 -07:00
Ralph Castain
24419b6523 Fix relative node syntax for dash-host option 2015-10-31 19:00:46 -07:00
Federico Reghenzani
6536a6a9f5 oob_tcp: fix peer->state wrong check 2015-10-29 16:43:58 +01:00
Mark Santcroos
30aab75b86 Make message consistent. 2015-10-24 13:40:03 +02:00
Nathan Hjelm
9602484568 Merge pull request #1040 from hjelmn/mtl_priority
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7 mca/base: add priority output to mca_base_select
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506 Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem 2015-10-17 20:24:03 -07:00
Jeff Squyres
62351f442a help: remove stale help messages and files
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Jeff Squyres
f9e9b69d93 Merge pull request #1001 from igor-ivanov/master
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
2015-10-09 14:07:47 -04:00
Igor Ivanov
489f27f8e9 orte/mca/rmaps: Improve orte_rmaps_dist_device help message
See: https://github.com/open-mpi/ompi/issues/953
2015-10-09 17:58:07 +03:00
rhc54
232f97a80c Merge pull request #968 from JohnWestlund/master
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-07 17:42:19 -07:00
Howard Pritchard
d899320574 odls/alps: close the directory
Close the /proc/self/fd dir after checking for open fds.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-10-06 11:13:44 -07:00
John Westlund
044fea8df7 re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure 2015-10-02 15:59:48 -07:00
John Westlund
6bfaa925ec simplify use of sockaddr* structs to work around buffer overflow warning 2015-10-02 14:26:52 -07:00
Gilles Gouaillardet
0445484820 ras: remove orte_ras_proc_t and associated code 2015-09-30 08:52:52 +09:00
Gilles Gouaillardet
7cc14ee6f6 orte/rmaps: silence warning 2015-09-29 16:05:52 +09:00
Ralph Castain
fad5638596 Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked. 2015-09-27 09:57:59 -07:00
Ralph Castain
0140ff048d Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe Plug a few memory leaks identified by valgrind 2015-09-23 15:21:04 -07:00
Ralph Castain
f28448702a Eliminate malloc by utilizing /proc/self/fd - optimization 2015-09-22 07:24:54 -07:00
Ralph Castain
f872e99315 Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize. 2015-09-21 20:31:57 -07:00