Ralph Castain
4dad5de8ff
Silence a couple of warnings - strncpy returns a char*, not an int
2016-01-16 09:44:52 -08:00
Jeff Squyres
60ffe713b8
common syms: whitelist bison-generated common symbols
...
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Gilles Gouaillardet
1d38430e43
opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid
2016-01-14 10:39:03 +09:00
Gilles Gouaillardet
4c43fb2a50
orte_rmaps_base_map_job: set OPAL_BIND_ALLOW_OVERLOAD when needed
2016-01-13 17:13:36 +09:00
Ralph Castain
332019b43a
Silence warning
2016-01-10 09:59:36 -08:00
Nathan Hjelm
fab1eca536
grpcomm: fix bugs in grpcomm algorithms
...
This commit fixes multiple issues in the bruck's and recursive
doubling grpcomm algorithms. The following changes are included:
- Use the existing bitmap implementation instead of implementing a
new one. There were bugs in the implementation that caused an
overrun of the bitmap array.
- Clean up the algorithms to eliminate errors.
- Send as little extra data as possible in the bruck's
algorithm.
The changes were testest with various numbers of ortes varying from 1
to 4096.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-07 10:12:08 -07:00
Ralph Castain
f53d3c7a18
Silence warning
2015-12-30 10:16:58 -08:00
Ralph Castain
0a6b8d2c14
Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system
2015-12-30 07:16:43 -08:00
Ralph Castain
1cdc1c121c
Revert "Standardize the handling of shutdown in the OOB TCP component"
...
This reverts commit open-mpi/ompi@12dccaa911 .
2015-12-30 07:05:40 -08:00
Ralph Castain
12dccaa911
Standardize the handling of shutdown in the OOB TCP component
2015-12-29 07:57:22 -08:00
rhc54
5dfb7ac396
Merge pull request #1266 from ggouaillardet/topic/misc_pmix_fixes
...
Topic/misc pmix fixes
2015-12-29 07:02:44 -08:00
Ralph Castain
810f2446b7
Add pmix120 component, update the error handling functions in the PMIx API.
...
Update the configure logic for the new pmix120 component
ckpt
Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates
Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works
Cleanup the rename files to use the pretty macros
2015-12-28 23:15:44 +09:00
Gilles Gouaillardet
352b05a552
rmaps: warn if oversubscribing when manually setting the number of hosts
...
This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6
2015-12-28 10:38:57 +09:00
Ralph Castain
8ab28cdc82
Fix a typo that causes segfaults on multi-node executions
2015-12-24 08:43:47 -08:00
rhc54
d7199dc75b
Merge pull request #1255 from annu13/fixup
...
Fixup
2015-12-22 20:54:48 -08:00
annu13
43f44f31c1
moved code to job setup first before enabling comm
2015-12-22 14:30:59 -08:00
Howard Pritchard
39367ca0bf
plm/alps: only use srun for Native SLURM
...
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network. Unlike with ALPS,
the mpirun process is on the first compute
node in the job. With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.
This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's. As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system. If alps is not being used, print a helpful
message for the user and abort the job launch.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-22 11:03:42 -08:00
rhc54
d9cd451a16
Merge pull request #1250 from rhc54/topic/rf
...
Fix the default slot mapping in rank file mapper
2015-12-21 10:57:52 -08:00
Ralph Castain
7cc5879bdd
Fix the default slot mapping in rank file mapper
2015-12-21 09:47:27 -08:00
Ralph Castain
94ffe10808
Do not override any external settings for PMIx component selection
2015-12-21 08:36:12 -08:00
Jeff Squyres
53ca721ff4
configury: clean up .so version numbers
...
Move .so version numbers to their appropriate project in the top-level
VERSION file. Also add the project name to all .so version number
names. Remove no-longer-used .so names.
2015-12-18 12:50:23 -05:00
Ralph Castain
64b695669a
Cleanup warnings in opal and orte layers when building optimized on Mac
2015-12-17 07:51:24 -08:00
Ralph Castain
3a56f0d34b
Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
...
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes #1204 (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Howard Pritchard
7a82174747
Merge pull request #1195 from hppritcha/topic/wlm_detect
...
support Cray nativized slurm environment
2015-12-15 07:58:53 -07:00
Jeff Squyres
3e308f41f7
rmaps base help: update binding error messages
...
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails. Thanks to Dave Love
(@loveshack) for the initial suggestion.
Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
03eb1a80bf
Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release
...
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes
Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.
Update the PMIx code and continue attempting to debug direct modex
Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.
Update PMIx to v1.1.2
2015-12-12 18:46:38 -08:00
Ralph Castain
5e5adebf8e
Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here.
...
Thanks to @jsharpe for the PR
2015-12-12 12:40:34 -08:00
Ralph Castain
1db3db022a
Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons.
2015-12-09 19:54:44 -08:00
Jeff Squyres
00c5dc9449
rml oob: C99-ification of structure member assignment
2015-12-08 17:05:16 -08:00
Howard Pritchard
cb7c26ce96
plm/slurm: add support for cray native slurm
...
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 13:47:20 -06:00
Howard Pritchard
9548b8a9e8
plm/alps: add wlm detect infrastructure
...
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-07 07:43:20 -08:00
Ralph Castain
8823069fe9
Provide a mechanism by which a tool can request async progress thread support for ORTE
2015-12-04 08:26:57 -08:00
Mark Santcroos
3119bc14b2
Merge branch 'master' into fix/alpsinfov3
2015-11-13 08:53:06 -05:00
Ralph Castain
986a8c1d48
If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests.
...
In a novm operation that only has the HNP, ensure the #nodes gets set
Clean up the error reporting
2015-11-11 14:24:13 -08:00
Jeff Squyres
8bd356549a
orte proc_info.h: use symbolic names
...
This fix was actually applied in the v2.x branch first (as commit
open-mpi/ompi-release@a9b22afc1a ).
2015-11-10 13:39:21 -08:00
Mark Santcroos
299fd69c6d
Merge branch 'master' into fix/alpsinfov3
2015-11-10 15:40:19 -05:00
rhc54
474a869b8d
Merge pull request #1121 from dmt4/orterun-manpage-typos
...
change -0bind-to and -bind-to to --bind-to in the manpages
2015-11-10 11:24:08 -08:00
Dimitar Pashov
9f6e306064
change -0bind-to and -bind-to to --bind-to in the manpages
2015-11-10 17:44:53 +00:00
Ralph Castain
6a607d42a6
Prevent a segfault on tools if a connection attempt fails - tools don't open the opal/pmix framework and thus have no way of looking up a proc hostname
2015-11-10 09:11:34 -08:00
Mark Santcroos
5ec2b4d98c
Fix some messages in the process.
2015-11-09 18:03:26 -05:00
Mark Santcroos
8ec89001b3
Merge branch 'master' into fix/alpsinfov3
2015-11-09 02:45:23 -05:00
Ralph Castain
9b0cdc0de2
Add support for -pernode and -npernode options to orte-submit
2015-11-08 11:34:18 -08:00
Ralph Castain
f1483eb2dc
Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.
2015-11-06 21:35:24 -08:00
Ralph Castain
5f446570d8
Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API.
2015-11-06 14:15:30 -08:00
Mark Santcroos
a40b4eb2ee
Support ALPS_APPINFO_VERSION 3.
2015-11-06 09:53:41 -05:00
Ralph Castain
ec0cc4bf21
Ensure that we completely register an nspace prior to launching local procs as otherwise we may attempt to send it down before it is registered, leading to data corruption
2015-11-05 20:51:56 -08:00
Ralph Castain
68996d6858
Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with
2015-11-05 07:57:54 -08:00
Ralph Castain
169c44258d
Fix missing check
2015-11-03 19:00:28 -08:00
Ralph Castain
fe0c995f6b
Fix a couple of minor issues identified by Jeff
2015-11-03 17:30:51 -08:00
Ralph Castain
186c18be0e
Add missing cmd line options to mpirun man page, update NEWS to contain that change
2015-11-01 09:19:08 -08:00
Ralph Castain
0523f60479
Remove debug from orte-submit help output
2015-11-01 09:19:07 -08:00
rhc54
1fe27bf1dd
Merge pull request #1084 from rhc54/topic/dashhost
...
Fix relative node syntax for dash-host option
2015-10-31 21:24:39 -07:00
Ralph Castain
8bfbe7f16c
Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile
2015-10-31 19:09:54 -07:00
Ralph Castain
24419b6523
Fix relative node syntax for dash-host option
2015-10-31 19:00:46 -07:00
rhc54
b23f1f3578
Merge pull request #1080 from federeghe/bugfixes
...
oob_tcp: fix peer->state wrong check
2015-10-31 16:09:23 -07:00
Ralph Castain
22dc05194e
Minor cleanup - explicitly NULL the last member of a function pointer module. Should default to that anyway, but this is cosmetically nicer.
2015-10-30 08:19:55 -07:00
Federico Reghenzani
6536a6a9f5
oob_tcp: fix peer->state wrong check
2015-10-29 16:43:58 +01:00
Ralph Castain
267ca8fcd3
Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now
...
to continue current default behavior.
Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
2015-10-27 17:31:56 -07:00
rhc54
3ffbf08283
Merge pull request #1068 from marksantcroos/master
...
Make odsl debug message consistent.
2015-10-24 08:11:11 -07:00
Mark Santcroos
30aab75b86
Make message consistent.
2015-10-24 13:40:03 +02:00
Ralph Castain
6506b0a5e5
Resolve a race condition that prevented the sigchild callback from being registered before short-lived apps terminated
...
Thanks to Mark Santcroos for the assistance in tracking it down.
2015-10-23 21:02:31 -07:00
Nathan Hjelm
9602484568
Merge pull request #1040 from hjelmn/mtl_priority
...
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7
mca/base: add priority output to mca_base_select
...
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506
Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem
2015-10-17 20:24:03 -07:00
Jeff Squyres
62351f442a
help: remove stale help messages and files
...
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Jeff Squyres
f9e9b69d93
Merge pull request #1001 from igor-ivanov/master
...
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
2015-10-09 14:07:47 -04:00
Igor Ivanov
489f27f8e9
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
...
See: https://github.com/open-mpi/ompi/issues/953
2015-10-09 17:58:07 +03:00
rhc54
232f97a80c
Merge pull request #968 from JohnWestlund/master
...
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-07 17:42:19 -07:00
Howard Pritchard
d899320574
odls/alps: close the directory
...
Close the /proc/self/fd dir after checking for open fds.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-10-06 11:13:44 -07:00
Igor Ivanov
d379873443
oshmem: Add man.1 pages for oshmem tools
...
This changes add man pages for oshrun, oshcc and oshfort as well as
depricated shmemrun, shmemcc and shmemfort.
2015-10-05 15:41:28 +03:00
John Westlund
044fea8df7
re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure
2015-10-02 15:59:48 -07:00
John Westlund
6bfaa925ec
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-02 14:26:52 -07:00
Ralph Castain
8f6855459d
Cleanup some coverity warnings
2015-09-30 10:33:53 -07:00
Gilles Gouaillardet
0445484820
ras: remove orte_ras_proc_t and associated code
2015-09-30 08:52:52 +09:00
Gilles Gouaillardet
7cc14ee6f6
orte/rmaps: silence warning
2015-09-29 16:05:52 +09:00
Ralph Castain
fad5638596
Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked.
2015-09-27 09:57:59 -07:00
Ralph Castain
0140ff048d
Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
...
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.
Also do a little cleanup to avoid bombarding the user with multiple error messages.
Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe
Plug a few memory leaks identified by valgrind
2015-09-23 15:21:04 -07:00
Ralph Castain
f28448702a
Eliminate malloc by utilizing /proc/self/fd - optimization
2015-09-22 07:24:54 -07:00
Ralph Castain
f872e99315
Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize.
2015-09-21 20:31:57 -07:00
Howard Pritchard
ef6cf50687
Merge pull request #917 from hppritcha/topic/alps_warning_swat
...
oob/alps: swat compiler warning
2015-09-21 16:17:30 -06:00
Howard Pritchard
8d7e759b85
oob/alps: swat compiler warning
...
swat some alps related compiler warnings when using --enable-picky
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
92ae386a34
As Jeff proposed, change the check to looking for the filename's first character to be a digit
2015-09-21 08:22:58 -07:00
rhc54
13def2a69b
Merge pull request #911 from rhc54/topic/cleanup
...
Cleanup the odls "close file descriptor" commit to conform to OMPI co…
2015-09-20 07:01:39 -07:00
Howard Pritchard
1367a442b6
Merge pull request #910 from hppritcha/topic/odls_alps_use_907_stuff
...
odls/alps: do smarter close of fds in child
2015-09-20 07:37:55 -06:00
Ralph Castain
c167acc5a7
Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks
2015-09-19 20:46:36 -07:00
Howard Pritchard
a31cc21bea
odls/alps: do smarter close of fds in child
...
Use a modified variant of #907 . Thanks to plesn
for noticing this.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-19 14:17:05 -07:00
Piotr Lesnicki
1dd5487fae
odls: close only used file descriptors at fork/exec
2015-09-18 16:44:57 +02:00
Ralph Castain
1b7930ad52
Silence some warnings and address Coverity issues
2015-09-16 07:58:22 -07:00
Ralph Castain
8b88ea9b13
Fix singletons by removing stale code
2015-09-16 00:58:05 -07:00
Ralph Castain
c1bbbb5e2f
Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds
2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a
Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer.
2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1
Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
...
Fix the locality computation by correctly computing the vpid of the local peer
This reverts commit open-mpi/ompi@6a8fad49e5 .
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5
Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
...
This reverts commit f94f3cda21
.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21
Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local
2015-09-10 10:25:30 -07:00
rhc54
f6b6b9a9ca
Merge pull request #877 from rhc54/topic/s1s2
...
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7
Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.
2015-09-08 18:37:09 -07:00
Ralph Castain
459f169e06
Fix segfault upon job error
...
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Jeff Squyres
bc9e5652ff
whitespace: purge whitespace at end of lines
...
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f
Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.
2015-09-07 09:19:24 -07:00