1
1

24482 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
4a8fbb5ff3 Merge pull request #1373 from hjelmn/xrc_fixes
btl/openib/udcm: fix local XRC connections
2016-02-17 17:07:15 -07:00
Nathan Hjelm
4f4ea96940 btl/openib/udcm: fix local XRC connections
This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when
creating the loopback queue pair. This is needed when communicating
with any other local peer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-17 14:54:19 -07:00
Howard Pritchard
72c7558176 Merge pull request #1371 from hppritcha/topic/alps_common_syms
ras/alps: squelch common symbol warnings
2016-02-17 14:35:43 -07:00
Howard Pritchard
31841b4367 ras/alps: squelch common symbol warnings
squelch a couple of warnings from the common symbols
script.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-02-17 13:27:29 -06:00
Jeff Squyres
d544e0e6e0 Merge pull request #1347 from ggouaillardet/topic/dss_tests
test/dss: update tests to make them usable again, and run them
2016-02-17 09:01:36 -05:00
Nathan Hjelm
2a728f3cbc Merge pull request #1367 from hjelmn/xrc_fixes
Fix XRC support
2016-02-16 23:43:56 -07:00
Ralph Castain
e0de4423ba Remove debug 2016-02-16 20:58:53 -08:00
rhc54
7a0605f6c9 Merge pull request #1368 from rhc54/topic/iof
Modify the IOF subsystem to handle per-job directives for…
2016-02-16 20:53:49 -08:00
Ralph Castain
50431001a3 Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO.
Fix stdin reader
2016-02-16 18:54:38 -08:00
Gilles Gouaillardet
1e26f9cda4 test/dss: update tests to make them usable again, and run them 2016-02-17 11:27:00 +09:00
Nathan Hjelm
bf8360388f btl/openib/udcm: fix XRC support
This commit fixes two bugs in XRC support

 - When dynamic add_procs support was added to master the remote
   process name was added to the non-XRC request structure. The same
   value was not added to the XRC xconnect structure. This error was
   not caught because the send/recv code was incorrectly using the
   wrong structure member. This commmit adds the member and ensure the
   xconnect code uses the correct structure.

 - XRC loopback QP support has been fixed. It was 1) not setting the
   correct fields on the endpoint structure, 2) calling
   udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
   data.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:49:04 -07:00
Nathan Hjelm
201c280e6c btl/openib: fix error in param check in mca_btl_openib_put
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:32 -07:00
Nathan Hjelm
123a39ac3c btl/openib: fix regression in XRC support
Commit open-mpi/ompi@400af6c52d
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:20 -07:00
yohann
7fe395c82a mtl/ofi: cleanup 2016-02-16 09:57:57 -08:00
yohann
22eddfee10 mtl/ofi: update copyright dates. 2016-02-16 09:56:09 -08:00
Gilles Gouaillardet
7de01b347c ompi/init: fix abstraction violation
This fixes open-mpi/ompi@8b05f308f9

libmpi.so cannot be built (unresolved symbols) with configure'd with
--disable-mem-debug --disable-mem-profile --disable-memchecker --without-memory-manager
2016-02-16 16:39:21 +09:00
rhc54
7c20d3deb8 Merge pull request #1365 from rhc54/topic/dvm
Plug some memory leaks
2016-02-15 23:26:08 -08:00
Mark Santcroos
14f0390b7d Release child object when we are recording someone's relatives.
(Thanks to Mark Santcroos!)

Release routing list entries.
(Thanks to Mark Santcroos!)

Address some Coverity concerns
2016-02-15 20:50:42 -08:00
George Bosilca
9dc79f4bc1 Initialize these 2 common symbols. 2016-02-15 12:27:24 -05:00
igor-ivanov
d9eefefa74 Merge pull request #1351 from igor-ivanov/pr/issue-1336
opal/memory: Move Memory Allocation Hooks usage from openib
2016-02-15 14:07:36 +04:00
rhc54
656de558db Merge pull request #1361 from rhc54/topic/oob2
Correct ordering when checking for privileged ports
2016-02-14 11:55:42 -08:00
Ralph Castain
351070659e Correct ordering when checking for privileged ports 2016-02-14 09:43:01 -08:00
George Bosilca
56425a5d48 Fix issue identified by Lisandro Dalcin regarding the lack
of support for NULL value in MPI_Type_set_attr.
Provides a fix for issue #1359.
2016-02-14 00:07:08 -05:00
George Bosilca
68c36ea9dc Fix two annoying warnings in our UCX support. 2016-02-14 00:02:16 -05:00
rhc54
59cc1f0a96 Merge pull request #1357 from rhc54/topic/oob
Protect against a non-privileged port connecting to us when we are running as root
2016-02-13 08:12:29 -08:00
rhc54
1f00a1112c Merge pull request #1334 from rhc54/topic/dvm
Refactor the ORTE DVM code
2016-02-13 08:12:01 -08:00
Ralph Castain
06c3dfc052 Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.

Extend the cmd_line utility to easily allow layering of command line definitions

Catch up with ORTE interface change and make build more generic.

Disable "fixed dvm" logic for now.

Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code

Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.

Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.

Add new commands to get_orted_comm_cmd_str().

Move ORTE command line options to orte_globals.[ch].

Catch up with extra orte_submit_init parameter.

Add example code.

Add documentation.

Bump version.

The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.

Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.

Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.

Extend the error reporting capability to job completion as well.

Add index parameter to orte_submit_job().

Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.

Factor out dvm termination.

Parse the terminate option at tool level.

Add error string for ORTE_ERR_JOB_CANCELLED.

Add some safeguards.

Cleanup and/of comments.

Enable the return.

Properly ORTE_DECLSPEC orte_submit_halt.

Add orte_submit_halt and orte_submit_cancel to interface.

Use the plm interface to terminate the job
2016-02-13 08:10:44 -08:00
Ralph Castain
233bd085ca Protect against a non-privileged port connecting to us when we are running as root
Don't close the listener socket upon error unless we are giving up

Cleanup the incoming socket
2016-02-13 08:07:27 -08:00
rhc54
52acd5b7ee Merge pull request #1354 from rhc54/topic/sing
Add support for Singularity containers
2016-02-13 06:57:47 -08:00
Jeff Squyres
7bc62e8f4c Merge pull request #1356 from hjelmn/get_address
Fix MPI_Get_address (MPI_BOTTOM, ...)
2016-02-13 08:27:18 -05:00
Ralph Castain
aa9e5a1a27 Add support for Singularity containers, including a .m4 file for checking if Singularity is available and an orte/schizo component for setting the proper support if a container was given as the executable
Cleanup the configury so we properly check for Singularity under the various typical use-cases

Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!
2016-02-13 04:40:22 -08:00
Nathan Hjelm
064a67f5b9 Fix MPI_Get_address (MPI_BOTTOM, ...)
Nowhere in the standard does it say that it is invalid to pass
MPI_BOTTOM to MPI_Get_address yet we were returning an error. This
commit removes the error check on NULL == location.

Fixes open-mpi/ompi#1355.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-12 16:34:21 -07:00
yohann
67ce4a080a mtl/ofi: FI_AV_MAP support only. 2016-02-12 10:06:52 -08:00
yohann
b3d8ead76e mtl/ofi: Fix dynamic add_procs. 2016-02-12 10:05:52 -08:00
Jeff Squyres
d98616b9ed Merge pull request #1337 from ggouaillardet/poc/f08_fn
mpi_f08: correctly implements MPI_{COMM,TYPE,WIN}_{DUP,NULL_{COPY,DEL…
2016-02-11 12:27:29 -05:00
Igor Ivanov
8b05f308f9 opal/memory: Move Memory Allocation Hooks usage from openib
These changes fix issue https://github.com/open-mpi/ompi/issues/1336

- improve abstractions: opal/memory/linux component should be single place that opeartes with
Memory Allocation Hooks.
- avoid collisions in case dynamic component open/close: it is safe because it is linked statically.
- does not change original behaivour.
2016-02-11 14:46:35 +02:00
Nathan Hjelm
39b44d0652 Merge pull request #1345 from ggouaillardet/topic/sentinel_proc_name_conversion
Topic/sentinel proc name conversion
2016-02-10 19:08:33 -07:00
Gilles Gouaillardet
96310f439b sentinel: fix 32 bits arch
since a sentinel is only made from the current job, only store the
first 31 bits of the vpid into the sentinel.
2016-02-10 15:44:07 +09:00
Gilles Gouaillardet
b55b9e6aee sentinel: fix sentinel to proc_name conversion
converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
2016-02-10 15:44:07 +09:00
Gilles Gouaillardet
030a5f2054 sentinel: use type uintptr_t for sentinel
MSB is now automatically cleared when right shifting
Thanks George for pointing this
2016-02-10 11:28:56 +09:00
Jeff Squyres
7850517215 brucks: rename the "brks" component to be "brucks"
After hearing the 3rd person ask what "brks" stood for, I'm renaming
this component to be "brucks" (because it uses a Bruck-based algorithm).
2016-02-09 13:17:11 -08:00
Jeff Squyres
d537ee9f26 Merge pull request #1340 from jsquyres/pr/decrease-mpi_add_procs_cutoff
RFC: ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0
2016-02-09 13:36:43 -05:00
Jeff Squyres
902b477aac ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0
Decrease the default value of the "mpi_add_procs_cutoff" MCA param
from 1024 to 0.
2016-02-09 09:41:36 -08:00
Gilles Gouaillardet
7c99115a94 opal/dss: fix comparison of OPAL_VALUE types 2016-02-09 16:21:57 +09:00
Jeff Squyres
8558def858 opal.pc.in: fix typo; use the write AC_SUBST'ed variable
As reported by @marksantcroos, this substitution in opal.pc was
incorrect -- it left @{libdir} in the string (vs. ${libdir}).  The fix
is simple: use the proper substitution variable in opal.pc (it was
never updated to reflect the new/correct name that was created just
for the pkg-config files).

Fixes open-mpi/ompi#1343.
2016-02-08 10:55:18 -08:00
Ralph Castain
3fbad2e2bd Transfer across the -host number of slots 2016-02-08 10:38:03 -08:00
George Bosilca
7c574a3530 Typo. 2016-02-07 07:22:22 +02:00
Jeff Squyres
8d0a592563 usnic: update a few verbose reachability messages 2016-02-06 03:28:48 -08:00
Jeff Squyres
87dbe6ce01 usnic: add high-verbose reachability messages 2016-02-06 03:28:47 -08:00
Jeff Squyres
dac2fe1589 usnic: ensure to use ntohl() for network-order values 2016-02-06 03:28:47 -08:00