Ralph Castain
c1bbbb5e2f
Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds
2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a
Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer.
2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1
Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
...
Fix the locality computation by correctly computing the vpid of the local peer
This reverts commit open-mpi/ompi@6a8fad49e5 .
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5
Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
...
This reverts commit f94f3cda21
.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21
Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local
2015-09-10 10:25:30 -07:00
rhc54
f6b6b9a9ca
Merge pull request #877 from rhc54/topic/s1s2
...
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7
Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.
2015-09-08 18:37:09 -07:00
Ralph Castain
459f169e06
Fix segfault upon job error
...
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Jeff Squyres
bc9e5652ff
whitespace: purge whitespace at end of lines
...
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f
Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.
2015-09-07 09:19:24 -07:00
Ralph Castain
37c3ed68e7
Cleanup connect/disconnect and bring comm_spawn back online!
2015-09-06 10:27:39 -07:00
rhc54
665b30376a
Merge pull request #868 from rhc54/topic/hwloc
...
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4
Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working
2015-09-04 10:07:17 -07:00
Ralph Castain
a772b46c15
Bring the MPI_Publish and friends online
2015-09-02 12:04:07 -07:00
Ralph Castain
38ba54366c
Fix shared memory operations by resolving local peers
2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca
Cleanup Coverity issues
2015-08-29 21:19:27 -07:00
Ralph Castain
cf6137b530
Integrate PMIx 1.0 with OMPI.
...
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
89c80b2294
Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.
2015-08-27 16:41:00 -07:00
Nathan Hjelm
156ce6af21
periodic whitespace purge
...
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
bc7815e178
Adjust the process type flags to remove confusion between orted and dvm state machines
2015-08-21 07:50:08 -07:00
Ralph Castain
5040f47ef3
Use the correct verbosity in an output_verbose
2015-08-13 22:33:25 -07:00
Ralph Castain
a2a049a612
Update test to match the one in MTT
2015-08-13 11:12:34 -07:00
Ralph Castain
0b1d4b62be
Cleanup some cruft and update to coordinate with CM operations:
...
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Jeff Squyres
31b329e585
odls default: ensure to initialize opts
...
This fixes CID 71127.
2015-08-12 05:27:37 -07:00
Howard Pritchard
8e7e4ca7f4
Merge pull request #780 from hppritcha/topic/plm_alps_minor_cleanup
...
plm/alps: remove unneded env. variable setting
2015-08-07 15:03:45 -06:00
Jeff Squyres
09f7434491
ORTE: update for the new opal_progress_thread API
2015-08-07 10:13:40 -07:00
Howard Pritchard
1b55d14dff
plm/alps: remove unneded env. variable setting
...
In order to address issue #741 , the orted's now are
always launched with the Cray PMI environment variables
PMI_NO_FORK
PMI_NO_PREINITIALIZE
set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-08-05 13:27:18 -07:00
Ralph Castain
9bc384282a
Fix an annoying segfault caused by incorrect indentation in a loop that causes the buffer to not be created prior to packing.
2015-08-01 10:01:47 -07:00
Ralph Castain
023936e84b
Silence coverity warnings
2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7
oob/tcp: fix a race condition when finalizing the oob/tcp component
2015-07-28 09:16:13 +09:00
Ralph Castain
93f7a51275
Update the orte/system/opal_hotel test
2015-07-24 07:34:59 -07:00
Howard Pritchard
70096d3753
plm/alps: fix orted based launch failures.
...
Turns out that when one builds Open MPI with --disable-dlopen
for Cray, a whole bunch of cray specific libraries get linked
in to the orted executable. One of these is Cray PMI. The
Cray PMI has a ctor which, if run, causes job launches using
mpirun to fail. This commit suppresses the running of the
ctor and thus prevents failure to launch.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-07-23 15:07:57 -07:00
Jeff Squyres
60609cbb79
orte/test/system: fix compiler warnings
...
Note that the opal_hotel test still doesn't compile; it looks like it
needs to be updated to the new requirement to pass an event base.
2015-07-23 06:19:33 -07:00
Ralph Castain
4853457b93
The RML posted recvs are controlled by the async progress thread when in an application process. The call to finalize and close the RML is done from the main thread, and so we need to shift the actual destruct of the posted recv list to the async thread for handling or else we encounter a race condition when accessing the posted recvs.
...
Thanks to Gilles for providing the required debug info
2015-07-21 08:44:23 -07:00
Ralph Castain
219c4dfba5
Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.
2015-07-12 08:23:34 -07:00
rhc54
bd91225cb5
Merge pull request #716 from rhc54/topic/alloc
...
Default allocated nodes to the UP state
2015-07-11 12:30:32 -07:00
Ralph Castain
2c896c5a2d
Default allocated nodes to the UP state
2015-07-11 10:43:11 -07:00
Ralph Castain
683efcb850
Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.
2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c
Merge pull request #713 from rhc54/topic/errhandler
...
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14
Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base
2015-07-11 06:42:23 -07:00
rhc54
c6bb227073
Merge pull request #692 from rhc54/topic/mapper
...
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43
Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.
2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9
Merge pull request #691 from rhc54/topic/mapper
...
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36
Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy
2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47
remove trigraph '??)' from comment
...
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
eb582b8276
Minor whitespace cleanups
2015-07-06 09:38:33 -07:00
Ralph Castain
836f49597d
There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
...
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad
Okay, that's it - trash it
2015-07-01 05:27:30 -05:00