Ralph Castain
fe68f23099
Only instantiate the HWLOC topology in an MPI process if it actually will be used.
...
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-29 10:33:29 -08:00
Ralph Castain
52533f755e
Remove debug
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 13:24:39 -08:00
Ralph Castain
3a2d6a5ab6
Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 09:14:26 -08:00
Ralph Castain
ea133206ec
Sync the internal OMPI component to PMIx master
...
Update external PMIx v2.x component
Add missing Makefile
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-19 19:14:16 -08:00
Ralph Castain
c6f6f40529
Transfer debugger support changes
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 18:14:46 -08:00
Ralph Castain
269753f5c1
Transfer back changes from debugger attach work
...
Silence warning
Remove debug
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 10:00:52 -08:00
Ralph Castain
884fb7fcf2
Update the PMIx2 support to include the latest shared memory optimizations
...
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-14 15:00:10 -08:00
rhc54
341ab683de
Merge pull request #2532 from rhc54/topic/pmixptl
...
Update to latest PMIx master + PTL branch
2016-12-07 17:28:22 -08:00
Ralph Castain
e1aa7939ef
Correctly cleanup the local children and node map info on remote orteds upon job completion. Ensure that register_nspace only includes procs from that job in the proc map
...
Thanks to Ashley Pittman for the report
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-07 13:53:00 -08:00
Ralph Castain
fbed2d794a
Update to latest PMIx master + PTL branch
...
Update the usock component to disable it
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-06 20:47:44 -08:00
Ralph Castain
79cde184ad
Allow a PMIx tool to spawn a job
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 16:00:47 -08:00
Ralph Castain
dd491db21f
Fix IOF when outputing to files - the remote orteds were failing to output stdout/err from their procs.
...
Silence a warning in orted_submit
Protect against a free'd value in an error path when forming oob tcp connections
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-01 14:12:47 -08:00
Ralph Castain
d5fd635efe
Bring forward the debugger-related changes
...
Refs https://github.com/open-mpi/ompi/pull/2425
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 13:15:20 -08:00
Ralph Castain
9c6c2fa61d
Bring the v2.0.x debugger patch up to the master branch
...
Ensure the personality gets set as specified by user, or defaults to
"ompi"
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 12:45:45 -08:00
Ralph Castain
649301a3a2
Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
...
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
57114a09ae
Pickup the npernode and npersocket options and include them in the job object
2016-10-17 12:26:21 -07:00
Ralph Castain
5b1484a836
Implement the backend support for process-generated event notification
2016-10-08 09:24:28 -07:00
Ralph Castain
e773c17cf3
Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations.
2016-10-02 16:02:23 -07:00
Gilles Gouaillardet
c7bf9a0ec9
ess/singleton: fix read on the pipe to spawn'ed orted
...
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Gilles Gouaillardet
83399adb3f
singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton
2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0
orted: fix spawn in singleton mode
...
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Gilles Gouaillardet
d84ac9bdc5
orted: remove debug
...
remove debug code that was added by mistake in open-mpi/ompi@eae9d31784
2016-09-19 19:15:42 +09:00
Gilles Gouaillardet
eae9d31784
pre_condition_transports: code cleanup
...
replace hard coded "OMPI_MCA_orte_precondition_transports" environment variable name
with macro'ed OPAL_MCA_PREFIX"orte_precondition_transports"
2016-09-19 13:31:47 +09:00
Artem Polyakov
9eba1b0b75
Merge pull request #2042 from artpol84/pmix_sdirs
...
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Gilles Gouaillardet
be41b120d0
orted: plug misc memory leaks
...
as reported by Coverity with CID 1362603 and 1362606
2016-09-07 10:08:44 +09:00
Artem Polyakov
81195ab724
Several fixes related to session directories:
...
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
4e0788e9ad
Enable PSM to support dynamic processes
...
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Ralph Castain
0ea1cff733
Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional
2016-09-01 13:10:10 -07:00
Ralph Castain
c1050bc01e
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.
2016-08-31 09:32:07 -07:00
Ralph Castain
2f6e0fec90
Provide the number of nodes in the job
2016-08-26 14:50:41 -07:00
Ralph Castain
ae2af61ee3
Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.
2016-08-15 22:46:46 -05:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
Ralph Castain
d4327fd973
The node index isn't normally passed with the packed node object, so we need to set it on the remote end as the orted needs to pass it down to the procs. Refactor the registration code to better package proc-level info - we will separate out the node and app levels in a subsequent change.
2016-08-12 12:06:23 -07:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Ralph Castain
16fccd4964
Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives
2016-07-29 09:52:36 -07:00
Ralph Castain
b748afceb1
Fix copy/paste error
2016-07-29 06:41:30 -07:00
Gilles Gouaillardet
e67c3d0a14
orted/pmix: protect against NULL node in orte_pmix_server_register_nspace()
2016-07-29 16:20:31 +09:00
Ralph Castain
cacb582ecd
Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application
2016-07-28 14:09:06 -07:00
Ralph Castain
9ab20cafe3
Pass the nodeid for each proc in the job. Fix a mistaken error output message
2016-07-25 15:41:15 -07:00
Ralph Castain
99f7096031
Fix permissions
2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c
Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling
2016-07-16 13:20:41 -07:00
Ralph Castain
20a91c2baf
Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
...
Cleanup debug message
2016-07-13 15:28:33 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
c9ada8e095
Silence Coverity warnings
2016-07-03 20:45:08 -07:00
Ralph Castain
6e434d6785
Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
...
Update to match PMIx RFC
Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Gilles Gouaillardet
5d32282230
orted/pmix_server_pub: fix packing type in pmix_server_lookup_fn()
...
and make it match the one used when unpacking in orte_data_server()
2016-06-27 14:37:08 +09:00
Ralph Castain
e3e4d73986
Need to be a little more careful when checking the range on a publish/lookup operation. If the range was constrained at publish, then we need to check that the lookup fits within that constraint. Otherwise, we should provide the data. More detailed constraint checking will be provided later.
2016-06-24 17:01:49 -07:00
Ralph Castain
0ba02821e6
Add requested key and job-level info
2016-06-19 18:22:31 -07:00
Ralph Castain
5d330d5220
Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
...
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00