Ralph Castain
9888615e75
Restore the coll/sync module and provide a test to verify its operation
2016-08-20 10:14:52 -07:00
Jeff Squyres
fb894e6e3e
pkg-config: fix static linking
...
We need to list all major project libraries in the private libraries
line to enable static linking to work properly.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-17 20:37:51 -05:00
Jeff Squyres
71ec5cfb43
rsh: robustify the check for plm_rsh_agent default value
...
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
rhc54
d7cd802426
Merge pull request #1971 from rhc54/topic/sesdir
...
Update the session dir structure. Restore the creation of a top-level…
2016-08-16 03:14:08 -05:00
Ralph Castain
ae2af61ee3
Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.
2016-08-15 22:46:46 -05:00
Ralph Castain
9f43db7303
Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.
2016-08-15 07:51:36 -07:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
rhc54
ddde154d28
Merge pull request #1962 from rhc54/topic/notify
...
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627
Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program
2016-08-12 21:14:29 -07:00
rhc54
9eed451916
Merge pull request #1960 from rhc54/topic/rsh
...
Restore the rsh template creation code
2016-08-12 13:38:43 -07:00
rhc54
8d67f753ca
Merge pull request #1959 from rhc54/topic/nodeid
...
The node index isn't normally passed with the packed node object, so …
2016-08-12 13:30:10 -07:00
rhc54
1ef3c86d44
Merge pull request #1931 from hjelmn/ess_fix
...
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Ralph Castain
5717b75b45
Restore the rsh template creation code
2016-08-12 12:43:40 -07:00
Ralph Castain
d4327fd973
The node index isn't normally passed with the packed node object, so we need to set it on the remote end as the orted needs to pass it down to the procs. Refactor the registration code to better package proc-level info - we will separate out the node and app levels in a subsequent change.
2016-08-12 12:06:23 -07:00
Ralph Castain
1c44543854
If the ssh agent hasn't been given, then check for qrsh and friends
2016-08-12 07:46:39 -07:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Artem Polyakov
1351a7065c
ess/pmi: minor code readablility cleanup.
...
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Howard Pritchard
ff669e7b15
code cleanup: clang is now a happier panda
...
Clang 5.1 on my mac was a sad panda compiling a couple
of files, complaining about uninitialized stack variables.
This commit makes clang a happier panda (or at least not so sad).
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-08-04 19:34:44 -06:00
Nathan Hjelm
3c23502dfe
ess/base: set up nidmap after pmix
...
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
16fccd4964
Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives
2016-07-29 09:52:36 -07:00
Ralph Castain
b748afceb1
Fix copy/paste error
2016-07-29 06:41:30 -07:00
Gilles Gouaillardet
e67c3d0a14
orted/pmix: protect against NULL node in orte_pmix_server_register_nspace()
2016-07-29 16:20:31 +09:00
Gilles Gouaillardet
273e56096b
configury: capture configury command line
...
configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro.
it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or
{orte-info,ompi_info,oshmem_info} --all --parseable | grep ^config:cli:
2016-07-29 09:14:09 +09:00
rhc54
19a2dbb04f
Merge pull request #1915 from rhc54/topic/connect
...
Support timeout values when performing connect/accept operations. Bum…
2016-07-28 15:51:06 -07:00
Jeff Squyres
cc651408dc
help-orterun: remove blank line at end of help message
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-28 14:53:34 -07:00
Ralph Castain
cacb582ecd
Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application
2016-07-28 14:09:06 -07:00
Ralph Castain
9ab20cafe3
Pass the nodeid for each proc in the job. Fix a mistaken error output message
2016-07-25 15:41:15 -07:00
Ralph Castain
71de03fc67
Cleanup the new naming requirements to ensure that info is correctly retrieved
...
Cleanup permissions
Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
Ralph Castain
99f7096031
Fix permissions
2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c
Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling
2016-07-16 13:20:41 -07:00
rhc54
2414244171
Merge pull request #1872 from rhc54/topic/continuous
...
Add support for continuously operating applications
2016-07-13 15:29:31 -07:00
Ralph Castain
20a91c2baf
Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
...
Cleanup debug message
2016-07-13 15:28:33 -07:00
rhc54
cc2a648124
Merge pull request #1862 from rhc54/topic/mapping
...
Fix a bug in the handling of nper<foo> when -host or -hostfile was gi…
2016-07-12 10:40:28 -07:00
Ralph Castain
aa78f902f2
Add some missing info to the job map so remote procs get their app_rank
2016-07-12 09:50:12 -07:00
Ralph Castain
ddd0d05de3
Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.
2016-07-12 09:27:02 -07:00
Ralph Castain
ae8444682f
Remove stale variable
2016-07-05 20:07:16 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
c9ada8e095
Silence Coverity warnings
2016-07-03 20:45:08 -07:00
Ralph Castain
6e434d6785
Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
...
Update to match PMIx RFC
Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Gilles Gouaillardet
5d32282230
orted/pmix_server_pub: fix packing type in pmix_server_lookup_fn()
...
and make it match the one used when unpacking in orte_data_server()
2016-06-27 14:37:08 +09:00
Ralph Castain
e3e4d73986
Need to be a little more careful when checking the range on a publish/lookup operation. If the range was constrained at publish, then we need to check that the lookup fits within that constraint. Otherwise, we should provide the data. More detailed constraint checking will be provided later.
2016-06-24 17:01:49 -07:00
Ralph Castain
380cc8f040
Add a test program to help diagnose binding issues
2016-06-23 06:27:18 -07:00
Ralph Castain
0ba02821e6
Add requested key and job-level info
2016-06-19 18:22:31 -07:00
Jeff Squyres
98a2f5248d
orte: add missing break statement
...
This seems like an obvious typo: insert a missing "break" statement so
that we don't fall through to the next case.
Fixes CIDs 1362756 and 1362764.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-18 07:48:45 -07:00
Ralph Castain
5d330d5220
Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
...
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Ralph Castain
a6e6c37484
Remove stale map-reduce support
2016-06-12 07:41:57 -07:00
Ralph Castain
dd0f843843
Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE
2016-06-05 21:39:44 -07:00
Ralph Castain
0ba9572f9f
Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT
...
Update alps component
2016-06-02 17:48:21 -07:00