Ralph Castain
227d4d9609
Open the conduits for application procs - we probably can remove all the
...
RML-related frameworks from MPI applications now, but let's wait a bit
to ensure we have cleaned up all the points where messaging might occur.
2016-10-24 16:53:19 -07:00
Ralph Castain
649301a3a2
Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
...
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
df8ac7b747
Properly mark a node as down and decrease the number of daemons so any
...
subsequent grpcomm collectives can correctly operate. Note that only the
direct grpcomm component knows how to deal with down nodes.
2016-10-21 09:53:37 -07:00
Gilles Gouaillardet
1846c2d8ad
plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set
2016-10-19 16:18:52 +09:00
Gilles Gouaillardet
40424c9d0f
orte/util/hostfile: add the port=<port> option
...
add the option to pass an alternate port to plm
for example
node0 port=2222
directs the plm (via the ORTE_NODE_PORT) attribute to use
the non default port 2222 (e.g. ssh -p 2222 node0 ...)
2016-10-19 15:04:01 +09:00
Gilles Gouaillardet
73ea87800b
orte/util: add the ORTE_NODE_PORT attribute
...
this can be used to direct the plm component to use an alternate port
(e.g. ssh -p 2222 ...)
2016-10-19 15:04:01 +09:00
Ralph Castain
16540c7422
Properly report failure to launch when someone mis-types the name of the application
...
Fixes #2233
2016-10-18 10:09:30 -07:00
Ralph Castain
7be607582e
ORTE applications need to commit any modex send's prior to calling fence
2016-10-18 09:22:56 -07:00
Ralph Castain
57114a09ae
Pickup the npernode and npersocket options and include them in the job object
2016-10-17 12:26:21 -07:00
Gilles Gouaillardet
bd1b6fe661
rml/oob: add a missing include file
2016-10-16 10:25:00 +09:00
Ralph Castain
6f65d0a173
Repair event notification support. Cleanup the long-suffering "epoll: warning" coming out of libevent whenever a process abnormally terminated.
...
Add changes to test program
Sync to PMIx master
2016-10-13 16:27:39 -07:00
Gilles Gouaillardet
451b9dc467
ess: tear down pmix (if any) before oob
2016-10-13 14:08:02 +09:00
Ralph Castain
fca1556787
Some compilers apparently complain about this, so modify the typedef statements
2016-10-12 08:44:03 -07:00
Ralph Castain
a2919174d0
Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport.
...
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.
While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
2016-10-11 16:01:02 -07:00
Ralph Castain
5b1484a836
Implement the backend support for process-generated event notification
2016-10-08 09:24:28 -07:00
Gilles Gouaillardet
c92e9a5406
use the new OPAL_HASH_TABLE_FOREACH convenience macro
2016-10-08 16:58:20 +09:00
Ralph Castain
51b2bb1d41
Send show_help out thru stderr
2016-10-07 19:23:52 -07:00
Ralph Castain
e773c17cf3
Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations.
2016-10-02 16:02:23 -07:00
Gilles Gouaillardet
0931d09afa
ess/singleton: silence a valgrind warning
...
initialize a pointer and keep valgrind happy about it
2016-09-27 15:22:39 +09:00
Gilles Gouaillardet
f9ebba4668
ess/singleton: only realloc() when required in fork_hnp()
2016-09-23 16:35:59 +09:00
rhc54
63ba088d09
Merge pull request #2108 from rhc54/topic/reorder
...
Mucho thanks to Gilles - his patch to reorder the CPPFLAGS solves the…
2016-09-22 11:04:21 -05:00
Ralph Castain
a14ec3bdbc
Mucho thanks to Gilles - his patch to reorder the CPPFLAGS solves the problem of inadvertently picking up hwloc and libevent headers from locations in CPPFLAGS while continuing to build the embedded versions. Also silence a minor warning about an uninitialized var.
2016-09-22 07:39:22 -07:00
Gilles Gouaillardet
c7bf9a0ec9
ess/singleton: fix read on the pipe to spawn'ed orted
...
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Ralph Castain
de7b1494d9
Clean out old cruft from the ORCM project
2016-09-21 00:13:30 -07:00
Gilles Gouaillardet
83399adb3f
singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton
2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0
orted: fix spawn in singleton mode
...
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Gilles Gouaillardet
d84ac9bdc5
orted: remove debug
...
remove debug code that was added by mistake in open-mpi/ompi@eae9d31784
2016-09-19 19:15:42 +09:00
Gilles Gouaillardet
eae9d31784
pre_condition_transports: code cleanup
...
replace hard coded "OMPI_MCA_orte_precondition_transports" environment variable name
with macro'ed OPAL_MCA_PREFIX"orte_precondition_transports"
2016-09-19 13:31:47 +09:00
Ralph Castain
e55cc63da9
Remove debug
2016-09-16 07:06:58 -07:00
Ralph Castain
a16b3cc33d
Fix some minor complaints - missing "void" in function parameters
2016-09-15 15:18:42 -07:00
Ralph Castain
6f086189e6
Fix trivial typo
2016-09-15 13:10:55 -07:00
Gregory M. Kurtzer
16794cc260
Updates to support Singularity containers v2.2
2016-09-15 09:52:06 -07:00
Gilles Gouaillardet
11ebf3ab23
ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp
...
as the jobid
2016-09-15 13:57:23 +09:00
Gilles Gouaillardet
628c730196
pkgconfig: define the pkgincludedir variable in *.pc files
...
this has been made necesarry with open-mpi/ompi@12e796dcaf
Refs open-mpi/ompi#2069
2016-09-13 09:50:14 +09:00
Gilles Gouaillardet
e84b35217f
oob/tcp: plug a memory leak
...
as reported by Coverity with CID 1196711
2016-09-08 18:50:18 +09:00
Gilles Gouaillardet
b2a2be0e5a
odls: fix memory leak plug
...
This fixes commit open-mpi/ompi@e2c343cdfc .
2016-09-08 10:02:52 +09:00
Jeff Squyres
fd829ac389
Merge pull request #1982 from jsquyres/pr/fix-pkg-config-static
...
pkg-config: fix static linking
2016-09-07 14:55:50 -04:00
Jeff Squyres
b811b0a15c
Merge pull request #2060 from jsquyres/pr/remove-unused-var
...
orte proc_info.c: remove unused variable
2016-09-07 06:33:26 -04:00
Artem Polyakov
9eba1b0b75
Merge pull request #2042 from artpol84/pmix_sdirs
...
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Artem Polyakov
a9a7f39773
ess/pmi: fix the comments about MCA/PMIx setting conflict resolution.
2016-09-07 07:47:35 +03:00
Gilles Gouaillardet
be41b120d0
orted: plug misc memory leaks
...
as reported by Coverity with CID 1362603 and 1362606
2016-09-07 10:08:44 +09:00
Gilles Gouaillardet
e2c343cdfc
odls: plus memory leak
...
as reported by Coverity with CID 710645
2016-09-07 10:08:44 +09:00
Gilles Gouaillardet
c09899f6af
plm: plus resource leaks
...
as reported by Coverity with CIDs 72274 and 1196733
2016-09-07 10:08:44 +09:00
Jeff Squyres
722d5eecf1
orte proc_info.c: remove unused variable
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 16:38:15 -07:00
Josh Hursey
f6337f9eae
Merge pull request #2047 from jjhursey/topic/mixed-host2
...
orte: !FQDN implementation to use opal_net_isaddr
2016-09-06 13:08:54 -05:00
Ralph Castain
f85dcaee2a
Fixes CID 1369067 and CID 1196684
...
Fixes CID 1369648
Fixes CID 1372409
2016-09-06 08:43:15 -07:00
Artem Polyakov
74a11d7832
Fix session dir cleanup code.
2016-09-05 07:53:55 +03:00
Artem Polyakov
dc0ab674de
Add PMIx key to provide RM with ability to indicate that it will cleanup
...
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Artem Polyakov
81195ab724
Several fixes related to session directories:
...
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
fb51d65049
Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map
2016-09-04 07:53:12 -07:00
Joshua Hursey
fe937d1e82
orte: !FQDN implementation to use opal_net_isaddr
...
* Switch to use opal_net_isaddr() for checking if a name is an IP
address - as it is a bit cleaner, and uses common functionality.
2016-09-02 13:31:49 -05:00
Ralph Castain
4e0788e9ad
Enable PSM to support dynamic processes
...
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Ralph Castain
0ea1cff733
Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional
2016-09-01 13:10:10 -07:00
Gilles Gouaillardet
0b8c58298d
oob/usock: fix handling of orte_process_name_t *
...
orte_process_name_t is aligned on 32 bits, so it cannot simply be casted
into an int64_t. use memcpy() instead
Thanks Paul Hargrove for the report
2016-09-01 13:18:02 +09:00
Ralph Castain
c1050bc01e
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.
2016-08-31 09:32:07 -07:00
Ralph Castain
9b991bd1f5
Ensure that the "running" state is correctly updated
...
It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine.
Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch
2016-08-30 19:24:39 -07:00
Ralph Castain
cfa784c9a6
Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking
2016-08-29 20:22:24 -07:00
Josh Hursey
b0d8638824
Merge pull request #2015 from jjhursey/topic/mixed-hostnames
...
orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter
2016-08-29 09:14:54 -05:00
Ralph Castain
2f6e0fec90
Provide the number of nodes in the job
2016-08-26 14:50:41 -07:00
Joshua Hursey
d26dd2c20e
orte: Expand the application of !orte_keep_fqdn_hostnames
...
* Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
it is set to false.
* If that parameter is set to false (default) then short hostnames
(e.g., `node01`) will match with the long hostnames (e.g.,
`node01.mycluster.org`). This allows a user (or resource manager)
to mix the use of short and long hostnames.
- Note that this mechanism does _not_ perform a DNS lookup, but
instead strips off the FQDN by truncating the hostname string at
the first `.` character (when not an IP address).
- By default (`false`) the following is true:
`node01 == node01.mycluster.org == node01.bogus.com`
since we use `node01` as the hostname.
2016-08-26 16:09:04 -05:00
Artem Polyakov
55ac3b0be3
orte/schizo: fix binding detection in slurm component
...
in SLURM 16.05 the SLURM_CPU_BIND_TYPE is equal to "mask_cpu:"
instead of "mask_cpu". Account for that.
2016-08-26 09:55:52 +03:00
rhc54
19b0f4db9f
Merge pull request #1995 from rhc54/topic/pe-per-rank
...
Change the behavior of cpus-per-rank.
2016-08-25 14:38:12 -05:00
Ralph Castain
440eae90ec
Correct the binding algorithm to decouple it from oversubscribe.
...
Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.
2016-08-24 21:17:22 -07:00
Ralph Castain
92102304b6
Minor typo - init the job_data stdin_target field to 0 for default behavior. Add test.
2016-08-22 21:03:45 -07:00
Gilles Gouaillardet
93e73841f9
ess/singleton: push all PMIX_* environment variables, regardless how many there are
2016-08-23 09:46:55 +09:00
Gilles Gouaillardet
a1e8e58a8a
ess/singleton: expects 4 PMIX_* environment variables or more
2016-08-23 09:34:03 +09:00
Ralph Castain
7de4d6922b
Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one.
...
Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription.
A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.
2016-08-22 15:54:41 -07:00
Ralph Castain
9888615e75
Restore the coll/sync module and provide a test to verify its operation
2016-08-20 10:14:52 -07:00
Jeff Squyres
fb894e6e3e
pkg-config: fix static linking
...
We need to list all major project libraries in the private libraries
line to enable static linking to work properly.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-17 20:37:51 -05:00
Jeff Squyres
71ec5cfb43
rsh: robustify the check for plm_rsh_agent default value
...
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
rhc54
d7cd802426
Merge pull request #1971 from rhc54/topic/sesdir
...
Update the session dir structure. Restore the creation of a top-level…
2016-08-16 03:14:08 -05:00
Ralph Castain
ae2af61ee3
Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.
2016-08-15 22:46:46 -05:00
Ralph Castain
9f43db7303
Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.
2016-08-15 07:51:36 -07:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
rhc54
ddde154d28
Merge pull request #1962 from rhc54/topic/notify
...
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627
Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program
2016-08-12 21:14:29 -07:00
rhc54
9eed451916
Merge pull request #1960 from rhc54/topic/rsh
...
Restore the rsh template creation code
2016-08-12 13:38:43 -07:00
rhc54
8d67f753ca
Merge pull request #1959 from rhc54/topic/nodeid
...
The node index isn't normally passed with the packed node object, so …
2016-08-12 13:30:10 -07:00
rhc54
1ef3c86d44
Merge pull request #1931 from hjelmn/ess_fix
...
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Ralph Castain
5717b75b45
Restore the rsh template creation code
2016-08-12 12:43:40 -07:00
Ralph Castain
d4327fd973
The node index isn't normally passed with the packed node object, so we need to set it on the remote end as the orted needs to pass it down to the procs. Refactor the registration code to better package proc-level info - we will separate out the node and app levels in a subsequent change.
2016-08-12 12:06:23 -07:00
Ralph Castain
1c44543854
If the ssh agent hasn't been given, then check for qrsh and friends
2016-08-12 07:46:39 -07:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Artem Polyakov
1351a7065c
ess/pmi: minor code readablility cleanup.
...
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Howard Pritchard
ff669e7b15
code cleanup: clang is now a happier panda
...
Clang 5.1 on my mac was a sad panda compiling a couple
of files, complaining about uninitialized stack variables.
This commit makes clang a happier panda (or at least not so sad).
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-08-04 19:34:44 -06:00
Nathan Hjelm
3c23502dfe
ess/base: set up nidmap after pmix
...
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
16fccd4964
Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives
2016-07-29 09:52:36 -07:00
Ralph Castain
b748afceb1
Fix copy/paste error
2016-07-29 06:41:30 -07:00
Gilles Gouaillardet
e67c3d0a14
orted/pmix: protect against NULL node in orte_pmix_server_register_nspace()
2016-07-29 16:20:31 +09:00
Gilles Gouaillardet
273e56096b
configury: capture configury command line
...
configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro.
it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or
{orte-info,ompi_info,oshmem_info} --all --parseable | grep ^config:cli:
2016-07-29 09:14:09 +09:00
rhc54
19a2dbb04f
Merge pull request #1915 from rhc54/topic/connect
...
Support timeout values when performing connect/accept operations. Bum…
2016-07-28 15:51:06 -07:00
Jeff Squyres
cc651408dc
help-orterun: remove blank line at end of help message
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-28 14:53:34 -07:00
Ralph Castain
cacb582ecd
Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application
2016-07-28 14:09:06 -07:00
Ralph Castain
9ab20cafe3
Pass the nodeid for each proc in the job. Fix a mistaken error output message
2016-07-25 15:41:15 -07:00
Ralph Castain
71de03fc67
Cleanup the new naming requirements to ensure that info is correctly retrieved
...
Cleanup permissions
Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
Ralph Castain
99f7096031
Fix permissions
2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c
Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling
2016-07-16 13:20:41 -07:00
rhc54
2414244171
Merge pull request #1872 from rhc54/topic/continuous
...
Add support for continuously operating applications
2016-07-13 15:29:31 -07:00