Ralph Castain
3478def791
Ensure that nodes get included in the nidmap when spawning a new DVM job - we really only need to do this once, but for now we do it for every job until we work out how to avoid the duplication. Remove debug from orte-dvm tool
2015-02-09 23:47:46 -05:00
Ralph Castain
ef13ba7db3
Add debug-daemons option to orte-dvm
2015-02-09 11:08:45 -05:00
Ralph Castain
a3275aa867
Once again, fix the blasted singleton comm_spawn
2015-02-05 17:34:25 -08:00
Ralph Castain
f28238af59
Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards.
2015-02-05 11:41:37 -08:00
Jeff Squyres
938b8e1dad
schitzo: fix free of uninitialized value
...
The "param" value is not assigned before this free() statement. So
remove it.
(yay clang compiler warnings)
2015-02-04 15:50:24 -05:00
Ralph Castain
251084a2da
When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well.
2015-02-04 07:59:47 -08:00
Ralph Castain
2b0b012460
Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it.
2015-02-04 06:21:54 -08:00
Ralph Castain
7299cc3ab9
Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm
2015-02-03 07:25:43 -08:00
Elena
5919b636e1
changed output format in pmix unit test
2015-02-02 14:22:51 +02:00
Ralph Castain
4dba298e6e
Update orte-submit manpage, add the ompi-* versions of orte-dvm and orte-submit manpages
2015-02-01 15:46:40 -08:00
Ralph Castain
e303a9b1d6
Provide an orte-dvm man page. Provide an option to orte-submit for terminating the DVM
2015-02-01 12:14:44 -08:00
Ralph Castain
ec5ccb76cf
Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time.
2015-01-30 11:00:43 -08:00
rhc54
e7fa600d85
Merge pull request #360 from elenash/master
...
added unit test for pmix functionality
2015-01-28 06:18:57 -06:00
Elena
472baa1284
added unit test for pmix functionality
2015-01-28 13:18:26 +02:00
Ralph Castain
b838df9eb8
Get slurm to stay out of the way on singletons
2015-01-27 09:29:43 -06:00
Ralph Castain
294ebc907a
Fix singleton operations so they can work inside a slurm environment
2015-01-27 09:29:42 -06:00
Ralph Castain
3eca55caec
Continue fixing singletons in slurm environments
2015-01-27 09:29:42 -06:00
Ralph Castain
fcec24b2a4
Minor cleanups to handle comm_spawn and singletons
2015-01-27 09:29:42 -06:00
Ralph Castain
74385302c0
Add the personality to the orte_job_t datatype support
2015-01-27 09:29:42 -06:00
Ralph Castain
88c38f87d2
Get the orteds to use schizo as well
2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d
Complete implementation of the schizo framework to support OMPI component
2015-01-27 09:29:42 -06:00
Ralph Castain
11c92eefe6
ckpt
2015-01-27 09:29:42 -06:00
rhc54
a1707326bf
Merge pull request #359 from hppritcha/topic/better_help
...
orte/util: minor improvement to show_help
2015-01-25 08:13:49 -08:00
Howard Pritchard
1e94d84ae6
orte/util: minor improvement to show_help
...
Make sure the show help gives it a good try to
print an error message locally if the
send_buffer_nb method returns an error.
2015-01-23 13:54:03 -08:00
Howard Pritchard
2809c21e0f
rml/oob: check peer param in send methods
...
The rml/oob was not doing sanity checks on the input peer
parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd.
Owing to the fact that there are places in the ompi/orte stack
where things like orte_show_help_norender are called way before
ORTE_PROC_MY_HNP, are setup properly, all kinds of weird
startup failures can occur as the rml/oob tries to process send
requests where the peer is junk.
Rather than try to expand this kind of thing:
/* if we are the HNP, or the RML has not yet been setup,
* or ROUTED has not been setup,
* or we weren't given an HNP, or we are running in standalone
* mode, then all we can do is process this locally
*/
if (ORTE_PROC_IS_HNP || orte_standalone_operation ||
NULL == orte_rml.send_buffer_nb ||
NULL == orte_routed.get_route ||
NULL == orte_process_info.my_hnp_uri) {
rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME);
}
do the right thing in the rml level and return an error rather than
eventually failing in the send owing to peer not being valid.
2015-01-22 06:12:39 -08:00
Howard Pritchard
06d3b57c07
Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug
...
odls/alps: check if PMI gni rdma creds already set
2015-01-19 11:48:24 -07:00
Howard Pritchard
fd807aee69
odls/alps: check if PMI gni rdma creds already set
...
Need to check if the alps odls component has already
read the rdma creds from alps. Its okay to ask apshepherd
multiple times for rdma creds, but opal_setenv gets
a bit picky about this. Rather than check for the OPAL_EXISTS
return value from opal_setenv, for now just check with
a static variable whether or not orte_odls_alps_get_rdma_creds
has already been successfully called before.
Would be nice to have an opal_getenv function for checking
if an env. variable had already been set by opal_putenv.
2015-01-19 10:12:38 -08:00
Gilles Gouaillardet
661c35ca67
cleanup dead code caused by the removal of the --with-threads configure option
2015-01-16 19:13:59 +09:00
Ralph Castain
e7ff21b3aa
The opal_stop_progress_thread function releases the event base, so don't do it again
2015-01-15 10:48:40 -08:00
Ralph Castain
9ac39b63cc
Use the opal_progress_threads support for the ORTE progress thread in applications
2015-01-15 07:55:19 -08:00
Ralph Castain
d2938a144f
Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch
2015-01-12 05:31:02 -08:00
Howard Pritchard
f34dd5f5fd
plm/alps: update copyright
2015-01-07 12:33:38 -07:00
Howard Pritchard
c454d11b01
plm/alps: fix orted abort hang problem
...
Turns out the alps plm component wasn't changing the state
of the job upon terminating the orted's in the case of
an abnormal termination. This caused mpirun to hang
with a zommbie'd aprun process if an orted on a node
in the job was killed via signal.
2015-01-07 12:31:41 -07:00
Howard Pritchard
f0f98f13b6
odls/base: fix an edge case with signals
...
In the course of doing some testing with how orted's
handle signaled child processes, found out that very
often doing a kill -9 on a process on a node just
results in the job hanging. The problem was that the
orted odls/errmgr was not properly handling the exit_code
being returned from waitpid. Now mark the proc state
as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code
from waitpid indicates the process exited owing to
a signal.
2015-01-06 15:42:38 -07:00
Nadezhda Kogteva
05af80b302
Fix commit bffb2b7a4b
which broke pmix server functionality
2014-12-24 13:25:23 +02:00
Ralph Castain
43a40f8aac
LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly
2014-12-19 12:06:05 -08:00
Ralph Castain
b314bfb5e9
If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info
2014-12-19 10:56:14 -08:00
Jeff Squyres
7b43bdc984
plm base: move flag inside the #if in which it is used
...
Avoid a compiler warning by declaring the tflag only inside the #if in
which it is used (i.e., if hwloc support is built).
2014-12-18 10:56:23 -08:00
Ralph Castain
2581b41d08
Continue refactoring code by splitting the msg processing from the sendrecv code
2014-12-17 19:57:14 -08:00
Ralph Castain
f489e871c2
Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac
2014-12-17 19:08:52 -08:00
Artem Polyakov
01601f3284
Merge pull request #305 from artpol84/timing
...
Timing framework improvement
2014-12-16 15:13:48 +06:00
Ralph Castain
573a574a3c
Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map
2014-12-15 12:11:13 -08:00
Ralph Castain
a22cc45769
Close the pmix server sockets on exec
2014-12-13 20:30:21 -08:00
Ralph Castain
f4ff791335
Close oob/usock connections upon exec
2014-12-13 20:24:09 -08:00
Ralph Castain
6c4d5a51c4
Close tcp sockets upon exec
2014-12-13 20:23:53 -08:00
Ralph Castain
9658256a98
Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.
2014-12-13 18:44:09 -08:00
Ralph Castain
bffb2b7a4b
Correct some issues with variables used before being set
2014-12-12 17:23:32 -08:00
Ralph Castain
0630680f36
Two cleanups required for transfer to 1.8.4:
...
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00
Rolf vandeVaart
f4aecdbfd2
Change logging function name from log to logfn. Fixes issue with PGI compile
2014-12-12 09:46:44 -05:00
Artem Polyakov
8ffad75a0a
Introduce timing interval measurement facility in timing framework
2014-12-10 16:47:49 +06:00
Ralph Castain
9d5135e6cd
Function definition should use the correct type
2014-12-09 01:04:31 -08:00
Ralph Castain
bb529ebd8e
Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings).
...
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.
Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
2014-12-08 15:38:14 -08:00
elenash
baf32fe480
Merge pull request #308 from elenash/master
...
restored _process_name_print_for_opal function in orte_init: it's requir...
2014-12-08 19:14:36 +03:00
Ralph Castain
b757b3f452
Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered.
2014-12-08 08:03:53 -08:00
Elena
6cf3925b09
restored _process_name_print_for_opal function in orte_init: it's required for opal output from daemons which never called ompi_init so didn't set opal_process_name_print pointer
2014-12-08 13:13:35 +02:00
Ralph Castain
d6d69e2b13
Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other.
...
So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate.
We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.
2014-12-07 09:11:48 -08:00
Ralph Castain
b1bf557024
Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.
2014-12-05 15:47:09 -08:00
Elena
af38a762a2
these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet.
2014-12-05 12:38:59 +02:00
Ralph Castain
c4fd6d1cde
Fix typo
2014-12-04 12:24:35 -08:00
Ralph Castain
c4002a8485
Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set.
2014-12-04 12:24:35 -08:00
Ralph Castain
c88f181efe
Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.
2014-12-03 18:11:17 -08:00
Howard Pritchard
c67afadcfc
Merge pull request #289 from hppritcha/topic/remove_pmi
...
Topic/remove pmi
2014-12-03 16:58:35 -07:00
Jeff Squyres
a3af7d6dbb
Revert "lsf configury: add dependent libraries for static linking"
...
This reverts commit 56cfa90dda
.
2014-12-03 13:32:56 -08:00
Jeff Squyres
92c2ff91ec
Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic."
...
This reverts commit open-mpi/ompi@32bf0e7b7e .
2014-12-03 13:15:20 -08:00
Ralph Castain
54c955c92d
Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit.
2014-12-03 12:19:00 -08:00
Howard Pritchard
666344a081
orte/mca/common/alps: fix configure file
...
Fix configure file for alps to actually check for
alps being available.
Also include stdio.h explicitly in common_alps.c
2014-12-03 09:44:18 -07:00
Howard Pritchard
ec38aa3732
orte/mca/common: add missing Makefile.am
2014-12-03 09:44:18 -07:00
Howard Pritchard
191fe0f949
alps configury changes
...
Clean up the orte_check_alps.m4. There was a little of
unnecesary stuff for handling cle 5, since it wasn't actually
doing the right thing, which would be to use pkg-config to
find dependencies both for dynamic and static linking.
Decouple the searching for alps libs, etc. from cray pmi.
Switch the alps ess and alps odls components' config files
to use the ALPS m4 macro.
alps configury fixes
Improve a check for detecting CLE release.
Improve an error message.
2014-12-03 09:44:17 -07:00
Howard Pritchard
d749077e1e
odls/alps: make sure PMI env. variables set up
...
Add call to orte_odls_alps_get_rdma_creds in the
local proc launch step to obtain the Cray Rdma
credentials from the apshepherd, and to set
the PMI env. variables expected by uGNI BTL, etc.
2014-12-03 09:44:17 -07:00
Howard Pritchard
e0487e7702
orte/common/alps: add an alps common lib to orte
...
Add an alps common lib to orte. Add a function
to determine whether or not a process is in a
PAGG container.
Note: we need a better naming convention for
common libs, since right now they use a "flat"
naming convention.
2014-12-03 09:44:17 -07:00
Howard Pritchard
a753c3ece0
ess/alps: add initial alps ess component
...
Note this alps ess component has nothing to do
with the old CNOS alps component used on
Cray Seastar/Portals3 (Cray XT) systems.
To work properly, changes need to be made to the
open method of the ess/pmi component to keep it
from selecting, and thus initializing, the opal/pmix/cray
component.
2014-12-03 09:44:17 -07:00
Ralph Castain
32bf0e7b7e
Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic.
2014-12-03 07:14:06 -08:00
Ralph Castain
cb15cc06e1
Minor changes per Jeff's request on PR for 1.8.4
2014-12-02 19:54:10 -08:00
Ralph Castain
6294ed991b
Fix singletons - still working on singleton comm_spawn
2014-12-02 14:12:24 -08:00
Ralph Castain
14cdb04327
Revise the ess/pmi selection logic as all APPs must select it, and no daemons. Cleanup some of the mca param levels in ess so we don't printout the topology quite as easily.
2014-12-01 21:19:11 -08:00
Jeff Squyres
56cfa90dda
lsf configury: add dependent libraries for static linking
...
Ensure to add the LSF dependent libraries and LD flags for the wrapper
compiler static linking case.
2014-12-01 14:59:10 -08:00
Ralph Castain
f92ccaf0f9
Add missing var declarations
2014-12-01 09:36:28 -08:00
Ralph Castain
960ef34988
Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists
2014-11-30 14:37:37 -08:00
Ralph Castain
3f9d9ae8b6
Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids.
...
Once validated, a version of this will be backported to the v1.8.4 release.
2014-11-30 11:50:31 -08:00
Elena
b17ea23ce0
fixes for direct routed component under mpirun
2014-11-26 13:36:49 +02:00
Ralph Castain
f48b9012cb
Some minor cleanup. We really don't need another peer error constant to indicate that a peer closed as we already have one for "connection failed", and that's all we really know. Update the orte constants to track their opal equivalents.
2014-11-25 08:02:29 -08:00
Nadezhda Kogteva
8dd21c7736
OOB UD: fix case when multiple oob components were specified in command line (checking of uri).
2014-11-25 11:48:11 +02:00
Gilles Gouaillardet
578fe41788
fix hangs introduced by previous commit a6744b8177
2014-11-25 17:50:44 +09:00
Gilles Gouaillardet
a6744b8177
fix misc memory leaks specific to the master
2014-11-25 13:52:10 +09:00
Gilles Gouaillardet
38879cf682
fix misc memory leaks
2014-11-25 11:32:43 +09:00
Ralph Castain
48f702827e
First part of memory leak cleanups from Gilles
2014-11-24 16:53:33 -08:00
Ralph Castain
2e00e335b9
Add missing header to tarball. Remove stale opal_unignore
2014-11-21 17:35:11 -08:00
Howard Pritchard
6e807c4e8a
odls/alps: minor config cleanup
2014-11-21 11:09:28 -07:00
Nadezhda Kogteva
05b2eb1270
OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component.
2014-11-20 10:20:35 +02:00
rhc54
7c0273ecb3
Merge pull request #276 from teng-lin/master
...
Fixed a bug that fails to parse hostname starting with numbers.
2014-11-19 16:39:00 -08:00
Teng Lin
07ff51f43f
Fixed a bug that fails to parse hostname starting with numbers.
...
According to RFC 1123, hostnames that begin with numbers are valid.
2014-11-19 16:03:55 -08:00
Howard Pritchard
9425ebefae
Be more selective about closing fd's for alps/odls
...
Be more selective about closing fd's for the alps odls
component. Don't close fd's of pipes set up by the
apshepherd for providing RDMA credentials, etc.
Add an entry to the help file in case
alps_app_lli_pipes returns an error.
2014-11-19 11:21:30 -07:00
Ralph Castain
bb91517349
All other layers to register their own print-attribute functions so we can maintain pretty-print capabilities as the attributes are extended.
2014-11-19 09:37:59 -08:00
Ralph Castain
37593b232d
Add a marker for the max attr value being used by ORTE so that other, higher-levels can also use the attribute system
2014-11-19 09:37:59 -08:00
Howard Pritchard
34c156759e
fix some compiler warnings in ras/alps
2014-11-18 11:32:37 -07:00
Howard Pritchard
4df3447d96
fix compare_nodes bug in alps ras component
...
There was an obvious bug in the alps/ras component compare_nodes method
which resulted in the function always evaluating the nodes
as being equivalent.
2014-11-18 11:15:02 -07:00
Howard Pritchard
ff362c16ce
add/update copyrights for alps odls component
2014-11-18 10:16:11 -07:00
Howard Pritchard
dc98b62070
add initial support for an alps odls component
...
It turns out that the support for Open MPI apps on
Cray was hanging on a thin thread of support when
using the mpirun job launcher. It just happened that
with a certain set of configuration options things would
work. This is bound to backfire at some point.
To fix this weakness, as well as to allow for mpirun launched
jobs to benefit from many of the advanced placement features
provided by the Cray Linux Environment (as opposed to the hwloc
only default env of orte), a new odls alps component is introduced.
2014-11-17 14:00:09 -07:00
Ralph Castain
d9ceb5aea4
Fix C++ builds by removing no-longer-needed type declaration
2014-11-14 11:44:24 -08:00
Gilles Gouaillardet
f3b36fdf6e
orted/pmix: fix pmix_server_release when several jobids are running on the same node
2014-11-14 16:17:28 +09:00