1
1
Граф коммитов

3515 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
78245e8a33 Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations. 2015-02-17 12:51:11 -08:00
Ralph Castain
22f1d29b82 Re-introduce the ORTE notifier framework for logging errors that would otherwise result in abort for persistent systems. Thanks to L. Rajeshnarayanan of Intel for the contribution
Subsequent commits will integrate this capability with the state and errmgr frameworks.
2015-02-16 12:46:58 -08:00
Ralph Castain
116fcaff2c Start adding support for cmd line options to orte-submit 2015-02-10 12:13:21 -08:00
Ralph Castain
063e4c9989 Cleanup the pretty-print of odls cmds as some were missing. Add a new cmd to terminate the DVM, which the HNP will use to trun around and issue an xcast to the DVM. 2015-02-10 08:27:13 -08:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Howard Pritchard
b62d9c2c70 ess/alps: fix compile issue for pgi
remote -fi-noident cflag option.  Wasn't helping anyway
and caused pgi compiles to break.
2015-02-09 20:49:04 -08:00
Ralph Castain
a3275aa867 Once again, fix the blasted singleton comm_spawn 2015-02-05 17:34:25 -08:00
Ralph Castain
f28238af59 Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards. 2015-02-05 11:41:37 -08:00
Jeff Squyres
938b8e1dad schitzo: fix free of uninitialized value
The "param" value is not assigned before this free() statement.  So
remove it.

(yay clang compiler warnings)
2015-02-04 15:50:24 -05:00
Ralph Castain
251084a2da When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well. 2015-02-04 07:59:47 -08:00
Ralph Castain
2b0b012460 Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it. 2015-02-04 06:21:54 -08:00
Ralph Castain
7299cc3ab9 Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm 2015-02-03 07:25:43 -08:00
Ralph Castain
ec5ccb76cf Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time. 2015-01-30 11:00:43 -08:00
Ralph Castain
b838df9eb8 Get slurm to stay out of the way on singletons 2015-01-27 09:29:43 -06:00
Ralph Castain
294ebc907a Fix singleton operations so they can work inside a slurm environment 2015-01-27 09:29:42 -06:00
Ralph Castain
3eca55caec Continue fixing singletons in slurm environments 2015-01-27 09:29:42 -06:00
Ralph Castain
88c38f87d2 Get the orteds to use schizo as well 2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
Ralph Castain
11c92eefe6 ckpt 2015-01-27 09:29:42 -06:00
Howard Pritchard
2809c21e0f rml/oob: check peer param in send methods
The rml/oob was not doing sanity checks on the input peer
parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd.
Owing to the fact that there are places in the ompi/orte stack
where things like orte_show_help_norender are called way before
ORTE_PROC_MY_HNP, are setup properly, all kinds of weird
startup failures can occur as the rml/oob tries to process send
requests where the peer is junk.

Rather than try to expand this kind of thing:

    /* if we are the HNP, or the RML has not yet been setup,
     * or ROUTED has not been setup,
     * or we weren't given an HNP, or we are running in standalone
     * mode, then all we can do is process this locally
     */
    if (ORTE_PROC_IS_HNP || orte_standalone_operation ||
        NULL == orte_rml.send_buffer_nb ||
        NULL == orte_routed.get_route ||
        NULL == orte_process_info.my_hnp_uri) {
        rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME);
    }

do the right thing in the rml level and return an error rather than
eventually failing in the send owing to peer not being valid.
2015-01-22 06:12:39 -08:00
Howard Pritchard
06d3b57c07 Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug
odls/alps: check if PMI gni rdma creds already set
2015-01-19 11:48:24 -07:00
Howard Pritchard
fd807aee69 odls/alps: check if PMI gni rdma creds already set
Need to check if the alps odls component has already
read the rdma creds from alps.  Its okay to ask apshepherd
multiple times for rdma creds, but opal_setenv gets
a bit picky about this.  Rather than check for the OPAL_EXISTS
return value from opal_setenv, for now just check with
a static variable whether or not orte_odls_alps_get_rdma_creds
has already been successfully called before.

Would be nice to have an opal_getenv function for checking
if an env. variable had already been set by opal_putenv.
2015-01-19 10:12:38 -08:00
Ralph Castain
e7ff21b3aa The opal_stop_progress_thread function releases the event base, so don't do it again 2015-01-15 10:48:40 -08:00
Ralph Castain
9ac39b63cc Use the opal_progress_threads support for the ORTE progress thread in applications 2015-01-15 07:55:19 -08:00
Ralph Castain
d2938a144f Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch 2015-01-12 05:31:02 -08:00
Howard Pritchard
f34dd5f5fd plm/alps: update copyright 2015-01-07 12:33:38 -07:00
Howard Pritchard
c454d11b01 plm/alps: fix orted abort hang problem
Turns out the alps plm component wasn't changing the state
of the job upon terminating the orted's in the case of
an abnormal termination.  This caused mpirun to hang
with a zommbie'd aprun process if an orted on a node
in the job was killed via signal.
2015-01-07 12:31:41 -07:00
Howard Pritchard
f0f98f13b6 odls/base: fix an edge case with signals
In the course of doing some testing with how orted's
handle signaled child processes, found out that very
often doing a kill -9 on a process on a node just
results in the job hanging. The problem was that the
orted odls/errmgr was not properly handling the exit_code
being returned from waitpid.  Now mark the proc state
as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code
from waitpid indicates the process exited owing to
a signal.
2015-01-06 15:42:38 -07:00
Ralph Castain
43a40f8aac LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly 2014-12-19 12:06:05 -08:00
Ralph Castain
b314bfb5e9 If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info 2014-12-19 10:56:14 -08:00
Jeff Squyres
7b43bdc984 plm base: move flag inside the #if in which it is used
Avoid a compiler warning by declaring the tflag only inside the #if in
which it is used (i.e., if hwloc support is built).
2014-12-18 10:56:23 -08:00
Artem Polyakov
01601f3284 Merge pull request #305 from artpol84/timing
Timing framework improvement
2014-12-16 15:13:48 +06:00
Ralph Castain
f4ff791335 Close oob/usock connections upon exec 2014-12-13 20:24:09 -08:00
Ralph Castain
6c4d5a51c4 Close tcp sockets upon exec 2014-12-13 20:23:53 -08:00
Ralph Castain
0630680f36 Two cleanups required for transfer to 1.8.4:
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00
Rolf vandeVaart
f4aecdbfd2 Change logging function name from log to logfn. Fixes issue with PGI compile 2014-12-12 09:46:44 -05:00
Artem Polyakov
8ffad75a0a Introduce timing interval measurement facility in timing framework 2014-12-10 16:47:49 +06:00
Ralph Castain
bb529ebd8e Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings).
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.

Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
2014-12-08 15:38:14 -08:00
Ralph Castain
b757b3f452 Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered. 2014-12-08 08:03:53 -08:00
Ralph Castain
d6d69e2b13 Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other.
So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate.

We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.
2014-12-07 09:11:48 -08:00
Ralph Castain
b1bf557024 Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application. 2014-12-05 15:47:09 -08:00
Elena
af38a762a2 these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet. 2014-12-05 12:38:59 +02:00
Ralph Castain
c4fd6d1cde Fix typo 2014-12-04 12:24:35 -08:00
Ralph Castain
c4002a8485 Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set. 2014-12-04 12:24:35 -08:00
Ralph Castain
c88f181efe Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate. 2014-12-03 18:11:17 -08:00
Howard Pritchard
c67afadcfc Merge pull request #289 from hppritcha/topic/remove_pmi
Topic/remove pmi
2014-12-03 16:58:35 -07:00
Jeff Squyres
a3af7d6dbb Revert "lsf configury: add dependent libraries for static linking"
This reverts commit 56cfa90dda.
2014-12-03 13:32:56 -08:00
Jeff Squyres
92c2ff91ec Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic."
This reverts commit open-mpi/ompi@32bf0e7b7e.
2014-12-03 13:15:20 -08:00
Ralph Castain
54c955c92d Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit. 2014-12-03 12:19:00 -08:00
Howard Pritchard
666344a081 orte/mca/common/alps: fix configure file
Fix configure file for alps to actually check for
alps being available.

Also include stdio.h explicitly in common_alps.c
2014-12-03 09:44:18 -07:00