1
1
Граф коммитов

4737 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
d8f3b378b3 orte/oob: fix misc memory leaks
as reported by Coverity as CIDs 1196748, 1196749 and 1269895
2015-03-02 15:31:11 +09:00
Jeff Squyres
336626dafe spelling: trivial spelling fix
s/interupted/interrupted/gi
2015-02-27 18:30:43 -08:00
Gilles Gouaillardet
ab78c7f54a orted/pmix: fix misc resource leak
as reported by Coverity with CID 1269844
2015-02-27 19:25:55 +09:00
Mike Dubman
dbc15009b6 Merge pull request #415 from alinask/topic/fix_fork_support_flow
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00
Nathan Hjelm
883d09376f Fix coverity #1271536 2015-02-25 11:35:45 -07:00
rhc54
efbb57430b Merge pull request #419 from nkogteva/master
grpcomm brcks: fix copy-paste bug which affects performance
2015-02-25 07:39:55 -08:00
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896 Merge pull request #422 from jsquyres/topic/coverity-fixes
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
05f00aface plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed
This was CID 993712
2015-02-24 15:48:50 -05:00
Ralph Castain
451bd16a10 Remove dead code 2015-02-24 12:41:12 -08:00
Jeff Squyres
4f54fedf05 orterun: ensure to set used_num_procs=true after finding that token
This was CID 71687.
2015-02-24 15:25:39 -05:00
Jeff Squyres
398ae15533 rmaps_base_frame: remove dead code
This was CID 1196641
2015-02-24 15:24:11 -05:00
Jeff Squyres
71ae0ad5ec oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code
This was CID 1196629
2015-02-24 15:24:11 -05:00
Jeff Squyres
0bd2783b91 oob_usock: don't try to close the socket if it didn't open
This was CID 1196663
2015-02-24 15:24:09 -05:00
Jeff Squyres
e2223cd9bf plm_rsh: ensure cwd array is \0-terminated
This was CID 72257
2015-02-24 15:24:08 -05:00
Ralph Castain
332e4fa7aa Minor fix - relative host name syntax cannot support usernames as you can't know which hosts will be selected 2015-02-24 12:15:28 -08:00
Nathan Hjelm
ed78553512 Update opal_free_list_t usage to reflect new class interface.
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:

OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st

I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
2015-02-24 10:05:44 -07:00
Nadezhda Kogteva
c4d6ca6468 grpcomm brcks: fix copy-paste bug which affects performance 2015-02-24 17:06:39 +02:00
Jeff Squyres
226a814c9d grpcomm_brks: fix minor compiler warning (rc used before set)
Also check for OBJ_NEW returning NULL.
2015-02-23 09:04:45 -08:00
Jeff Squyres
600858609e grpcomm_rcd: fix minor compiler warning (rc used before set)
Also check for OBJ_NEW returning NULL.
2015-02-23 09:03:07 -08:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Jeff Squyres
15be948d79 wrappers: *_EXTRA_INCLUDES does not exist any more
There were a few places where *_EXTRA_INCLUDES (and derivates) were
still being used.  This commit removes all of them.
2015-02-20 08:43:25 -08:00
Jeff Squyres
9b716d946e wrappers: fix errant @{libdir} reference in pkg-config files
The RPATH support added a @{libdir} token into
<package>_WRAPPER_EXTRA_LDFLAGS.  However, these flags are also
substituted into the pkg-config data files, and they don't understand
the @{foo} notation.  So convert @{libdir} into ${libdir}, which
pkg-config *does* understand.

Thanks to Christoph Junghans (@junghans) for notifying us of the issue.

Fixes #406.
2015-02-20 08:43:19 -08:00
Jeff Squyres
ec62766a71 notifier base: remove unused variables 2015-02-20 07:06:13 -08:00
Elena
48eae25b8f fixed issue with grpcomm rcd and brks algorithms which led to performance issues: data just for part of processes was unpacked and stored locally during fence, therefore clients were forced to ask daemons for data directly during get request 2015-02-20 16:41:25 +02:00
Ralph Castain
f7c28ea706 Fix bad test - opal_buffer and opal_ptr can support NULL locations 2015-02-17 21:46:23 -08:00
Ralph Castain
852fbca020 Shut coverity up 2015-02-17 21:17:23 -08:00
Ralph Castain
c1282d5b99 The opal_buffer type also generates its own alloc, so need to let it pass thru the check 2015-02-17 21:06:19 -08:00
Ralph Castain
207cc74f87 Correct name of help file 2015-02-17 16:03:20 -08:00
Ralph Castain
624b16e070 Protect the unload attribute function 2015-02-17 14:21:23 -08:00
Ralph Castain
78245e8a33 Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations. 2015-02-17 12:51:11 -08:00
Gilles Gouaillardet
8dc4f30fae orte/tools: fix NULL pointer dereference
as reported by Coverity with CIDs 1196671 and 1196824
2015-02-17 15:45:06 +09:00
Gilles Gouaillardet
b762766969 orte/util: fix misc memory leaks
as reported by Coverity with CIDS 70314, 710653-710657 and 1196741-1196744
2015-02-17 12:27:23 +09:00
Ralph Castain
22f1d29b82 Re-introduce the ORTE notifier framework for logging errors that would otherwise result in abort for persistent systems. Thanks to L. Rajeshnarayanan of Intel for the contribution
Subsequent commits will integrate this capability with the state and errmgr frameworks.
2015-02-16 12:46:58 -08:00
Gilles Gouaillardet
8fe8079080 Fix a build failure when configure'd with --without-hwloc
see http://mtt.open-mpi.org/index.php?do_redir=2235
2015-02-16 10:31:09 +09:00
Jeff Squyres
3ac1d0dae5 *-info: add "lt_dladvise support" lines 2015-02-11 12:25:20 -08:00
Ralph Castain
2a83d2613a Cleanup the orte/test/system directory 2015-02-11 10:42:38 -08:00
Ralph Castain
d5775bf9de Cleanup orte MPI test directory so it all builds again 2015-02-11 10:14:06 -08:00
Ralph Castain
ce56c0a2cf Oops - remove debug/exit 2015-02-11 10:14:06 -08:00
Jeff Squyres
c9e3f22933 orte mpi tests: fix a bunch of compiler warnings 2015-02-11 12:28:10 -05:00
Jeff Squyres
07179ef669 orte mpi tests: don't use deprecated MPI functions
Change MPI_Errhandler_set -> MPI_Comm_set_errhandler
2015-02-11 12:28:10 -05:00
Jeff Squyres
cc7f433c0f Makefile: this file should not be executable 2015-02-11 07:33:56 -08:00
Ralph Castain
3de8c5c7c6 Cleanup the munge support - the credential cannot be reused for multiple connections 2015-02-10 20:34:35 -08:00
Ralph Castain
46fb850bb0 Continue adding support for options on orte-submit - still need to shift some of the MCA params to job object attributes 2015-02-10 13:56:14 -08:00
Ralph Castain
116fcaff2c Start adding support for cmd line options to orte-submit 2015-02-10 12:13:21 -08:00
rhc54
cf3f4def48 Merge pull request #386 from marksantcroos/master
Add debug option to orte-dvm.

Looks fine - thanks
2015-02-10 11:38:52 -08:00
Ralph Castain
df2cd96772 Display the local/global attribute flag more prominently. Mark the attributes as global in orte-submit so they will be communicated 2015-02-10 10:47:32 -08:00
Mark Santcroos
ff6a69a68d Add debug option to orte-dvm. 2015-02-10 13:02:23 -05:00
Ralph Castain
063e4c9989 Cleanup the pretty-print of odls cmds as some were missing. Add a new cmd to terminate the DVM, which the HNP will use to trun around and issue an xcast to the DVM. 2015-02-10 08:27:13 -08:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Elena
948c20d862 added pmix unit test to tarball 2015-02-10 13:41:15 +02:00
Howard Pritchard
b62d9c2c70 ess/alps: fix compile issue for pgi
remote -fi-noident cflag option.  Wasn't helping anyway
and caused pgi compiles to break.
2015-02-09 20:49:04 -08:00
Ralph Castain
3478def791 Ensure that nodes get included in the nidmap when spawning a new DVM job - we really only need to do this once, but for now we do it for every job until we work out how to avoid the duplication. Remove debug from orte-dvm tool 2015-02-09 23:47:46 -05:00
Ralph Castain
ef13ba7db3 Add debug-daemons option to orte-dvm 2015-02-09 11:08:45 -05:00
Ralph Castain
a3275aa867 Once again, fix the blasted singleton comm_spawn 2015-02-05 17:34:25 -08:00
Ralph Castain
f28238af59 Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards. 2015-02-05 11:41:37 -08:00
Jeff Squyres
938b8e1dad schitzo: fix free of uninitialized value
The "param" value is not assigned before this free() statement.  So
remove it.

(yay clang compiler warnings)
2015-02-04 15:50:24 -05:00
Ralph Castain
251084a2da When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well. 2015-02-04 07:59:47 -08:00
Ralph Castain
2b0b012460 Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it. 2015-02-04 06:21:54 -08:00
Ralph Castain
7299cc3ab9 Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm 2015-02-03 07:25:43 -08:00
Elena
5919b636e1 changed output format in pmix unit test 2015-02-02 14:22:51 +02:00
Ralph Castain
4dba298e6e Update orte-submit manpage, add the ompi-* versions of orte-dvm and orte-submit manpages 2015-02-01 15:46:40 -08:00
Ralph Castain
e303a9b1d6 Provide an orte-dvm man page. Provide an option to orte-submit for terminating the DVM 2015-02-01 12:14:44 -08:00
Ralph Castain
ec5ccb76cf Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time. 2015-01-30 11:00:43 -08:00
rhc54
e7fa600d85 Merge pull request #360 from elenash/master
added unit test for pmix functionality
2015-01-28 06:18:57 -06:00
Elena
472baa1284 added unit test for pmix functionality 2015-01-28 13:18:26 +02:00
Ralph Castain
b838df9eb8 Get slurm to stay out of the way on singletons 2015-01-27 09:29:43 -06:00
Ralph Castain
294ebc907a Fix singleton operations so they can work inside a slurm environment 2015-01-27 09:29:42 -06:00
Ralph Castain
3eca55caec Continue fixing singletons in slurm environments 2015-01-27 09:29:42 -06:00
Ralph Castain
fcec24b2a4 Minor cleanups to handle comm_spawn and singletons 2015-01-27 09:29:42 -06:00
Ralph Castain
74385302c0 Add the personality to the orte_job_t datatype support 2015-01-27 09:29:42 -06:00
Ralph Castain
88c38f87d2 Get the orteds to use schizo as well 2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
Ralph Castain
11c92eefe6 ckpt 2015-01-27 09:29:42 -06:00
rhc54
a1707326bf Merge pull request #359 from hppritcha/topic/better_help
orte/util: minor improvement to show_help
2015-01-25 08:13:49 -08:00
Howard Pritchard
1e94d84ae6 orte/util: minor improvement to show_help
Make sure the show help gives it a good try to
print an error message locally if the
send_buffer_nb method returns an error.
2015-01-23 13:54:03 -08:00
Howard Pritchard
2809c21e0f rml/oob: check peer param in send methods
The rml/oob was not doing sanity checks on the input peer
parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd.
Owing to the fact that there are places in the ompi/orte stack
where things like orte_show_help_norender are called way before
ORTE_PROC_MY_HNP, are setup properly, all kinds of weird
startup failures can occur as the rml/oob tries to process send
requests where the peer is junk.

Rather than try to expand this kind of thing:

    /* if we are the HNP, or the RML has not yet been setup,
     * or ROUTED has not been setup,
     * or we weren't given an HNP, or we are running in standalone
     * mode, then all we can do is process this locally
     */
    if (ORTE_PROC_IS_HNP || orte_standalone_operation ||
        NULL == orte_rml.send_buffer_nb ||
        NULL == orte_routed.get_route ||
        NULL == orte_process_info.my_hnp_uri) {
        rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME);
    }

do the right thing in the rml level and return an error rather than
eventually failing in the send owing to peer not being valid.
2015-01-22 06:12:39 -08:00
Howard Pritchard
06d3b57c07 Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug
odls/alps: check if PMI gni rdma creds already set
2015-01-19 11:48:24 -07:00
Howard Pritchard
fd807aee69 odls/alps: check if PMI gni rdma creds already set
Need to check if the alps odls component has already
read the rdma creds from alps.  Its okay to ask apshepherd
multiple times for rdma creds, but opal_setenv gets
a bit picky about this.  Rather than check for the OPAL_EXISTS
return value from opal_setenv, for now just check with
a static variable whether or not orte_odls_alps_get_rdma_creds
has already been successfully called before.

Would be nice to have an opal_getenv function for checking
if an env. variable had already been set by opal_putenv.
2015-01-19 10:12:38 -08:00
Gilles Gouaillardet
661c35ca67 cleanup dead code caused by the removal of the --with-threads configure option 2015-01-16 19:13:59 +09:00
Ralph Castain
e7ff21b3aa The opal_stop_progress_thread function releases the event base, so don't do it again 2015-01-15 10:48:40 -08:00
Ralph Castain
9ac39b63cc Use the opal_progress_threads support for the ORTE progress thread in applications 2015-01-15 07:55:19 -08:00
Ralph Castain
d2938a144f Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch 2015-01-12 05:31:02 -08:00
Howard Pritchard
f34dd5f5fd plm/alps: update copyright 2015-01-07 12:33:38 -07:00
Howard Pritchard
c454d11b01 plm/alps: fix orted abort hang problem
Turns out the alps plm component wasn't changing the state
of the job upon terminating the orted's in the case of
an abnormal termination.  This caused mpirun to hang
with a zommbie'd aprun process if an orted on a node
in the job was killed via signal.
2015-01-07 12:31:41 -07:00
Howard Pritchard
f0f98f13b6 odls/base: fix an edge case with signals
In the course of doing some testing with how orted's
handle signaled child processes, found out that very
often doing a kill -9 on a process on a node just
results in the job hanging. The problem was that the
orted odls/errmgr was not properly handling the exit_code
being returned from waitpid.  Now mark the proc state
as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code
from waitpid indicates the process exited owing to
a signal.
2015-01-06 15:42:38 -07:00
Nadezhda Kogteva
05af80b302 Fix commit bffb2b7a4b which broke pmix server functionality 2014-12-24 13:25:23 +02:00
Ralph Castain
43a40f8aac LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly 2014-12-19 12:06:05 -08:00
Ralph Castain
b314bfb5e9 If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info 2014-12-19 10:56:14 -08:00
Jeff Squyres
7b43bdc984 plm base: move flag inside the #if in which it is used
Avoid a compiler warning by declaring the tflag only inside the #if in
which it is used (i.e., if hwloc support is built).
2014-12-18 10:56:23 -08:00
Ralph Castain
2581b41d08 Continue refactoring code by splitting the msg processing from the sendrecv code 2014-12-17 19:57:14 -08:00
Ralph Castain
f489e871c2 Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac 2014-12-17 19:08:52 -08:00
Artem Polyakov
01601f3284 Merge pull request #305 from artpol84/timing
Timing framework improvement
2014-12-16 15:13:48 +06:00
Ralph Castain
573a574a3c Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map 2014-12-15 12:11:13 -08:00
Ralph Castain
a22cc45769 Close the pmix server sockets on exec 2014-12-13 20:30:21 -08:00
Ralph Castain
f4ff791335 Close oob/usock connections upon exec 2014-12-13 20:24:09 -08:00
Ralph Castain
6c4d5a51c4 Close tcp sockets upon exec 2014-12-13 20:23:53 -08:00
Ralph Castain
9658256a98 Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later. 2014-12-13 18:44:09 -08:00
Ralph Castain
bffb2b7a4b Correct some issues with variables used before being set 2014-12-12 17:23:32 -08:00
Ralph Castain
0630680f36 Two cleanups required for transfer to 1.8.4:
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00