openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	d7d8ae46ed	We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info.	2015-03-17 06:10:20 -07:00
Ralph Castain	3e32c360c7	Add new MCA parameter to support edge case with debugger at LLNL	2015-03-16 20:04:05 -07:00
Ralph Castain	a0487e014c	Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.	2015-03-16 19:42:05 -07:00
Ralph Castain	5ae42c816e	Attempt to reduce the RARP traffic during definition of allocations	2015-03-16 16:26:40 -07:00
Ralph Castain	64d11f170a	Adjust the default keepalive interval. Refactor the code when setting keepalive options	2015-03-16 12:32:58 -07:00
Ralph Castain	4ded049cbc	Modify MCA param description	2015-03-16 11:57:32 -07:00
Ralph Castain	019bba5caf	Cleanup a bit - don't need to lookup the protocol number if we just use the right define	2015-03-16 11:54:51 -07:00
Ralph Castain	69ac25bf55	Add support for TCP keepalive on inter-node sockets	2015-03-16 09:59:44 -07:00
adrianreber	714d9aa67e	Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart Topic/orte cr continue like restart	2015-03-12 14:54:02 +01:00
Nathan Hjelm	695dcd5a28	oob/ud: fix compiler warning	2015-03-11 10:53:32 -06:00
Adrian Reber	c08e234af7	FT: fix compilation using --with-ft (5/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. With the changes introduced in the previous patches in this series some goto constructs for cleanup are no longer necessary and removed.	2015-03-11 14:23:33 +01:00
Adrian Reber	8ba41a834a	FT: fix compilation using --with-ft (4/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. This patch tries to handle the new xcast semantic.	2015-03-11 14:23:33 +01:00
Adrian Reber	1c5a8df724	FT: fix compilation using --with-ft (2/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. The FT code used barrier mechanisms which have been removed with `aec5cd08bd`. This patch replaces all those different barriers with opal_pmix.fence(NULL, 0); I am not sure this is completely correct but at least a starting point for a review.	2015-03-11 14:23:33 +01:00
Adrian Reber	f45dd069bd	FT: fix compilation using --with-ft (1/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. This first patch moves orte_cr_continue_like_restart from ORTE to opal_cr_continue_like_restart in OPAL. This only leaves three calls from OPAL to ORTE in the FT code. As it is not yet 100% clear how to handle these calls the code orte_sstore.set_attr() has been #ifdef'd out for now.	2015-03-11 14:23:33 +01:00
Gilles Gouaillardet	a69d935d55	oob/tcp: fix misc issues as reported by Coverity with CIDs 70726, 710564, 1196630, 1269805, 1269803, 1269932	2015-03-10 19:32:01 +09:00
Gilles Gouaillardet	dc0bc756dc	iof/base: fix misc memory leak as reported by Coverity with CID 1196732	2015-03-10 14:37:53 +09:00
Jeff Squyres	a026456bef	(orte\|ompi\|oshmem)info tools: convert to opal_dl interface Noe that this commit removes option:lt_dladvise from the various "info" tools output. This technically breaks our CLI "ABI" because we're not deprecating it / replacing it with an alias to some other "into" tool output. Although the dl/libltdl component contains an "have_lt_dladvise" MCA var that contains the same information, the "option:lt_dladvise" output from the various "info" tools is not* an MCA var, and therefore we can't alias it. So it just has to die.	2015-03-09 08:18:13 -07:00
Gilles Gouaillardet	59be12b260	filem/raw: fix misc memory leaks as reported by Coverity with CIDs 716815, 716817, 720760, 1196703, 1196704, 1196746	2015-03-09 19:56:20 +09:00
Gilles Gouaillardet	2ab9a411f8	plm/base: fix misc memory leaks as reported by Coverity with CIDs 1196733 and 1196745	2015-03-09 16:25:07 +09:00
Gilles Gouaillardet	fa10025843	ras/slurm: fix misc memory leaks as reported by Coverity with CIDs 968580 and 1196723-1196727	2015-03-09 15:58:51 +09:00
Gilles Gouaillardet	eae39bd948	ras/simulator: fix misc memory leaks as reported by Coverity with CIDs 710647, 714133 and 714134	2015-03-09 15:52:29 +09:00
Gilles Gouaillardet	4c0eb11e08	orterun: fix misc errors as reported by Coverity with CIDs 70700, 71039, 710651	2015-03-09 11:57:18 +09:00
Gilles Gouaillardet	33841361c0	orte-clean: use pclose instead of fclose as reported by Coverity with CID 1287029	2015-03-09 11:17:59 +09:00
Elena	6c6fe75c7b	added one more time interval for barrier to pmix unit test	2015-03-06 10:33:14 +02:00
Ralph Castain	64ec498a20	Add a declspec	2015-03-05 19:48:27 -08:00
Ralph Castain	eaa666bd57	Instantiate debug output variable	2015-03-05 12:25:49 -08:00
Ralph Castain	7ce0a9931c	Updates to the notifier interfaces to support system events	2015-03-05 10:39:25 -08:00
Gilles Gouaillardet	7de3f35b90	pml/rsh: fix misc memory leaks as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389, 1196718 and 1196719	2015-03-05 20:03:37 +09:00
Gilles Gouaillardet	33352e9506	schizo: fix misc memory leak as reported by Coverity with CID 1196722	2015-03-05 14:06:18 +09:00
Gilles Gouaillardet	89806c6261	orte/util: fix memory leaks as reported by Coverity with CIDs 70845, 71855, 710652, 1196738, 1196739, 1196757, 1196758, 1269863 and 1269883	2015-03-05 14:06:18 +09:00
Gilles Gouaillardet	4e7b5240e4	orte/tools: fix misc memory leaks as reported by Coverity with CIDs 70700, 71039, 71854, 72384 and 710651	2015-03-05 14:06:18 +09:00
Gilles Gouaillardet	d1b2f043ff	fix misc memory leaks as already reported by Coverity with CIDs 71818, 71819, 72250, 715767, 1196749 and 1274002	2015-03-05 13:58:05 +09:00
Gilles Gouaillardet	42f5a36ee3	rmaps/seq: fix misc memory leaks as reported by Coverity with CIDs 1269886 and 1269887	2015-03-02 15:31:11 +09:00
Gilles Gouaillardet	0c7a2846d1	rmaps/rank_file: fix misc memory leaks as reported by Coverity with CIDs 72250 and 1196774	2015-03-02 15:31:11 +09:00
Gilles Gouaillardet	c15b919635	rmaps/lama: fix misc memory leaks as reported by Coverity with CIDs 719263, 719264, 1196712 and 1269842	2015-03-02 15:31:11 +09:00
Gilles Gouaillardet	456baeb71b	rmaps/base: fix misc memory leaks as reported by Coverity with CIDs 1196751, 1196754, 1196755 and 1269866	2015-03-02 15:31:11 +09:00
Gilles Gouaillardet	d8f3b378b3	orte/oob: fix misc memory leaks as reported by Coverity as CIDs 1196748, 1196749 and 1269895	2015-03-02 15:31:11 +09:00
Jeff Squyres	336626dafe	spelling: trivial spelling fix s/interupted/interrupted/gi	2015-02-27 18:30:43 -08:00
Gilles Gouaillardet	ab78c7f54a	orted/pmix: fix misc resource leak as reported by Coverity with CID 1269844	2015-02-27 19:25:55 +09:00
Mike Dubman	dbc15009b6	Merge pull request #415 from alinask/topic/fix_fork_support_flow Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.	2015-02-26 21:50:11 +02:00
Nathan Hjelm	883d09376f	Fix coverity #1271536	2015-02-25 11:35:45 -07:00
rhc54	efbb57430b	Merge pull request #419 from nkogteva/master grpcomm brcks: fix copy-paste bug which affects performance	2015-02-25 07:39:55 -08:00
Alina Sklarevich	e4c4e7df5e	Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support. In order to have an effect, ibv_fork_init should be called in the beginning of the verbs initialization flow - before the calls to the ibv_create_qp and ibv_create_cq verbs. These functions are called from the oob/ud code and by the time the other verbs components (btl openib, pml yalla, ...) call ibv_fork_init, it's too late. This commit forces the call to ibv_fork_init (if it's requested) right at the beginning of all the components that are using verbs. (ibv_fork_init() can be safely called multiple times) This commit also removes the btl_openib_want_fork_support mca parameter and adds a new mca parameter instead - opal_verbs_want_fork_support. Through this new parameter, fork support may be requested for ALL components. The default value for this parameter is set to 1. Before this commit the btl_openib_want_fork_support parameter didn't provide fork support for the openib btl if its value was set to 1. (because when openib called ibv_fork_init, it was already after the calls to ibv_create_* in oob/ud and thereofre it failed).	2015-02-25 10:58:50 +02:00
Jeff Squyres	a85a392896	Merge pull request #422 from jsquyres/topic/coverity-fixes Some Coverity fixes	2015-02-24 17:00:10 -05:00
Jeff Squyres	05f00aface	plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed This was CID 993712	2015-02-24 15:48:50 -05:00
Ralph Castain	451bd16a10	Remove dead code	2015-02-24 12:41:12 -08:00
Jeff Squyres	4f54fedf05	orterun: ensure to set used_num_procs=true after finding that token This was CID 71687.	2015-02-24 15:25:39 -05:00
Jeff Squyres	398ae15533	rmaps_base_frame: remove dead code This was CID 1196641	2015-02-24 15:24:11 -05:00
Jeff Squyres	71ae0ad5ec	oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code This was CID 1196629	2015-02-24 15:24:11 -05:00
Jeff Squyres	0bd2783b91	oob_usock: don't try to close the socket if it didn't open This was CID 1196663	2015-02-24 15:24:09 -05:00
Jeff Squyres	e2223cd9bf	plm_rsh: ensure cwd array is \0-terminated This was CID 72257	2015-02-24 15:24:08 -05:00
Ralph Castain	332e4fa7aa	Minor fix - relative host name syntax cannot support usernames as you can't know which hosts will be selected	2015-02-24 12:15:28 -08:00
Nathan Hjelm	ed78553512	Update opal_free_list_t usage to reflect new class interface. Please verify your components have been updated correctly. Keep in mind that in terms of threading: OPAL_FREE_LIST_GET -> opal_free_list_get_st OPAL_FREE_LIST_RETURN -> opal_free_list_return_st I used the opal_using_threads() variant anytime it appeared multiple threads could be operating on the free list. If this is not the case update to _st. If multiple threads are always in use change to _mt.	2015-02-24 10:05:44 -07:00
Nadezhda Kogteva	c4d6ca6468	grpcomm brcks: fix copy-paste bug which affects performance	2015-02-24 17:06:39 +02:00
Jeff Squyres	226a814c9d	grpcomm_brks: fix minor compiler warning (rc used before set) Also check for OBJ_NEW returning NULL.	2015-02-23 09:04:45 -08:00
Jeff Squyres	600858609e	grpcomm_rcd: fix minor compiler warning (rc used before set) Also check for OBJ_NEW returning NULL.	2015-02-23 09:03:07 -08:00
Howard Pritchard	bf89131f9e	add owner files to opa/ompi/orte mca directories This commit adds an owner file in each of the component directories for each framework. This allows for a simple script to parse the contents of the files and generate, among other things, tables to be used on the project's wiki page. Currently there are two "fields" in the file, an owner and a status. A tool to parse the files and generate tables for the wiki page will be added in a subsequent commit.	2015-02-22 15:10:23 -07:00
Jeff Squyres	15be948d79	wrappers: _EXTRA_INCLUDES does not exist any more There were a few places where _EXTRA_INCLUDES (and derivates) were still being used. This commit removes all of them.	2015-02-20 08:43:25 -08:00
Jeff Squyres	9b716d946e	wrappers: fix errant @{libdir} reference in pkg-config files The RPATH support added a @{libdir} token into <package>_WRAPPER_EXTRA_LDFLAGS. However, these flags are also substituted into the pkg-config data files, and they don't understand the @{foo} notation. So convert @{libdir} into ${libdir}, which pkg-config does understand. Thanks to Christoph Junghans (@junghans) for notifying us of the issue. Fixes #406.	2015-02-20 08:43:19 -08:00
Jeff Squyres	ec62766a71	notifier base: remove unused variables	2015-02-20 07:06:13 -08:00
Elena	48eae25b8f	fixed issue with grpcomm rcd and brks algorithms which led to performance issues: data just for part of processes was unpacked and stored locally during fence, therefore clients were forced to ask daemons for data directly during get request	2015-02-20 16:41:25 +02:00
Ralph Castain	f7c28ea706	Fix bad test - opal_buffer and opal_ptr can support NULL locations	2015-02-17 21:46:23 -08:00
Ralph Castain	852fbca020	Shut coverity up	2015-02-17 21:17:23 -08:00
Ralph Castain	c1282d5b99	The opal_buffer type also generates its own alloc, so need to let it pass thru the check	2015-02-17 21:06:19 -08:00
Ralph Castain	207cc74f87	Correct name of help file	2015-02-17 16:03:20 -08:00
Ralph Castain	624b16e070	Protect the unload attribute function	2015-02-17 14:21:23 -08:00
Ralph Castain	78245e8a33	Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations.	2015-02-17 12:51:11 -08:00
Gilles Gouaillardet	8dc4f30fae	orte/tools: fix NULL pointer dereference as reported by Coverity with CIDs 1196671 and 1196824	2015-02-17 15:45:06 +09:00
Gilles Gouaillardet	b762766969	orte/util: fix misc memory leaks as reported by Coverity with CIDS 70314, 710653-710657 and 1196741-1196744	2015-02-17 12:27:23 +09:00
Ralph Castain	22f1d29b82	Re-introduce the ORTE notifier framework for logging errors that would otherwise result in abort for persistent systems. Thanks to L. Rajeshnarayanan of Intel for the contribution Subsequent commits will integrate this capability with the state and errmgr frameworks.	2015-02-16 12:46:58 -08:00
Gilles Gouaillardet	8fe8079080	Fix a build failure when configure'd with --without-hwloc see http://mtt.open-mpi.org/index.php?do_redir=2235	2015-02-16 10:31:09 +09:00
Jeff Squyres	3ac1d0dae5	*-info: add "lt_dladvise support" lines	2015-02-11 12:25:20 -08:00
Ralph Castain	2a83d2613a	Cleanup the orte/test/system directory	2015-02-11 10:42:38 -08:00
Ralph Castain	d5775bf9de	Cleanup orte MPI test directory so it all builds again	2015-02-11 10:14:06 -08:00
Ralph Castain	ce56c0a2cf	Oops - remove debug/exit	2015-02-11 10:14:06 -08:00
Jeff Squyres	c9e3f22933	orte mpi tests: fix a bunch of compiler warnings	2015-02-11 12:28:10 -05:00
Jeff Squyres	07179ef669	orte mpi tests: don't use deprecated MPI functions Change MPI_Errhandler_set -> MPI_Comm_set_errhandler	2015-02-11 12:28:10 -05:00
Jeff Squyres	cc7f433c0f	Makefile: this file should not be executable	2015-02-11 07:33:56 -08:00
Ralph Castain	3de8c5c7c6	Cleanup the munge support - the credential cannot be reused for multiple connections	2015-02-10 20:34:35 -08:00
Ralph Castain	46fb850bb0	Continue adding support for options on orte-submit - still need to shift some of the MCA params to job object attributes	2015-02-10 13:56:14 -08:00
Ralph Castain	116fcaff2c	Start adding support for cmd line options to orte-submit	2015-02-10 12:13:21 -08:00
rhc54	cf3f4def48	Merge pull request #386 from marksantcroos/master Add debug option to orte-dvm. Looks fine - thanks	2015-02-10 11:38:52 -08:00
Ralph Castain	df2cd96772	Display the local/global attribute flag more prominently. Mark the attributes as global in orte-submit so they will be communicated	2015-02-10 10:47:32 -08:00
Mark Santcroos	ff6a69a68d	Add debug option to orte-dvm.	2015-02-10 13:02:23 -05:00
Ralph Castain	063e4c9989	Cleanup the pretty-print of odls cmds as some were missing. Add a new cmd to terminate the DVM, which the HNP will use to trun around and issue an xcast to the DVM.	2015-02-10 08:27:13 -08:00
Ralph Castain	3ae3b96c17	Fix master compilation - a buried header dependency must have been removed.	2015-02-10 07:22:10 -08:00
Elena	948c20d862	added pmix unit test to tarball	2015-02-10 13:41:15 +02:00
Howard Pritchard	b62d9c2c70	ess/alps: fix compile issue for pgi remote -fi-noident cflag option. Wasn't helping anyway and caused pgi compiles to break.	2015-02-09 20:49:04 -08:00
Ralph Castain	3478def791	Ensure that nodes get included in the nidmap when spawning a new DVM job - we really only need to do this once, but for now we do it for every job until we work out how to avoid the duplication. Remove debug from orte-dvm tool	2015-02-09 23:47:46 -05:00
Ralph Castain	ef13ba7db3	Add debug-daemons option to orte-dvm	2015-02-09 11:08:45 -05:00
Ralph Castain	a3275aa867	Once again, fix the blasted singleton comm_spawn	2015-02-05 17:34:25 -08:00
Ralph Castain	f28238af59	Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards.	2015-02-05 11:41:37 -08:00
Jeff Squyres	938b8e1dad	schitzo: fix free of uninitialized value The "param" value is not assigned before this free() statement. So remove it. (yay clang compiler warnings)	2015-02-04 15:50:24 -05:00
Ralph Castain	251084a2da	When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well.	2015-02-04 07:59:47 -08:00
Ralph Castain	2b0b012460	Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it.	2015-02-04 06:21:54 -08:00
Ralph Castain	7299cc3ab9	Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm	2015-02-03 07:25:43 -08:00
Elena	5919b636e1	changed output format in pmix unit test	2015-02-02 14:22:51 +02:00
Ralph Castain	4dba298e6e	Update orte-submit manpage, add the ompi-* versions of orte-dvm and orte-submit manpages	2015-02-01 15:46:40 -08:00
Ralph Castain	e303a9b1d6	Provide an orte-dvm man page. Provide an option to orte-submit for terminating the DVM	2015-02-01 12:14:44 -08:00
Ralph Castain	ec5ccb76cf	Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time.	2015-01-30 11:00:43 -08:00
rhc54	e7fa600d85	Merge pull request #360 from elenash/master added unit test for pmix functionality	2015-01-28 06:18:57 -06:00
Elena	472baa1284	added unit test for pmix functionality	2015-01-28 13:18:26 +02:00
Ralph Castain	b838df9eb8	Get slurm to stay out of the way on singletons	2015-01-27 09:29:43 -06:00
Ralph Castain	294ebc907a	Fix singleton operations so they can work inside a slurm environment	2015-01-27 09:29:42 -06:00
Ralph Castain	3eca55caec	Continue fixing singletons in slurm environments	2015-01-27 09:29:42 -06:00
Ralph Castain	fcec24b2a4	Minor cleanups to handle comm_spawn and singletons	2015-01-27 09:29:42 -06:00
Ralph Castain	74385302c0	Add the personality to the orte_job_t datatype support	2015-01-27 09:29:42 -06:00
Ralph Castain	88c38f87d2	Get the orteds to use schizo as well	2015-01-27 09:29:42 -06:00
Ralph Castain	028b00154d	Complete implementation of the schizo framework to support OMPI component	2015-01-27 09:29:42 -06:00
Ralph Castain	11c92eefe6	ckpt	2015-01-27 09:29:42 -06:00
rhc54	a1707326bf	Merge pull request #359 from hppritcha/topic/better_help orte/util: minor improvement to show_help	2015-01-25 08:13:49 -08:00
Howard Pritchard	1e94d84ae6	orte/util: minor improvement to show_help Make sure the show help gives it a good try to print an error message locally if the send_buffer_nb method returns an error.	2015-01-23 13:54:03 -08:00
Howard Pritchard	2809c21e0f	rml/oob: check peer param in send methods The rml/oob was not doing sanity checks on the input peer parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd. Owing to the fact that there are places in the ompi/orte stack where things like orte_show_help_norender are called way before ORTE_PROC_MY_HNP, are setup properly, all kinds of weird startup failures can occur as the rml/oob tries to process send requests where the peer is junk. Rather than try to expand this kind of thing: /* if we are the HNP, or the RML has not yet been setup, * or ROUTED has not been setup, * or we weren't given an HNP, or we are running in standalone * mode, then all we can do is process this locally */ if (ORTE_PROC_IS_HNP \|\| orte_standalone_operation \|\| NULL == orte_rml.send_buffer_nb \|\| NULL == orte_routed.get_route \|\| NULL == orte_process_info.my_hnp_uri) { rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME); } do the right thing in the rml level and return an error rather than eventually failing in the send owing to peer not being valid.	2015-01-22 06:12:39 -08:00
Howard Pritchard	06d3b57c07	Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug odls/alps: check if PMI gni rdma creds already set	2015-01-19 11:48:24 -07:00
Howard Pritchard	fd807aee69	odls/alps: check if PMI gni rdma creds already set Need to check if the alps odls component has already read the rdma creds from alps. Its okay to ask apshepherd multiple times for rdma creds, but opal_setenv gets a bit picky about this. Rather than check for the OPAL_EXISTS return value from opal_setenv, for now just check with a static variable whether or not orte_odls_alps_get_rdma_creds has already been successfully called before. Would be nice to have an opal_getenv function for checking if an env. variable had already been set by opal_putenv.	2015-01-19 10:12:38 -08:00
Gilles Gouaillardet	661c35ca67	cleanup dead code caused by the removal of the --with-threads configure option	2015-01-16 19:13:59 +09:00
Ralph Castain	e7ff21b3aa	The opal_stop_progress_thread function releases the event base, so don't do it again	2015-01-15 10:48:40 -08:00
Ralph Castain	9ac39b63cc	Use the opal_progress_threads support for the ORTE progress thread in applications	2015-01-15 07:55:19 -08:00
Ralph Castain	d2938a144f	Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch	2015-01-12 05:31:02 -08:00
Howard Pritchard	f34dd5f5fd	plm/alps: update copyright	2015-01-07 12:33:38 -07:00
Howard Pritchard	c454d11b01	plm/alps: fix orted abort hang problem Turns out the alps plm component wasn't changing the state of the job upon terminating the orted's in the case of an abnormal termination. This caused mpirun to hang with a zommbie'd aprun process if an orted on a node in the job was killed via signal.	2015-01-07 12:31:41 -07:00
Howard Pritchard	f0f98f13b6	odls/base: fix an edge case with signals In the course of doing some testing with how orted's handle signaled child processes, found out that very often doing a kill -9 on a process on a node just results in the job hanging. The problem was that the orted odls/errmgr was not properly handling the exit_code being returned from waitpid. Now mark the proc state as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code from waitpid indicates the process exited owing to a signal.	2015-01-06 15:42:38 -07:00
Nadezhda Kogteva	05af80b302	Fix commit `bffb2b7a4b` which broke pmix server functionality	2014-12-24 13:25:23 +02:00
Ralph Castain	43a40f8aac	LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly	2014-12-19 12:06:05 -08:00
Ralph Castain	b314bfb5e9	If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info	2014-12-19 10:56:14 -08:00
Jeff Squyres	7b43bdc984	plm base: move flag inside the #if in which it is used Avoid a compiler warning by declaring the tflag only inside the #if in which it is used (i.e., if hwloc support is built).	2014-12-18 10:56:23 -08:00
Ralph Castain	2581b41d08	Continue refactoring code by splitting the msg processing from the sendrecv code	2014-12-17 19:57:14 -08:00
Ralph Castain	f489e871c2	Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac	2014-12-17 19:08:52 -08:00
Artem Polyakov	01601f3284	Merge pull request #305 from artpol84/timing Timing framework improvement	2014-12-16 15:13:48 +06:00
Ralph Castain	573a574a3c	Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map	2014-12-15 12:11:13 -08:00
Ralph Castain	a22cc45769	Close the pmix server sockets on exec	2014-12-13 20:30:21 -08:00
Ralph Castain	f4ff791335	Close oob/usock connections upon exec	2014-12-13 20:24:09 -08:00
Ralph Castain	6c4d5a51c4	Close tcp sockets upon exec	2014-12-13 20:23:53 -08:00
Ralph Castain	9658256a98	Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.	2014-12-13 18:44:09 -08:00
Ralph Castain	bffb2b7a4b	Correct some issues with variables used before being set	2014-12-12 17:23:32 -08:00
Ralph Castain	0630680f36	Two cleanups required for transfer to 1.8.4: * Use %d format for the topo signature as some systems apparently have problems with %u * Use correct variable in show_help message	2014-12-12 17:23:32 -08:00
Rolf vandeVaart	f4aecdbfd2	Change logging function name from log to logfn. Fixes issue with PGI compile	2014-12-12 09:46:44 -05:00
Artem Polyakov	8ffad75a0a	Introduce timing interval measurement facility in timing framework	2014-12-10 16:47:49 +06:00
Ralph Castain	9d5135e6cd	Function definition should use the correct type	2014-12-09 01:04:31 -08:00
Ralph Castain	bb529ebd8e	Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings). Retain the hetero-nodes flag for those cases where the user knows that there are differences and our automated system isn't good enough to see it. Will obviously require further refinement as we find out which variances it can detect, and which it cannot.	2014-12-08 15:38:14 -08:00
elenash	baf32fe480	Merge pull request #308 from elenash/master restored _process_name_print_for_opal function in orte_init: it's requir...	2014-12-08 19:14:36 +03:00
Ralph Castain	b757b3f452	Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered.	2014-12-08 08:03:53 -08:00
Elena	6cf3925b09	restored _process_name_print_for_opal function in orte_init: it's required for opal output from daemons which never called ompi_init so didn't set opal_process_name_print pointer	2014-12-08 13:13:35 +02:00
Ralph Castain	d6d69e2b13	Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other. So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate. We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.	2014-12-07 09:11:48 -08:00
Ralph Castain	b1bf557024	Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.	2014-12-05 15:47:09 -08:00
Elena	af38a762a2	these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet.	2014-12-05 12:38:59 +02:00
Ralph Castain	c4fd6d1cde	Fix typo	2014-12-04 12:24:35 -08:00
Ralph Castain	c4002a8485	Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set.	2014-12-04 12:24:35 -08:00
Ralph Castain	c88f181efe	Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.	2014-12-03 18:11:17 -08:00
Howard Pritchard	c67afadcfc	Merge pull request #289 from hppritcha/topic/remove_pmi Topic/remove pmi	2014-12-03 16:58:35 -07:00
Jeff Squyres	a3af7d6dbb	Revert "lsf configury: add dependent libraries for static linking" This reverts commit `56cfa90dda`.	2014-12-03 13:32:56 -08:00
Jeff Squyres	92c2ff91ec	Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic." This reverts commit open-mpi/ompi@32bf0e7b7e.	2014-12-03 13:15:20 -08:00
Ralph Castain	54c955c92d	Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit.	2014-12-03 12:19:00 -08:00
Howard Pritchard	666344a081	orte/mca/common/alps: fix configure file Fix configure file for alps to actually check for alps being available. Also include stdio.h explicitly in common_alps.c	2014-12-03 09:44:18 -07:00
Howard Pritchard	ec38aa3732	orte/mca/common: add missing Makefile.am	2014-12-03 09:44:18 -07:00
Howard Pritchard	191fe0f949	alps configury changes Clean up the orte_check_alps.m4. There was a little of unnecesary stuff for handling cle 5, since it wasn't actually doing the right thing, which would be to use pkg-config to find dependencies both for dynamic and static linking. Decouple the searching for alps libs, etc. from cray pmi. Switch the alps ess and alps odls components' config files to use the ALPS m4 macro. alps configury fixes Improve a check for detecting CLE release. Improve an error message.	2014-12-03 09:44:17 -07:00
Howard Pritchard	d749077e1e	odls/alps: make sure PMI env. variables set up Add call to orte_odls_alps_get_rdma_creds in the local proc launch step to obtain the Cray Rdma credentials from the apshepherd, and to set the PMI env. variables expected by uGNI BTL, etc.	2014-12-03 09:44:17 -07:00
Howard Pritchard	e0487e7702	orte/common/alps: add an alps common lib to orte Add an alps common lib to orte. Add a function to determine whether or not a process is in a PAGG container. Note: we need a better naming convention for common libs, since right now they use a "flat" naming convention.	2014-12-03 09:44:17 -07:00
Howard Pritchard	a753c3ece0	ess/alps: add initial alps ess component Note this alps ess component has nothing to do with the old CNOS alps component used on Cray Seastar/Portals3 (Cray XT) systems. To work properly, changes need to be made to the open method of the ess/pmi component to keep it from selecting, and thus initializing, the opal/pmix/cray component.	2014-12-03 09:44:17 -07:00
Ralph Castain	32bf0e7b7e	Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic.	2014-12-03 07:14:06 -08:00
Ralph Castain	cb15cc06e1	Minor changes per Jeff's request on PR for 1.8.4	2014-12-02 19:54:10 -08:00
Ralph Castain	6294ed991b	Fix singletons - still working on singleton comm_spawn	2014-12-02 14:12:24 -08:00
Ralph Castain	14cdb04327	Revise the ess/pmi selection logic as all APPs must select it, and no daemons. Cleanup some of the mca param levels in ess so we don't printout the topology quite as easily.	2014-12-01 21:19:11 -08:00
Jeff Squyres	56cfa90dda	lsf configury: add dependent libraries for static linking Ensure to add the LSF dependent libraries and LD flags for the wrapper compiler static linking case.	2014-12-01 14:59:10 -08:00
Ralph Castain	f92ccaf0f9	Add missing var declarations	2014-12-01 09:36:28 -08:00
Ralph Castain	960ef34988	Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists	2014-11-30 14:37:37 -08:00
Ralph Castain	3f9d9ae8b6	Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids. Once validated, a version of this will be backported to the v1.8.4 release.	2014-11-30 11:50:31 -08:00
Elena	b17ea23ce0	fixes for direct routed component under mpirun	2014-11-26 13:36:49 +02:00
Ralph Castain	f48b9012cb	Some minor cleanup. We really don't need another peer error constant to indicate that a peer closed as we already have one for "connection failed", and that's all we really know. Update the orte constants to track their opal equivalents.	2014-11-25 08:02:29 -08:00
Nadezhda Kogteva	8dd21c7736	OOB UD: fix case when multiple oob components were specified in command line (checking of uri).	2014-11-25 11:48:11 +02:00
Gilles Gouaillardet	578fe41788	fix hangs introduced by previous commit `a6744b8177`	2014-11-25 17:50:44 +09:00
Gilles Gouaillardet	a6744b8177	fix misc memory leaks specific to the master	2014-11-25 13:52:10 +09:00
Gilles Gouaillardet	38879cf682	fix misc memory leaks	2014-11-25 11:32:43 +09:00
Ralph Castain	48f702827e	First part of memory leak cleanups from Gilles	2014-11-24 16:53:33 -08:00
Ralph Castain	2e00e335b9	Add missing header to tarball. Remove stale opal_unignore	2014-11-21 17:35:11 -08:00
Howard Pritchard	6e807c4e8a	odls/alps: minor config cleanup	2014-11-21 11:09:28 -07:00
Nadezhda Kogteva	05b2eb1270	OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component.	2014-11-20 10:20:35 +02:00
rhc54	7c0273ecb3	Merge pull request #276 from teng-lin/master Fixed a bug that fails to parse hostname starting with numbers.	2014-11-19 16:39:00 -08:00
Teng Lin	07ff51f43f	Fixed a bug that fails to parse hostname starting with numbers. According to RFC 1123, hostnames that begin with numbers are valid.	2014-11-19 16:03:55 -08:00
Howard Pritchard	9425ebefae	Be more selective about closing fd's for alps/odls Be more selective about closing fd's for the alps odls component. Don't close fd's of pipes set up by the apshepherd for providing RDMA credentials, etc. Add an entry to the help file in case alps_app_lli_pipes returns an error.	2014-11-19 11:21:30 -07:00
Ralph Castain	bb91517349	All other layers to register their own print-attribute functions so we can maintain pretty-print capabilities as the attributes are extended.	2014-11-19 09:37:59 -08:00
Ralph Castain	37593b232d	Add a marker for the max attr value being used by ORTE so that other, higher-levels can also use the attribute system	2014-11-19 09:37:59 -08:00
Howard Pritchard	34c156759e	fix some compiler warnings in ras/alps	2014-11-18 11:32:37 -07:00
Howard Pritchard	4df3447d96	fix compare_nodes bug in alps ras component There was an obvious bug in the alps/ras component compare_nodes method which resulted in the function always evaluating the nodes as being equivalent.	2014-11-18 11:15:02 -07:00
Howard Pritchard	ff362c16ce	add/update copyrights for alps odls component	2014-11-18 10:16:11 -07:00
Howard Pritchard	dc98b62070	add initial support for an alps odls component It turns out that the support for Open MPI apps on Cray was hanging on a thin thread of support when using the mpirun job launcher. It just happened that with a certain set of configuration options things would work. This is bound to backfire at some point. To fix this weakness, as well as to allow for mpirun launched jobs to benefit from many of the advanced placement features provided by the Cray Linux Environment (as opposed to the hwloc only default env of orte), a new odls alps component is introduced.	2014-11-17 14:00:09 -07:00
Ralph Castain	d9ceb5aea4	Fix C++ builds by removing no-longer-needed type declaration	2014-11-14 11:44:24 -08:00
Gilles Gouaillardet	f3b36fdf6e	orted/pmix: fix pmix_server_release when several jobids are running on the same node	2014-11-14 16:17:28 +09:00
Gilles Gouaillardet	84b21d726e	orte/util: add OPAL_{VPID,JOBID} types to orte_attr_{load,unload}	2014-11-14 15:55:25 +09:00
rhc54	1fdb6a62d3	Merge pull request #265 from miked-mellanox/topic/undeprecate_env_x ORTE: undeprecate -x var=val in mpirun Looks okay to me - thanks!	2014-11-12 08:46:09 -08:00
Mike Dubman	f83d6045aa	ORTE: undeprecate -x var=val in mpirun mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"	2014-11-12 10:51:15 +02:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Ralph Castain	d0704ef118	Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case.	2014-11-10 14:00:40 -08:00
Ralph Castain	2a90788724	Support physical processor ids in rankfile	2014-11-10 14:00:40 -08:00
Ralph Castain	8c837d3cb3	Doh - if we can't output an entire block, then we need to adjust the number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write.	2014-11-07 13:13:13 -08:00
Ralph Castain	b56b744041	Silence some warnings and remove debug output	2014-11-07 07:54:01 -08:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Ralph Castain	738c3e1d72	Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available.	2014-11-03 21:35:42 -08:00
Ralph Castain	6fbc68c830	Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one	2014-11-03 14:43:17 -08:00
Gilles Gouaillardet	652ecdb888	oob/tcp: always include a missing header file improve open-mpi/ompi@c9d1e16a9e	2014-10-29 13:39:23 +09:00
Gilles Gouaillardet	eef7590e58	wrappers: add the $(EXEEXT) extension to the installed symbolic links	2014-10-28 16:42:51 +09:00
Gilles Gouaillardet	c9d1e16a9e	oob/tcp: include a missing header file warning can be seen under cygwin without the missing header file	2014-10-28 13:56:25 +09:00
Ralph Castain	64fae47d85	Ensure that the proxy "pull" of output gets directed to the requesting tool, which is no longer just the sender since the HNP may be making the request on behalf of someone else	2014-10-23 10:21:21 -07:00
Ralph Castain	526682e2f9	Add the ability for a tool that requests spawn of a job to also request forwarding of all output to the tool. The tool is responsible for its own call to push its stdin to the new job. The push request can come -after- the job is started, but the pull request has to be done during the spawn procedure or else output can be lost.	2014-10-23 08:16:49 -07:00
Ralph Castain	894acb0aa8	configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros These two macros set the MCA prefix and MCA cmd line id, respectively. Specifically, MCA parameters will be named PREFIX<foo> in the environment, and the cmd line will use -ID foo bar. These macros must be called during configure.ac and a value supplied. In the case of Open MPI, the values given are PREFIX=OMPI_MCA_ and ID=mca. Other projects (such as ORCM) will call these macros with their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_ and ID=omca This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running OMPI applications under ORCM, we need the MCA params passed to the ORCM daemons to be separated from those recognized by the OMPI application.	2014-10-22 18:57:40 -07:00
Jeff Squyres	c22e1ae33b	configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros These two macros set the prefix for the OPAL and ORTE libraries, respectively. Specifically, the OPAL library will be named libPREFIXopen-pal.la and the ORTE library will be named libPREFIXopen-rte.la. These macros must be called, even if the prefix argument is empty. The intent is that Open MPI will call these macros with an empty prefix, but other projects (such as ORCM) will call these macros with a non-empty prefix. For example, ORCM libraries can be named liborcm-open-pal.la and liborcm-open-rte.la. This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running MPI applications under ORTE, if the ORTE and OPAL libraries between OMPI and ORCM are not identical (which, because they are released at different times, are likely to be different), we need to ensure that the OMPI applications link against their ORTE and OPAL libraries, but the ORCM executables link against their ORTE and OPAL libraries.	2014-10-22 10:32:19 -07:00
Jeff Squyres	01fd96bfa5	Revert "Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build." This reverts commit `63f619f871`.	2014-10-22 10:32:11 -07:00
Jeff Squyres	206eade32c	mpirun.1in: whitespace cleanup Whitespace cleanup only; no content changes.	2014-10-20 05:18:25 -07:00
Jeff Squyres	9529289319	mpirun.1in: more updates about binding/etc. Follow on to `91e9686` and `f9d620e`.	2014-10-20 05:17:49 -07:00
Ralph Castain	91e96861dd	Cleanup the orterun man page per review by Gus Correa	2014-10-19 10:21:50 -07:00
Ralph Castain	f9d620e3a7	Update the orterun man page	2014-10-16 21:05:04 -07:00
Ralph Castain	ecbae03009	Fix typo	2014-10-16 13:30:06 -07:00
Ralph Castain	b6aa691e0a	Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons.	2014-10-16 12:58:56 -07:00
Gilles Gouaillardet	b5aea782ce	Revert "Fix heterogeneous support" Per the discussion at http://www.open-mpi.org/community/lists/devel/2014/10/16050.php This reverts commit `c9c5d4011b`.	2014-10-16 12:24:38 +09:00
Gilles Gouaillardet	c9c5d4011b	Fix heterogeneous support * redefine orte_process_name_t so it can be converted between host and network format as an opal_identifier_t aka uint64_t by the OPAL layer. * correctly send OPAL_DSTORE_ARCH key	2014-10-15 17:19:13 +09:00
Ralph Castain	1ae34da5e5	Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment.	2014-10-10 12:13:25 -07:00
Ralph Castain	63f619f871	Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build.	2014-10-10 11:39:08 -07:00
Ralph Castain	1be1654e5f	Correctly identify the synonym for orte_direct_modex_cutoff as ompi_hostname_cutoff	2014-10-10 06:05:06 -07:00
Ralph Castain	4fc4a8346b	Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL.	2014-10-09 20:58:46 -07:00
Elena	b937b31693	fix for multiple spawn test	2014-10-09 06:18:16 +02:00
Elena	3d65799236	pmix: fixed ugly bug which caused many strange hangs	2014-10-09 06:17:03 +02:00
Elena	c905fe9b78	pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff	2014-10-09 06:15:31 +02:00
Elena	e319c95267	fixes for grpcomm rcd/brucks algorithms	2014-10-09 06:12:26 +02:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Howard Pritchard	d2bb8d8829	remove alps ess component The alps ess component is obsolete. It relies on header files only present in very old CLE (Cray Linux) 3.X for the Cray XT series. As support for these systems is being dropped starting with release 1.9, this code is being removed.	2014-10-03 13:17:33 -06:00
Jeff Squyres	413e775dbf	version configury: make dist now works Update the VERSION file scheme: * Remove "want_repo_rev". * Add "tarball_version". All values are now always included (major, minor, release, greek, repo_rev). However, configure.ac now runs "opal_get_version.sh ... --tarball", which will return the value of tarball_version (if it is non-empty) or the "full" version string (i.e., "major.minor.releasegreek").	2014-10-02 11:32:54 -07:00
hpp	8ded59ce0f	fix alps plm to allow explicit host placement It turns out that the alps plm code was developed only on cray systems that were running batch schedulers. However, for bring up and development systems, its not at all uncommon for there to be no batch scheduler, and thus to orte it appears that orte_num_allocated_nodes is always zero. This forces a user using mpirun on such a system to always specify a host list: mpirun -n 4 -N 1 -host 32,45,68 .... just to get the job to run, but then since the -L argument for aprun is never built, the app always runs on the first batch of nodes that aprun finds available.	2014-10-02 10:42:01 -06:00
Jeff Squyres	72704441a2	URLs: update URLs for GitHub	2014-10-01 14:44:09 -07:00
Ralph Castain	84810b80fd	Cover the remaining code paths for Java apps to define class path Refs trac:4926 This commit was SVN r32823. The following Trac tickets were found above: Ticket 4926 --> https://svn.open-mpi.org/trac/ompi/ticket/4926	2014-09-30 22:27:03 +00:00
Ralph Castain	040a69c38b	Correct the classpath to correctly include the local directory so Java programs find the application class cmr=v1.8.4:reviewer=jsquyres This commit was SVN r32817.	2014-09-30 16:35:12 +00:00
Ralph Castain	4320457394	Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings This commit was SVN r32811.	2014-09-29 17:10:38 +00:00
Howard Pritchard	f8ac8bb6b0	remove improper use of hwloc_bitmap_free When using the native aprun launcher, it was observed that there were frequent memory corruption errors occuring either during a PMI kvs-fence operation, or at mpi termation during opal cleanup of allocated objects. This was especially bad when using aprun --c none In some cases, the application would even just hang in finalize if using ptmalloc, owing to some kind of infinite loop in cleanup of small blocks, etc. It turns out that the proble was in orte_ess_base_proc_binding's improper use of opal_hwloc_base_get_available_cpus. The cpuset (bitmap) returned from that function is not meant to be freed by the caller. This problem is likely never observed when using the mpirun launcher as there's an early exit if the OMPI_MCA_orte_bound_at_launch environment variable is set. This commit was SVN r32809.	2014-09-29 16:10:37 +00:00
Gilles Gouaillardet	9661e4537f	oob/tcp: fix a race condition Mimick the btl/tcp protocol to solve the race condition that happens when two peers try to connect to each other at the same time cmr=v1.8.4:reviewer=rhc This commit was SVN r32799.	2014-09-26 06:54:30 +00:00
Ralph Castain	17846411c3	Now that we have an ORTE thread running in apps, we can't just call "exit" during RTE abort as that is happening in a thread, and (at least in some environments) doesn't result in the main thread being immediately terminated. Instead, we wind up going thru orte_finalize in the main thread, which isn't what we want. So replace the call to "exit" with the "quick exit" variant "_exit", which causes the entire process to exit immediately. (custom patch has been posted for 1.8.3) This commit was SVN r32780.	2014-09-23 22:51:10 +00:00
Howard Pritchard	1508a01325	Fixes to enable mpirun to work again on Cray The ess pmi module was not handling aprun launched daemons. All daemons were thinking they were vpid 1. Also, turns out that on cray systems using MOM nodes for launched jobs, just detecting whether or not a process is in a PAGG container is not sufficient. Crank up the priority of the alps PLM component in the event that the configure detected the presence of both slurm and alps. Have the ESS pmi component open the pmix framework and select a pmix component. This commit was SVN r32773.	2014-09-23 15:37:26 +00:00
Gilles Gouaillardet	5fa2b6c59c	oob/tcp: fix a race condition Refs trac:4909 This commit was SVN r32754. The following Trac tickets were found above: Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909	2014-09-18 08:17:25 +00:00
Ralph Castain	3a437cbdb3	Silence set-but-not-used warning when timing isn't enabled This commit was SVN r32749.	2014-09-17 00:40:10 +00:00
Ralph Castain	414f4e9783	Try to provide a real hostname for the remote host to aid in debugging Refs trac:4908 This commit was SVN r32748. The following Trac tickets were found above: Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908	2014-09-17 00:39:49 +00:00
Jeff Squyres	9dc49c5f92	oob_tcp_connection: print "<unknown>" instead of "NULL" "NULL" doesn't meany anything to the user, and is somewhat confusing to see in an error message. "<unknown>" at least indicates that there's an error, and we know who the peer is. This commit was SVN r32747.	2014-09-16 22:47:57 +00:00
Ralph Castain	09aecea55a	Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message. Refs trac:4908 This commit was SVN r32746. The following Trac tickets were found above: Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908	2014-09-16 22:21:02 +00:00
Ralph Castain	4bbc9a28d6	Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it. cmr=v1.8.3:reviewer=rhc This commit was SVN r32744.	2014-09-16 18:59:36 +00:00
Ralph Castain	a74428513d	Provide a better help message when we are unable to complete a connection due to a firewall. cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32743.	2014-09-16 16:28:29 +00:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00
Jeff Squyres	e95ed94a94	plm_rsh_module.c: output to the framework output Trivial fix from r32686: don't output to stream 0, but rather to orte_plm_base_framework.framework_output (this is the way it was before r32686). In reality, this is going to end up being stream 0, anyway, but we might as well be pedantically correct... Refs trac:4897. This commit was SVN r32726. The following SVN revision numbers were found above: r32686 --> open-mpi/ompi@4df1aa63f7 The following Trac tickets were found above: Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897	2014-09-13 00:46:35 +00:00
Ralph Castain	0445052a1c	Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value. Refs trac:4897 This commit was SVN r32719. The following Trac tickets were found above: Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897	2014-09-12 22:21:30 +00:00
Ralph Castain	9e7e90265f	Temporarily make the direct grpcomm component the default until we can debug the other modules This commit was SVN r32707.	2014-09-11 14:47:54 +00:00
Ralph Castain	4eb6291334	Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives. This commit was SVN r32703.	2014-09-10 17:36:16 +00:00
Ralph Castain	ea11e63f59	Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node. Reviewed by rhc, so RM-approved to go across to v1.8.3 cmr=v1.8.3:reviewer=ompi-gk1.8 This commit was SVN r32701.	2014-09-10 15:52:18 +00:00
Ralph Castain	e671620ac7	Per request from Jeff: tune up the help messages for binding options Refs trac:4898 This commit was SVN r32691. The following Trac tickets were found above: Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898	2014-09-09 22:39:22 +00:00
Gilles Gouaillardet	63209eac5b	orte/util: use ORTE_JOB_FAMILY and ORTE_LOCAL_JOBID macros This commit was SVN r32688.	2014-09-09 05:13:00 +00:00
Ralph Castain	4207b4c4ad	Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32687.	2014-09-08 21:03:51 +00:00
Ralph Castain	4df1aa63f7	Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused. So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes. cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32686.	2014-09-08 20:38:46 +00:00
Ralph Castain	6323b226c7	Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation). This commit was SVN r32673.	2014-09-06 19:19:44 +00:00
Ralph Castain	94ffca4901	Correct the cutoff point for full modex operation as it is based on the number of nodes in the system, not the number of procs in the signature. This commit was SVN r32666.	2014-09-03 17:28:12 +00:00
Ralph Castain	2bfb18e004	Resolve some race conditions when async pmix modex modes are invoked. Since calls to "get" data can come both locally and remotely before data for a given proc has actually been received, we have to track all requests that cannot be immediately fulfilled and provide the data once it has been received. This commit was SVN r32664.	2014-09-02 20:04:17 +00:00
Ralph Castain	4d186e6402	Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32662.	2014-09-02 14:53:00 +00:00
Ralph Castain	f2b26bde4c	Resolve a race condition that could cause us to hang during abnormal terminations due to multi-counting num_terminated This commit was SVN r32660.	2014-09-02 00:32:52 +00:00
Ralph Castain	e49ca05f11	Remove unused variable This commit was SVN r32651.	2014-08-31 03:11:50 +00:00
Ralph Castain	5cdbc00136	Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others. This commit was SVN r32650.	2014-08-30 19:33:46 +00:00
Ralph Castain	a2085a5916	Fix the PSM transport key generator to match prior releases This commit was SVN r32649.	2014-08-30 00:48:25 +00:00
Ralph Castain	cb0739dfd4	Update the regex to resolve a bug This commit was SVN r32647.	2014-08-29 22:24:20 +00:00
Ralph Castain	8faabed2cd	Add some further initialization and protection for zero-byte messages This commit was SVN r32644.	2014-08-29 17:24:55 +00:00
Ralph Castain	2b225e3776	Cleanup a race condition regarding marking that waitpid_fired. We should always mark it as fired when we enter the wait_local_proc routine, and also mark it as no longer alive if iof_complete has also been found. If other places in the code also update those flags, there is no harm done. This commit was SVN r32643.	2014-08-29 17:03:31 +00:00
Ralph Castain	730e28349e	Some minor uninitialized variable cleanups This commit was SVN r32629.	2014-08-29 02:21:13 +00:00
Ralph Castain	fafdbeec0c	Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them. This commit was SVN r32624.	2014-08-28 20:35:35 +00:00
Ralph Castain	731a878ff3	Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe This commit was SVN r32620.	2014-08-27 19:52:20 +00:00
Ralph Castain	5fb7c7d23b	Don't explicitly add the hostname to the data fetch when we already cached a remote blob This commit was SVN r32619.	2014-08-27 16:18:05 +00:00
Ralph Castain	3c24770bce	Protect debug printing on backend nodes This commit was SVN r32618.	2014-08-27 16:17:28 +00:00
Ralph Castain	b87b69e977	Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction This commit was SVN r32617.	2014-08-27 16:16:46 +00:00
Ralph Castain	842aaf6167	Correctly end mapping oversubscribed nodes round-robin byslot cmr=v1.8.3:reviewer=rhc This commit was SVN r32616.	2014-08-27 16:15:18 +00:00
Gilles Gouaillardet	2679629a12	pmix: fix compilation when configured with --without-hwloc This commit was SVN r32604.	2014-08-26 08:31:05 +00:00
Ralph Castain	1221e8a96f	Compare the full signature - thanks to Gilles for identifying the problem This commit was SVN r32595.	2014-08-25 14:52:06 +00:00
Ralph Castain	5a13cdb739	Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check This commit was SVN r32587.	2014-08-22 22:48:16 +00:00
Ralph Castain	039b7acfb5	Fix the quoting algorithm so only rsh command lines get quoted values cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32586.	2014-08-22 22:47:38 +00:00
Ralph Castain	f00af81c1d	Little more cleanup under the abort cases cited by Gilles. All seem to be working now This commit was SVN r32585.	2014-08-22 19:57:57 +00:00
Ralph Castain	b1a7375192	Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors This commit was SVN r32584.	2014-08-22 19:20:45 +00:00
Ralph Castain	6ff2a60829	Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario This commit was SVN r32578.	2014-08-22 14:26:24 +00:00
Ralph Castain	8f1b9b463e	Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound. This commit was SVN r32577.	2014-08-22 05:17:51 +00:00
Ralph Castain	c6f78d6e54	The PMI ess component now gets used for more than direct launch, so only set standalone_operation flag if no daemon uri is available so we aggregate show_help messages This commit was SVN r32574.	2014-08-22 03:00:56 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Ralph Castain	b4511913f6	Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us. Refs trac:4870 This commit was SVN r32558. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 20:58:07 +00:00
Ralph Castain	fa28710d53	Track down the last piece of the connection problem. It appears that providing a netmask of 0 to opal_net_samenetwork results in everything looking like it is on the same network. Hence, we were not retaining any of the alternative addresses, so we had no other way to check them. Refs trac:4870 This commit was SVN r32556. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 16:55:36 +00:00
Ralph Castain	343038af7b	Frazzle-frump! Missed that we reset the peer state just before the new check. Refs trac:4870 This commit was SVN r32554. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-19 22:34:49 +00:00
Ralph Castain	0a91fdf85f	If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure. cmr=v1.8.2:reviewer=edgar This commit was SVN r32553.	2014-08-19 19:48:24 +00:00
Ralph Castain	024572cb6c	Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it. cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket This commit was SVN r32552.	2014-08-19 19:40:20 +00:00
Jeff Squyres	1551339eba	rsh: revert part of r32517: keep the quoting As part of reviewing CMR #4860, I talked through r32517 with Ralph. In attempt to fix various rsh quoting problems, r32517 removed all the quoting from the main code path and then only added it back in at the end in some cases. This commit puts back the quoting parts that were removed in r32517 (r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca <foo_equivalent> 1" so that de-duplication works, and b) change a != to ==). refs trac:4860 This commit was SVN r32524. The following SVN revision numbers were found above: r32517 --> open-mpi/ompi@7342bce58f The following Trac tickets were found above: Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860	2014-08-13 19:27:10 +00:00
Gilles Gouaillardet	f96d382d1d	Fix typo. Thanks to Christopher Samuel for reporting it This commit was SVN r32520.	2014-08-13 05:54:59 +00:00
Ralph Castain	7342bce58f	Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing. Fixes trac:4857 cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32517. The following Trac tickets were found above: Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857	2014-08-13 03:51:04 +00:00
George Bosilca	de7191132d	Remove few warnings. This commit was SVN r32506.	2014-08-11 13:34:44 +00:00
Gilles Gouaillardet	a873f45a90	Fix r32460 race condition resolution when procs call MPI_Abort. do not invoke orte_session_dir_finalize(...) so orte_ess_base_app_abort(...) can successfully createi <orte_process_info.proc_session_dir>/aborted cmr=v1.8.2:reviewer=rhc This commit was SVN r32498. The following SVN revision numbers were found above: r32460 --> open-mpi/ompi@abedb97be4	2014-08-11 05:50:32 +00:00
Gilles Gouaillardet	e184733ef6	check-help-strings cleanup This commit was SVN r32496.	2014-08-11 03:26:21 +00:00
Gilles Gouaillardet	f24699623f	check-help-strings cleanup This commit was SVN r32495.	2014-08-11 03:25:22 +00:00
Gilles Gouaillardet	c3c364a262	check-help-strings cleanup This commit was SVN r32494.	2014-08-11 03:22:05 +00:00
Gilles Gouaillardet	d9e0212e0e	check-help-strings cleanup This commit was SVN r32493.	2014-08-11 03:21:08 +00:00
Gilles Gouaillardet	d139f75db4	check-help-strings cleanup This commit was SVN r32492.	2014-08-11 03:20:37 +00:00
Ralph Castain	abedb97be4	Resolve race condition when procs call MPI_Abort. Since we go thru the errmgr instead of the normal proc termination routines, we need to ensure we mark that the proc has fired its waitpid and is no longer alive. Otherwise, the local daemon won't terminate because it thinks there is still a local proc alive and we hang. Thanks to Gilles for tracking it down. cmr=v1.8.2:reviewer=rhc This commit was SVN r32460.	2014-08-08 15:58:49 +00:00
Ralph Castain	8ea576c870	I have no idea how they did it, but someone managed to write a test that circled around and around and eventually reached this point with a NULL pointer. So protect against that possibility. This commit was SVN r32434.	2014-08-05 16:20:46 +00:00
Ralph Castain	42c5073aa3	Safely cleanup the opal_proc_t structure for non-MPI procs. This commit was SVN r32402.	2014-08-01 16:38:49 +00:00
Ralph Castain	7758528d72	Apparently, someone else is destructing the opal_proc_t, so don't destruct it ourselves This commit was SVN r32400.	2014-08-01 14:54:22 +00:00
Ralph Castain	d29a5ab69d	Okay, now handle the non-MPI apps This commit was SVN r32399.	2014-08-01 14:49:25 +00:00
Ralph Castain	daeb9b6c4f	Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain. Thanks to Gilles for pointing out some of the discrepancies. This commit was SVN r32398.	2014-08-01 14:44:11 +00:00
Ralph Castain	8cfadd1842	Per Paul Hargrove, add missing include to fix build under FreeBSD RM-approved cmr=v1.8.2:reviewer=ompi-gk1.8 This commit was SVN r32397.	2014-08-01 13:37:41 +00:00
George Bosilca	daa076995a	orte_rmaps_numa_node_t -> opal_rmaps_numa_node_t This commit was SVN r32380.	2014-07-31 19:58:47 +00:00
Ralph Castain	5db717f090	Some small leak cleanups cmr=v1.8.3:reviewer=artpol This commit was SVN r32358.	2014-07-30 15:46:02 +00:00
Ralph Castain	98b5a86a58	Now that we are using the radix routed module, teach it how to behave nicely with singletons This commit was SVN r32312.	2014-07-24 22:46:17 +00:00
Ralph Castain	0cad281a92	Single-word cmd line values for orted are dealt with in orte_plm_base_orted_append_basic_args, so protect against special characters there. Have the rsh module only deal with multi-word arguments as those were skipped by orte_plm_base_orted_append_basic_args. Refs trac:4802 This commit was SVN r32293. The following Trac tickets were found above: Ticket 4802 --> https://svn.open-mpi.org/trac/ompi/ticket/4802	2014-07-23 17:06:51 +00:00
Jeff Squyres	4da3c85b54	fortran: revert Absoft-based fixes Rever r32246, r32254, and 32255 -- they were fixing side-effects of the real bug. Real fix coming after this one. This commit was SVN r32286. The following SVN revision numbers were found above: r32246 --> open-mpi/ompi@08d2a1a48d r32254 --> open-mpi/ompi@232d4dbb7b	2014-07-22 21:49:22 +00:00
Ralph Castain	a94a97bd50	Cleanup the passing of MCA params on the orted cmd line in ssh by ensuring that we quote all values since they could be multi-word and/or contain special characters. Thanks to Dirk Schubert for pointing it out. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32280.	2014-07-22 18:22:06 +00:00
Ralph Castain	2f579806ae	Make the radix routed component the default pending repair/completion of debruijn option This commit was SVN r32276.	2014-07-22 16:48:51 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	1feaffbb15	Get the blasted singleton comm_spawn working again. There remain problems with the Slurm interaction in this use-case as the PMI components (if configured to build) try to run even when a Slurm allocation hasn't been made, but I leave that to someone else to resolve. I did, however, tell the Slurm ess to quit interfering with applications launched in this use-case by ORTE daemons, so things do work when inside a Slurm allocation. Also discovered that the rsh launcher is not picking up --enable-orterun-prefix-by-default when invoked during singleton comm_spawn, but I was unable to see why that was happening and ran out of time. cmr=v1.8.2:reviewer=rhc This commit was SVN r32229.	2014-07-13 14:47:22 +00:00
Ralph Castain	d55f16db50	Fix a hang in daemon collectives when run on multinode systems This commit was SVN r32222.	2014-07-12 00:43:12 +00:00
Ralph Castain	2ce11ed5c4	Fix one spot missed by recent commit This commit was SVN r32210.	2014-07-10 20:08:57 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Ralph Castain	8c85ca350e	Remove debug This commit was SVN r32200.	2014-07-10 18:28:24 +00:00
Jeff Squyres	6cc538ae16	help-orterun.txt: wrap long messages, clarify new messages Clarify the new -x/mca_base_env_list help messages. This commit was SVN r32199.	2014-07-10 17:24:52 +00:00
Ralph Castain	796f57f709	Protect against problems if someone passes us thru a pipe and then abnormally terminates the pipe early This commit was SVN r32189.	2014-07-09 22:41:53 +00:00
Ralph Castain	7beb8f6799	Silence used-before-set var warning This commit was SVN r32188.	2014-07-09 22:37:47 +00:00
Joshua Ladd	801e2cb544	Fix error and warning messages after reverting the mca_base_env_list to being semicolon delimited. This commit was SVN r32179.	2014-07-09 14:46:19 +00:00
Joshua Ladd	30da6d3a17	Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option. mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3 should now be expressed as mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. This feature was developed by Elena Shipunova and was reviewed by Josh Ladd. This commit was SVN r32163.	2014-07-09 00:38:25 +00:00
Ralph Castain	5bb5b22573	When a user asks for cpus/rank > 1 and only has one slot, we need to ensure we always map at least one process when they don't tell us -np cmr=v1.8.2:reviewer=rhc:subject=correct num_procs in corner case This commit was SVN r32142.	2014-07-04 17:00:35 +00:00
Ralph Castain	9f97c74ba3	Silence warning This commit was SVN r32136.	2014-07-03 17:29:04 +00:00
Ralph Castain	356e7ea904	Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now. This commit was SVN r32133.	2014-07-03 15:58:13 +00:00
Ralph Castain	0a4639308e	Remove a potential race condition - we'll cleanup the local children when we are all done This commit was SVN r32132.	2014-07-03 14:13:43 +00:00
George Bosilca	2883adcdf3	Remove useless variables. This commit was SVN r32123.	2014-07-03 00:30:54 +00:00
Ralph Castain	149810f02c	Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message This commit was SVN r32122.	2014-07-02 14:46:00 +00:00
Ralph Castain	e9d69ca370	Remove stale test This commit was SVN r32104.	2014-06-29 16:37:19 +00:00
Adrian Reber	47b118c0ae	fix FT compilation This commit was SVN r32094.	2014-06-26 03:40:07 +00:00
Adrian Reber	cabf1d4e68	use the orte attributes in the FT code to fix compile errors This commit was SVN r32093.	2014-06-26 03:19:17 +00:00
Adrian Reber	10c1a50705	"handle" removal of opal_db.remove() in the FT code This commit was SVN r32092.	2014-06-26 03:11:37 +00:00
Ralph Castain	f3cb124e50	Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. This commit was SVN r32089. The following SVN revision numbers were found above: r32070 --> open-mpi/ompi@12d92d0c22 r32082 --> open-mpi/ompi@aa6438ef7a	2014-06-25 20:43:28 +00:00
Adrian Reber	9f73e79d91	also change the callback function prototype (to get the FT code to compile again) This commit was SVN r32088.	2014-06-25 20:37:02 +00:00
Adrian Reber	4aca7095dc	fix a syntax error in the FT code This commit was SVN r32087.	2014-06-25 20:35:50 +00:00
Adrian Reber	4b25e92194	get the FT code to compile again by adding/removing #includes This commit was SVN r32086.	2014-06-25 18:42:17 +00:00
Ralph Castain	8fca77c3d3	Protect the binding policy setting so it builds when --without-hwloc Refs trac:4742 This commit was SVN r32085. The following Trac tickets were found above: Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742	2014-06-25 18:13:54 +00:00
Adrian Reber	72f1c7941f	use a consistent naming scheme for the SNAPSHOT attributes This commit was SVN r32083.	2014-06-25 15:26:24 +00:00
Nathan Hjelm	563eaf0726	Fix support for Cray alps The alps ras and plm components were broken by recent changes in ORTE. This commit resolves those issues. Changes: - Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's PMI implementation which does not define (for some reason) PMI2_SUCCESS. We had previously just used PMI_SUCCESS. - Add missing definition and a typo in pml_alps_module. - launch_id is no longer available in the orte_node_t structure. Use the attribute lookup to get the value. - Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use opal_list_sort instead (O(nlogn)). This commit was SVN r32076.	2014-06-24 21:29:04 +00:00
Ralph Castain	5f6be06b54	Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended. cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load This commit was SVN r32072.	2014-06-24 18:11:39 +00:00
Ralph Castain	12d92d0c22	Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS This commit was SVN r32070.	2014-06-24 17:05:11 +00:00
Ralph Castain	34e5573988	Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination. This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose. Refs trac:4717 This commit was SVN r32063. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-21 17:09:02 +00:00
Ralph Castain	9cfc408fd4	Little more debug - getting close to figuring this one out Refs trac:4717 This commit was SVN r32060. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-20 16:24:06 +00:00
Ralph Castain	f9da295682	Add some additional debug Refs trac:4717 This commit was SVN r32059. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-20 14:14:36 +00:00
Ralph Castain	645df5e823	Don't release the node_name field as it gets used in the slots parsing - will be released at newline detection This commit was SVN r32058.	2014-06-20 13:18:46 +00:00
Ralph Castain	9a47e45a09	<laugh> ensure we really compare the things we want to compare This commit was SVN r32055.	2014-06-19 20:54:25 +00:00
Ralph Castain	e65538e91b	Add some defensive programming, fix a typo This commit was SVN r32054.	2014-06-19 20:52:13 +00:00
Ralph Castain	b43f760f93	If you don't specify all the rank-file mapping for all procs, then you'll segfault - which is probably a bad idea. I can't see an easy workaround, so just error out for now and let's see if anyone really cares. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32053.	2014-06-19 20:30:06 +00:00
Ralph Castain	b618b36a2f	Fix potential issue if opal_hwloc_topology is NULL cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32050.	2014-06-19 18:52:41 +00:00
Ralph Castain	61fe4daa33	Add some further debug Refs trac:4717 This commit was SVN r32047. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-19 15:59:51 +00:00
Ralph Castain	65275d6326	Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message This commit was SVN r32030.	2014-06-18 19:20:28 +00:00
Ralph Castain	3f032d39e8	Mark the proc as alive so waitpid callback system doesn't immediately activate the callback Refs trac:4717 This commit was SVN r32026. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-18 14:04:55 +00:00
Ralph Castain	8e7c0257f0	Cleanup some missed updates to orte_wait_cb as params have changed Refs trac:4717 This commit was SVN r32025. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 23:40:31 +00:00
Ralph Castain	5dbf4a62c4	Cleanup: we were accidentally killing ourselves (bad idea) Refs trac:4717 This commit was SVN r32022. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 20:38:42 +00:00
Ralph Castain	5216bd5558	Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb Refs trac:4717 This commit was SVN r32020. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 18:46:40 +00:00
Ralph Castain	42bf7466fc	This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces very long delays at scale. Just kill the procs and move on. Refs trac:4717 This commit was SVN r32019. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 17:57:51 +00:00
Ralph Castain	ab52f16100	Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process. cmr=v1.8.2:reviewer=rhc:subject=cache termination cleanups This commit was SVN r32008.	2014-06-16 20:46:35 +00:00
Ralph Castain	561983ae52	Fix static builds by renaming conflicting type This commit was SVN r32006.	2014-06-14 17:39:28 +00:00
Ralph Castain	3f04d50cb0	Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions: (a) default binding policy is in effect. In this case, we will emit a warning and default to not binding unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option. (b) user-specified binding policy is in effect. In this case, we will error out unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option as we cannot meet the directive. Either "bind-to" modifier (oversubscribe or overload) will be accepted for now - in 1.9, we will deprecate the "overload" term in favor of "oversubscribe". Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy. Closes trac:4345 cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions This commit was SVN r32005. The following Trac tickets were found above: Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345	2014-06-14 15:38:32 +00:00
Ralph Castain	56c3575c0e	Can't emit an error for an unrecognized mapping policy modifier as the ppr policy relies on not doing so. This commit was SVN r31998.	2014-06-13 20:10:09 +00:00
Ralph Castain	3ed282bf44	Per patch from Tetsuya, correct the cpus-per-proc logic so we correctly detect when the user is attempting to bind too low for that option Refs trac:4702 This commit was SVN r31988. The following Trac tickets were found above: Ticket 4702 --> https://svn.open-mpi.org/trac/ompi/ticket/4702	2014-06-13 16:32:52 +00:00
Ralph Castain	ba926d8635	The TCP component will have set the hash table entry to NULL, but that doesn't remove the key. So the hash_table retrieval function will return success, but with a NULL pointer - protect against that scenario Patch provided by Gilles - reviewed by rhc. RM-approved cmr=v1.8.2:reviewer=ompi-gk1.8 This commit was SVN r31971.	2014-06-09 17:46:22 +00:00
Ralph Castain	5d5ae41ea5	Cleanup a memory leak in the daemons - thanks to Artem for spotting it This commit was SVN r31970.	2014-06-09 17:14:02 +00:00
Ralph Castain	06dbfa3098	Make the cpus-per-proc equivalent a little more intuitive: * allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc. * if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior. Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier. cmr=v1.8.2:reviewer=rhc This commit was SVN r31967.	2014-06-08 20:26:59 +00:00
Ralph Castain	8db76e9c6f	Ensure that we change to the session dir if we preload binaries so we'll use the loaded one Special patch created for v1.8 and CMR filed This commit was SVN r31963.	2014-06-06 21:43:23 +00:00
Ralph Castain	b7c08582ba	Add new tag to avoid conflicts This commit was SVN r31960.	2014-06-06 17:23:35 +00:00
Ralph Castain	638c24f655	Correct the bind-in-place algorithm to better handle comm_spawn. If the location identified by the mapper is already occupied by procs from another job, then we need to shift either right or left until we find an unoccupied location where we can be bound. If nothing is available, then check for the overload flag (and bind us in the original location if provided), or see if this was the default binding policy instead of one specified by the user - if so, then just don't bind this process. cmr=v1.8.2:reviewer=rhc This commit was SVN r31959.	2014-06-06 12:36:14 +00:00
Gilles Gouaillardet	d26ac02b4a	#if OPAL_HAVE_HWLOC protect access to orte_proc_info_t.cpuset Fix a bug when trunk is configured with --without-hwloc v1.8 is safe so no cmr This commit was SVN r31957.	2014-06-06 07:25:39 +00:00
Ralph Castain	e21bfeadcd	Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed. This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier. cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component This commit was SVN r31956.	2014-06-06 02:24:17 +00:00
Ralph Castain	34cb137314	Add another attribute to the orte_proc_t area This commit was SVN r31953.	2014-06-05 14:48:19 +00:00
Ralph Castain	b2413a6b88	Cannot update the proc state prior to activating the state machine as some callback functions need to compare the prior proc state against the new one. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31949.	2014-06-04 03:40:08 +00:00
Ralph Castain	b771388fa7	We really need to send all the daemon info whenever the daemon job has changed as new daemons need a full nidmap cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31948.	2014-06-04 03:38:54 +00:00
Ralph Castain	c5384d44d7	Protect against NULL result in get_attr This commit was SVN r31947.	2014-06-04 03:09:37 +00:00
Ralph Castain	8e768de317	Really should check our own node for oversubscription, not the HNP cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31946.	2014-06-04 03:09:02 +00:00
Ralph Castain	5398158d5a	Move fclose inside bracket to protect against NULL fp This commit was SVN r31945.	2014-06-04 03:08:18 +00:00
Ralph Castain	f1978fba7c	Cleanup a set of typos on the orte_get_attribute call This commit was SVN r31942.	2014-06-03 20:36:38 +00:00
Ralph Castain	5668f085a3	Silence some useless warnings, and fix a missed updated in the tm plm This commit was SVN r31930.	2014-06-02 17:57:56 +00:00
Ralph Castain	742c0d2284	Fix typo that would cause a segfault if orte_startup_timeout was set This commit was SVN r31929.	2014-06-02 15:59:18 +00:00
Ralph Castain	7df500ecf5	Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up. This commit was SVN r31928.	2014-06-02 15:00:33 +00:00
Ralph Castain	03234f2a33	Revert r31926 and replace it with a more complete checking of availability and accessibility of the required freq control paths. This commit was SVN r31927. The following SVN revision numbers were found above: r31926 --> open-mpi/ompi@9779084352	2014-06-02 14:34:00 +00:00
Gilles Gouaillardet	9779084352	orte_rtc_base_select: skip a RTC module if it has a zero priority This commit was SVN r31926.	2014-06-02 07:16:41 +00:00
Ralph Castain	65a35d92ef	Cleanup compile issues - missing updates to some plm components and the slurm ras component This commit was SVN r31921.	2014-06-01 17:59:06 +00:00
Ralph Castain	2c3d07db24	Cleanup the test so it is MPI correct This commit was SVN r31919.	2014-06-01 17:57:36 +00:00
Ralph Castain	8736a1c138	Per RFC: http://www.open-mpi.org/community/lists/devel/2014/05/14822.php Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root). This commit was SVN r31916.	2014-06-01 16:14:10 +00:00
Ralph Castain	cf2c7381d0	Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs. Refs trac:4643 This commit was SVN r31915. The following Trac tickets were found above: Ticket 4643 --> https://svn.open-mpi.org/trac/ompi/ticket/4643	2014-06-01 16:08:56 +00:00
Ralph Castain	1107f9099e	Per the RFC issued here: http://www.open-mpi.org/community/lists/devel/2014/05/14827.php Refactor PMI support This commit was SVN r31907.	2014-06-01 04:28:17 +00:00
Nathan Hjelm	041b72b0cc	plm/alps: better workaround for the noisy cray pmi implementation This commit is a slightly better workaround to prevent mesages of the form: [unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed [unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor It works by completely disabling PMI in the application process when using mpirun. This should not be an issue for any apps. cmr=v1.8.2:reviewer=rhc This commit was SVN r31882.	2014-05-22 16:04:36 +00:00
Oscar Vega-Gisbert	83bdebbf81	Java bindings for OSHMEM. This commit was SVN r31810.	2014-05-18 21:48:09 +00:00
Nathan Hjelm	73bfecd650	More leak fixes. Two leaks are fixed in this commit: - Do not leak btl component list items. - Do not leak the nodename when decoding the pidmap. cmr=v1.8.2:reviewer=rhc This commit was SVN r31779.	2014-05-15 16:38:13 +00:00
Nathan Hjelm	59d09ad9de	orte: fix several small memory leaks grpcomm: fix memory leaks We were leaking the caddy object used to pass data to the callback function. This commit fixes these leaks. oob,rml: fix memory leaks This commit fixes several leaks: - Both the oob/base and oob/tcp were leaking objects on their peer hash tables. Iterate on the hash tables and free any objects. - Leaked sent messages because of missing OBJ_RELEASE. I placed the release in ORTE_RML_SEND_COMPLETE to catch all the possible paths. ess/base: close the state framework cmr=v1.8.2:reviewer=rhc This commit was SVN r31776.	2014-05-15 15:06:27 +00:00
Gilles Gouaillardet	5b9364fc12	Fix a memory leak in orte_register_params() mca_base_var_register (..., MCA_BASE_VAR_TYPE_STRING, ...) will dup() the orte_set_slots string, so there is no need to do this in the first place. cmr=v1.8.2:reviewer=rhc This commit was SVN r31773.	2014-05-15 10:31:19 +00:00
Gilles Gouaillardet	5f82c391a6	Fix memory leaks in orte/util/nidmap.c This patch fixes four memory leaks in orte/util/nidmap.c : - hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed - even if bo->bytes is freed in the decode, bo was not freed - a job list is populated but never used nor freed cmr=v1.8.2:reviewer=rhc This commit was SVN r31770.	2014-05-15 08:28:53 +00:00
Ralph Castain	ad0e8f841d	Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection. Refs trac:4627 This commit was SVN r31763. The following Trac tickets were found above: Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627	2014-05-14 19:23:02 +00:00
Ralph Castain	e605e73379	Close the incoming socket if we aren't going to accept it cmr=v1.8.2:reviewer=rhc This commit was SVN r31759.	2014-05-14 16:51:59 +00:00
Ralph Castain	3a1c2fff3e	Correct a misplaced bracket - daemons shouldn't be doing app-related operations This may need a patch for 1.8.2, but we can try to directly apply it cmr=v1.8.2:reviewer=hjelmn This commit was SVN r31754.	2014-05-14 15:23:30 +00:00
Nathan Hjelm	2a57e71a47	plm/alps: fix typo introduced in r31589 This commit was SVN r31747. The following SVN revision numbers were found above: r31589 --> open-mpi/ompi@445b552d3a	2014-05-13 22:36:54 +00:00
Ralph Castain	f55c587a74	Per patch from Tetsuya Mishima, ensure the rank_file mapper accurately tracks number of nodes in the map Refs trac:4594 This commit was SVN r31725. The following Trac tickets were found above: Ticket 4594 --> https://svn.open-mpi.org/trac/ompi/ticket/4594	2014-05-13 14:36:25 +00:00
Ralph Castain	5388347511	Per Jeff's suggestion, remove function that has duplicate functionality and just use one to check if session_dir directory should be removed. Refs trac:4584 This commit was SVN r31691. The following Trac tickets were found above: Ticket 4584 --> https://svn.open-mpi.org/trac/ompi/ticket/4584	2014-05-08 17:22:43 +00:00
Ralph Castain	aaae4841e9	Flush the show_help system on our way out - this also restores the opal_show_help function pointer to the OPAL layer for any subsequent processing. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31685.	2014-05-08 14:37:47 +00:00
Ralph Castain	5602156a1c	Use the correct abstraction layer name for the data dirs This commit was SVN r31684.	2014-05-08 14:32:24 +00:00
Ralph Castain	11faab1091	The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. This commit was SVN r31679.	2014-05-08 02:01:35 +00:00
Ralph Castain	a8e2d6c3a6	The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: top_ompi_srcdir -> OMPI_TOP_SRCDIR top_ompi_builddir -> OMPI_TOP_BUILDDIR We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers. Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon. This commit was SVN r31678.	2014-05-07 21:48:53 +00:00

... 6 7 8 9 10 ...

5073 Коммитов