openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	fcec24b2a4	Minor cleanups to handle comm_spawn and singletons	2015-01-27 09:29:42 -06:00
Ralph Castain	74385302c0	Add the personality to the orte_job_t datatype support	2015-01-27 09:29:42 -06:00
Ralph Castain	88c38f87d2	Get the orteds to use schizo as well	2015-01-27 09:29:42 -06:00
Ralph Castain	028b00154d	Complete implementation of the schizo framework to support OMPI component	2015-01-27 09:29:42 -06:00
Ralph Castain	11c92eefe6	ckpt	2015-01-27 09:29:42 -06:00
rhc54	a1707326bf	Merge pull request #359 from hppritcha/topic/better_help orte/util: minor improvement to show_help	2015-01-25 08:13:49 -08:00
Howard Pritchard	1e94d84ae6	orte/util: minor improvement to show_help Make sure the show help gives it a good try to print an error message locally if the send_buffer_nb method returns an error.	2015-01-23 13:54:03 -08:00
Howard Pritchard	2809c21e0f	rml/oob: check peer param in send methods The rml/oob was not doing sanity checks on the input peer parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd. Owing to the fact that there are places in the ompi/orte stack where things like orte_show_help_norender are called way before ORTE_PROC_MY_HNP, are setup properly, all kinds of weird startup failures can occur as the rml/oob tries to process send requests where the peer is junk. Rather than try to expand this kind of thing: /* if we are the HNP, or the RML has not yet been setup, * or ROUTED has not been setup, * or we weren't given an HNP, or we are running in standalone * mode, then all we can do is process this locally */ if (ORTE_PROC_IS_HNP \|\| orte_standalone_operation \|\| NULL == orte_rml.send_buffer_nb \|\| NULL == orte_routed.get_route \|\| NULL == orte_process_info.my_hnp_uri) { rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME); } do the right thing in the rml level and return an error rather than eventually failing in the send owing to peer not being valid.	2015-01-22 06:12:39 -08:00
Howard Pritchard	06d3b57c07	Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug odls/alps: check if PMI gni rdma creds already set	2015-01-19 11:48:24 -07:00
Howard Pritchard	fd807aee69	odls/alps: check if PMI gni rdma creds already set Need to check if the alps odls component has already read the rdma creds from alps. Its okay to ask apshepherd multiple times for rdma creds, but opal_setenv gets a bit picky about this. Rather than check for the OPAL_EXISTS return value from opal_setenv, for now just check with a static variable whether or not orte_odls_alps_get_rdma_creds has already been successfully called before. Would be nice to have an opal_getenv function for checking if an env. variable had already been set by opal_putenv.	2015-01-19 10:12:38 -08:00
Gilles Gouaillardet	661c35ca67	cleanup dead code caused by the removal of the --with-threads configure option	2015-01-16 19:13:59 +09:00
Ralph Castain	e7ff21b3aa	The opal_stop_progress_thread function releases the event base, so don't do it again	2015-01-15 10:48:40 -08:00
Ralph Castain	9ac39b63cc	Use the opal_progress_threads support for the ORTE progress thread in applications	2015-01-15 07:55:19 -08:00
Ralph Castain	d2938a144f	Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch	2015-01-12 05:31:02 -08:00
Howard Pritchard	f34dd5f5fd	plm/alps: update copyright	2015-01-07 12:33:38 -07:00
Howard Pritchard	c454d11b01	plm/alps: fix orted abort hang problem Turns out the alps plm component wasn't changing the state of the job upon terminating the orted's in the case of an abnormal termination. This caused mpirun to hang with a zommbie'd aprun process if an orted on a node in the job was killed via signal.	2015-01-07 12:31:41 -07:00
Howard Pritchard	f0f98f13b6	odls/base: fix an edge case with signals In the course of doing some testing with how orted's handle signaled child processes, found out that very often doing a kill -9 on a process on a node just results in the job hanging. The problem was that the orted odls/errmgr was not properly handling the exit_code being returned from waitpid. Now mark the proc state as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code from waitpid indicates the process exited owing to a signal.	2015-01-06 15:42:38 -07:00
Nadezhda Kogteva	05af80b302	Fix commit `bffb2b7a4b` which broke pmix server functionality	2014-12-24 13:25:23 +02:00
Ralph Castain	43a40f8aac	LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly	2014-12-19 12:06:05 -08:00
Ralph Castain	b314bfb5e9	If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info	2014-12-19 10:56:14 -08:00
Jeff Squyres	7b43bdc984	plm base: move flag inside the #if in which it is used Avoid a compiler warning by declaring the tflag only inside the #if in which it is used (i.e., if hwloc support is built).	2014-12-18 10:56:23 -08:00
Ralph Castain	2581b41d08	Continue refactoring code by splitting the msg processing from the sendrecv code	2014-12-17 19:57:14 -08:00
Ralph Castain	f489e871c2	Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac	2014-12-17 19:08:52 -08:00
Artem Polyakov	01601f3284	Merge pull request #305 from artpol84/timing Timing framework improvement	2014-12-16 15:13:48 +06:00
Ralph Castain	573a574a3c	Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map	2014-12-15 12:11:13 -08:00
Ralph Castain	a22cc45769	Close the pmix server sockets on exec	2014-12-13 20:30:21 -08:00
Ralph Castain	f4ff791335	Close oob/usock connections upon exec	2014-12-13 20:24:09 -08:00
Ralph Castain	6c4d5a51c4	Close tcp sockets upon exec	2014-12-13 20:23:53 -08:00
Ralph Castain	9658256a98	Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.	2014-12-13 18:44:09 -08:00
Ralph Castain	bffb2b7a4b	Correct some issues with variables used before being set	2014-12-12 17:23:32 -08:00
Ralph Castain	0630680f36	Two cleanups required for transfer to 1.8.4: * Use %d format for the topo signature as some systems apparently have problems with %u * Use correct variable in show_help message	2014-12-12 17:23:32 -08:00
Rolf vandeVaart	f4aecdbfd2	Change logging function name from log to logfn. Fixes issue with PGI compile	2014-12-12 09:46:44 -05:00
Artem Polyakov	8ffad75a0a	Introduce timing interval measurement facility in timing framework	2014-12-10 16:47:49 +06:00
Ralph Castain	9d5135e6cd	Function definition should use the correct type	2014-12-09 01:04:31 -08:00
Ralph Castain	bb529ebd8e	Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings). Retain the hetero-nodes flag for those cases where the user knows that there are differences and our automated system isn't good enough to see it. Will obviously require further refinement as we find out which variances it can detect, and which it cannot.	2014-12-08 15:38:14 -08:00
elenash	baf32fe480	Merge pull request #308 from elenash/master restored _process_name_print_for_opal function in orte_init: it's requir...	2014-12-08 19:14:36 +03:00
Ralph Castain	b757b3f452	Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered.	2014-12-08 08:03:53 -08:00
Elena	6cf3925b09	restored _process_name_print_for_opal function in orte_init: it's required for opal output from daemons which never called ompi_init so didn't set opal_process_name_print pointer	2014-12-08 13:13:35 +02:00
Ralph Castain	d6d69e2b13	Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other. So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate. We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.	2014-12-07 09:11:48 -08:00
Ralph Castain	b1bf557024	Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.	2014-12-05 15:47:09 -08:00
Elena	af38a762a2	these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet.	2014-12-05 12:38:59 +02:00
Ralph Castain	c4fd6d1cde	Fix typo	2014-12-04 12:24:35 -08:00
Ralph Castain	c4002a8485	Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set.	2014-12-04 12:24:35 -08:00
Ralph Castain	c88f181efe	Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.	2014-12-03 18:11:17 -08:00
Howard Pritchard	c67afadcfc	Merge pull request #289 from hppritcha/topic/remove_pmi Topic/remove pmi	2014-12-03 16:58:35 -07:00
Jeff Squyres	a3af7d6dbb	Revert "lsf configury: add dependent libraries for static linking" This reverts commit `56cfa90dda`.	2014-12-03 13:32:56 -08:00
Jeff Squyres	92c2ff91ec	Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic." This reverts commit open-mpi/ompi@32bf0e7b7e.	2014-12-03 13:15:20 -08:00
Ralph Castain	54c955c92d	Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit.	2014-12-03 12:19:00 -08:00
Howard Pritchard	666344a081	orte/mca/common/alps: fix configure file Fix configure file for alps to actually check for alps being available. Also include stdio.h explicitly in common_alps.c	2014-12-03 09:44:18 -07:00
Howard Pritchard	ec38aa3732	orte/mca/common: add missing Makefile.am	2014-12-03 09:44:18 -07:00

1 2 3 4 5 ...

4618 Коммитов