openmpi

Автор	SHA1	Сообщение	Дата
Elena	948c20d862	added pmix unit test to tarball	2015-02-10 13:41:15 +02:00
Howard Pritchard	b62d9c2c70	ess/alps: fix compile issue for pgi remote -fi-noident cflag option. Wasn't helping anyway and caused pgi compiles to break.	2015-02-09 20:49:04 -08:00
Ralph Castain	3478def791	Ensure that nodes get included in the nidmap when spawning a new DVM job - we really only need to do this once, but for now we do it for every job until we work out how to avoid the duplication. Remove debug from orte-dvm tool	2015-02-09 23:47:46 -05:00
Ralph Castain	ef13ba7db3	Add debug-daemons option to orte-dvm	2015-02-09 11:08:45 -05:00
Ralph Castain	a3275aa867	Once again, fix the blasted singleton comm_spawn	2015-02-05 17:34:25 -08:00
Ralph Castain	f28238af59	Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards.	2015-02-05 11:41:37 -08:00
Jeff Squyres	938b8e1dad	schitzo: fix free of uninitialized value The "param" value is not assigned before this free() statement. So remove it. (yay clang compiler warnings)	2015-02-04 15:50:24 -05:00
Ralph Castain	251084a2da	When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well.	2015-02-04 07:59:47 -08:00
Ralph Castain	2b0b012460	Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it.	2015-02-04 06:21:54 -08:00
Ralph Castain	7299cc3ab9	Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm	2015-02-03 07:25:43 -08:00
Elena	5919b636e1	changed output format in pmix unit test	2015-02-02 14:22:51 +02:00
Ralph Castain	4dba298e6e	Update orte-submit manpage, add the ompi-* versions of orte-dvm and orte-submit manpages	2015-02-01 15:46:40 -08:00
Ralph Castain	e303a9b1d6	Provide an orte-dvm man page. Provide an option to orte-submit for terminating the DVM	2015-02-01 12:14:44 -08:00
Ralph Castain	ec5ccb76cf	Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time.	2015-01-30 11:00:43 -08:00
rhc54	e7fa600d85	Merge pull request #360 from elenash/master added unit test for pmix functionality	2015-01-28 06:18:57 -06:00
Elena	472baa1284	added unit test for pmix functionality	2015-01-28 13:18:26 +02:00
Ralph Castain	b838df9eb8	Get slurm to stay out of the way on singletons	2015-01-27 09:29:43 -06:00
Ralph Castain	294ebc907a	Fix singleton operations so they can work inside a slurm environment	2015-01-27 09:29:42 -06:00
Ralph Castain	3eca55caec	Continue fixing singletons in slurm environments	2015-01-27 09:29:42 -06:00
Ralph Castain	fcec24b2a4	Minor cleanups to handle comm_spawn and singletons	2015-01-27 09:29:42 -06:00
Ralph Castain	74385302c0	Add the personality to the orte_job_t datatype support	2015-01-27 09:29:42 -06:00
Ralph Castain	88c38f87d2	Get the orteds to use schizo as well	2015-01-27 09:29:42 -06:00
Ralph Castain	028b00154d	Complete implementation of the schizo framework to support OMPI component	2015-01-27 09:29:42 -06:00
Ralph Castain	11c92eefe6	ckpt	2015-01-27 09:29:42 -06:00
rhc54	a1707326bf	Merge pull request #359 from hppritcha/topic/better_help orte/util: minor improvement to show_help	2015-01-25 08:13:49 -08:00
Howard Pritchard	1e94d84ae6	orte/util: minor improvement to show_help Make sure the show help gives it a good try to print an error message locally if the send_buffer_nb method returns an error.	2015-01-23 13:54:03 -08:00
Howard Pritchard	2809c21e0f	rml/oob: check peer param in send methods The rml/oob was not doing sanity checks on the input peer parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd. Owing to the fact that there are places in the ompi/orte stack where things like orte_show_help_norender are called way before ORTE_PROC_MY_HNP, are setup properly, all kinds of weird startup failures can occur as the rml/oob tries to process send requests where the peer is junk. Rather than try to expand this kind of thing: /* if we are the HNP, or the RML has not yet been setup, * or ROUTED has not been setup, * or we weren't given an HNP, or we are running in standalone * mode, then all we can do is process this locally */ if (ORTE_PROC_IS_HNP \|\| orte_standalone_operation \|\| NULL == orte_rml.send_buffer_nb \|\| NULL == orte_routed.get_route \|\| NULL == orte_process_info.my_hnp_uri) { rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME); } do the right thing in the rml level and return an error rather than eventually failing in the send owing to peer not being valid.	2015-01-22 06:12:39 -08:00
Howard Pritchard	06d3b57c07	Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug odls/alps: check if PMI gni rdma creds already set	2015-01-19 11:48:24 -07:00
Howard Pritchard	fd807aee69	odls/alps: check if PMI gni rdma creds already set Need to check if the alps odls component has already read the rdma creds from alps. Its okay to ask apshepherd multiple times for rdma creds, but opal_setenv gets a bit picky about this. Rather than check for the OPAL_EXISTS return value from opal_setenv, for now just check with a static variable whether or not orte_odls_alps_get_rdma_creds has already been successfully called before. Would be nice to have an opal_getenv function for checking if an env. variable had already been set by opal_putenv.	2015-01-19 10:12:38 -08:00
Gilles Gouaillardet	661c35ca67	cleanup dead code caused by the removal of the --with-threads configure option	2015-01-16 19:13:59 +09:00
Ralph Castain	e7ff21b3aa	The opal_stop_progress_thread function releases the event base, so don't do it again	2015-01-15 10:48:40 -08:00
Ralph Castain	9ac39b63cc	Use the opal_progress_threads support for the ORTE progress thread in applications	2015-01-15 07:55:19 -08:00
Ralph Castain	d2938a144f	Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch	2015-01-12 05:31:02 -08:00
Howard Pritchard	f34dd5f5fd	plm/alps: update copyright	2015-01-07 12:33:38 -07:00
Howard Pritchard	c454d11b01	plm/alps: fix orted abort hang problem Turns out the alps plm component wasn't changing the state of the job upon terminating the orted's in the case of an abnormal termination. This caused mpirun to hang with a zommbie'd aprun process if an orted on a node in the job was killed via signal.	2015-01-07 12:31:41 -07:00
Howard Pritchard	f0f98f13b6	odls/base: fix an edge case with signals In the course of doing some testing with how orted's handle signaled child processes, found out that very often doing a kill -9 on a process on a node just results in the job hanging. The problem was that the orted odls/errmgr was not properly handling the exit_code being returned from waitpid. Now mark the proc state as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code from waitpid indicates the process exited owing to a signal.	2015-01-06 15:42:38 -07:00
Nadezhda Kogteva	05af80b302	Fix commit `bffb2b7a4b` which broke pmix server functionality	2014-12-24 13:25:23 +02:00
Ralph Castain	43a40f8aac	LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly	2014-12-19 12:06:05 -08:00
Ralph Castain	b314bfb5e9	If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info	2014-12-19 10:56:14 -08:00
Jeff Squyres	7b43bdc984	plm base: move flag inside the #if in which it is used Avoid a compiler warning by declaring the tflag only inside the #if in which it is used (i.e., if hwloc support is built).	2014-12-18 10:56:23 -08:00
Ralph Castain	2581b41d08	Continue refactoring code by splitting the msg processing from the sendrecv code	2014-12-17 19:57:14 -08:00
Ralph Castain	f489e871c2	Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac	2014-12-17 19:08:52 -08:00
Artem Polyakov	01601f3284	Merge pull request #305 from artpol84/timing Timing framework improvement	2014-12-16 15:13:48 +06:00
Ralph Castain	573a574a3c	Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map	2014-12-15 12:11:13 -08:00
Ralph Castain	a22cc45769	Close the pmix server sockets on exec	2014-12-13 20:30:21 -08:00
Ralph Castain	f4ff791335	Close oob/usock connections upon exec	2014-12-13 20:24:09 -08:00
Ralph Castain	6c4d5a51c4	Close tcp sockets upon exec	2014-12-13 20:23:53 -08:00
Ralph Castain	9658256a98	Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.	2014-12-13 18:44:09 -08:00
Ralph Castain	bffb2b7a4b	Correct some issues with variables used before being set	2014-12-12 17:23:32 -08:00
Ralph Castain	0630680f36	Two cleanups required for transfer to 1.8.4: * Use %d format for the topo signature as some systems apparently have problems with %u * Use correct variable in show_help message	2014-12-12 17:23:32 -08:00
Rolf vandeVaart	f4aecdbfd2	Change logging function name from log to logfn. Fixes issue with PGI compile	2014-12-12 09:46:44 -05:00
Artem Polyakov	8ffad75a0a	Introduce timing interval measurement facility in timing framework	2014-12-10 16:47:49 +06:00
Ralph Castain	9d5135e6cd	Function definition should use the correct type	2014-12-09 01:04:31 -08:00
Ralph Castain	bb529ebd8e	Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings). Retain the hetero-nodes flag for those cases where the user knows that there are differences and our automated system isn't good enough to see it. Will obviously require further refinement as we find out which variances it can detect, and which it cannot.	2014-12-08 15:38:14 -08:00
elenash	baf32fe480	Merge pull request #308 from elenash/master restored _process_name_print_for_opal function in orte_init: it's requir...	2014-12-08 19:14:36 +03:00
Ralph Castain	b757b3f452	Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered.	2014-12-08 08:03:53 -08:00
Elena	6cf3925b09	restored _process_name_print_for_opal function in orte_init: it's required for opal output from daemons which never called ompi_init so didn't set opal_process_name_print pointer	2014-12-08 13:13:35 +02:00
Ralph Castain	d6d69e2b13	Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other. So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate. We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.	2014-12-07 09:11:48 -08:00
Ralph Castain	b1bf557024	Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.	2014-12-05 15:47:09 -08:00
Elena	af38a762a2	these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet.	2014-12-05 12:38:59 +02:00
Ralph Castain	c4fd6d1cde	Fix typo	2014-12-04 12:24:35 -08:00
Ralph Castain	c4002a8485	Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set.	2014-12-04 12:24:35 -08:00
Ralph Castain	c88f181efe	Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.	2014-12-03 18:11:17 -08:00
Howard Pritchard	c67afadcfc	Merge pull request #289 from hppritcha/topic/remove_pmi Topic/remove pmi	2014-12-03 16:58:35 -07:00
Jeff Squyres	a3af7d6dbb	Revert "lsf configury: add dependent libraries for static linking" This reverts commit `56cfa90dda`.	2014-12-03 13:32:56 -08:00
Jeff Squyres	92c2ff91ec	Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic." This reverts commit open-mpi/ompi@32bf0e7b7e.	2014-12-03 13:15:20 -08:00
Ralph Castain	54c955c92d	Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit.	2014-12-03 12:19:00 -08:00
Howard Pritchard	666344a081	orte/mca/common/alps: fix configure file Fix configure file for alps to actually check for alps being available. Also include stdio.h explicitly in common_alps.c	2014-12-03 09:44:18 -07:00
Howard Pritchard	ec38aa3732	orte/mca/common: add missing Makefile.am	2014-12-03 09:44:18 -07:00
Howard Pritchard	191fe0f949	alps configury changes Clean up the orte_check_alps.m4. There was a little of unnecesary stuff for handling cle 5, since it wasn't actually doing the right thing, which would be to use pkg-config to find dependencies both for dynamic and static linking. Decouple the searching for alps libs, etc. from cray pmi. Switch the alps ess and alps odls components' config files to use the ALPS m4 macro. alps configury fixes Improve a check for detecting CLE release. Improve an error message.	2014-12-03 09:44:17 -07:00
Howard Pritchard	d749077e1e	odls/alps: make sure PMI env. variables set up Add call to orte_odls_alps_get_rdma_creds in the local proc launch step to obtain the Cray Rdma credentials from the apshepherd, and to set the PMI env. variables expected by uGNI BTL, etc.	2014-12-03 09:44:17 -07:00
Howard Pritchard	e0487e7702	orte/common/alps: add an alps common lib to orte Add an alps common lib to orte. Add a function to determine whether or not a process is in a PAGG container. Note: we need a better naming convention for common libs, since right now they use a "flat" naming convention.	2014-12-03 09:44:17 -07:00
Howard Pritchard	a753c3ece0	ess/alps: add initial alps ess component Note this alps ess component has nothing to do with the old CNOS alps component used on Cray Seastar/Portals3 (Cray XT) systems. To work properly, changes need to be made to the open method of the ess/pmi component to keep it from selecting, and thus initializing, the opal/pmix/cray component.	2014-12-03 09:44:17 -07:00
Ralph Castain	32bf0e7b7e	Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic.	2014-12-03 07:14:06 -08:00
Ralph Castain	cb15cc06e1	Minor changes per Jeff's request on PR for 1.8.4	2014-12-02 19:54:10 -08:00
Ralph Castain	6294ed991b	Fix singletons - still working on singleton comm_spawn	2014-12-02 14:12:24 -08:00
Ralph Castain	14cdb04327	Revise the ess/pmi selection logic as all APPs must select it, and no daemons. Cleanup some of the mca param levels in ess so we don't printout the topology quite as easily.	2014-12-01 21:19:11 -08:00
Jeff Squyres	56cfa90dda	lsf configury: add dependent libraries for static linking Ensure to add the LSF dependent libraries and LD flags for the wrapper compiler static linking case.	2014-12-01 14:59:10 -08:00
Ralph Castain	f92ccaf0f9	Add missing var declarations	2014-12-01 09:36:28 -08:00
Ralph Castain	960ef34988	Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists	2014-11-30 14:37:37 -08:00
Ralph Castain	3f9d9ae8b6	Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids. Once validated, a version of this will be backported to the v1.8.4 release.	2014-11-30 11:50:31 -08:00
Elena	b17ea23ce0	fixes for direct routed component under mpirun	2014-11-26 13:36:49 +02:00
Ralph Castain	f48b9012cb	Some minor cleanup. We really don't need another peer error constant to indicate that a peer closed as we already have one for "connection failed", and that's all we really know. Update the orte constants to track their opal equivalents.	2014-11-25 08:02:29 -08:00
Nadezhda Kogteva	8dd21c7736	OOB UD: fix case when multiple oob components were specified in command line (checking of uri).	2014-11-25 11:48:11 +02:00
Gilles Gouaillardet	578fe41788	fix hangs introduced by previous commit `a6744b8177`	2014-11-25 17:50:44 +09:00
Gilles Gouaillardet	a6744b8177	fix misc memory leaks specific to the master	2014-11-25 13:52:10 +09:00
Gilles Gouaillardet	38879cf682	fix misc memory leaks	2014-11-25 11:32:43 +09:00
Ralph Castain	48f702827e	First part of memory leak cleanups from Gilles	2014-11-24 16:53:33 -08:00
Ralph Castain	2e00e335b9	Add missing header to tarball. Remove stale opal_unignore	2014-11-21 17:35:11 -08:00
Howard Pritchard	6e807c4e8a	odls/alps: minor config cleanup	2014-11-21 11:09:28 -07:00
Nadezhda Kogteva	05b2eb1270	OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component.	2014-11-20 10:20:35 +02:00
rhc54	7c0273ecb3	Merge pull request #276 from teng-lin/master Fixed a bug that fails to parse hostname starting with numbers.	2014-11-19 16:39:00 -08:00
Teng Lin	07ff51f43f	Fixed a bug that fails to parse hostname starting with numbers. According to RFC 1123, hostnames that begin with numbers are valid.	2014-11-19 16:03:55 -08:00
Howard Pritchard	9425ebefae	Be more selective about closing fd's for alps/odls Be more selective about closing fd's for the alps odls component. Don't close fd's of pipes set up by the apshepherd for providing RDMA credentials, etc. Add an entry to the help file in case alps_app_lli_pipes returns an error.	2014-11-19 11:21:30 -07:00
Ralph Castain	bb91517349	All other layers to register their own print-attribute functions so we can maintain pretty-print capabilities as the attributes are extended.	2014-11-19 09:37:59 -08:00
Ralph Castain	37593b232d	Add a marker for the max attr value being used by ORTE so that other, higher-levels can also use the attribute system	2014-11-19 09:37:59 -08:00
Howard Pritchard	34c156759e	fix some compiler warnings in ras/alps	2014-11-18 11:32:37 -07:00
Howard Pritchard	4df3447d96	fix compare_nodes bug in alps ras component There was an obvious bug in the alps/ras component compare_nodes method which resulted in the function always evaluating the nodes as being equivalent.	2014-11-18 11:15:02 -07:00
Howard Pritchard	ff362c16ce	add/update copyrights for alps odls component	2014-11-18 10:16:11 -07:00
Howard Pritchard	dc98b62070	add initial support for an alps odls component It turns out that the support for Open MPI apps on Cray was hanging on a thin thread of support when using the mpirun job launcher. It just happened that with a certain set of configuration options things would work. This is bound to backfire at some point. To fix this weakness, as well as to allow for mpirun launched jobs to benefit from many of the advanced placement features provided by the Cray Linux Environment (as opposed to the hwloc only default env of orte), a new odls alps component is introduced.	2014-11-17 14:00:09 -07:00
Ralph Castain	d9ceb5aea4	Fix C++ builds by removing no-longer-needed type declaration	2014-11-14 11:44:24 -08:00
Gilles Gouaillardet	f3b36fdf6e	orted/pmix: fix pmix_server_release when several jobids are running on the same node	2014-11-14 16:17:28 +09:00
Gilles Gouaillardet	84b21d726e	orte/util: add OPAL_{VPID,JOBID} types to orte_attr_{load,unload}	2014-11-14 15:55:25 +09:00
rhc54	1fdb6a62d3	Merge pull request #265 from miked-mellanox/topic/undeprecate_env_x ORTE: undeprecate -x var=val in mpirun Looks okay to me - thanks!	2014-11-12 08:46:09 -08:00
Mike Dubman	f83d6045aa	ORTE: undeprecate -x var=val in mpirun mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"	2014-11-12 10:51:15 +02:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Ralph Castain	d0704ef118	Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case.	2014-11-10 14:00:40 -08:00
Ralph Castain	2a90788724	Support physical processor ids in rankfile	2014-11-10 14:00:40 -08:00
Ralph Castain	8c837d3cb3	Doh - if we can't output an entire block, then we need to adjust the number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write.	2014-11-07 13:13:13 -08:00
Ralph Castain	b56b744041	Silence some warnings and remove debug output	2014-11-07 07:54:01 -08:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Ralph Castain	738c3e1d72	Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available.	2014-11-03 21:35:42 -08:00
Ralph Castain	6fbc68c830	Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one	2014-11-03 14:43:17 -08:00
Gilles Gouaillardet	652ecdb888	oob/tcp: always include a missing header file improve open-mpi/ompi@c9d1e16a9e	2014-10-29 13:39:23 +09:00
Gilles Gouaillardet	eef7590e58	wrappers: add the $(EXEEXT) extension to the installed symbolic links	2014-10-28 16:42:51 +09:00
Gilles Gouaillardet	c9d1e16a9e	oob/tcp: include a missing header file warning can be seen under cygwin without the missing header file	2014-10-28 13:56:25 +09:00
Ralph Castain	64fae47d85	Ensure that the proxy "pull" of output gets directed to the requesting tool, which is no longer just the sender since the HNP may be making the request on behalf of someone else	2014-10-23 10:21:21 -07:00
Ralph Castain	526682e2f9	Add the ability for a tool that requests spawn of a job to also request forwarding of all output to the tool. The tool is responsible for its own call to push its stdin to the new job. The push request can come -after- the job is started, but the pull request has to be done during the spawn procedure or else output can be lost.	2014-10-23 08:16:49 -07:00
Ralph Castain	894acb0aa8	configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros These two macros set the MCA prefix and MCA cmd line id, respectively. Specifically, MCA parameters will be named PREFIX<foo> in the environment, and the cmd line will use -ID foo bar. These macros must be called during configure.ac and a value supplied. In the case of Open MPI, the values given are PREFIX=OMPI_MCA_ and ID=mca. Other projects (such as ORCM) will call these macros with their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_ and ID=omca This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running OMPI applications under ORCM, we need the MCA params passed to the ORCM daemons to be separated from those recognized by the OMPI application.	2014-10-22 18:57:40 -07:00
Jeff Squyres	c22e1ae33b	configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros These two macros set the prefix for the OPAL and ORTE libraries, respectively. Specifically, the OPAL library will be named libPREFIXopen-pal.la and the ORTE library will be named libPREFIXopen-rte.la. These macros must be called, even if the prefix argument is empty. The intent is that Open MPI will call these macros with an empty prefix, but other projects (such as ORCM) will call these macros with a non-empty prefix. For example, ORCM libraries can be named liborcm-open-pal.la and liborcm-open-rte.la. This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running MPI applications under ORTE, if the ORTE and OPAL libraries between OMPI and ORCM are not identical (which, because they are released at different times, are likely to be different), we need to ensure that the OMPI applications link against their ORTE and OPAL libraries, but the ORCM executables link against their ORTE and OPAL libraries.	2014-10-22 10:32:19 -07:00
Jeff Squyres	01fd96bfa5	Revert "Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build." This reverts commit `63f619f871`.	2014-10-22 10:32:11 -07:00
Jeff Squyres	206eade32c	mpirun.1in: whitespace cleanup Whitespace cleanup only; no content changes.	2014-10-20 05:18:25 -07:00
Jeff Squyres	9529289319	mpirun.1in: more updates about binding/etc. Follow on to `91e9686` and `f9d620e`.	2014-10-20 05:17:49 -07:00
Ralph Castain	91e96861dd	Cleanup the orterun man page per review by Gus Correa	2014-10-19 10:21:50 -07:00
Ralph Castain	f9d620e3a7	Update the orterun man page	2014-10-16 21:05:04 -07:00
Ralph Castain	ecbae03009	Fix typo	2014-10-16 13:30:06 -07:00
Ralph Castain	b6aa691e0a	Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons.	2014-10-16 12:58:56 -07:00
Gilles Gouaillardet	b5aea782ce	Revert "Fix heterogeneous support" Per the discussion at http://www.open-mpi.org/community/lists/devel/2014/10/16050.php This reverts commit `c9c5d4011b`.	2014-10-16 12:24:38 +09:00
Gilles Gouaillardet	c9c5d4011b	Fix heterogeneous support * redefine orte_process_name_t so it can be converted between host and network format as an opal_identifier_t aka uint64_t by the OPAL layer. * correctly send OPAL_DSTORE_ARCH key	2014-10-15 17:19:13 +09:00
Ralph Castain	1ae34da5e5	Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment.	2014-10-10 12:13:25 -07:00
Ralph Castain	63f619f871	Provide a mechanism by which an upstream project can rename the OPAL and ORTE libraries. This is required by projects such as ORCM that have their own ORTE and OPAL libraries in order to avoid library confusion. By renaming their version of the libraries, the OMPI applications can correctly dynamically load the correct one for their build.	2014-10-10 11:39:08 -07:00
Ralph Castain	1be1654e5f	Correctly identify the synonym for orte_direct_modex_cutoff as ompi_hostname_cutoff	2014-10-10 06:05:06 -07:00
Ralph Castain	4fc4a8346b	Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL.	2014-10-09 20:58:46 -07:00
Elena	b937b31693	fix for multiple spawn test	2014-10-09 06:18:16 +02:00
Elena	3d65799236	pmix: fixed ugly bug which caused many strange hangs	2014-10-09 06:17:03 +02:00
Elena	c905fe9b78	pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff	2014-10-09 06:15:31 +02:00
Elena	e319c95267	fixes for grpcomm rcd/brucks algorithms	2014-10-09 06:12:26 +02:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Howard Pritchard	d2bb8d8829	remove alps ess component The alps ess component is obsolete. It relies on header files only present in very old CLE (Cray Linux) 3.X for the Cray XT series. As support for these systems is being dropped starting with release 1.9, this code is being removed.	2014-10-03 13:17:33 -06:00
Jeff Squyres	413e775dbf	version configury: make dist now works Update the VERSION file scheme: * Remove "want_repo_rev". * Add "tarball_version". All values are now always included (major, minor, release, greek, repo_rev). However, configure.ac now runs "opal_get_version.sh ... --tarball", which will return the value of tarball_version (if it is non-empty) or the "full" version string (i.e., "major.minor.releasegreek").	2014-10-02 11:32:54 -07:00
hpp	8ded59ce0f	fix alps plm to allow explicit host placement It turns out that the alps plm code was developed only on cray systems that were running batch schedulers. However, for bring up and development systems, its not at all uncommon for there to be no batch scheduler, and thus to orte it appears that orte_num_allocated_nodes is always zero. This forces a user using mpirun on such a system to always specify a host list: mpirun -n 4 -N 1 -host 32,45,68 .... just to get the job to run, but then since the -L argument for aprun is never built, the app always runs on the first batch of nodes that aprun finds available.	2014-10-02 10:42:01 -06:00
Jeff Squyres	72704441a2	URLs: update URLs for GitHub	2014-10-01 14:44:09 -07:00
Ralph Castain	84810b80fd	Cover the remaining code paths for Java apps to define class path Refs trac:4926 This commit was SVN r32823. The following Trac tickets were found above: Ticket 4926 --> https://svn.open-mpi.org/trac/ompi/ticket/4926	2014-09-30 22:27:03 +00:00
Ralph Castain	040a69c38b	Correct the classpath to correctly include the local directory so Java programs find the application class cmr=v1.8.4:reviewer=jsquyres This commit was SVN r32817.	2014-09-30 16:35:12 +00:00
Ralph Castain	4320457394	Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings This commit was SVN r32811.	2014-09-29 17:10:38 +00:00
Howard Pritchard	f8ac8bb6b0	remove improper use of hwloc_bitmap_free When using the native aprun launcher, it was observed that there were frequent memory corruption errors occuring either during a PMI kvs-fence operation, or at mpi termation during opal cleanup of allocated objects. This was especially bad when using aprun --c none In some cases, the application would even just hang in finalize if using ptmalloc, owing to some kind of infinite loop in cleanup of small blocks, etc. It turns out that the proble was in orte_ess_base_proc_binding's improper use of opal_hwloc_base_get_available_cpus. The cpuset (bitmap) returned from that function is not meant to be freed by the caller. This problem is likely never observed when using the mpirun launcher as there's an early exit if the OMPI_MCA_orte_bound_at_launch environment variable is set. This commit was SVN r32809.	2014-09-29 16:10:37 +00:00
Gilles Gouaillardet	9661e4537f	oob/tcp: fix a race condition Mimick the btl/tcp protocol to solve the race condition that happens when two peers try to connect to each other at the same time cmr=v1.8.4:reviewer=rhc This commit was SVN r32799.	2014-09-26 06:54:30 +00:00
Ralph Castain	17846411c3	Now that we have an ORTE thread running in apps, we can't just call "exit" during RTE abort as that is happening in a thread, and (at least in some environments) doesn't result in the main thread being immediately terminated. Instead, we wind up going thru orte_finalize in the main thread, which isn't what we want. So replace the call to "exit" with the "quick exit" variant "_exit", which causes the entire process to exit immediately. (custom patch has been posted for 1.8.3) This commit was SVN r32780.	2014-09-23 22:51:10 +00:00
Howard Pritchard	1508a01325	Fixes to enable mpirun to work again on Cray The ess pmi module was not handling aprun launched daemons. All daemons were thinking they were vpid 1. Also, turns out that on cray systems using MOM nodes for launched jobs, just detecting whether or not a process is in a PAGG container is not sufficient. Crank up the priority of the alps PLM component in the event that the configure detected the presence of both slurm and alps. Have the ESS pmi component open the pmix framework and select a pmix component. This commit was SVN r32773.	2014-09-23 15:37:26 +00:00
Gilles Gouaillardet	5fa2b6c59c	oob/tcp: fix a race condition Refs trac:4909 This commit was SVN r32754. The following Trac tickets were found above: Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909	2014-09-18 08:17:25 +00:00

1 2 3 4 5 ...

4737 Коммитов