openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	d749077e1e	odls/alps: make sure PMI env. variables set up Add call to orte_odls_alps_get_rdma_creds in the local proc launch step to obtain the Cray Rdma credentials from the apshepherd, and to set the PMI env. variables expected by uGNI BTL, etc.	2014-12-03 09:44:17 -07:00
Howard Pritchard	e0487e7702	orte/common/alps: add an alps common lib to orte Add an alps common lib to orte. Add a function to determine whether or not a process is in a PAGG container. Note: we need a better naming convention for common libs, since right now they use a "flat" naming convention.	2014-12-03 09:44:17 -07:00
Howard Pritchard	a753c3ece0	ess/alps: add initial alps ess component Note this alps ess component has nothing to do with the old CNOS alps component used on Cray Seastar/Portals3 (Cray XT) systems. To work properly, changes need to be made to the open method of the ess/pmi component to keep it from selecting, and thus initializing, the opal/pmix/cray component.	2014-12-03 09:44:17 -07:00
Ralph Castain	32bf0e7b7e	Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic.	2014-12-03 07:14:06 -08:00
Ralph Castain	cb15cc06e1	Minor changes per Jeff's request on PR for 1.8.4	2014-12-02 19:54:10 -08:00
Ralph Castain	6294ed991b	Fix singletons - still working on singleton comm_spawn	2014-12-02 14:12:24 -08:00
Ralph Castain	14cdb04327	Revise the ess/pmi selection logic as all APPs must select it, and no daemons. Cleanup some of the mca param levels in ess so we don't printout the topology quite as easily.	2014-12-01 21:19:11 -08:00
Jeff Squyres	56cfa90dda	lsf configury: add dependent libraries for static linking Ensure to add the LSF dependent libraries and LD flags for the wrapper compiler static linking case.	2014-12-01 14:59:10 -08:00
Ralph Castain	f92ccaf0f9	Add missing var declarations	2014-12-01 09:36:28 -08:00
Ralph Castain	960ef34988	Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists	2014-11-30 14:37:37 -08:00
Ralph Castain	3f9d9ae8b6	Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids. Once validated, a version of this will be backported to the v1.8.4 release.	2014-11-30 11:50:31 -08:00
Elena	b17ea23ce0	fixes for direct routed component under mpirun	2014-11-26 13:36:49 +02:00
Nadezhda Kogteva	8dd21c7736	OOB UD: fix case when multiple oob components were specified in command line (checking of uri).	2014-11-25 11:48:11 +02:00
Gilles Gouaillardet	578fe41788	fix hangs introduced by previous commit a6744b81777ab8247908350bd15cca49bedf5208	2014-11-25 17:50:44 +09:00
Gilles Gouaillardet	a6744b8177	fix misc memory leaks specific to the master	2014-11-25 13:52:10 +09:00
Ralph Castain	48f702827e	First part of memory leak cleanups from Gilles	2014-11-24 16:53:33 -08:00
Ralph Castain	2e00e335b9	Add missing header to tarball. Remove stale opal_unignore	2014-11-21 17:35:11 -08:00
Howard Pritchard	6e807c4e8a	odls/alps: minor config cleanup	2014-11-21 11:09:28 -07:00
Nadezhda Kogteva	05b2eb1270	OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component.	2014-11-20 10:20:35 +02:00
Howard Pritchard	9425ebefae	Be more selective about closing fd's for alps/odls Be more selective about closing fd's for the alps odls component. Don't close fd's of pipes set up by the apshepherd for providing RDMA credentials, etc. Add an entry to the help file in case alps_app_lli_pipes returns an error.	2014-11-19 11:21:30 -07:00
Howard Pritchard	34c156759e	fix some compiler warnings in ras/alps	2014-11-18 11:32:37 -07:00
Howard Pritchard	4df3447d96	fix compare_nodes bug in alps ras component There was an obvious bug in the alps/ras component compare_nodes method which resulted in the function always evaluating the nodes as being equivalent.	2014-11-18 11:15:02 -07:00
Howard Pritchard	ff362c16ce	add/update copyrights for alps odls component	2014-11-18 10:16:11 -07:00
Howard Pritchard	dc98b62070	add initial support for an alps odls component It turns out that the support for Open MPI apps on Cray was hanging on a thin thread of support when using the mpirun job launcher. It just happened that with a certain set of configuration options things would work. This is bound to backfire at some point. To fix this weakness, as well as to allow for mpirun launched jobs to benefit from many of the advanced placement features provided by the Cray Linux Environment (as opposed to the hwloc only default env of orte), a new odls alps component is introduced.	2014-11-17 14:00:09 -07:00
Ralph Castain	d9ceb5aea4	Fix C++ builds by removing no-longer-needed type declaration	2014-11-14 11:44:24 -08:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Ralph Castain	2a90788724	Support physical processor ids in rankfile	2014-11-10 14:00:40 -08:00
Ralph Castain	8c837d3cb3	Doh - if we can't output an entire block, then we need to adjust the number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write.	2014-11-07 13:13:13 -08:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Ralph Castain	738c3e1d72	Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available.	2014-11-03 21:35:42 -08:00
Ralph Castain	6fbc68c830	Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one	2014-11-03 14:43:17 -08:00
Gilles Gouaillardet	652ecdb888	oob/tcp: always include a missing header file improve open-mpi/ompi@c9d1e16a9e	2014-10-29 13:39:23 +09:00
Gilles Gouaillardet	c9d1e16a9e	oob/tcp: include a missing header file warning can be seen under cygwin without the missing header file	2014-10-28 13:56:25 +09:00
Ralph Castain	64fae47d85	Ensure that the proxy "pull" of output gets directed to the requesting tool, which is no longer just the sender since the HNP may be making the request on behalf of someone else	2014-10-23 10:21:21 -07:00
Ralph Castain	526682e2f9	Add the ability for a tool that requests spawn of a job to also request forwarding of all output to the tool. The tool is responsible for its own call to push its stdin to the new job. The push request can come -after- the job is started, but the pull request has to be done during the spawn procedure or else output can be lost.	2014-10-23 08:16:49 -07:00
Ralph Castain	894acb0aa8	configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros These two macros set the MCA prefix and MCA cmd line id, respectively. Specifically, MCA parameters will be named PREFIX<foo> in the environment, and the cmd line will use -ID foo bar. These macros must be called during configure.ac and a value supplied. In the case of Open MPI, the values given are PREFIX=OMPI_MCA_ and ID=mca. Other projects (such as ORCM) will call these macros with their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_ and ID=omca This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running OMPI applications under ORCM, we need the MCA params passed to the ORCM daemons to be separated from those recognized by the OMPI application.	2014-10-22 18:57:40 -07:00
Ralph Castain	b6aa691e0a	Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons.	2014-10-16 12:58:56 -07:00
Ralph Castain	4fc4a8346b	Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL.	2014-10-09 20:58:46 -07:00
Elena	e319c95267	fixes for grpcomm rcd/brucks algorithms	2014-10-09 06:12:26 +02:00
Howard Pritchard	d2bb8d8829	remove alps ess component The alps ess component is obsolete. It relies on header files only present in very old CLE (Cray Linux) 3.X for the Cray XT series. As support for these systems is being dropped starting with release 1.9, this code is being removed.	2014-10-03 13:17:33 -06:00
hpp	8ded59ce0f	fix alps plm to allow explicit host placement It turns out that the alps plm code was developed only on cray systems that were running batch schedulers. However, for bring up and development systems, its not at all uncommon for there to be no batch scheduler, and thus to orte it appears that orte_num_allocated_nodes is always zero. This forces a user using mpirun on such a system to always specify a host list: mpirun -n 4 -N 1 -host 32,45,68 .... just to get the job to run, but then since the -L argument for aprun is never built, the app always runs on the first batch of nodes that aprun finds available.	2014-10-02 10:42:01 -06:00
Ralph Castain	4320457394	Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings This commit was SVN r32811.	2014-09-29 17:10:38 +00:00
Howard Pritchard	f8ac8bb6b0	remove improper use of hwloc_bitmap_free When using the native aprun launcher, it was observed that there were frequent memory corruption errors occuring either during a PMI kvs-fence operation, or at mpi termation during opal cleanup of allocated objects. This was especially bad when using aprun --c none In some cases, the application would even just hang in finalize if using ptmalloc, owing to some kind of infinite loop in cleanup of small blocks, etc. It turns out that the proble was in orte_ess_base_proc_binding's improper use of opal_hwloc_base_get_available_cpus. The cpuset (bitmap) returned from that function is not meant to be freed by the caller. This problem is likely never observed when using the mpirun launcher as there's an early exit if the OMPI_MCA_orte_bound_at_launch environment variable is set. This commit was SVN r32809.	2014-09-29 16:10:37 +00:00
Gilles Gouaillardet	9661e4537f	oob/tcp: fix a race condition Mimick the btl/tcp protocol to solve the race condition that happens when two peers try to connect to each other at the same time cmr=v1.8.4:reviewer=rhc This commit was SVN r32799.	2014-09-26 06:54:30 +00:00
Ralph Castain	17846411c3	Now that we have an ORTE thread running in apps, we can't just call "exit" during RTE abort as that is happening in a thread, and (at least in some environments) doesn't result in the main thread being immediately terminated. Instead, we wind up going thru orte_finalize in the main thread, which isn't what we want. So replace the call to "exit" with the "quick exit" variant "_exit", which causes the entire process to exit immediately. (custom patch has been posted for 1.8.3) This commit was SVN r32780.	2014-09-23 22:51:10 +00:00
Howard Pritchard	1508a01325	Fixes to enable mpirun to work again on Cray The ess pmi module was not handling aprun launched daemons. All daemons were thinking they were vpid 1. Also, turns out that on cray systems using MOM nodes for launched jobs, just detecting whether or not a process is in a PAGG container is not sufficient. Crank up the priority of the alps PLM component in the event that the configure detected the presence of both slurm and alps. Have the ESS pmi component open the pmix framework and select a pmix component. This commit was SVN r32773.	2014-09-23 15:37:26 +00:00
Gilles Gouaillardet	5fa2b6c59c	oob/tcp: fix a race condition Refs trac:4909 This commit was SVN r32754. The following Trac tickets were found above: Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909	2014-09-18 08:17:25 +00:00
Ralph Castain	3a437cbdb3	Silence set-but-not-used warning when timing isn't enabled This commit was SVN r32749.	2014-09-17 00:40:10 +00:00
Ralph Castain	414f4e9783	Try to provide a real hostname for the remote host to aid in debugging Refs trac:4908 This commit was SVN r32748. The following Trac tickets were found above: Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908	2014-09-17 00:39:49 +00:00
Jeff Squyres	9dc49c5f92	oob_tcp_connection: print "<unknown>" instead of "NULL" "NULL" doesn't meany anything to the user, and is somewhat confusing to see in an error message. "<unknown>" at least indicates that there's an error, and we know who the peer is. This commit was SVN r32747.	2014-09-16 22:47:57 +00:00

1 2 3 4 5 ...

3513 Коммитов