openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	1911d74095	Prevent segfault when -debug given to mpirun	2016-05-08 10:19:05 -07:00
Ralph Castain	58dd41facf	Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>. Remove debug, let orterun send terminate cmd to DVM Recover the DVM support	2016-05-06 13:14:03 -07:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Ralph Castain	fac409d094	Ensure the personality gets set for the debugger job launch when attaching	2016-04-28 15:28:55 -07:00
Ralph Castain	4a623778a9	Fix the debugger attach - previous commit had fixed one instance of a check prior to sending the release message, but there was a second code path that included a similar check that was missed. Thanks to John DelSignore for spotting it!	2016-03-23 08:25:25 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Ralph Castain	8f410d7897	Revert one part of open-mpi/ompi@4d0cc27eb7	2016-03-18 07:23:30 -07:00
Ralph Castain	4d0cc27eb7	Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution.	2016-03-05 21:47:42 -08:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	011403c04a	Fix a number of issues, some of which have lingered for a long time: * provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case. * change the relative priority of the pmix120 and pmix112 components to make pmix120 the default * fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons * ensure orterun doesn't propagate any ess or pmix directives in its environment * Cleanup a few valgrind issues and memory leaks * Fix a race condition that prevented the client from completing notification registrations (missing thread shift) * Ensure the shizo/alps component detects launch by mpirun	2016-03-01 06:53:00 -08:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Nathan Hjelm	69de442136	orterun: allow DDT if options contain :'s There is a bug in MPMD detection that disables totalview if a : is found anywhere on the command line. This includes inside an argument option or MCA variable value. This commit changes the check to look for the string " : " instead of the character : which should eliminate the issue in most cases. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-18 16:56:08 -07:00
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
Ralph Castain	8f9508cace	Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.	2016-02-17 13:33:06 -08:00
Ralph Castain	50431001a3	Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO. Fix stdin reader	2016-02-16 18:54:38 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	5e5adebf8e	Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here. Thanks to @jsharpe for the PR	2015-12-12 12:40:34 -08:00
Ralph Castain	8bfbe7f16c	Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile	2015-10-31 19:09:54 -07:00
Ralph Castain	f872e99315	Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize.	2015-09-21 20:31:57 -07:00
rhc54	665b30376a	Merge pull request #868 from rhc54/topic/hwloc Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 17:58:07 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	f6948c2bb4	Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working	2015-09-04 10:07:17 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Gilles Gouaillardet	2e384a3b65	initialize common symbols from orte A few uninitialized common symbols are remaining (generated by flex) : * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text	2015-05-08 10:11:58 +09:00
Ralph Castain	0e23f76eee	Fix comment	2015-04-15 20:09:14 -06:00
Jeff Squyres	99754afd25	orterun.c: re-justify the output message text The type-A personality / english lit major in me compells me to re-justify the text. :-)	2015-04-01 10:57:23 -07:00
Ralph Castain	41dd65d6cd	Per Jeff's request, tone down the comments and "standardize" the warning	2015-03-31 17:54:54 -07:00
Ralph Castain	f863147b05	Per the telecon and chat with Jeff, let root only do the version option without warning. Otherwise, require that the user specifically indicate allow-use-as-root	2015-03-31 10:34:35 -07:00
Ralph Castain	2f365720b0	Allow root to request the version and help from mpirun without having to override the run-as-root protection. Thanks to Robert McLay for pointing this out	2015-03-28 08:17:44 -07:00
Gilles Gouaillardet	4c0eb11e08	orterun: fix misc errors as reported by Coverity with CIDs 70700, 71039, 710651	2015-03-09 11:57:18 +09:00
Gilles Gouaillardet	4e7b5240e4	orte/tools: fix misc memory leaks as reported by Coverity with CIDs 70700, 71039, 71854, 72384 and 710651	2015-03-05 14:06:18 +09:00
Jeff Squyres	4f54fedf05	orterun: ensure to set used_num_procs=true after finding that token This was CID 71687.	2015-02-24 15:25:39 -05:00
Ralph Castain	ec5ccb76cf	Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time.	2015-01-30 11:00:43 -08:00
Ralph Castain	028b00154d	Complete implementation of the schizo framework to support OMPI component	2015-01-27 09:29:42 -06:00
Mike Dubman	f83d6045aa	ORTE: undeprecate -x var=val in mpirun mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"	2014-11-12 10:51:15 +02:00
Ralph Castain	738c3e1d72	Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available.	2014-11-03 21:35:42 -08:00
Ralph Castain	894acb0aa8	configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros These two macros set the MCA prefix and MCA cmd line id, respectively. Specifically, MCA parameters will be named PREFIX<foo> in the environment, and the cmd line will use -ID foo bar. These macros must be called during configure.ac and a value supplied. In the case of Open MPI, the values given are PREFIX=OMPI_MCA_ and ID=mca. Other projects (such as ORCM) will call these macros with their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_ and ID=omca This scheme is necessary to allow running Open MPI applications under systems that use their own versions of ORTE and OPAL. For example, when running OMPI applications under ORCM, we need the MCA params passed to the ORCM daemons to be separated from those recognized by the OMPI application.	2014-10-22 18:57:40 -07:00
Ralph Castain	ecbae03009	Fix typo	2014-10-16 13:30:06 -07:00
Ralph Castain	b6aa691e0a	Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons.	2014-10-16 12:58:56 -07:00
Ralph Castain	84810b80fd	Cover the remaining code paths for Java apps to define class path Refs trac:4926 This commit was SVN r32823. The following Trac tickets were found above: Ticket 4926 --> https://svn.open-mpi.org/trac/ompi/ticket/4926	2014-09-30 22:27:03 +00:00
Ralph Castain	040a69c38b	Correct the classpath to correctly include the local directory so Java programs find the application class cmr=v1.8.4:reviewer=jsquyres This commit was SVN r32817.	2014-09-30 16:35:12 +00:00
Ralph Castain	0445052a1c	Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value. Refs trac:4897 This commit was SVN r32719. The following Trac tickets were found above: Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897	2014-09-12 22:21:30 +00:00
Ralph Castain	e671620ac7	Per request from Jeff: tune up the help messages for binding options Refs trac:4898 This commit was SVN r32691. The following Trac tickets were found above: Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898	2014-09-09 22:39:22 +00:00
Ralph Castain	4207b4c4ad	Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32687.	2014-09-08 21:03:51 +00:00
Ralph Castain	4df1aa63f7	Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused. So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes. cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32686.	2014-09-08 20:38:46 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Joshua Ladd	30da6d3a17	Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option. mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3 should now be expressed as mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. This feature was developed by Elena Shipunova and was reviewed by Josh Ladd. This commit was SVN r32163.	2014-07-09 00:38:25 +00:00
Adrian Reber	cabf1d4e68	use the orte attributes in the FT code to fix compile errors This commit was SVN r32093.	2014-06-26 03:19:17 +00:00
Ralph Castain	5f6be06b54	Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended. cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load This commit was SVN r32072.	2014-06-24 18:11:39 +00:00

1 2 3 4 5 ...

433 Коммитов