openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Shiqing Fan	482a621e31	Change the behavior of exporting/importing symbols on Windows, so that to fit the new build procedure, i.e. import statically linked opal/orte libraries for other libraries/binaries. There are several use cases when creating dll libraries: 1. create DLL A, export symbols of A, import nothing (A normally is OPAL) should define _USRDLL , A_EXPORT 2. create DLL B, export symbols of B, import A.lib (B could be ORTE, OMPI or other ompi tools) should define _USRDLL, B_EXPORT 3. create DLL C, import B.dll (C could be external libs or apps) should define B_IMPORT This commit was SVN r24016.	2010-11-09 16:13:30 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	b76278085e	Sigh. Since 0,0 is a valid process, this just doesn't work (duh) This commit was SVN r23795.	2010-09-23 15:03:05 +00:00
Ralph Castain	7aed412798	Add definitions to tell us when jobid and vpid haven't been defined This commit was SVN r23793.	2010-09-23 01:33:42 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Jeff Squyres	f1a7b5cc33	Make "processor affinity not supported" error message a little better: * Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic OPAL_ERR_NOT_SUPPORTED case. * When odls_default detects that processor affinity is not supported, it prints a specific message about it, and then it suppressed a generic HNP help message that would normally follow it (i.e., it's easier to have the "processor affinity is not supported" show_help message last). * Use some symbolic names in odls_default instead of fixed int's, just for slight readability improvements in the code. * Introduce orte_show_help_suppress(), which gives the ability to suppress any future showings of any arbitrary show_help() message. This is useful if you display message X and want to suppress message Y. This suppression only works in environments where orte_show_help() does coalescing. This commit was SVN r23249.	2010-06-08 20:16:07 +00:00
Ralph Castain	7ce34223f1	Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases. Implement process migration when failed nodes are detected. Some testing still required This commit was SVN r23121.	2010-05-12 18:11:58 +00:00
Ralph Castain	4bd25f587c	Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so. Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort. This commit was SVN r23112.	2010-05-11 00:34:12 +00:00
Ralph Castain	d6a1d7a082	Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes. This commit was SVN r23107.	2010-05-07 14:04:55 +00:00
Ralph Castain	d4f56cff61	More cleanup on paffinity....groan It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected. Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages. This commit was SVN r23106.	2010-05-06 20:57:17 +00:00
Ralph Castain	f4ae2885e2	Add new error constant This commit was SVN r23090.	2010-05-04 13:44:33 +00:00
Ralph Castain	e3164d2ac1	For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found. This commit was SVN r23032.	2010-04-24 01:39:26 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	8fb71c0fe6	Add some helpful defined values This commit was SVN r22850.	2010-03-19 03:59:29 +00:00
Ralph Castain	2541aa98ab	Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort. This commit was SVN r22727.	2010-02-27 17:37:34 +00:00
Ralph Castain	51f64aaf96	Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system. Only active if specifically directed to use it This commit was SVN r22040.	2009-09-30 23:30:24 +00:00
Ralph Castain	7370235c3e	Create a more specific error code for when specific sockets are not available. Ensure that slurm 2.0 gets the expected error return if the process can't start for that reason so it can take corrective action. This commit was SVN r21867.	2009-08-21 21:28:15 +00:00
Iain Bason	e7ff2368d6	This fixes trac:1930. Emit a more informative error message when the file descriptor limit is reached during an accept() call. Also, abort when the accept fails to avoid an infinite loop. Emit a more informative error message when the help file can't be opened. This commit was SVN r21271. The following Trac tickets were found above: Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930	2009-05-26 20:03:21 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Ralph Castain	a6a7335694	Catch a potential bug spanning several ESS modules. The node_rank and local_rank types were changed to uint16_t, however the modules returned UINT8_MAX as an "invalid" value. To clean this up, define an INVALID value for these types, and change the various modules so they return this value to indicate an invalid response. This commit was SVN r20303.	2009-01-21 00:19:37 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Ralph Castain	037231fbcb	MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. This commit was SVN r19637.	2008-09-25 13:39:08 +00:00
Brian Barrett	38eb301919	Follow-on to r19457. Rather than have #ifs in the middle of functions (which neither Ralph nor I liked), don't allow the functions we don't need to be visible. Still not happy about the number of #ifs in the code, but splitting the code further would have been a nightmare and this was a good cutting point. Also protected some variables that were declared but not instanced so that users would be notified at compile time instead of link or run time (in the case of dss constants) that things wouldn't work. This commit was SVN r19471. The following SVN revision numbers were found above: r19457 --> open-mpi/ompi@a15171e46b	2008-09-01 17:15:01 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	ebea4d04e4	Remove defunct error constant - we no longer have a GPR that can hold corrupt data! This commit was SVN r17942.	2008-03-24 21:05:14 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Ethan Mallove	005652c9d4	* Embed ident strings into the Open MPI libraries using one of the following methods (in order of precedence): 1. #pragma ident <ident string> (e.g., Intel and Sun) 1. #ident <ident string> (e.g., GCC) 1. static const char ident[] = <ident string> (all others) By default, the ident string used is the standard Open MPI version string. Only the following libraries will get the embedded version strings (e.g., DSOs will not): * libmpi.so * libmpi_cxx.so * libmpi_f77.so * libopen-pal.so * libopen-rte.so * Added two new configure options: * `--with-package-name="STRING"` (defaults to "Open MPI username@hostname Distribution"). `STRING` is displayed by `ompi_info` next to the "Package" heading. * `--with-ident-string="STRING"` (defaults to the standard Open MPI version string - e.g., X.Y.Zr######). `%VERSION%` will expand to the Open MPI version string if it is supplied to this configure option. This commit was SVN r16644.	2007-11-03 02:40:22 +00:00
Brian Barrett	5b9fa7e998	reapply r15517 and r15520, which were removed in r15527 so that I could get the RML/OOB merge in slightly easier This commit was SVN r15530. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95 r15520 --> open-mpi/ompi@9cbc9df1b8 r15527 --> open-mpi/ompi@2d17dd9516	2007-07-20 02:34:29 +00:00
Brian Barrett	2d17dd9516	temporarily back our r15517 and 15520 so that I can get the RML / OOB changes to cleanly apply This commit was SVN r15527. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95	2007-07-20 01:10:34 +00:00
Ralph Castain	41977fcc95	Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but... Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point. Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings. This commit was SVN r15517.	2007-07-19 20:56:46 +00:00
Ralph Castain	d109e9a6f4	Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter. You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix. This commit was SVN r15427.	2007-07-14 15:14:07 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	75d51812a3	Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does. This commit was SVN r14693. The following SVN revision numbers were found above: r14554 --> open-mpi/ompi@4510b42638	2007-05-18 13:29:11 +00:00
Sven Stork	a04c8eb39a	- Bring over the visibility feature, for a finer symbol export control via the visibility feature that is provided by some compilers. Per default this feature is disabled, to enable it you need to configure with --enable-visibility and obviously you need a compiler with visibility support. Please refer to the wiki for more information. https://svn.open-mpi.org/trac/ompi/wiki/Visibility This commit was SVN r14582.	2007-05-04 09:03:37 +00:00
Ralph Castain	9cd85ef55a	Add a few more error constants that will help provide more definitive output to the user This commit was SVN r14471.	2007-04-23 18:25:03 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
George Bosilca	b2b2ade6c9	Remove the last underscore. Look like a typo ... This commit was SVN r12006.	2006-10-05 05:41:57 +00:00
Ralph Castain	cd7d87aa7b	Define the map data types for dss compatibility. Setup to debug bproc This commit was SVN r11955.	2006-10-03 17:40:00 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	07b8d3c72c	On Windows we can now deliver Open MPI on several flavors: - everything statically built (dynamically opened). - OPAL, ORTE and OMPI static libraries and all the components as dynamic files(DLL). - everything as dynamic files (DLL). This commit was SVN r11461.	2006-08-28 04:19:42 +00:00

1 2

76 Коммитов