openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	26cfac94e6	Fix a formatting problem with xml output of map This commit was SVN r18976.	2008-07-22 13:14:02 +00:00
Ralph Castain	a4f0fa6e3a	Update the routed framework to: 1. add a new API delete_route(orte_process_name_t*) to delete the specified proc from the routing table 2. modify update_route so that it actually updates pre-existing routes instead of only adding routing info the end of the hash table This fixes ticket #1403 This commit was SVN r18970.	2008-07-21 21:37:09 +00:00
Jeff Squyres	54dbd95243	Fix some component version numbers to be the same as the OMPI release This commit was SVN r18965.	2008-07-21 20:05:29 +00:00
Ralph Castain	3137ed9255	Update the manpages for comm_spawn(_multiple) - add man page to explain host/hostfile behavior This commit was SVN r18961.	2008-07-21 17:58:12 +00:00
George Bosilca	bcac9a0540	Remove a warning about using map when it is not initialized. This commit was SVN r18957.	2008-07-21 14:35:05 +00:00
Jeff Squyres	750ea30961	So apparently my clever fix in r18873 was not good -- apparently, we can have a pub_endpoint and a sub_endpoint that are not equal but go to the same place (fd). I didn't think that that was possible. :-\ So just use a bool to track whether we have forwarded the fragment at all; if we have, then don't forward to the sub_endpoint. IOF is going to be re-written for v1.4. This commit was SVN r18950. The following SVN revision numbers were found above: r18873 --> open-mpi/ompi@773c92a6eb	2008-07-18 20:04:26 +00:00
Ralph Castain	6135943382	Update the paffinity call in the ODLS so we retrieve the number of processors on the local node, thus allowing us to correctly set the sched_yield parameter. This commit was SVN r18946.	2008-07-18 19:19:16 +00:00
Ralph Castain	09db4c3a60	Add target tag to diagnostic output This commit was SVN r18918.	2008-07-16 01:57:01 +00:00
Ralph Castain	07841808ee	Add some debugging - provide a gentler abort when routes cannot be found This commit was SVN r18915.	2008-07-15 15:48:46 +00:00
Rolf vandeVaart	84ede78e40	Clarify use of --path option to mpirun. This commit fixes trac:1231. This commit was SVN r18909. The following Trac tickets were found above: Ticket 1231 --> https://svn.open-mpi.org/trac/ompi/ticket/1231	2008-07-14 21:27:23 +00:00
Lenny Verkhovsky	a812324963	Fixing "paffinity_base_slot_list" environment This commit was SVN r18900.	2008-07-14 07:10:50 +00:00
Ralph Castain	b1f367563c	Few minor code cleanups This commit was SVN r18890.	2008-07-11 15:40:41 +00:00
Ralph Castain	58964b2bf8	Make lsf support compile This commit was SVN r18889.	2008-07-11 15:40:25 +00:00
Jeff Squyres	583bf425c0	Fixes trac:1383: Short version: remove opal_paffinity_alone and restore mpi_paffinity_alone. ORTE makes various information available for the MPI layer to decide what it wants to do in terms of processor affinity. Details: * remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone MCA param * move opal_paffinity_slot_list param registration to paffinity base * ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that succeeds use that. If no slot list was set, see if mpi_paffinity_alone was set. If so, bind this process to its Node Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you COMM_SPAWN to a host in this ORTE universe that already has procs on it, the NLR for the new job will start at N (not 0). So this is slightly better than mpi_paffinity_alone in the v1.2 series. * If a slot list is specified and mpi_paffinity_alone is set, we display an error and abort. * Remove calls from rmaps/rank_file component to register and lookup opal_paffinity mca params. * Remove code in orte/odls that set affinities - instead, have them just pass a slot_list if it exists. * Cleanup the orte/odls code that determined oversubscribed/want_processor as these were just opposites of each other. This commit was SVN r18874. The following Trac tickets were found above: Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383	2008-07-10 21:12:45 +00:00
Jeff Squyres	773c92a6eb	Fixes trac:1135. Short version: when the HNP launches VPID 0 on the same node as itself, the STDIN IOF endpoint will have both a pub and a sub on it. We need to ensure to only forward incoming messages ''once'' (not twice, as was happening). A lengthy comment in the code explains in more detail. This commit was SVN r18873. The following Trac tickets were found above: Ticket 1135 --> https://svn.open-mpi.org/trac/ompi/ticket/1135	2008-07-10 18:18:56 +00:00
Lenny Verkhovsky	30f0b33274	Priority of rmaps_rank_file_component changed to 100 when it selected This commit was SVN r18866.	2008-07-10 13:57:40 +00:00
Shiqing Fan	67a842fd17	Too many actual parameters for this function, remove the wrong one in order to get rid of the compiler warnings. This commit was SVN r18863.	2008-07-10 08:06:53 +00:00
Ralph Castain	1edcfa970e	Remove debug This commit was SVN r18845.	2008-07-09 00:12:11 +00:00
Pak Lui	924bface15	The plm env var should set to the name of a current plm module, which is rsh. This commit was SVN r18844.	2008-07-08 23:15:52 +00:00
Ralph Castain	8e3658b320	Remove the nodename:pid prefix from show_help output so it doesn't disrupt the formatted output This commit was SVN r18843.	2008-07-08 22:57:50 +00:00
Ralph Castain	f1114b4144	Upgrade the ability of orterun to deal with cmd line MCA params that are passed to the orteds. Help reduce the size of the cmd line by eliminating duplicates where possible, and alert to duplicate entries that can cause problems. Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned. This commit was SVN r18842.	2008-07-08 22:36:39 +00:00
Ralph Castain	51da9f2980	Properly ensure that cmd line MCA params override environmental MCA params for procs local to mpirun. Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell. This commit fixes trac:1373 This commit was SVN r18836. The following Trac tickets were found above: Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373	2008-07-08 13:48:47 +00:00
Lenny Verkhovsky	23da11fdcc	add some documentation for mpirun's new --loadbalance option closing #1277 This commit was SVN r18832.	2008-07-08 07:56:54 +00:00
Josh Hursey	0071ed8961	Fix broken C/R build resulting from r18804 Will patch v1.3 branch shortly. This commit was SVN r18814. The following SVN revision numbers were found above: r18804 --> open-mpi/ompi@ba5498cdc6	2008-07-07 14:55:29 +00:00
Lenny Verkhovsky	489c22b6b1	Added -rf\|--rankfile <arg0> oprion to mpirun -h + man info This commit was SVN r18812.	2008-07-07 13:46:22 +00:00
Lenny Verkhovsky	2fae770a86	information about carto framework in mpirun man page, ticket #1372 This commit was SVN r18809.	2008-07-07 12:02:10 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Shiqing Fan	5d0f4dc88d	- Clean up the unreferenced variables. - Change the arguments for launch failed function according to changeset r18611. This commit was SVN r18795. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-07-03 10:11:08 +00:00
George Bosilca	07cb54995b	Reactivate the daemon spin from the command line. This commit was SVN r18794.	2008-07-02 01:46:58 +00:00
Shiqing Fan	a3e1718126	Missing one argument for calling this function. This commit was SVN r18793.	2008-07-01 18:01:22 +00:00
Lenny Verkhovsky	c143c95ff9	Partial rankfile slots allocation fix This commit was SVN r18787.	2008-07-01 08:54:20 +00:00
Ralph Castain	6f85e34d66	Detect homo/hetero scenarios in the nidmap, setup to take appropriate actions in the basic grpcomm module. NOT for inclusion in v1.3 This commit was SVN r18786.	2008-07-01 02:44:57 +00:00
Ralph Castain	bbaf000db2	Singletons need to construct their own nidmap and cannot use the std function in the base This commit was SVN r18777.	2008-06-30 13:28:56 +00:00
Ralph Castain	158040cf3b	First step: be kind to Jeff's disk space - let's abort without dumping core files all over the place This commit was SVN r18751.	2008-06-26 16:10:03 +00:00
Ralph Castain	9cebe0ca96	Ckpt the bproc support. All compiles now except for PLM module This commit was SVN r18744.	2008-06-26 03:48:22 +00:00
Brian Barrett	cbd6749c22	Move the lock initialization back to orte_init so that the finalize lock is properly initialized and available in all cases (like ompi_info, where the ess is never actually initialized). Fixes trac:1364. This commit was SVN r18733. The following Trac tickets were found above: Ticket 1364 --> https://svn.open-mpi.org/trac/ompi/ticket/1364	2008-06-25 03:18:37 +00:00
Ralph Castain	578d1c15c6	Allow the ESS to return the hostname and arch for a specified daemon instead of just for application procs. Uses the same API - just need to detect that the specified proc is a daemon and lookup its corresponding node in the nidmap. This commit was SVN r18722.	2008-06-24 17:53:10 +00:00
Ralph Castain	b118779c08	It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet! Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes. This commit was SVN r18721.	2008-06-24 17:50:56 +00:00
Ralph Castain	17fcd72b5d	Restore bproc code - if someone wants to maintain it, then more power to them...but it would definitely be easier if the old code is in the trunk. This is all .ompi_ignore'd except for me so I can play with making it compile again in my copious free time. This commit was SVN r18716.	2008-06-24 01:27:22 +00:00
Ralph Castain	3e61a3f92e	Sandbox for next-gen launch This commit was SVN r18715.	2008-06-24 01:25:51 +00:00
Ralph Castain	f799ea225f	Orterun creates a "clean" copy of its environment for use in launching procs. This includes properly setting LD_LIBRARY_PATH and PATH, among other things. Unfortunately, our PLM modules were using the local environ instead of the saved copy, thus missing a number of things that really should have been included. From what I see, we got away with the error because the PLM's were duplicating all that setup logic themselves - I'll clean this up over the next few days. Meantime, correct the PLM's so they use the correct environ for launching. This commit was SVN r18713.	2008-06-23 22:39:36 +00:00
Ralph Castain	0fa9d88009	Set $PWD for the application proc to match the cwd. If the user specifies a working dir via -wdir, this ensures that the enviro variable matches what they get from getcwd. Note that any subsequent calls to chdir in the user's program will break that equivalence - we can only ensure it starts out matching! This commit was SVN r18709.	2008-06-23 18:25:41 +00:00
Ralph Castain	acbcbb81b5	Add some debugging output to the modex set_proc_attr function to see what is being added to the modex This commit was SVN r18708.	2008-06-23 18:24:08 +00:00
Jeff Squyres	24c3aa1d77	Really fix "make dist". Really. This commit was SVN r18704.	2008-06-21 18:04:38 +00:00
Jeff Squyres	930667ac73	Ensure that orte-checkpoint and orte-restart man pages are always included in the distribution tarball. This ''appears'' to be an Automake bug -- I have submitted a bug report to the bug-automake list: http://lists.gnu.org/archive/html/bug-automake/2008-06/msg00019.html This commit was SVN r18696.	2008-06-20 18:19:01 +00:00
Ralph Castain	c693d3a5d1	I hadn't honestly considered before that an MPI process might attempt to call functions in the routed framework intended solely for daemons and HNPs. By design, MPI processes are not allowed to route RML/OOB messages, and hence the routed module in an MPI process has no knowledge whatsoever of how a message will reach its destination (except in the direct module). Thus, it has no way to return a valid routing tree, update a routing tree, or get wireup info. This commit ensures that attempts to access information that is unknowable or undefined returns appropriate invalid or not_supported values to avoid unexpected behavior and/or segfaults. This commit was SVN r18692.	2008-06-20 03:26:13 +00:00
Ralph Castain	5ebe10ebf1	Fix a bad typo - need to look at the node array as the arch array hasn't been built yet This commit was SVN r18689.	2008-06-19 21:34:39 +00:00
Ralph Castain	174b9f1482	Ensure this module works in heterogeneous environments. Note: this module is under development, which is why it is not set as the default. Use at your own risk! This commit was SVN r18688.	2008-06-19 19:40:47 +00:00
Ralph Castain	26c9ad5799	Clean-up the DSS API to remove two functions that are supposed to be used solely internally to the DSS. These were likely exposed because we need to call them when packing/unpacking declared types, but this means that developers may accidentally use the wrong functions, causing the DSS buffer to get confused. Instead, return the system to the way it used to work and hide those functions. This commit was SVN r18684.	2008-06-19 18:46:25 +00:00
Josh Hursey	b78ae13bf3	add back a missing a header taken away in r18664 This commit was SVN r18682. The following SVN revision numbers were found above: r18664 --> open-mpi/ompi@0532d799d6	2008-06-19 16:08:27 +00:00
Jeff Squyres	e4172a3c44	Shift the AM "if" logic down from orte/tools/Makefile.am down to the individual orte/tools/*/Makefile.am files. This causes "make" to travese into every directory, even if it's not going to build anything in that directory (which is a good thing). It also helps cleanup and dist issues. This also affects orte-checkpoint and orte-restart, but I couldn't get --with-ft to compile properly; I'll pass along a heads-up to Josh to ensure that I didn't break anything. This commit was SVN r18680.	2008-06-19 14:46:10 +00:00
Ralph Castain	571f483c39	Ensure that we don't breakpoint the debugger until -after- all procs have reported their contact info so we can successfully send the release message This commit was SVN r18678.	2008-06-19 14:37:46 +00:00
Ralph Castain	3b5e80fa61	Shift responsibility for preconnecting the oob to the orte routed framework, which is the only place that knows what needs to be done. Only the direct module will actually do anything - it uses the same algo as the original preconnect function. This commit was SVN r18677.	2008-06-19 13:48:26 +00:00
Jeff Squyres	7e45b24001	MPIR_being_debugged is an int, not a bool. This commit was SVN r18676.	2008-06-19 13:31:34 +00:00
Ralph Castain	b56f8ced4f	Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon. Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs. This commit was SVN r18674.	2008-06-19 02:58:14 +00:00
Ralph Castain	955d117f5e	Add a new grpcomm module that mimics the old 1.2 behavior - it -always- does a modex because it always includes the architecture. Hence, we called it "blind-and-dumb" since it doesn't look to see if this is required - moniker of "bad". :-) Update the ESS API so we can update the stored arch's should the modex include that info. Update ompi/proc to check/set the arch for remote procs, and add that function call to mpi_init right after the modex is done. Setup to allow other grpcomm modules to decide whether or not to add the arch to the modex, and to detect if other entries have been made. If not, then the modex can just fall through. Begin setting up some logic in the "basic" module to handle different arch situations. For now, default to the "bad" module so we will work in all situations, even though we may be sending around more info than we really require. This fixes ticket #1340 This commit was SVN r18673.	2008-06-18 22:17:53 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
George Bosilca	8e7c35e76c	These symbols are only available via the module/component structure, so they don't have to be globally visible. This commit was SVN r18666.	2008-06-18 08:20:02 +00:00
George Bosilca	0f9b9c0aff	Remove a warning and add arequired header (otherwise we cannot compile when --disable-debug is specified). This commit was SVN r18665.	2008-06-18 08:10:02 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Brian Barrett	7712b07ac4	Add perl based wrapper compilers for cross-compile environments. The default is still to use the C based wrapper compilers (which have many more features and are more well tested). The Perl compilers are enabled with the option --enable-script-wrapper-compilers, which also ignores the option --disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries will result in perl-based wrapper compilers being installed, but no other binaries being installed). This commit was SVN r18655.	2008-06-13 22:52:25 +00:00
Ralph Castain	a87aa442e3	Remove last remaining reference to iof_flush - it was #if'd out anyway. The existing flush code appears to have several critical problems. Given the impending rework of the IOF subsystem, there is no point in trying to fix it here. This commit was SVN r18649.	2008-06-11 16:25:46 +00:00
Ralph Castain	1f41069ac9	Fix CID 752 - if we can't find the daemon job object, we have to ensure we exit without attempting to dereference it This commit was SVN r18647.	2008-06-11 14:49:58 +00:00
Ralph Castain	13ea4e4673	Be consistent - since we don't strdup the other values for param, don't strdup this one. This allows r18645 to fix the memory corruption issue, but also allows us to resolve the memory leaks cited by CID 1039 This commit was SVN r18646. The following SVN revision numbers were found above: r18645 --> open-mpi/ompi@53d83ba1c5	2008-06-11 14:42:47 +00:00
Pak Lui	53d83ba1c5	Take out a couple of free's. This commit fixes trac:1343 This commit was SVN r18645. The following Trac tickets were found above: Ticket 1343 --> https://svn.open-mpi.org/trac/ompi/ticket/1343	2008-06-11 14:02:49 +00:00
Ralph Castain	d61fe87d04	Use the opal_show_help system if orte_show_help has not been initialized This fixes ticket #1342 This commit was SVN r18644.	2008-06-11 12:50:40 +00:00
Ralph Castain	f9d809748c	Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable. This commit was SVN r18643.	2008-06-10 19:15:59 +00:00
Camille Coti	67cd1849f7	*map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to. This commit was SVN r18642.	2008-06-10 19:00:57 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
Brian Barrett	4127bd0dcc	fix two other mistakes in the cnos ess This commit was SVN r18632.	2008-06-09 22:28:26 +00:00
George Bosilca	f72ab90b16	Allow xgrid to compile again. This commit was SVN r18631.	2008-06-09 21:51:41 +00:00
Brian Barrett	11cd3a7cba	Fix problem where local rank always had different architecture than remote ranks on Red Storm This commit was SVN r18630.	2008-06-09 21:46:03 +00:00
Ralph Castain	8d9ff44134	Add visibility required for some environments and configs This commit was SVN r18629.	2008-06-09 21:28:19 +00:00
Ralph Castain	03ab4f5c64	Make the ifdef name mirror the change in filename This commit was SVN r18626.	2008-06-09 20:36:55 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	2cc8b2c51f	Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize. Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set. This commit was SVN r18622.	2008-06-09 19:21:20 +00:00
Ralph Castain	bf5c34d10a	The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here. This commit fixes trac:1200 This commit was SVN r18621. The following Trac tickets were found above: Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200	2008-06-09 19:07:19 +00:00
Ralph Castain	11692ca98e	Update tests to flag that these are non-MPI apps This commit was SVN r18620.	2008-06-09 18:48:21 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	83dd3d8c6f	Restore the ability to forcibly terminate by providing multiple ctrl-c's This commit was SVN r18618.	2008-06-09 13:08:54 +00:00
Pak Lui	caac0e0182	Add in a couple missing ones from r18611 for all tm users out there... This commit was SVN r18615. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-06-06 22:53:43 +00:00
Ralph Castain	b65eb54ea2	Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly Thanks to Pak for letting me know... This commit was SVN r18614.	2008-06-06 21:24:15 +00:00
Pak Lui	7f7777a538	Check for NULL in prefix_dir. This commit fixes trac:1337. This commit was SVN r18612. The following Trac tickets were found above: Ticket 1337 --> https://svn.open-mpi.org/trac/ompi/ticket/1337	2008-06-06 19:55:01 +00:00
Ralph Castain	7bee71aa59	Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit. Add a new function to opal_progress that tells us our recursion depth to support that solution. Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death... Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang. After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them. Thanks you Jeff! :-/ This commit was SVN r18611.	2008-06-06 19:36:27 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Jeff Squyres	d3795d7a34	Fix CID 987: remove unused variable. This commit was SVN r18598.	2008-06-05 20:17:02 +00:00
Ralph Castain	332e6c89ab	Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero This commit was SVN r18588.	2008-06-04 23:59:53 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Ralph Castain	ca91ec525b	Add a suffix to the opal_output stream descriptor object - we can now output both a prefix and a suffix for a given stream. Default the suffix to NULL. Remove lingering references to a filtering system as this will no longer be implemented. This commit was SVN r18586.	2008-06-04 20:52:20 +00:00
Josh Hursey	78f14b5255	Fix the none.checkpoint command. orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later. It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will not be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this. orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly. Thanks to Eric Roman for bringing this to my attention. This commit was SVN r18583.	2008-06-04 14:44:11 +00:00
Jeff Squyres	75a97ebbf0	Many thanks to Ralf W. for finding a subtle bug in these Makefile.am's that can sometimes cause problems with "make -j [N>1] install". Ensure to make the target directory before we copy stuff into it -- read the thread starting here for more details: http://www.open-mpi.org/community/lists/devel/2008/06/4080.php This commit was SVN r18570.	2008-06-04 01:28:03 +00:00
Ralph Castain	8ce4b64b5a	Ensure we don't go past the end of the array This commit was SVN r18569.	2008-06-03 21:31:02 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
George Bosilca	fa89d299bf	Silence the Obj-C compiler. This commit was SVN r18567.	2008-06-03 19:24:17 +00:00
Jeff Squyres	a4db97c213	More man pages fixes from our Debian Open MPI package maintainer friends. Woo hoo! This commit was SVN r18559.	2008-06-03 16:44:40 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	95578b0528	Fix single-node operations so that the HNP correctly exits when the job completes This commit was SVN r18556.	2008-06-03 14:23:04 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Shiqing Fan	af656b2b3d	Fix some typing mistakes, make the sources compile again for Windows Visual Studio. This commit was SVN r18542.	2008-05-29 15:27:43 +00:00
Ralph Castain	2772221ae0	Add the -xml option to mpirun to indicate that xml output is desired This commit was SVN r18539.	2008-05-29 14:11:31 +00:00
Ralph Castain	2b28bef15a	Provide a "nicer" indication that we don't know the pid of the failed orted This commit was SVN r18538.	2008-05-29 14:10:58 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Ralph Castain	52fb773c6c	Tell slurm to kill the job if an orted abnormally exits This commit was SVN r18535.	2008-05-29 12:26:58 +00:00
Ralph Castain	b2a566b610	Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems This commit was SVN r18534.	2008-05-29 12:21:17 +00:00
Ralph Castain	e5e542ddcf	Clarify an error message This commit was SVN r18533.	2008-05-29 12:20:24 +00:00
Josh Hursey	4ac7016200	Make sure to check "opal_list_get_last" instead of "opal_list_get_end". The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list. It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory. This commit fixes trac:1279 This commit was SVN r18529. The following Trac tickets were found above: Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279	2008-05-28 19:37:20 +00:00
Ralph Castain	f76240e7cc	Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch. This commit was SVN r18528.	2008-05-28 18:38:47 +00:00
Ralph Castain	347752e40b	One more corner case (caught by Jeff) occurs when someone requests usage help - e.g., with mpirun --help. In this case, we do a show_help prior to orte_init, but it is okay to do so. To allow this, we let show_help just operate correctly without any warning about pre-orte_output_init. This commit was SVN r18525.	2008-05-28 13:58:03 +00:00
Ralph Castain	828ae26d90	ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params. By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. Fixes ticket #1314 In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally. Fixes ticket #1315 This commit was SVN r18524.	2008-05-28 13:29:58 +00:00
Ralph Castain	5a2992dea2	After some discussion with Jeff, we have determined that the only time orte_output functions can be accessed prior to calling orte_output_init is if someone calls them prior to calling orte_init. This is clearly an error, so we now report that fact to the caller so it can be fixed. It is still possible that someone can call an orte_output function during orte_finalize - this is not an error. Prior commits ensured that this is correctly handled. This commit only deals with improper calls prior to calling orte_init. This commit was SVN r18513.	2008-05-27 20:13:04 +00:00
Ralph Castain	be193ead83	Ensure that any use of orte_output prior to calling orte_output_init gets a properly initiated stream tracking object to avoid later segfaults This commit was SVN r18512.	2008-05-27 19:07:31 +00:00
George Bosilca	1eb1742225	Remove this left over dependency. This commit was SVN r18508.	2008-05-27 16:57:40 +00:00
Ralph Castain	93d932aa0c	Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function. This commit was SVN r18507.	2008-05-27 15:46:21 +00:00
Ralph Castain	e190a990ba	Do not re-init the orte_output system if we have already finalized it as part of orte_finalize. Instead, default to routing the output to the std opal_output system so that the message still gets out. Of course, such messages cannot be filtered, but they are only for debug purposes by ORTE developers, so this should be a minimial issue. This commit was SVN r18506.	2008-05-27 15:15:55 +00:00
Ralph Castain	0b2b655de5	Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312 This commit was SVN r18505. The following Trac tickets were found above: Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312	2008-05-27 14:53:24 +00:00
Pak Lui	695c158192	silence some intel and pgcc compiler warnings. This commit was SVN r18501.	2008-05-26 20:35:13 +00:00
Pak Lui	7b3d7dcac4	This commit closes trac:1300. This commit was SVN r18473. The following Trac tickets were found above: Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300	2008-05-21 22:35:04 +00:00
Terry Dontje	ef7ac86929	created opal_version_string and orte_version_string to match the ompi changes made in r18345 for ompi_version_string. This was done per request from Jeff Squyres to maintain consistency and to remove some warnings caused by the non-use of some static const char. This commit was SVN r18461. The following SVN revision numbers were found above: r18345 --> open-mpi/ompi@8dd0421015	2008-05-20 12:13:19 +00:00
Jeff Squyres	c546d7bda9	Ensure that the last few duplicates can be shown -- don't shut down everything unitl those duplicates are shown This commit was SVN r18460.	2008-05-20 01:34:55 +00:00
Jeff Squyres	af1a9290cb	Ensure to check that opal_init_util() didn't fail. This commit was SVN r18453.	2008-05-19 11:58:48 +00:00
Jeff Squyres	e88ac13e53	Fixes trac:1290: ensure that we setup the orte_init subsystem before using it. This commit was SVN r18448. The following Trac tickets were found above: Ticket 1290 --> https://svn.open-mpi.org/trac/ompi/ticket/1290	2008-05-16 14:32:42 +00:00
Jeff Squyres	28d5f762ca	Fixes trac:1289: ensure that if we haven't initialized the orte_output system, we don't try to use it (e.g., if orte_output or orte_show_help is called before orte_init). This commit was SVN r18442. The following Trac tickets were found above: Ticket 1289 --> https://svn.open-mpi.org/trac/ompi/ticket/1289	2008-05-16 02:03:42 +00:00
Josh Hursey	7e8cd20a0a	a fix for C/R support This commit was SVN r18438.	2008-05-14 16:57:37 +00:00
Jeff Squyres	671f0c379d	Remove a whole pile of orte/util/show_help.h's that I missed. :-( This commit was SVN r18437.	2008-05-14 11:32:33 +00:00
Pak Lui	4c8d79d907	Silence the compiler warnings/errors. There is no orte/util/show_help.h This commit was SVN r18436.	2008-05-13 22:07:38 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	313240f2b6	Pass the pointer to a string pointer to the packing function This commit was SVN r18429.	2008-05-13 01:31:49 +00:00
Shiqing Fan	7ff440f628	Add quotation marks for windows path. This commit was SVN r18420.	2008-05-09 14:12:09 +00:00
Josh Hursey	da2f1c58e2	Some checkpoint/restart cleanup. * Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted. * Cleanup metadata interactions at the local level. * Touch up some of the INC funcitonality (fix typos and a minor ordering issue) This commit was SVN r18416.	2008-05-08 18:47:47 +00:00
Ralph Castain	64ef4102c4	Add the topo mapper module - requires some work in carto for completion. Little cleanup in round-robin mapper. This commit was SVN r18412.	2008-05-08 05:09:13 +00:00
Ralph Castain	ac5263613c	Fix stupid singletons yet again This commit was SVN r18408.	2008-05-07 20:26:31 +00:00
George Bosilca	dbea3e070e	Correct some copy/paste errors. This commit was SVN r18396.	2008-05-07 04:04:42 +00:00
Ralph Castain	ff70636024	Allgather_list needs its own tag to avoid conflicting with the allgather modex operation. All spawned procs must decode the port of the spawning process so they can communicate in direct routed mode. This fixes comm_spawn for all routing modes. This commit was SVN r18395.	2008-05-07 03:03:56 +00:00
Josh Hursey	bc67f40936	whoops typo This commit was SVN r18390.	2008-05-06 22:00:24 +00:00
Josh Hursey	50c909a23d	Fix a bit of selection logic. Filem should not fail select if the user decided not to build with any filem components. This matches the logic before the mca_base_select() change. This commit was SVN r18389.	2008-05-06 21:57:45 +00:00
Pak Lui	108921c020	typo This commit was SVN r18387.	2008-05-06 21:37:35 +00:00
Pak Lui	0302c098be	minor typo This commit was SVN r18386.	2008-05-06 21:26:17 +00:00
Ralph Castain	d97a4f880d	Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation. Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two. Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day. This commit was SVN r18385.	2008-05-06 20:16:17 +00:00
Josh Hursey	c47406810e	Fix AMCA orted command line. If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set. This commit was SVN r18382.	2008-05-06 18:27:31 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	40904dd152	Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later. Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way. Cleanup ignore properties on the new routed components. This commit was SVN r18377.	2008-05-05 22:32:25 +00:00
Aurelien Bouteiller	5ba62469a0	Add a route_is_defined implementation for the linear oob routing. This commit was SVN r18375.	2008-05-05 19:12:41 +00:00
Aurelien Bouteiller	2ae30fe126	Implementation of the route_is_defined stub for direct oob routing. This commit was SVN r18373.	2008-05-05 18:23:26 +00:00
Ralph Castain	b8bb990acf	Rename the routed modules to more accurately reflect what they do and the role they will play in soon-to-come updates. Add two new API's to the routed framework - stub them out so that collaborators can work on them in various components without conflicts. Remove a "finalize" from the select function that could cause problems as the component had not had its initialize called yet. This commit was SVN r18369.	2008-05-05 02:59:09 +00:00
Ralph Castain	519c15f8af	Fix direct and linear xcast modes This commit was SVN r18359.	2008-05-02 14:30:07 +00:00
Ralph Castain	8e846bf7f2	Separate the gathering of collective data by jobid This commit was SVN r18357.	2008-05-02 12:00:08 +00:00
Ralph Castain	432d441b3e	Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation Continue work on loadbalancing Cleanup code organization in rmaps_base This commit was SVN r18353.	2008-05-01 21:07:49 +00:00
Ralph Castain	b2c73f6e11	Fix tree-spawn to work within the new modex system This commit was SVN r18349.	2008-05-01 19:19:34 +00:00
Josh Hursey	dcd21d7d07	Some checkpoint/restart fixes in response to r18338 (changes in modex). Things should be working now. This commit was SVN r18348. The following SVN revision numbers were found above: r18338 --> open-mpi/ompi@3e55fe6f6d	2008-05-01 17:48:13 +00:00
Ralph Castain	ad894b050b	Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node. This commit was SVN r18346.	2008-05-01 15:24:03 +00:00

1 2 3 4 5 ...

1876 Коммитов