openmpi

Автор	SHA1	Сообщение	Дата
Andrew Friedley	798c19d395	Blah.. we should always return after try_connect() here, not just when we have an error. Another fix for ticket #362. This commit was SVN r11756.	2006-09-22 15:51:11 +00:00
Tim Prins	567676f3c1	- Formatting and minor cleanup - made it so we now set the architecture of each node we discover - remove debugging output This commit was SVN r11751.	2006-09-22 13:24:32 +00:00
Tim Prins	83a7f6e4de	Fix for bug #369 . LoadLeveler only sets LOADL_PROCESSOR_LIST when there are 128 or less tasks allocated to a job. The POE RAS relied on this variable so I created a new RAS which uses the LoadLeveler API instead of relying on the environment variable. This still needs some testing, so for now we use the POE RAS whenever LOADL_PROCESSOR_LIST, otherwise we fall back on this component. Unfortunately, this will require an autogen... This commit was SVN r11732.	2006-09-21 00:08:49 +00:00
Andrew Friedley	8895bf7369	Fix the fix (r11718) for bug #362 . We were still waiting the entire duration of the timeout before we figured out that a connect() was successful. Re-introduce adding the peer_send_event so that we detect immediately when a connect() completes. Also make sure to delete the timeout event in complete_connect(). Fixed a struct timeval initialization warning reported by Jeff. Remove an erroneous opal_output(). This commit was SVN r11724. The following SVN revision numbers were found above: r11718 --> open-mpi/ompi@1b6231a9b5	2006-09-20 14:29:37 +00:00
Andrew Friedley	1b6231a9b5	Fix for running jobs that span multiple 's' partitions on IU BigRed. Each 's' partition has its own TCP network. It's fine to use this network for jobs that fit inside the partition, but the TCP OOB errors when trying to connect across two partitions, because there are two disjoint networks. Each node also has another TCP network connecting ALL nodes together. So the solution is to actually try all the available TCP interfaces on a node, instead of erroring when the first one fails. Also, the default TCP connect() timeout is way too long (5 minutes) - use our own timeout mechanism, with the timeout value expressed as an MCA parameter. This commit was SVN r11718.	2006-09-19 19:33:49 +00:00
Tim Prins	c4db5654fa	Fix for bug #370 The POE ras did not correctly enter the number of slots per node. This fixes that. This commit was SVN r11716.	2006-09-19 16:27:15 +00:00
Ralph Castain	977e3c5ca1	Let's see if Cyrador understands this version a little better... This commit was SVN r11709.	2006-09-19 13:05:40 +00:00
Ralph Castain	0ad0d84afd	Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality. No implementation backs these new APIs - just placeholders for now. This commit was SVN r11699.	2006-09-19 01:45:05 +00:00
Ralph Castain	d7e61e40fc	Quiet a few warnings from Cyrador This commit was SVN r11686.	2006-09-18 12:40:42 +00:00
Ralph Castain	8a291afda6	Ensure the rds_private.h file gets included in the distribution This commit was SVN r11682.	2006-09-16 11:45:02 +00:00
Ralph Castain	f906af983a	Forgot to change the silly Makefile.am names - sorry Cyrador! This commit was SVN r11670.	2006-09-15 04:52:20 +00:00
Jeff Squyres	3e239f4532	Add a missing .ompi_ignore This commit was SVN r11666.	2006-09-15 02:36:22 +00:00
George Bosilca	4fe39a4e7d	The old PLS is now called a ODLS. However, the real name is not windows but process. This change will follow shortly... This commit was SVN r11663.	2006-09-14 22:22:34 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	17afe7dc9f	Do it on the correct way as this is normally compiled as a module. This commit was SVN r11660.	2006-09-14 21:22:41 +00:00
George Bosilca	01c5a115b2	Don't export the POE module. Only the component have to be exported (visible). This commit was SVN r11659.	2006-09-14 21:20:31 +00:00
Josh Hursey	908f31fe9f	Fix a code clarity issue in the POE PLS. Allow the POE RAS to be compled for linux as well as AIX. The POE RAS is really a Loadleveler RAS, and IU now has a cluster that uses Loadleveler in a Linux environment (BigRed). This seems to be the only thing we need to do so far to run Open MPI on BigRed. Yay :) This commit was SVN r11600.	2006-09-09 05:13:15 +00:00
Josh Hursey	160120b4c5	Fix a cut-n-paste error that causes the 'num_concurrent' to be set to 1 or 0 instead of the user defined number or default (128). This caused the PLS to deadlock when using '--debug-daemons' with more than 2 processes. :( svn blame says that it was broken in r11347 It is not a problem on v1.1 or v1.2 branches. Bug spotted by Tim Mattox and myself. This commit was SVN r11575. The following SVN revision numbers were found above: r11347 --> open-mpi/ompi@f52c10d18e	2006-09-08 15:17:17 +00:00
Jeff Squyres	0f11584a6c	* Update svn:ignore * Remove svn:executable from non-executable files This commit was SVN r11555.	2006-09-07 17:17:40 +00:00
Ralph Castain	9e6e9b8619	Fix a couple of variable declarations This commit was SVN r11467.	2006-08-28 13:28:10 +00:00
George Bosilca	c2311f6e42	Don't define the yywrap function. This commit was SVN r11459.	2006-08-28 04:11:25 +00:00
George Bosilca	693c835137	No need to cast as the returned value is already in the expected type. This commit was SVN r11458.	2006-08-28 04:10:43 +00:00
George Bosilca	ba1514f2e7	A slightly more Windows friendly version. Unfortunately there is no support for SGE on Windows. This commit was SVN r11436.	2006-08-27 04:46:43 +00:00
Pak Lui	131f0eff04	fix the verbose value. This commit was SVN r11418.	2006-08-24 21:30:08 +00:00
Pak Lui	65a524dd0d	- need to provide option for showing the grid engine's JOB_ID in case the grid engine job needs to be killed - clean up the orted_path and debug message This commit was SVN r11413.	2006-08-24 20:27:19 +00:00
Pak Lui	4f75dfd353	- missed the opal_os_path() for LD_LIBRARY_PATH This commit was SVN r11410.	2006-08-24 18:58:50 +00:00
George Bosilca	9110ea2b80	Add the Windows fork component. As fork is not available on Windows, I create a process component which use CreateProcess to spawn the child. Special care should be taken in order to correctly redirect the stdin, stdout and stderr of the child process. This commit was SVN r11405.	2006-08-24 17:51:20 +00:00
George Bosilca	0d607c1346	Use opal_os_path and OPAL_PATH_SEP to build the file path. I don't have any machine to test, so I hope I get it right. This commit was SVN r11398.	2006-08-24 16:20:32 +00:00
Pak Lui	5220c1ca42	- converted some tabs into spaces This commit was SVN r11384.	2006-08-23 23:21:08 +00:00
Pak Lui	9dda057f05	- Do the changes as in r11347 for gridengine to use opal_os_path(). - Remove extra NULL argument from rsh module. This commit was SVN r11377. The following SVN revision numbers were found above: r11347 --> open-mpi/ompi@f52c10d18e	2006-08-23 20:40:01 +00:00
Jeff Squyres	715bae369c	Remove extra argument - now obsoleted by the use of opal_os_path(). This commit was SVN r11366.	2006-08-23 14:32:06 +00:00
Brian Barrett	e39f0096a0	* add header file to sources list so make dist works This commit was SVN r11357.	2006-08-23 13:31:56 +00:00
George Bosilca	c03ef692c1	And the missing header. This commit was SVN r11348.	2006-08-23 03:33:35 +00:00
George Bosilca	f52c10d18e	And ORTE is ready for prime-time. All Windows tricks are in: - use the OPAL functions for PATH and environment variables - make all headers C++ friendly - no unamed structures - no implicit cast. Plus a full implementation for the orte_wait functions. This commit was SVN r11347.	2006-08-23 03:32:36 +00:00
George Bosilca	aecdfc80eb	Don't orget to relase the object if we detect an error. This commit was SVN r11346.	2006-08-23 02:43:05 +00:00
Ralph Castain	c3ba1c1cc1	Fix a pack/unpack mismatch This commit was SVN r11315.	2006-08-22 13:50:59 +00:00
Ralph Castain	73a7916946	For Ollie...fix a few names. Should help the Bproc SMR component compile. This commit was SVN r11284.	2006-08-21 15:11:20 +00:00
George Bosilca	6afa4c6c64	Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3 different macros, one for each project. Therefore, now we have OPAL_DECLSPEC, ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project. This commit was SVN r11270.	2006-08-20 15:54:04 +00:00
Ralph Castain	ee04e04dd0	Attempt to cleanup the xgrid pls module This commit was SVN r11261.	2006-08-18 21:21:31 +00:00
Ralph Castain	6bf06d4602	Fix connect-accept by cleaning up two minor bugs. This commit was SVN r11260.	2006-08-18 21:12:03 +00:00
Ralph Castain	517d6fda49	Add the smr_private include file so it gets put in tarballs This commit was SVN r11243.	2006-08-17 12:24:44 +00:00
Ralph Castain	8c7f0ed9ae	Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system. Other changes: 1. Remove the old xcpu components as they are not functional. 2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one. This will require an autogen/configure, I'm afraid. This commit was SVN r11228.	2006-08-16 16:35:09 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Brian Barrett	cd7b138d74	propogate up errors when setting up standard input forwarding This commit was SVN r11187.	2006-08-14 21:09:05 +00:00
Ralph Castain	d2912f03e0	Cleanup a historical naming convention problem. Move the socket_errno definitions to the OPAL layer and change the name accordingly. This cleans up some interrelationship issues as well as removing a name confusion. This commit was SVN r11186.	2006-08-14 20:14:44 +00:00
Ralph Castain	663e25f7cb	Finalize the Bproc vpid algorithm. Bproc is now fully operational and supports oversubscribed conditions for both bynode and byslot mapping procedures. This commit was SVN r11180.	2006-08-14 19:16:11 +00:00
Ralph Castain	285aea1c0c	Update to bproc algorithm to support oversubscription - committing to move to another test environment. Note that this may break bproc for the moment. This commit was SVN r11178.	2006-08-14 18:34:13 +00:00
Ralph Castain	de9156552b	I have confirmed that the later version of the bproc launcher does support Bproc 3, so it appears that the outdated bproc_seed launcher truly is no longer required. This commit was SVN r11164.	2006-08-12 07:47:21 +00:00
Ralph Castain	0ccc910485	Fix the Bproc vpid computation so that, when we map by slot, adjacent processes have vpids differing by only one. I will ammend the documentation in the files shortly to explain why this was previously broken. This commit was SVN r11162.	2006-08-11 19:41:33 +00:00
Pak Lui	8fab3d5b82	* Inadvertently removed a wrong variable during the last change. This commit was SVN r11157.	2006-08-11 16:00:39 +00:00
Ralph Castain	59d6f1e2eb	Remove ompi_ignores on gridengine components as this seems resolved - thanks Pak for quick response! Fixed a few very minor compiler complaints in the pls_gridengine_module.c file. ISO C is less forgiving about where variables get declared. This commit was SVN r11156.	2006-08-11 15:32:17 +00:00
Pak Lui	99a0521e44	* Fix the issue that Ralph observed in MacOS X with an invalid header file and other warnings. This commit was SVN r11155.	2006-08-11 15:04:51 +00:00
Ralph Castain	5fd6306c2f	Add ompi_ignores until the configuration can be fixed This commit was SVN r11154.	2006-08-11 14:11:41 +00:00
Pak Lui	08352878cc	* Added in new ras and pls components to support Sun N1 Grid Engine (N1GE) 6 and its open source version as the job launchers for ORTE. This commit was SVN r11153.	2006-08-10 21:46:52 +00:00
Ralph Castain	bd937b219d	Tell xcast not to send to processes that have "aborted". One of those fixes that has been sitting on another branch for awhile...sigh. This commit was SVN r11142.	2006-08-09 18:23:43 +00:00
Ralph Castain	8496b6aff4	When a "fork" launch cannot find the executable, the system used to just return an error. This meant that the state of that process was never updated in the registry, leaving the counters at the incorrect levels. As a result, the triggers would never fire to indicate that the job had been aborted. This left orterun and other orteds/processes hanging. This fix should fix the problem. I will test it on a broader range of systems forsooth... This commit was SVN r11140.	2006-08-09 15:29:08 +00:00
Ralph Castain	ddd575d126	Ensure that the localhost gets placed on the registry with the same name as found in the system_info structure. Otherwise, we wind up with confusion in the session directory names. This commit was SVN r11139.	2006-08-09 15:26:37 +00:00
Brian Barrett	59844f2119	Galen noticed that the soh component wasn't linking against the bproc libraries. Fix that issue. This commit was SVN r11119.	2006-08-07 16:20:33 +00:00
Brian Barrett	16186978bb	- Fix some compile issues in r11109 - indent / whitespace cleanup - don't set --daemon-debug when pls debug is given, as it seems to make the daemons abort. This commit was SVN r11113. The following SVN revision numbers were found above: r11109 --> open-mpi/ompi@da7df6d257	2006-08-03 18:51:42 +00:00
Galen Shipman	da7df6d257	monitor bproc node state and terminate the job if a node in our job goes down.. This commit was SVN r11109.	2006-08-03 05:29:49 +00:00
Josh Hursey	d1e1a68645	This commit contains the necessary changes to get "mpirun a.out" working correctly with MPI_Comm_spawn. The problem wiht MPI_Comm_spawn was that the 'parent' process was rmgr.create'ing and then rmgr.launch'ing the children via the rmgr proxy component. The HNP saw these commands and processed them normally, but since we never went through the HNP's rmgr (urm component) spawn() logic the triggers and key/value pairs were never created. So the children were launched correctly, but since the HNP did not have any triggers setup, never triggered the xcast for the children to finish orte_init(). This fix puts the trigger and key/value pair initialization in rmgr_urm_spawn() for the 'mpirun a.out' case, and in the rmgr_base_unpack routine that deals with the creation of the job for the child as requested by the proxy component. This will allow the triggers to be registered for the proxy's request which only happens during MPI_Comm_spawn* Small change for a lot of debugging. Notice that his reverts r11037 to its previous version, and adds a newline to handle the spawn cases. This commit was SVN r11046. The following SVN revision numbers were found above: r11037 --> open-mpi/ompi@5813fb7d2a	2006-07-28 17:17:31 +00:00
Josh Hursey	5813fb7d2a	It seems that MPI_Comm_spawn{_multiple} has been broken since r10708 By reverting this file (changeset from commit r10708) to its previous version fixes the problem. This should be moved to the v1.1 branch where it is also broken. This commit was SVN r11037. The following SVN revision numbers were found above: r10708 --> open-mpi/ompi@febc143d8c	2006-07-27 21:21:10 +00:00
Brian Barrett	c744f650ba	* really didn't mean for this patch (the threaded accept() code) to come in with r10841, so revert it (and it's fixes) out. Will bring back once cleaned up from the code used in the tbird experiment This commit was SVN r10991. The following SVN revision numbers were found above: r10841 --> open-mpi/ompi@dfa1221c3b	2006-07-25 22:32:01 +00:00
Jeff Squyres	c2d4dfce78	Remove unused variable This commit was SVN r10985.	2006-07-25 21:43:21 +00:00
Jeff Squyres	bdab8d744c	Send a pointer to the data, not the data itself. Otherwise, we could get a segv in some cases. This commit was SVN r10984.	2006-07-25 21:42:44 +00:00
Ralph Castain	65acc9325a	Fix a bug that crept in during the last change to support "mpirun a.out" operations. Since we now reserve a range of vpids for each app_context, we no longer need to track the rank and offset the starting vpid each time through the mapper - the name service automatically accounts for the offset when allocating the next starting vpid for the job. This should be shifted to v1.1. This commit was SVN r10916.	2006-07-20 21:06:15 +00:00
Ralph Castain	8bec270f90	Fix a bug noted by Jeff - we were no longer accurately recording in the registry that a process had been terminated when the user initiated the "kill" process (via cntrl-c). Added another system-level test function for ORTE that just spins until terminated by a ctrl-c signal. Modified orterun - added a couple of newlines to the output when abnormally terminating so the prompt always is on a new line. This commit was SVN r10866.	2006-07-18 14:42:27 +00:00
Gleb Natapov	f15fc4ef2f	include signal.h for SIGPIPE definition This commit was SVN r10863.	2006-07-18 09:07:53 +00:00
Brian Barrett	2185c059e8	* use opal_free_list_item_t as the type of items stored in an opal_free_list_t, rather than assuing it's an opal_list_item_t. This commit was SVN r10860.	2006-07-17 21:51:50 +00:00
Jeff Squyres	82161d20ca	Catch a SIGPIPE and allow it to be harmless. Register a no-op SIGPIPE handler before the write() and de-register it afterwards. Determine if the write() succeeded or failed by the return of write(). This commit was SVN r10858.	2006-07-17 21:15:56 +00:00
George Bosilca	33a7634009	Silence the compiler. This commit was SVN r10851.	2006-07-17 17:13:28 +00:00
Ralph Castain	404acc9f65	It's okay to call index prior to anything being put in the registry... This commit was SVN r10848.	2006-07-17 14:31:42 +00:00
Ralph Castain	574a6f7896	Fix a bug that caused the system to crash when asked for an index of the segment names. Such a request required passing a NULL value for the segment name, but the find_seg function didn't protect itself from that value. Thanks to James Kennedy (UCC-Ireland) for finding it. This commit was SVN r10847.	2006-07-17 13:51:07 +00:00
Brian Barrett	dfa1221c3b	* AC_CONFIG_LINKS has a minor problem in that it always uses ln -s, rather than $(LN_S). This causes problems with with Windows and probably elsewhere (re: #200). So use a slightly different trick to get the right header selected for the MEMCPY and TIMER components. * Using the same trick used to solve the AC_CONFIG_LINKS problem, stop using a separate header file for direct calling in the PML and MTL. This lets me remove some icky code in ompi_mca.m4 that was more fragile than I really liked. This commit was SVN r10841.	2006-07-16 04:23:52 +00:00
Jeff Squyres	ffddfc5629	Turns out that it's a really Bad Idea(tm) to tm_spawn() and then not keep the resulting tm_event_t that is generated because the back-end TM library actually caches a bunch of stuff on it for internal processing, and doesn't let go of it until tm_poll(). tm_event_t's are similar to (but slightly different than) MPI_Requests: you can't do a million MPI_Isend()'s on a single MPI_Request -- a) you need an array of MPI_Request's to fill and b) you need to keep them around until all the requests have completed. This commit was SVN r10820.	2006-07-14 22:04:41 +00:00
Rainer Keller	50b5791969	- Release best_item - Reformat This commit was SVN r10814.	2006-07-14 19:55:14 +00:00
Ralph Castain	7b3ced80e8	Fix a bug that has been causing inconsistent behavior on a number of platforms. Will explain more on the core-devel list. Jeff: this needs to be back-patched to our supported prior releases. I'll try to verify how far back we need to go - my initial guess is probably all of them This commit was SVN r10801.	2006-07-14 14:16:20 +00:00
Ralph Castain	cef1ce19d6	Restore the "sleep" delay during startup. Since Jeff and I are going to a branch for T-bird, we have restored the trunk to its prior state to avoid any possibility of disturbing it. This commit was SVN r10774.	2006-07-12 22:18:53 +00:00
Jeff Squyres	ef8433a60b	After more discussion on the phone, it seems easier to not muck around in special components but rather go down to a /tmp branch. So removing these components and I'll branch next. This commit was SVN r10771.	2006-07-12 22:12:29 +00:00
Jeff Squyres	62c189ea1c	Fix a few blanket search/replaces This commit was SVN r10768.	2006-07-12 21:54:05 +00:00
Ralph Castain	badd3f4acb	Clean up a few lingering references to "urm". This commit was SVN r10765.	2006-07-12 21:01:21 +00:00
Jeff Squyres	36ca7497d1	Update m4 and configure files This commit was SVN r10764.	2006-07-12 20:55:39 +00:00
Ralph Castain	9102b5af3b	Remove the "sleep" delay in the oob connection procedure. This shouldn't cause any problems, especially for launches of less than 1000 processes. Please report any abnormal behavior during launch, though, as we would like to understand what (if any) impact is seen. I couldn't see any on small jobs (the modulo functions render this number down pretty low). This commit was SVN r10763.	2006-07-12 20:31:30 +00:00
Ralph Castain	a84898316c	Create new components to support Thunderbird scalability development This commit was SVN r10762.	2006-07-12 20:28:23 +00:00
Brian Barrett	4b70bb92db	* Per ticket #112 , localhost checks should check against 127.0.0.1/8, rather than just 127.0.0.1. This commit was SVN r10750.	2006-07-11 20:54:49 +00:00
Ralph Castain	11125dd67a	George has a retarded compiler - but that's okay. This will quiet it's warning system. This commit was SVN r10736.	2006-07-11 15:27:02 +00:00
George Bosilca	3daa063772	Make the format and the arguments matchs. This commit was SVN r10734.	2006-07-11 15:10:44 +00:00
Josh Hursey	9a31060b6d	Fix r10725 so that the trunk builds again. This commit was SVN r10733. The following SVN revision numbers were found above: r10725 --> open-mpi/ompi@ae222cca5b	2006-07-11 14:48:31 +00:00
Ralph Castain	ae222cca5b	Include the help file so it can be accessed This commit was SVN r10725.	2006-07-11 12:15:25 +00:00
Ralph Castain	6129a5a887	Enable -host support for "mpirun a.out". You can now execute on all slots on specified nodes within your overall allocation. This commit was SVN r10713.	2006-07-11 02:59:23 +00:00
George Bosilca	a9df5035f9	Remove unused variable. This commit was SVN r10712.	2006-07-11 00:30:51 +00:00
Ralph Castain	febc143d8c	Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them. Update the help text to report errors when not following that rule. Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base. The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so. This commit was SVN r10708.	2006-07-10 21:25:33 +00:00
Ralph Castain	3d220cbd48	This patch fixes several issues relating to comm_spawn and N1GE. In particular, it does the following: 1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed. 2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x. 3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough". 4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it. 5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on. 6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too. This commit was SVN r10697.	2006-07-10 14:10:21 +00:00
Brian Barrett	41e144c879	Fix for ticket #92 , bproc stdin being borked. The problem was that we were using a pty for everything, which drops all buffered data on the floor when close() is called on the daemon side, meaning EOF has some issues. Instead, do the same thing we do for other starters that use the fork() pls -- use a pipe/fifo for stdin and stderr and a pty for stdout. This is good enough for what we need and avoids most of the issues with ptys. This commit was SVN r10692.	2006-07-08 21:18:24 +00:00
Ralph Castain	bc7690bcb0	Fix the bproc allocator. This is just a bandaid for 1.x that will be fixed more thoroughly in 2.0. Basically, the problem was that the allocator was grabbing everything on the cluster for which the user had access privilege. Thus, if a user had two sessions operable, each with its own allocation, mpirun in each session would grab both sets of nodes and use them. Not very polite. This commit was SVN r10683.	2006-07-06 18:31:14 +00:00
Jeff Squyres	3d5d0959fa	Remove unused variable, and therefore silence a compiler warning. This commit was SVN r10673.	2006-07-06 10:44:04 +00:00
Josh Hursey	b1da6f8bc4	A bit more cleanup for that last patch. * num_children should really be an int instead of size_t since 'size_t' is not signed and num_children can (in rare cases) drop below 0, and don't want it to roll around to MAX_INT or some such. * I figured out that this problem only happened to me because I use the pls_fork_reap_timeout MCA parameter and thus the only time that the code in pls_fork_module.c to waitpid is executed is if this is not set to 0 (I had it set to 1 to give my procs time to exit). I adjusted the loop from while{...} to do{...}while; so that it is executed at least once for consistency. * de-register the SIGCHILD callback for the pid before we attempt to kill it, so that we don't leave the door open for both the waitpids (the one in the callback, and the one in this function) to race to see who can wait on the child. * Move the 'thread release' to outside the for loop for a bit of an optimization, and always set the value to 0 since we want to finish after this function. * Added a help message for the case when we can't send a kill() signal to the process. Should never happen, but all is possible in the wild wild west of HPC. This commit was SVN r10666.	2006-07-05 21:38:23 +00:00
Josh Hursey	696bb4a0c0	A partial fix for the hanging orted bugs (Ticket #177 ) When we force an application to terminate (via CTRL-C to mpirun) we send an out-of-band message to the orted to reap its children. the fork PLS was doing an internal waitpid but never releasing or updating the information and signaling the condition variable. So the fork PLS callback for SIGCHLD registered with the event library and this waitpid are in a bit of a race to 'waitpid' for the children. Since the PLS callback was the only one that handled the signal properly when it 'won' then things were great -- as in the normal termination case. But when it 'lost' -- as in the abnormal termination case -- the orted never received the proper signal that its children had gone away. We want to preserve the internal fork PLS callback since it allows for a timeout while waiting for the child, which the event library won't do. This allows both to exist, and behave properly. This was introduced in r9068. The ticket is still open since the orted's hang in other situations still. This is a fix for one of the causes. This commit was SVN r10662. The following SVN revision numbers were found above: r9068 --> open-mpi/ompi@c2c2daa966	2006-07-05 19:37:29 +00:00
Jeff Squyres	538965aeb0	Final merge of stuff from /tmp/tm-stuff tree (merged through /tmp/tm-merge). Validated by RHC. Summary: - Add --nolocal (and -nolocal) options to orterun - Make some scalability improvements to the tm pls This commit was SVN r10651.	2006-07-04 20:12:35 +00:00
Josh Hursey	5c5ce7e051	When 'mca_oob_send_callback' accesses the callback 'orte_pls_rsh_terminate_job_cb' with an error status (< 0) then the req buffer is NULL. Put checks around the OBJ_RELEASE(req) calls so that we don't try to release NULL :/ This commit was SVN r10641.	2006-07-03 22:44:54 +00:00
Josh Hursey	d082a63734	Add some new OPAL functionality. After seeing the uglyness that is removing directories in the codebase I decided to push down this to the OPAL by extending the opal/os_create_dirpath.(c\|h) to contain some more functionality. In this process I renamed 'os_create_dirpath' to 'os_dirpath' since it is a bit more general now. Added a few functions to: - check if an directory is empty - check to see if the access permissions are set correctly - destroy the directory at the end of the dirpath - By using a caller callback function (a la Perl, I believe) for every file, the caller can have fine grained control over whether a specific file is deleted or not. This simplifies things a bit for orte_session_dir_(finalize\|cleanup) as it should no longer contain any of this functionality, but uses these functions to do the work. From the external perspective nothing has changed, from the developer point of view we have some cleaner, more generic code. This commit was SVN r10640.	2006-07-03 22:23:07 +00:00
Brian Barrett	0bd5acc51f	* Fix for bus error in XGrid starter This commit was SVN r10615.	2006-07-01 16:16:46 +00:00
Josh Hursey	0a931f9fad	Brining over the session directory and universe changes from the tmp/jjhursey-ft-cr branch. In this commit we change the way universe names are created. Before we by default first created "default-universe" then if there was a conflict we created "default-universe-PID" where PID is the PID of the HNP. Now we create "default-universe-PID" all the time (when a default universe name is used). This makes it much easier when trying to find a HNP from an outside app (e.g. orte-ps, orteconsole, ...) This also adds a "search" function to find all of the universes on the machine. This is useful in many contexts when trying to find a persistent daemon or when trying to connect to a HNP. This commit also makes orte_universe_t an opal_object_t, which is something that needed to happen, and only effected the SDS in one of it's base functions. I was asked to bring this over to aid in fixing orteconsole and orteprobe. Due to the change of orte_universe_t to an object orteprobe may need to be updated to reflect this change. Since orteprobe needs to be looked at anyway I'll leave this to Ralph to take care of. Note: These changes do not depend upon any of the FT work (but the FT work does depend upon them). These were brought over to help in fixing some of the ORTE tool set that require the functionality layed out in this patch. Testing: Ran the 'ibm' tests before and after this change, and all was as well as before the change. If anyone notices additional irregularities in the system let me know. But none are expected. This commit was SVN r10550.	2006-06-28 21:03:31 +00:00
Brian Barrett	2cf73912e2	* fix for signal forwarding additions in bproc_orted code This commit was SVN r10529.	2006-06-27 19:59:07 +00:00
Sushant Sharma	76926756d0	variable ntid not being assigned any value was resulting in errors This commit was SVN r10480.	2006-06-22 18:00:54 +00:00
Josh Hursey	58110f9fc9	Fixes Ticket #125 for both the trunk and v1.1 branch. This commit will apply cleanly to the v1.1 branch, and should be moved over once I get someone to verify it. The problem is outlined in the bug. The fix was to move the setting of the app context index (idx) before we put it in the GPR so that it is propogated to the gpr. The reason this hasn't bitten us before is because we init app->idx to 0, which is true most of the time. Except that is when MPI_Comm_spawn_multiple in which we put in more than one app context, thus care about correct indexing. This was causing down the line memory corruption by overrunning the mapping array. This commit also puts in a check to make sure that we error out if we ever try to do that again. This commit was SVN r10380.	2006-06-15 22:14:07 +00:00
Sushant Sharma	ca01291aea	Updated soh-xcpu component. Not going to be used for time being. This commit was SVN r10343.	2006-06-13 23:25:46 +00:00
Sushant Sharma	b5a16b6515	Updated xcpu launcher. open-mpi no longer needs xcpu library. Launcher code is now moved within xcpu. This commit was SVN r10342.	2006-06-13 23:21:56 +00:00
Brian Barrett	17a8ccef89	* update XGrid API to match recent signal changes This commit was SVN r10262.	2006-06-08 21:15:35 +00:00
Ralph Castain	ee5a626d25	Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files: 1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc. 2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X. 3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job. 4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality. I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun. Ralph This commit was SVN r10258.	2006-06-08 18:27:17 +00:00
Jeff Squyres	4882dc0e2c	Addendum to r9930: missed a chunk of the rsh pls to use the basename of $libdir and $bindir (i.e., was correctly doing local launches, but was still using $prefix/lib and $prefix/bin for remote launches). [Re-]Fixes OFED bug 59. This commit was SVN r10207. The following SVN revision numbers were found above: r9930 --> open-mpi/ompi@1d6902296c	2006-06-05 21:12:36 +00:00
Brian Barrett	22cd78abb5	* add header required when debugging is not enabled This commit was SVN r10155.	2006-06-01 01:26:52 +00:00
Josh Hursey	bb95df9bf2	Added some user friendly output to the hostfile RDS component. This is more of a usability feature, but a very useful one. So I suggest that it go into the release branches. This commit was SVN r10153.	2006-05-31 20:07:59 +00:00
Josh Hursey	2f20a38c98	This is a fix for bug Ticket #27 We were stuck in an infinite loop inside the rmaps round_robin component when the user specified a host, then over subscribed it. Instead of retuning an error, we looped forever. For example: $ cat hostfile A slots=2 max-slots=2 B slots=2 max-slots=2 $ mpirun -np 3 --hostfile hostfile --host B <hang> The loop would not terminate because both host A and B are in the 'nodes' structure as they are both allocated to the job. However, after allocating 2 slots to host B, we remove it from the node list leaving us with a 'nodes' structure with just A in it. Since we can't use host A, we keep looping here until we find a node that we can use. This patch checks to make sure that if we get into this situation where rmaps is looping over the list a second time without finding a node during the first pass then we know that there are no nodes left to use, so we have a resource allocation error, and should return to the user. This patch should be moved to all of the release branches This commit was SVN r10131.	2006-05-31 03:42:01 +00:00
Brian Barrett	7000cecf78	Fix for standard output / standard error truncation issue when in a shell pipeline. See lengthy comment in iof_base_endpoint.c for the details, but the short version is that we shouldn't set O_NONBLOCK on standard I/O file descriptors, so we no longer do. Closes ticket:9 This commit was SVN r9966.	2006-05-18 15:43:32 +00:00
Jeff Squyres	1d6902296c	Additions to the tm, slurm, and rsh pls modules to handle the --prefix option as discussed on the devel-core mailing list. The Big Difference is that instead of hard-coding the strings "/lib" and "/bin" in to append to the prefix, we append the basename of the local libdir and bindir. Hence, if your libdir is $prefix/lib64, we'll append /lib64 to construct the remote node's LD_LIBRARY_PATH (etc.). Also appended the orterun.1 man page to include a description of --prefix, how it is constructed, what it handles / what it does not, etc. This commit was SVN r9930.	2006-05-16 14:14:12 +00:00
Gleb Natapov	80dfe7e39b	remove newline from environment This commit was SVN r9892.	2006-05-11 13:15:48 +00:00
Brian Barrett	1c0c84cf67	If the urm gets a request to kill itself and it's a singleton, just exit out, rather than trying to have the pls exit. Since singletons weren't started with a pls, there's no way the pls is going to be able to kill the process. So just exit and save the error message. This commit was SVN r9859.	2006-05-09 13:40:41 +00:00
Brian Barrett	b76b46bcec	* fix some compile issues on Red Storm This commit was SVN r9812.	2006-05-04 14:08:36 +00:00
Brian Barrett	9276127c0d	* add some extra sauce to make sure we close down our processes properly This commit was SVN r9807.	2006-05-04 00:38:49 +00:00
Brian Barrett	5fed99c2c2	Sending SIZE_MAX from machines with different sizeof(size_t) causes big problems, as the smaller machine's SIZE_MAX won't be SIZE_MAX on the bigger machine, which can lead to failures along the way -- in this case, with GPR triggers being improperly fired. This commit was SVN r9776.	2006-04-28 21:09:42 +00:00
George Bosilca	1ea3a39372	The condition was wrong. The fact that it accept 0 length messages is interpreted as a shutdown of the io channel on the next iteration. Definitively not the good approach. The correct condition is bigger than 0. This commit was SVN r9770.	2006-04-28 04:57:07 +00:00
Jeff Squyres	bfcf3867fc	Back out George's commit from earlier today; it seems to break stdout forwarding. More detailed mail coming to devel-core shortly that explains. This commit was SVN r9769.	2006-04-28 03:32:27 +00:00
Sushant Sharma	7a6e0c9ebf	Fixed remote environment setup. Submitted by: Tim Woodall This commit was SVN r9759.	2006-04-27 20:07:56 +00:00
George Bosilca	bafc16f724	We don't need the len anymore as everything is not attached to the fragment. This commit was SVN r9758.	2006-04-27 17:35:05 +00:00
George Bosilca	5df94f812e	Aren't we supposed to release the value on all possible execution paths ? This commit was SVN r9757.	2006-04-27 17:31:01 +00:00
Tim Woodall	0a56067509	Correction to resolve a problem related to partial reads. We were making a copy of the receive buffer based on the iovec struct that may have been updated during partial reads to reflect the current offset. Need to make the copy using the base address of the buffer. Thanks to Sven Stork for finding this. This should be backported to 1.0.X and 1.1.X branches. This commit was SVN r9749.	2006-04-27 14:27:02 +00:00
Tim Woodall	7a139d6cc8	- corrections to I/O forwarding - handling of incomplete writes THESE CHANGES SHOULD BE PROPOGATED TO BOTH 1.0 and 1.1 BRANCHES This commit was SVN r9734.	2006-04-26 15:36:06 +00:00
Tim Woodall	3e57a4ec48	remove debug code - not required This commit was SVN r9715.	2006-04-25 19:05:57 +00:00
Brian Barrett	e737b0a106	Fix a bunch of warnings the Sun compilers find: - The constant 1 is a signed int by default. Explicitly say that it is an unsigned value so we can't overflow - Fix unreachable statement warnings in dss_arith by breaking out of switch statements instead of returning - this should have no impact on performance, since it's a non-conditional jump - A couple of the GPR files had carriage returns and were in DOS mode - put them in unix mode... These should all probably go to the v1.1 branch... This commit was SVN r9664.	2006-04-20 15:35:58 +00:00
Ralph Castain	95c4795157	Try a different tack... This commit was SVN r9658.	2006-04-19 15:33:34 +00:00
Ralph Castain	93115fdaea	Try again with passing the right enviro variables. This commit was SVN r9629.	2006-04-13 18:07:22 +00:00
Ralph Castain	480af1c150	Add the missing enviro variables This commit was SVN r9627.	2006-04-13 16:41:47 +00:00
Sushant Sharma	642e33fb3e	xcpu launcher updated to setup the environment on remote nodes before launching jobs. This commit was SVN r9622.	2006-04-12 22:42:41 +00:00
Ralph Castain	424900068f	Update the xcpu launcher to setup the environment This commit was SVN r9620.	2006-04-12 15:41:54 +00:00
Sushant Sharma	9fe5870862	xcpu pls component fixed so that it will compile correctly. This commit was SVN r9617.	2006-04-11 20:27:13 +00:00
Ralph Castain	9adc16130e	Proposed revision of the xcpu launcher to correctly incorporate the OpenRTE and Open MPI environment This commit was SVN r9612.	2006-04-11 14:33:17 +00:00
Brian Barrett	f37a77dd08	* Fix potential deadlock when mpi threads are enabled and progress threads are not. See lengthy comment in the body of commit. This commit was SVN r9573.	2006-04-07 18:13:35 +00:00
Sushant Sharma	26d51d5041	Cleaned lots of dead code in xcpu soh component (soh_xcpu.c). Checked the fix submitted by Ralph Castain for completing processes in soh_xcpu. Its working fine now. This commit was SVN r9554.	2006-04-06 16:26:25 +00:00
George Bosilca	ca75ff2569	In the case we have support for threads, then the opal library have it's own thread, which will do progress independently of MPI. So in this case we have to call opal_event_loop instead of opal_progress. This commit was SVN r9551.	2006-04-06 14:31:38 +00:00
Brian Barrett	7408de0bfb	When progress threads are enabled, opal_progress() doesn't call the event library (since the event library has its own thread). So when we are using progress threads, we really want to call opal_event_loop() and not opal_progres(). This commit was SVN r9549.	2006-04-06 12:58:09 +00:00
Ralph Castain	895c2ade8b	Proposed fix for completing processes This commit was SVN r9543.	2006-04-06 08:18:42 +00:00
Ralph Castain	b9bdb2125e	Fix and upgrade the console to support better debugging. Activate "dump" commands to display registry content. Remove the blasted opal_output default prefix that made the dump output illegible. Properly connect to existing daemons and/or start new ones. This commit was SVN r9528.	2006-04-04 11:05:52 +00:00
Sushant Sharma	8d5289b2b8	Corrected Makefile.am files for pls and soh xcpu-components as per Brian's suggestion. This commit was SVN r9519.	2006-04-03 17:14:47 +00:00
Brian Barrett	4ea8790342	* Don't try to call tcgetprgp on platforms that don't have that function * Some more stuff to ignore / do in Red Storm build This commit was SVN r9511.	2006-04-01 05:46:15 +00:00
Brian Barrett	2c64ab562e	More fixes to try to get Red Storm port going again.... * Add a platform spec for using the portals reference implementation's RTE instead of our own to make local testing easier. * Add a cnos rmgr component so that 1) we don't have to build nearly as many components (no need for ras,rds,pls,etc.) and 2) calls to MPI_ABORT() won't print error messages about not being able to contact the daemon. Still need to fill in some of the terminate stuff with calls from cnos, but will come in time. * Make gpr_null use the base code for creating value and keyval structures so that we don't segfault in ompi_mpi_init(). This commit was SVN r9510.	2006-04-01 04:54:46 +00:00
Jeff Squyres	858612fd06	Face the possibilty that the child may have already died. This commit was SVN r9508.	2006-04-01 02:23:10 +00:00
Sushant Sharma	46f84b1e8e	Added xcpu component in pls and soh. This commit was SVN r9491.	2006-03-31 02:19:52 +00:00
Ralph Castain	8ba453b866	Modify the rmgr_proxy component so it includes the automatic wire-up of stdio. This commit was SVN r9483.	2006-03-30 19:44:28 +00:00
George Bosilca	2b3779cd6e	Correct some of the casting issues. By default the compilers attach an signed type to the defines. As our internal types (job_id and co.) are unsigned that generate several errors (integer overflow in expression and comparison between signed and unsigned). Casting the defines to the correct type solve these problems. This commit was SVN r9481.	2006-03-30 19:28:17 +00:00

1 2 3 4 5 ...

655 Коммитов