openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	a9af219ba7	Fix CID 723: a pointless whine about not checking a return code This commit was SVN r20274.	2009-01-14 19:06:36 +00:00
Josh Hursey	a9da2dada1	Remove some unused variables. This commit was SVN r20270.	2009-01-14 17:28:40 +00:00
Tim Mattox	5b70160626	For two error conditions in the ras_loadleveler_module, output the error code reported by loadleveler. Also, clean up a few more internal error messages. This commit was SVN r20255.	2009-01-13 15:44:26 +00:00
Brian Barrett	d3310a5ad1	fixes to get compiling on Red Storm again This commit was SVN r20252.	2009-01-12 22:30:00 +00:00
Ralph Castain	694008e9bb	Fix a reported bug whereby keyboard entry to a remote proc was being lost after the first iteration. In other words, if an application has a proc reading stdin from the keyboard, and that proc is not co-located with mpirun, then the system would hang. The problem was eventually traced to two bugs in the code: 1. the orted wasn't resetting the write event flag, thus preventing itself from turning it on again. 2. the HNP needed to check if the stdin was attached to tty or not before adding the delay for fairness. If it is attached to a tty, there is no need for the delay. This prevents some strangely slow typing response. This patch needs to move to 1.3 This commit was SVN r20246.	2009-01-12 20:12:58 +00:00
Josh Hursey	1420c32a5d	Update SnapC Local Coordinator in reaction to structure changes in r20228. The list of local children became more globalized so I needed to update the loop invariants appropriately. This commit was SVN r20245. The following SVN revision numbers were found above: r20228 --> open-mpi/ompi@007d68becc	2009-01-12 19:45:48 +00:00
Ralph Castain	2778c13fac	Continue to refine the timing instrumentation to identify where launch time is being spent This commit was SVN r20244.	2009-01-12 19:12:58 +00:00
Jeff Squyres	d1c6f3f89a	* Fix a truckload of Cisco copyrights to be the same as the rest of the code base. * Fix a few misspellings in other copyrights. This commit was SVN r20241.	2009-01-11 02:30:00 +00:00
Tim Mattox	820b209564	Oops, forgot to update the copyright date range... This commit was SVN r20239.	2009-01-09 19:04:52 +00:00
Tim Mattox	af45569366	Clean up some debugging output in the loadleveler ras module. Error output strings were changed to be unique per code site. They are still pretty meaningless to the user, but at least now developers might be able to find which unique place in the code reported which error. This commit was SVN r20238.	2009-01-09 19:03:52 +00:00
Ralph Castain	c009b51ad3	Silence warning about signed vs unsigned comparisons This commit was SVN r20237.	2009-01-09 16:01:03 +00:00
George Bosilca	78d856e04c	Release resources when a job is completed. This allows us to correctly count and load balance MPI-2 dynamic type of applications. This commit was SVN r20236.	2009-01-08 21:21:54 +00:00
Ralph Castain	25f578a7d2	Continue to improve timing instrumentation. Add ability to store timing data directly to a file instead of just to stdout. This commit was SVN r20229.	2009-01-08 14:27:52 +00:00
Ralph Castain	007d68becc	Make the data on local children and their jobs available globally on both daemons and the HNP. This simply shifts the data structures from the ODLS base to the orte globals area to support subsequent movement of the daemon collective operations from the odls to the grpcomm framework. As that will be a larger change, it will be implemented on a branch and rolled over separately. This commit was SVN r20228.	2009-01-08 14:25:56 +00:00
Ralph Castain	80fb98ae32	Cleanup the modex-less operations for efficiency. Have the component default to normal modex operations if modex-less isn't specified. This commit was SVN r20220.	2009-01-07 15:00:26 +00:00
Ralph Castain	7818779760	Expose the nidmap and pidmap as orte globals so that components in other frameworks can access and/or manipulate them without forcing API modifications - modify the individual ess components that were affected so they use the global variables. Add a list of attributes to the nids for storing node-related data (e.g., modex attrs), and define a new object for that purpose. Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects. This commit was SVN r20219.	2009-01-07 14:58:38 +00:00
Ralph Castain	9dbcee9110	Increase efficiency for modex-less launch by storing byte objects in the profile file This commit was SVN r20206.	2009-01-05 21:46:12 +00:00
Ralph Castain	5f5d8ad231	CID 1139-1141: remove outdated variable from the various routed components This commit was SVN r20201.	2009-01-05 15:09:54 +00:00
Ralph Castain	1bc125c0a7	CID 1131: cleanup a minor memory leak This commit was SVN r20200.	2009-01-05 15:05:05 +00:00
Jeff Squyres	6d0d8848ac	Fix CID 1129: Remove variable that is set but never used. This commit was SVN r20194.	2009-01-03 15:39:51 +00:00
Jeff Squyres	e52ac6da40	Fix CID 1130: remove variable that is set but never used. This commit was SVN r20193.	2009-01-03 15:37:00 +00:00
Ralph Castain	7787f84540	Per the earlier RFC and some discussion at the Dec ORTE design meeting, add the ompi-top tool and all its supporting infrastructure. This includes a new OPAL pstat framework and data type, currently with rather weak support for Mac OSX and pretty complete support for Linux. The Sun team promised to add Solaris support as well. Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds. This commit was SVN r20165.	2008-12-22 20:23:05 +00:00
Ralph Castain	caa5771908	Don't force tools to dump core files when they abort This commit was SVN r20159.	2008-12-20 23:24:36 +00:00
Ralph Castain	9f6c1b9d07	Per discussion at the Dec ORTE design meeting, add an "set_lifeline" API to the orte_routed framework. This allows the caller to define a "lifeline" process so that, if the connection to that lifeline is subsequently lost, the process will be terminated. This helps tools that connect to an mpirun to know when that mpirun completes and terminates. This commit was SVN r20158.	2008-12-20 23:23:11 +00:00
Shiqing Fan	5ae5f0e173	- 4/4 commit for Windows Visual Studio and CCP support: unnecessary clean up to non windows related files (within ifdef __WINDOWS__). This commit was SVN r20111.	2008-12-10 21:13:27 +00:00
Shiqing Fan	8673f19f50	- 2/4 commit for Windows Visual Studio and CCP support: changes to the already existing ccp components event/win32.c: merge old FD handling into new opal_installdirs_windows.c:fix the registry handling This commit was SVN r20109.	2008-12-10 21:01:54 +00:00
Shiqing Fan	a5281f0434	- 1/4 commit for Windows Visual Studio and CCP support: CMakeLists and .windows files. In contribs preconfigured and precompiled parts. This commit was SVN r20108.	2008-12-10 20:59:20 +00:00
Ralph Castain	728a24c8ec	After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits. Basically, the remaining problem turned out to be: 1. closing stdout/stderr during orte_finalize of mpirun 2. inadvertently setting up a write event on fd = -1 3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3. This commit was SVN r20106. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-10 20:40:47 +00:00
Ralph Castain	9d7cb82bba	Modify the daemon cmd processor to relay and then process the cmd locally. We couldn't do this before due to the daemon's needing to update contact info prior to doing the relay. However, the new routed system plus the inclusion of the nidmap in the launch message now makes this possible. It is a small launch performance improvement as now we relay the launch cmd across to the next daemon before taking the time to launch our own local procs. Still, it does allow more parallel operations during the launch procedure. This commit was SVN r20104.	2008-12-10 19:18:36 +00:00
Josh Hursey	67ae66326c	remove unused variable This commit was SVN r20103.	2008-12-10 18:08:46 +00:00
Ralph Castain	1ace83c470	Enable modex-less launch. Consists of: 1. minor modification to include two new opal MCA params: (a) opal_profile: outputs what components were selected by each framework currently enabled for most, but not all, frameworks (b) opal_profile_file: name of file that contains profile info required for modex 2. introduction of two new tools: (a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with opal_profile set. Also reports back the rml IP address for all interfaces on the node (b) ompi-profiler: uses ompi-probe to create the profile_file, also reports out a summary of what framework components are actually being used to help with configuration options 3. modification of the grpcomm basic component to utilize the profile file in place of the modex where possible 4. modification of orterun so it properly sees opal mca params and handles opal_profile correctly to ensure we don't get its profile 5. similar mod to orted as for orterun 6. addition of new test that calls orte_init followed by calls to grpcomm.barrier This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details. This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint. This commit was SVN r20098.	2008-12-09 23:49:02 +00:00
Ralph Castain	e28210d0dc	Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work. This commit was SVN r20089. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-09 17:11:59 +00:00
Ralph Castain	61c21d787d	Add missing param in tm launcher This commit was SVN r20087.	2008-12-09 13:31:33 +00:00
Ralph Castain	6e050bc78c	Update the route when it comes from a different job family. This fixes ticket #1699 This commit was SVN r20085.	2008-12-09 01:16:18 +00:00
Ralph Castain	ce4018efeb	Take a step back on the slurm and tm launchers. Problems were occurring in the MTT runs, although not under non-MTT scenarios. Preserve the modified plm versions in new components that are ompi_ignored until we can resolve the problems. This will allow for better MTT coverage until the problem can be better understood. This commit was SVN r20083.	2008-12-09 00:32:04 +00:00
Ralph Castain	89792bbc72	May as well have the other "clean" outputs use the same channel This commit was SVN r20082.	2008-12-08 19:37:22 +00:00
Ralph Castain	c2b18b363d	Initialize a variable before use This commit was SVN r20080.	2008-12-08 16:16:40 +00:00
Ralph Castain	2940309613	Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 1. a direct callback from waitpid - this set the waitpid_fired flag 2. a notify event callback from the IOF - this set the iof complete flag 3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize. The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3. This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT. This commit was SVN r20074.	2008-12-05 04:20:00 +00:00
Ralph Castain	ec930d14a9	Ensure IOF tags are properly assigned to sinks and read events This commit was SVN r20068.	2008-12-04 01:09:20 +00:00
Ralph Castain	a07660aea8	Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app 2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required 3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin. This commit was SVN r20064.	2008-12-03 17:45:42 +00:00
Josh Hursey	44109e0084	Fix the ft_event function in response to r20022. Also make the structure cleanup match the finalize() function a bit more closely. This seems to fix the segv seen on process restart. This commit was SVN r20051. The following SVN revision numbers were found above: r20022 --> open-mpi/ompi@9a57db4a81	2008-12-02 21:18:32 +00:00
Ralph Castain	ff8e83ff3b	Per request from IBM/Eclipse, provide MCA param to request output when nodes are resolved to a different nodename. This really only happens for the node that mpirun executes on, but they need the alert so they can do string matching of node names. This commit was SVN r20032.	2008-11-24 19:57:08 +00:00
George Bosilca	7a30a98a89	Use the generic cast. This commit was SVN r20028.	2008-11-24 15:52:36 +00:00
Ralph Castain	7213c109ac	Revamp the TM plm module so that we detect orted termination without requiring a callback message by using the TM native capabilities. This allows TM to function with fully routed OOB comm, and to tell us what node failed to spawn a daemon. This commit was SVN r20027.	2008-11-20 18:57:35 +00:00
Ralph Castain	9a57db4a81	To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it. This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message. This commit was SVN r20022.	2008-11-18 15:35:50 +00:00
Ralph Castain	89559396ea	Resolve a race condition when running under a SLURM environment. The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit. If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state. Mucho bad. :-/ This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths). This commit was SVN r20018.	2008-11-18 13:59:23 +00:00
Ralph Castain	68423f7544	Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output This commit was SVN r19999.	2008-11-14 20:36:18 +00:00
Ralph Castain	586334d1c8	Per discussion with Tim Mattox, reset the trunk to pre-19991 level for the iof only. I will shortly add a changeset that will repair the one known error where we were incorrectly closing the stdout/err/diag file descriptors when all we wanted to do was close stdin. I will leave out the changes associated with coordinating proc termination due to race conditions IU encounted during MTT testing. I have been unable to replicate those so far, but we hope to resolve it in the near future. This commit was SVN r19998.	2008-11-14 20:22:36 +00:00
Ralph Castain	891630ae85	Handle a race condition between mpirun detecting stdin closed (and releasing the read event), and receiving an xon/xoff notice from a remote orted that detects proc termination and tells mpirun "don't send any more input - the proc is gone". This latter was necessary since we might have hung an infinite source of input on mpirun, while the proc terminated after some point in time. This commit was SVN r19997.	2008-11-14 15:19:53 +00:00
Ralph Castain	101b6fdeb8	Cleanup a little on how we handle the stdin write when we encounter end-of-input. Ensure that mpirun handles it correctly if the proc receiving stdin is local to mpirun This commit was SVN r19996.	2008-11-14 14:31:33 +00:00

1 2 3 4 5 ...

1521 Коммитов