openmpi

Автор	SHA1	Сообщение	Дата
Shiqing Fan	8673f19f50	- 2/4 commit for Windows Visual Studio and CCP support: changes to the already existing ccp components event/win32.c: merge old FD handling into new opal_installdirs_windows.c:fix the registry handling This commit was SVN r20109.	2008-12-10 21:01:54 +00:00
Shiqing Fan	a5281f0434	- 1/4 commit for Windows Visual Studio and CCP support: CMakeLists and .windows files. In contribs preconfigured and precompiled parts. This commit was SVN r20108.	2008-12-10 20:59:20 +00:00
Ralph Castain	728a24c8ec	After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits. Basically, the remaining problem turned out to be: 1. closing stdout/stderr during orte_finalize of mpirun 2. inadvertently setting up a write event on fd = -1 3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3. This commit was SVN r20106. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-10 20:40:47 +00:00
Ralph Castain	9d7cb82bba	Modify the daemon cmd processor to relay and then process the cmd locally. We couldn't do this before due to the daemon's needing to update contact info prior to doing the relay. However, the new routed system plus the inclusion of the nidmap in the launch message now makes this possible. It is a small launch performance improvement as now we relay the launch cmd across to the next daemon before taking the time to launch our own local procs. Still, it does allow more parallel operations during the launch procedure. This commit was SVN r20104.	2008-12-10 19:18:36 +00:00
Josh Hursey	67ae66326c	remove unused variable This commit was SVN r20103.	2008-12-10 18:08:46 +00:00
Ralph Castain	1ace83c470	Enable modex-less launch. Consists of: 1. minor modification to include two new opal MCA params: (a) opal_profile: outputs what components were selected by each framework currently enabled for most, but not all, frameworks (b) opal_profile_file: name of file that contains profile info required for modex 2. introduction of two new tools: (a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with opal_profile set. Also reports back the rml IP address for all interfaces on the node (b) ompi-profiler: uses ompi-probe to create the profile_file, also reports out a summary of what framework components are actually being used to help with configuration options 3. modification of the grpcomm basic component to utilize the profile file in place of the modex where possible 4. modification of orterun so it properly sees opal mca params and handles opal_profile correctly to ensure we don't get its profile 5. similar mod to orted as for orterun 6. addition of new test that calls orte_init followed by calls to grpcomm.barrier This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details. This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint. This commit was SVN r20098.	2008-12-09 23:49:02 +00:00
Ralph Castain	e28210d0dc	Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work. This commit was SVN r20089. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-09 17:11:59 +00:00
Ralph Castain	61c21d787d	Add missing param in tm launcher This commit was SVN r20087.	2008-12-09 13:31:33 +00:00
Ralph Castain	6e050bc78c	Update the route when it comes from a different job family. This fixes ticket #1699 This commit was SVN r20085.	2008-12-09 01:16:18 +00:00
Ralph Castain	ce4018efeb	Take a step back on the slurm and tm launchers. Problems were occurring in the MTT runs, although not under non-MTT scenarios. Preserve the modified plm versions in new components that are ompi_ignored until we can resolve the problems. This will allow for better MTT coverage until the problem can be better understood. This commit was SVN r20083.	2008-12-09 00:32:04 +00:00
Ralph Castain	89792bbc72	May as well have the other "clean" outputs use the same channel This commit was SVN r20082.	2008-12-08 19:37:22 +00:00
Ralph Castain	c2b18b363d	Initialize a variable before use This commit was SVN r20080.	2008-12-08 16:16:40 +00:00
Ralph Castain	2940309613	Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 1. a direct callback from waitpid - this set the waitpid_fired flag 2. a notify event callback from the IOF - this set the iof complete flag 3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize. The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3. This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT. This commit was SVN r20074.	2008-12-05 04:20:00 +00:00
Ralph Castain	ec930d14a9	Ensure IOF tags are properly assigned to sinks and read events This commit was SVN r20068.	2008-12-04 01:09:20 +00:00
Ralph Castain	a07660aea8	Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app 2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required 3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin. This commit was SVN r20064.	2008-12-03 17:45:42 +00:00
Josh Hursey	44109e0084	Fix the ft_event function in response to r20022. Also make the structure cleanup match the finalize() function a bit more closely. This seems to fix the segv seen on process restart. This commit was SVN r20051. The following SVN revision numbers were found above: r20022 --> open-mpi/ompi@9a57db4a81	2008-12-02 21:18:32 +00:00
Ralph Castain	ff8e83ff3b	Per request from IBM/Eclipse, provide MCA param to request output when nodes are resolved to a different nodename. This really only happens for the node that mpirun executes on, but they need the alert so they can do string matching of node names. This commit was SVN r20032.	2008-11-24 19:57:08 +00:00
George Bosilca	7a30a98a89	Use the generic cast. This commit was SVN r20028.	2008-11-24 15:52:36 +00:00
Ralph Castain	7213c109ac	Revamp the TM plm module so that we detect orted termination without requiring a callback message by using the TM native capabilities. This allows TM to function with fully routed OOB comm, and to tell us what node failed to spawn a daemon. This commit was SVN r20027.	2008-11-20 18:57:35 +00:00
Ralph Castain	9a57db4a81	To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it. This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message. This commit was SVN r20022.	2008-11-18 15:35:50 +00:00
Ralph Castain	89559396ea	Resolve a race condition when running under a SLURM environment. The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit. If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state. Mucho bad. :-/ This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths). This commit was SVN r20018.	2008-11-18 13:59:23 +00:00
Ralph Castain	68423f7544	Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output This commit was SVN r19999.	2008-11-14 20:36:18 +00:00
Ralph Castain	586334d1c8	Per discussion with Tim Mattox, reset the trunk to pre-19991 level for the iof only. I will shortly add a changeset that will repair the one known error where we were incorrectly closing the stdout/err/diag file descriptors when all we wanted to do was close stdin. I will leave out the changes associated with coordinating proc termination due to race conditions IU encounted during MTT testing. I have been unable to replicate those so far, but we hope to resolve it in the near future. This commit was SVN r19998.	2008-11-14 20:22:36 +00:00
Ralph Castain	891630ae85	Handle a race condition between mpirun detecting stdin closed (and releasing the read event), and receiving an xon/xoff notice from a remote orted that detects proc termination and tells mpirun "don't send any more input - the proc is gone". This latter was necessary since we might have hung an infinite source of input on mpirun, while the proc terminated after some point in time. This commit was SVN r19997.	2008-11-14 15:19:53 +00:00
Ralph Castain	101b6fdeb8	Cleanup a little on how we handle the stdin write when we encounter end-of-input. Ensure that mpirun handles it correctly if the proc receiving stdin is local to mpirun This commit was SVN r19996.	2008-11-14 14:31:33 +00:00
Ralph Castain	875741a5e3	Don't set the stdin fd to -1 before calling the object destructor as that function calls event delete, which uses the fd as an index into the event array. This commit was SVN r19994.	2008-11-13 19:34:29 +00:00
Ralph Castain	b8ae4604ed	Correct the notifier default module to include the new added API This commit was SVN r19993.	2008-11-13 18:03:41 +00:00
Ralph Castain	702fc7154c	Remove stale function definition This commit was SVN r19992.	2008-11-13 05:07:11 +00:00
Ralph Castain	555bbf0c02	Fix the iof race conditions wrt proc termination. This is comprised of two sections: 1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification. 2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete. Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem. This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk. This commit was SVN r19991.	2008-11-12 23:32:01 +00:00
Ralph Castain	26cd1c1955	Fix a typo and some formatting This commit was SVN r19990.	2008-11-12 22:01:40 +00:00
Ralph Castain	ce26e3a2fb	Update the notifier framework in prep for move to v1.3. Add an API to handle the case where error messages have been expressed via "show_help" so they can look similar to what was presented to users. Add three key calls in the openib btl to drop messages into syslog. This will sit in trunk for a few days - would like to actually see some errors reported to syslog before moving the code to 1.3 This commit was SVN r19986.	2008-11-12 18:03:51 +00:00
Josh Hursey	d5c38c2601	fix some typos. should be moved to v1.3 This commit was SVN r19964.	2008-11-10 19:05:26 +00:00
Josh Hursey	077b3df7cc	Fix C/R restart case by passing the correct address to the orte_ess_base_build_nidmap() function. This cropped up from r19866. It does not look like this effects the v1.3 branch since r19866 has not moved to the release branch. Thanks to Leonardo Fialho for reporting this and supplying a patch. This commit was SVN r19961. The following SVN revision numbers were found above: r19866 --> open-mpi/ompi@f54fda489e	2008-11-10 15:19:28 +00:00
Ralph Castain	5889dcd30b	Fix a warning reported by Jeff that actually could cause singleton operations to fail. Ensure that the byte object used to init the job map for singleton's is properly initialized. This commit was SVN r19957.	2008-11-08 01:09:06 +00:00
Jeff Squyres	f4ba25cf3c	Remove linking components against ORTE and OPAL libs. This was removed from all other components long ago; I'm not sure how these survived. This commit was SVN r19956.	2008-11-08 00:56:57 +00:00
Ralph Castain	25491628b8	Discovered while documenting the "preconnect" mca params that several of them didn't make sense any more. After chatting with Jeff, we agreed to the following: 1. register "mpi_preconnect_all" as a deprecated synonym for "mpi_preconnect_mpi" 2. remove "mpi_preconnect_oob" and "mpi_preconnect_oob_simultaneous" as these are no longer valid. 3. remove the routed framework's "warmup_routes" API. With the removal of the direct routed component, this function at best only wasted communications. The daemon routes are completely "warmed up" during launch, so having MPI procs order the sending of additional messages is simply wasteful. 4. remove the call to orte_routed.warmup_routes from MPI_Init. This was the only place it was used anyway. The FAQs will be updated to reflect this changed situation, and a CMR filed to move this to the 1.3 branch. This commit was SVN r19933.	2008-11-05 19:41:16 +00:00
Ralph Castain	b48bbec366	Cleanup modex logic to allow modex-less launch: 1. minor change in base_modex to only set modex_reqd when it -is- reqd 2. cleanup logic in grpcomm-basic module This commit was SVN r19903.	2008-11-03 21:48:52 +00:00
Ralph Castain	6db5737779	Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway. This commit was SVN r19889.	2008-11-03 17:23:10 +00:00
Kenneth Matney	c650ef58c5	Build requires prototypes, defined by "orte/util/nidmap.h". This commit was SVN r19887.	2008-11-03 16:23:42 +00:00
Ralph Castain	55f52d7a4b	Ensure we know how to route to a different job family when it connects to us This commit was SVN r19885.	2008-11-03 14:25:14 +00:00
Ralph Castain	85bc7bb26a	Minor cleanups: * fix an if condition so that we do the right thing when procs local to mpirun output to stderr * ensure that tools can handle relays of 0-byte output, indicating that a process closed that io channel This commit was SVN r19884.	2008-11-03 14:03:08 +00:00
Ralph Castain	58fe779388	Remove double destruct to fix segv when ctrl-c is used to terminate job This commit was SVN r19875.	2008-11-02 02:25:20 +00:00
George Bosilca	d23fe1bb10	Include Ralph's suggestions, i.e. keep the hnp and orted management in sync. This commit was SVN r19872.	2008-11-01 00:39:46 +00:00
George Bosilca	9528d33e90	Nothing relevant, few indentations and replace tab by spaces. This commit was SVN r19870.	2008-10-31 22:24:52 +00:00
George Bosilca	ebe87d1842	Apply some suggestions from Ralph and avoid a pretty nasty race condition on the close of the fd. The problem was that we close the same fd twice, and that meantime the fd could have been reassigned to some other file or socket. This commit was SVN r19869.	2008-10-31 22:23:53 +00:00
George Bosilca	9f17d1d67d	Allow xgrid to compile with the changes from 19866. This commit was SVN r19868.	2008-10-31 21:56:53 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
George Bosilca	0ce76248e8	Close the file descriptors used to push or pull the data to the children. Without this patch, doing spawn in a loop ended up by exhausting all available file descriptors pretty quickly. There were about 5 file descriptors opened per spawned process. Now the number of file descriptors managed by the process (orted or HNP) is a lot smaller. This commit was SVN r19864.	2008-10-31 18:05:28 +00:00
Ralph Castain	30b3bc6761	Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched. This commit was SVN r19847.	2008-10-29 16:00:48 +00:00
Ralph Castain	82ece176d5	Sanity check needs to allow vpid_invalid as this indicates the "none" scenario This commit was SVN r19820.	2008-10-28 14:50:26 +00:00

... 3 4 5 6 7 ...

1696 Коммитов