openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	c749fefbd0	Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request. This commit was SVN r22021.	2009-09-28 03:17:15 +00:00
Ralph Castain	47c9a5409e	Ensure that tools init the multicast channel correctly This commit was SVN r22020.	2009-09-28 03:15:51 +00:00
Ralph Castain	ef0fd8b8d1	Return an error code if the job failed to start This commit was SVN r22019.	2009-09-26 03:34:58 +00:00
Ralph Castain	e337fa686e	Correct handling of pointer array indexing This commit was SVN r22018.	2009-09-26 03:33:55 +00:00
Ralph Castain	6fa2e81491	Correct handling of pointer array indexing This commit was SVN r22017.	2009-09-26 03:33:26 +00:00
Jeff Squyres	bc3060d668	Fixes trac:2028. George and I found this via some collaborative debugging (yay cisco webex!). Make sure we only go up to OPAL max datatype, not OMPI max datatype. This commit was SVN r22016. The following Trac tickets were found above: Ticket 2028 --> https://svn.open-mpi.org/trac/ompi/ticket/2028	2009-09-25 21:52:42 +00:00
Ethan Mallove	e9014ff20e	Adjust comment according to r22014 This commit was SVN r22015. The following SVN revision numbers were found above: r22014 --> open-mpi/ompi@9e951ce664	2009-09-25 19:53:16 +00:00
Ethan Mallove	9e951ce664	Remove `static` from `MPIR_Breakpoint` so Intel compilers will not inline it This commit was SVN r22014.	2009-09-25 19:14:19 +00:00
Jeff Squyres	1886d5a004	Remove the libopenmpi_malloc library; it is only necessary for backwards compatibility in the v1.3 series. This commit was SVN r22013.	2009-09-25 17:09:54 +00:00
Ralph Castain	709b36efb4	Cleanup auto-wireup and enable tools to "discover" the HNP via multicast This commit was SVN r22012.	2009-09-25 01:00:09 +00:00
Abhishek Kulkarni	2af7657db1	A few changes to the FTB notifier interface: - add an orte ftb notifier help file for more verbose error messages - check if we can connect to the FTB during component->query and close the component, if we cannot. - make the ftb component interface methods static. - add mca parameters to set override the default subscription style and priority. This commit was SVN r22011.	2009-09-24 23:56:41 +00:00
Jeff Squyres	3340f62e5f	This directory has not been used in years; no one has contributed anything useful to it. So let's ditch it. It can always be brought back if someone wants to put something useful in here. This commit was SVN r22010.	2009-09-24 14:09:59 +00:00
George Bosilca	56c653ebcd	Add some comments. This commit was SVN r22008.	2009-09-24 00:08:28 +00:00
George Bosilca	74d56d51ac	Fix a typo. This commit was SVN r22007.	2009-09-23 23:36:44 +00:00
Ralph Castain	3167f0a0a0	Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds. Teach the ess cm module to bootstrap orted launch This commit was SVN r22006.	2009-09-23 20:57:49 +00:00
Jeff Squyres	152bc14079	Rename the help file to be consistent with others; add it to the Makefile.am. This commit was SVN r22005.	2009-09-23 20:28:49 +00:00
Josh Hursey	c9bd045cff	move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along. This commit was SVN r22004.	2009-09-23 18:29:13 +00:00
Josh Hursey	a6ee73156c	Add a verbose debug options. And add some error prints in the ESS' ft_event code. This commit was SVN r22003.	2009-09-23 17:05:49 +00:00
Josh Hursey	2769091261	Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently. Thanks to MTT for picking this up. This commit was SVN r22002.	2009-09-23 13:26:48 +00:00
Ralph Castain	26bb6e8f79	Add a couple of non-orte multicast tests This commit was SVN r22001.	2009-09-23 05:24:22 +00:00
Jeff Squyres	ef338602ef	Arrgh -- effectively revert r21997. We ''do'' need that header file... This commit was SVN r21998. The following SVN revision numbers were found above: r21997 --> open-mpi/ompi@bf5f14ab32	2009-09-22 21:19:38 +00:00
Jeff Squyres	bf5f14ab32	Remove some debugging stuff. This commit was SVN r21997.	2009-09-22 19:39:01 +00:00
Ralph Castain	dff0d01673	Yet another paffinity cleanup...sigh. 1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings 2. when you try to bind to more cores than we have, generate a not-enough-processors error message 3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it. This commit was SVN r21996.	2009-09-22 18:44:53 +00:00
Josh Hursey	5406fdfb80	Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). This commit looks larger than it really is since it includes a fair amount of code cleanup. The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated). {{{ shell 1) mpirun -np 2 -am ft-enable-cr my-app ... running ... shell 2) ompi-checkpoint --stop -v MPIRUN_PID [localhost:001300] [ 0.00 / 0.20] Requested - ... [localhost:001300] [ 0.00 / 0.20] Pending - ... [localhost:001300] [ 0.01 / 0.21] Running - ... [localhost:001300] [ 1.01 / 1.22] Stopped - ompi_global_snapshot_1234.ckpt Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt shell 2) killall -CONT mpirun ... Application Continues execution in shell 1 ... }}} Other items in this commit are mostly cleanup that has been sitting off-trunk for too long: * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier. * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type. * Lay some basic ground work for some future features. This commit was SVN r21995. The following SVN revision numbers were found above: r20391 --> open-mpi/ompi@0704b98668	2009-09-22 18:26:12 +00:00
Jeff Squyres	bb69bf22c0	Fix dumb logic in common sm setup that determines which nodes are local and who has the lowest name. This commit was SVN r21994.	2009-09-22 17:54:43 +00:00
Eugene Loh	67bac2fe31	Fix paffinity_linux_module.c. The set and get functions transferred cpu masks between the mask argument and a local PLPA mask. There were three problems: 1) The "get" function computed the number of bits as sizeof(mask), which is the size of the pointer to the mask rather than the mask itself. So, only 4 bits were copied with m32 and 8 bits with m64. There are actually 1024 bits. 2) The "get" and "set" functions both copied a number of bits computed from the sizeof() mask, but sizeof() reports the number of bytes. We have to multiply by 8 to get the number of bits. 3) These two functions check to make sure tha the mask argument is not bigger than the PLPA mask. But, the set function copies a number of bits in the PLPA mask, which is conceivably greater than the number of bits in the mask argument. So, accesses to the mask argument may overrun that argument. Problems 1 and 2 meant that one would encounter errors when the number of cores exceeded 4 (with -m32) or 8 (with -m64). Problem 3 probably caused no errors. This commit was SVN r21993.	2009-09-22 16:00:37 +00:00
Ralph Castain	8da3aa8d5c	Some (hopefully final!) adjustments and corrections to the paffinity support: 1. default -npersocket to force -bind-to-socket 2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core 3. add missing error message for not-enough-processors 4. since we no longer loop through orte_register_params twice, put the auto-detect of topology info in the rte_init for hnp and std_orted 5. fix bind-to-core, bysocket combination This commit was SVN r21992.	2009-09-22 15:41:03 +00:00
Jeff Squyres	b91e7ba91f	This is no longer necessary. This commit was SVN r21991.	2009-09-22 15:01:00 +00:00
Ralph Castain	12613352eb	Add missing header file This commit was SVN r21990.	2009-09-22 13:07:57 +00:00
Ralph Castain	2210989e2d	Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability. This commit was SVN r21989.	2009-09-22 02:16:40 +00:00
Ralph Castain	c3f9096fd9	Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast. This commit was SVN r21988.	2009-09-22 00:58:29 +00:00
Ralph Castain	82af6ee940	Update test This commit was SVN r21987.	2009-09-22 00:55:02 +00:00
Ralph Castain	247977fe70	Update cisco platform files This commit was SVN r21986.	2009-09-22 00:54:22 +00:00
Ralph Castain	7765c71428	Add a macro for formatting IP addresses for printing This commit was SVN r21985.	2009-09-22 00:53:54 +00:00
Jeff Squyres	1ef988c3d9	A slight optimization: no longer call sched_yield() when polling for shmem progress (or the Windows equiv). Instead, poll hard on the condition, but periocially call opal_progress(). This allows badly-formed apps (e.g., the ibm test communicator/bsend_free) to actually complete. To be clear, there are far too many apps out there that assume that MPI collectives will actually progress the rest of MPI. I don't like putting in a feature to enable broken apps, but I have a dim recollection of this issue coming up before (apps "hanging" when testing the sm coll because they assumed that calling collectives would trigger other MPI progress). Rather than have people claim that OMPI is broken, I prefer to put in this "workaround". :-( Indeed, the bsend_free test ''may'' be coded that way for exactly that reason...? I don't remember offhand... This commit was SVN r21984.	2009-09-21 22:20:44 +00:00
Jeff Squyres	64e3689a52	Grr -- test ''before'' committing! Sorry for all the noise folks; this one really fixes the problem. One more optimization coming later (separately). This commit was SVN r21983.	2009-09-21 21:32:26 +00:00
Jeff Squyres	bc43b6a085	Arrgh -- there was an extra assignment in there. Additionally, clean it up a little to drive the point home that the lowest named proc goes into array position [0]. This commit was SVN r21982.	2009-09-21 21:15:32 +00:00
Jeff Squyres	f9dfa03fde	Fix a potential ordering issue with the names and RML exchange during sm coll setup. This commit was SVN r21981.	2009-09-21 21:10:45 +00:00
Terry Dontje	0ccf2d87b6	rename do-not-bind to bind-to-none and clean up an error message This commit was SVN r21980.	2009-09-21 17:00:02 +00:00
Terry Dontje	13be2d2a00	correct mistype in odle should be odls call to orte_show_help This commit was SVN r21979.	2009-09-21 13:22:37 +00:00
Ralph Castain	7138fd131f	Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry This commit was SVN r21978.	2009-09-19 17:43:21 +00:00
George Bosilca	b18ca686ae	Correct the pointer math when we copy the opal_datatype_t object. In addition don't set the ref count to 1, it has been already set by the call to OBJ_NEW when the type was allocated. This fixes ticket #2014. This commit was SVN r21976.	2009-09-18 20:05:22 +00:00
Ralph Castain	2028017554	Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies. The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system. This commit was SVN r21975.	2009-09-18 19:48:42 +00:00
Josh Hursey	7ac8d89f12	Since r21967 converted the mpool sm module into a real module, it broke some of the C/R logic in the ft_event funciton (actually it wouldn't build after that patch). This commit fixes the ft_event logic so that it uses the normal destroy funcitonality instead of the workaround with the component that was previously there. All and all it made for cleaner code, which is always good. If r21967 moves to v1.3, this patch will need to be moved as well. This commit was SVN r21972. The following SVN revision numbers were found above: r21967 --> open-mpi/ompi@533633b8cb	2009-09-17 14:45:17 +00:00
Josh Hursey	59143be39d	Fix a minor C/R bug related to cleaning up session directories when sm is present. Before this, we would restore the topmost old session directory. This commit makes sure that we remove it when we are done with it. This commit was SVN r21971.	2009-09-17 14:43:06 +00:00
Edgar Gabriel	9abeaad6e2	so here is what happens: in the v1.2 series the cid's could never go above the max. allowed for a particular pml. Because of that, pml_add_comm never checked for the cid, and in fact pml_add_comm was called in comm_set, which is before we knew the cid. in the v1.3 series (and trunk) we check now the cid to detect overflow, and because of that pml_add_comm has been moved after the cid allocation routine, namely into the comm_activate routine. in the v1.2 series, the comm_activate contained a synchronization step of the old communicator in order to prevent incoming fragments on the new communicator, with the main problem being that the allreduce in the communicator allocation finished at different times on different processes, and thus, this scenario could and did really occur. in the v1.3 series, the comm_activate does not contain the synchronization step anymore, since we introduced the new queue for fragments with unknown cid. The problem is however, that whether a fragment is known or not is decided by using ompi_comm_lookup(), which will return something useful as soon as the cid allocation finished, even before pml_add_comm has been called. So there is a small time gap where we will not post a message into queue for unknown cid's, but we can also not look up the process structure belonging to the rank in that comm ( that is in pml_ob1_match_recv_frag or something like that). The current fix reintroduces the synchronization step in comm_activate, and ensures that no fragment can be received for a new communicator before the synchronization occurs , and thus comm_nextcid() and pml_add_comm has been called. It seems to be the safest and easiest way for now. Welcome back, v1.2. This commit was SVN r21970.	2009-09-17 14:37:02 +00:00
Ralph Castain	98a4450df6	Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it This commit was SVN r21969.	2009-09-17 05:18:37 +00:00
Jeff Squyres	4a40be650e	Improve the MCA param help messages for btl_tcp_if_in\|exclude. This commit was SVN r21968.	2009-09-15 17:19:57 +00:00
Jeff Squyres	533633b8cb	Fixes trac:1988. The little bug that turned out to be huge. Yoinks. * Various cosmetic/style updates in the btl sm * Clean up concept of mpool module (I think that code was written way back when the concept of "modules" was fuzzy) * Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to fix potential segv's when mmap'ed regions were at different addresses in different processes (thanks Tim!). * Change sm coll to no longer use mpool as its main source of shmem; rather, just mmap its own segment (because it's fixed size -- there was nothing to be gained by using mpool; shedding the use of mpool saved a lot of complexity in the sm coll setup). This effectively made Tim's fixes moot (because now everything is an offset into the mmap that is computed locally; there are no global pointers). :-) * Slightly updated common/sm to allow making mmap's for a specific set of procs (vs. ''all'' procs in the process). This potentially allows for same-host-inter-proc mmaps -- yay! * Fixed many, many things in the coll sm (particularly in reduce): * Fixed handling of MPI_IN_PLACE in reduce and allreduce * Fixed handling of non-contiguous datatypes in reduce * Changed the order of reductions to go from process (n-1)'s data to process 0's data, because that's how all other OMPI coll components work * Fixed lots of usage of ddt functions * When using a non-contiguous datatype, if the root process is not (n-1), now we used a 2nd convertor to copy from shmem to the rbuf (saves a memory copy vs. what was done before) * Lots and lots of little cleanups, clarifications, and minor optimizations (although still more could be done -- e.g., I think the use of write memory barriers is fairly sub-optimal; they could be ganged together at the root, for example) I'm marking this as "fixes trac:1988" and closing the ticket; if something is still broken, we can re-open the ticket. This commit was SVN r21967. The following Trac tickets were found above: Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988	2009-09-15 00:25:21 +00:00
Jeff Squyres	cf6b71e8a2	Helper -- ensure to revert VERSION when we're done. This commit was SVN r21966.	2009-09-11 15:49:15 +00:00

1 2 3 4 5 ...

13870 Коммитов