openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	e10afcd354	udcm: fix bugs This commit fixes the following bugs: - On send failure release newly allocated message. - In the destructor for udcm_message_sent_t always remove the send timeout event from the event base. Failure to do this can lead to memory corruption since the destructor may be called from an event callback. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-21 12:54:14 -06:00
Nathan Hjelm	55d24ee7a3	btl/openib: fix argument type for internal atomic function This was fixed on my btl 3.0 branch but the changeset got lost in a rebase. Fixes issues with lock ups when using osc/rdma. References open-mpi/ompi#1010 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-20 13:47:28 -06:00
Nathan Hjelm	90db00e37f	Merge pull request #996 from hjelmn/openib_progress_thread btl/openib: remove extra threads	2015-10-08 07:31:27 -06:00
Nathan Hjelm	b8af310efa	btl/openib: remove extra threads This commit removes the service and async event threads from the openib btl. Both threads are replaced by opal progress thread support. The run_in_main function is now supported by allocating an event and adding it to the sync event base. This ensures that the requested function is called as part of opal_progress. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-07 12:30:41 -06:00
Nathan Hjelm	59aa93e1b6	opal/mpool: add support for passing access flags to register This commit adds a access_flags argument to the mpool registration function. This flag indicates what kind of access is being requested: local write, remote read, remote write, and remote atomic. The values of the registration access flags in the btl are tied to the new flags in the mpool. All mpools have been updated to include the new argument but only the grdma and udreg mpools have been updated to make use of the access flags. In both mpools existing registrations are checked for sufficient access before being returned. If a registration does not contain sufficient access it is marked as invalid and a new registration is generated. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-05 13:53:55 -06:00
Nathan Hjelm	60f3dbd160	btl/openib: fix udcm coverity errors Fix CID 1312120: Uninitialized scalar variable The response type will always be set unless a message of another type is passed to this function. To make sure that error is caught I am adding an assert. Fix CID 1312116: Dereference after null check This is a potential bug. If there is no endpoint data for an incoming connection a rejection should be sent. In this case we would just SEGV. Fix CID 1312115: Dereference after null check Clear error in the error message. Use the queue pair number that was passed in. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 17:08:27 -06:00
Nathan Hjelm	2041aac4e4	btl/openib: add support for dynamic add_procs Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:55 -06:00
Nathan Hjelm	6f8f2325ed	btl: btls are now required to set the send flag if supported This commit updates each non-compliant btl to send the MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is supported. This fixes a problem identified after the latest bml/r2 update which excplicitly checks for the send flag. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Matias Cabral	f360eebfeb	Merge pull request #855 from matcabral/btl_openib_mtu Fix for openib btl mca command line parameter btl_openib_mtu being ignored	2015-09-09 11:22:00 -07:00
Rolf vandeVaart	2e64a69fa9	Add some verbosity to help debug hwloc issues	2015-09-08 10:50:22 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
matcabral	1f9218a0bc	Fix for openib btl mca command line parameter btl_openib_mtu being ignored.	2015-09-02 02:22:30 -07:00
Nathan Hjelm	f926796e57	Merge pull request #828 from hjelmn/openib_thread_fix openib thread fixes	2015-09-01 09:12:50 -06:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Nathan Hjelm	64e4419d76	btl/openib: allow the use of the openib btl in thread muliple There were several issues preventing the openib btl from running in thread multiple mode: - Missing locks in UDCM when generating a loopback endpoint. Fixed in open-mpi/ompi@8205d79819. - Incorrect sequence numbers generated in debug mode. This did not prevent the openib btl from running but instead produced incorrect error messages in debug builds. - Recursive locking of the rcache lock caused by the malloc hooks. This is fixed by open-mpi/ompi#827 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:04:52 -06:00
Nathan Hjelm	c101385f64	btl/openib: fix sequence number generation for debug mode When using eager RDMA in debug builds the openib btl generates a sequence number for each send. The code independently updated the head index and the sequence number for the eager rdma transaction. If multiple threads enter this code at the same time and run in the following order: thread 1: update sequence (0 -> 1) thread 2: update sequence (1 -> 2) thread 2: update head (0 -> 1) thread 1: update head (1 -> 2) the sequence number for head[0] gets 1 and the sequence number for head[1] gets 0. The fix is to generate the sequence number from the head index. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 16:00:06 -06:00
Nathan Hjelm	8205d79819	btl/openib: add missing lock calls Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 12:21:49 -06:00
Gilles Gouaillardet	d02ccd67de	btl/openib: remove OFED version runtime check when XRC is used this test seems broken : - some false positive were reported - it fails to detect some OFED version mismatch this commit simply removes this test, which means the application will likely fail if XRC is used ad OFED version is different between compile time and runtime	2015-08-14 09:10:03 +09:00
Gilles Gouaillardet	f7cf7d5070	configury: fix XRC detection on OFED < 3.12 since ibv_create_xrc_rcv_qp is now deprecated, and in order to be "future-proof", we have to consider the case in which only XRC Domains are supported. also, correctly handle distro that ship broken ibverbs devel headers Thanks Paul Hargrove for the detailled report.	2015-07-13 10:43:22 +09:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
Jeff Squyres	4341639a66	Revert "configury: fix (again) XRC detection on OFED < 3.12" @ggouaillardet is likely offline for the weekend, but master is broken on RHEL 6.5 systems that do not have MOFED installed. So I'm taking the liberty of revering this commit; I'm guessing Gilles will fixup and re-commit next week. This reverts commit 77f8282d51d8f40f6ae988ef84c9c852de75c625.	2015-07-10 06:45:33 -07:00
Gilles Gouaillardet	77f8282d51	configury: fix (again) XRC detection on OFED < 3.12 since ibv_create_xrc_rcv_qp is now deprecated, and in order to be "future-proof", we have to consider the case in which only XRC Domains are supported. Thanks Paul Hargrove for the detailled report.	2015-07-10 15:31:45 +09:00
Rolf vandeVaart	ae0f3cfee7	Make explicit call to initalize MCA parameters in common CUDA code. This allows us to view them with ompi_info and possibly modify with tools interface	2015-07-09 12:51:55 -04:00
Gilles Gouaillardet	9f171de412	btl/openib: queue pending fragments once only when running out of credit Fixes open-mpi/ompi#640	2015-07-06 09:45:01 +09:00
bosilca	77367ca02c	Merge pull request #687 from rolfv/pr/fix-smcuda-perfprob Add the ability use different size buffers for host and CUDA buffers	2015-07-02 18:42:41 -04:00
Rolf vandeVaart	30a872b478	Add the ability to send host buffers through one sized staging buffers and CUDA buffers through different sized buffers. Fixes performance issues	2015-07-02 11:11:15 -04:00
Alina Sklarevich	27797654db	openib btl: added a new vendor_part_id for Mellanox ConnectX4-LX.	2015-06-29 13:50:43 +03:00
Howard Pritchard	e49a37c034	ownership: update ownership files per discussions at OMPI devel workshop Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-06-25 10:04:42 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Jeff Squyres	8ab2b11f88	btl_openib.c: fix another compiler warning Remove this unused variable	2015-06-17 09:00:12 -07:00
Jeff Squyres	f688289aaf	btl_openib.c: fix compiler warning This return code is not used; tell the compiler we're not going to use it.	2015-06-17 08:56:56 -07:00
Jeff Squyres	4384131e65	openib: minor style and defensive programming fixes Minor comment/whitespace fixes. Also some minor logic changes that are mainly for defensive programming purposes (i.e., ensure to always set malloc_hook_set to true or false, and then check it before we try to actually invoke it).	2015-06-12 20:11:47 -07:00
Jeff Squyres	2f137ff151	openib: reset memalign threshhold properly Now that open-mpi/ompi#638 is fixed, reset the openib BTL memalign threshhold properly. This effectively re-instates commit open-mpi/ompi@ce915b5757.	2015-06-12 20:11:47 -07:00
Jeff Squyres	88c13adc8c	openib: only set the memory hook if it is enabled Instead of unconditionally setting the memory hook, only set it when the memory hooks are both available and have been enabled (e.g., opal/mca/memory/linux has decided that it can be enabled, and when the mpi_leave_pinned MCA param is set to 1, or is set to -1 and some component requested the memory hooks be enabled). If we set the memory hook when memory hooks are not enabled, __malloc_hook will be NULL, which will cause problems when btl_openib_malloc_hook() tries to invoke it. Fixes open-mpi/ompi#638.	2015-06-12 20:11:47 -07:00
Ralph Castain	12d3c9ca22	Revert "Fix a typo that incorrectly set the alignment threshold in the openib BTL." This reverts commit ce915b5757d428d3e914dcef50bd4b2636561bca.	2015-06-10 14:02:49 -07:00
Rolf vandeVaart	8622b34664	Check for GPU Direct RDMA and leave pinned turned off	2015-06-04 14:25:24 -04:00
Nathan Hjelm	5e2bc2c662	btl/openib: fix coverity issue CID 1269821 Dereference null return value (NULL_RETURNS) This is another false positive that can be silenced by looping on opal_list_remove_first instead of using both opal_list_is_empty and opal_list_remove_first. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-29 08:44:03 -06:00
Nathan Hjelm	b038eb6434	btl/openib: more coverity fixes CID 1301390 Dereference before null check (REVERSE_INULL) endpoint can not be NULL here. Remove NULL check. CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) CID 1301388 Bad bit shift operation (BAD_SHIFT) Add ull to integer constants to ensure the math is done in 64-bits not 32. CID 715749 Explicit null dereferenced (FORWARD_NULL) As far as I can tell this parser function does not accept a line that does match key = value. If that is the case then value should never be NULL. If it is it is a parse error. Updated the code to reflect this. Also modified the intify function to do something more sane (strtol vs atoi with hex detection). CID 1269820 Dereference null return value (NULL_RETURNS) This is a false positive as strchr will never return NULL here. It makes sense, though, to quiet the warning by changing the do {} while () loop to a while () loop. CID 1269780 Dereference after null check (FORWARD_NULL) Just return an error if the endpoint's cpc data is NULL. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 11:58:17 -06:00
Nathan Hjelm	ceb319170a	btl/openib: fix more coverity issues CID 1269674 Ignoring number of bytes read (CHECKED_RETURN) Check that we read enough bytes to get a complete async command. CID 1269793 Missing break in switch (MISSING_BREAK) Added comment to indicate fall through was intentional. CID 1269702: Constant variable guards dead code (DEADCODE) Remove an unused argument to opal_show_help. This will quiet the coverity issue. CID 1269675 Ignoring number of bytes read (CHECKED_RETURN) Check that at least sizeof(int) bytes are read. If this is not the case then it is an error. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 08:38:10 -06:00
Nathan Hjelm	43d678e7ca	btl/openib: fix more coverity issues CID 1269931 Uninitialized scalar variable (UNINIT) Initialize complete async message. This was not a bug but the fix contributes to valgrind cleanness (uninitialed write). CID 1269915 Unintended sign extension (SIGN_EXTENSION) Should never happen. Quieting this by explicitly casting to uint64_t. CID 1269824 Dereference null return value (NULL_RETURNS) It is impossible for opal_list_remove_first to return NULL if opal_list_is_empty returns false. I refactored the code in question to not use opal_list_is_empty but loop until NULL is returned by opal_list_remove_first. That will quiet the issue. CID 1269913 Dereference before null check (REVERSE_INULL) The storage parameter should never be NULL. The check intended to check if *storage was NULL not storage. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 08:38:10 -06:00
Nathan Hjelm	6b86e74218	btl/openib: fix coverity issues CID 1269933 Uninitialized scalar variable (UNINIT) This CID isn't really an error but it is best for both valgrind and coverity cleanness to not write uninitialized data. Added an initializer for async_command in btl_openib_component_close. CID 1269930 Uninitialized scalar variable (UNINIT) Same as above. Best not to write uninitialized data. Added an initializer for async_command. CID 1269701 Logically dead code (DEADCODE) Coverity is correct. The smallest_pp_qp will always be 0. Changed the initial value so that the smallest_pp_qp is set as intended. If no per-per queue pair exists then use the last shared queue pair. This queue pair should have the smallest message size. This will reduce buffer waste. CID 1269713 Logically dead code (DEADCODE) False positive but easy to silence. The two check are meaningless if HAVE_XRC is 0 so protect them with #if HAVE_XRC. CID 1269726 Division or modulo by zero (DIVIDE_BY_ZERO) Indeed an issue. If we get an invalid value for rd_win then this will cause a divide-by-zero exception. Added a check to ensure rd_win is > 0. Also updated the help message to reflect this requirement. CID 1269672 Ignoring number of bytes read (CHECKED_RETURN) This error was somewhat intentional. Linux parameter files are probably not empty but it is safer to check the return code of read to make sure we got something. If 0 bytes are read this code could SEGV whe running strtoull. CID 1269836 Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) Add a range check to read_module_param to ensure we do not overflow. In the future it might be worthwhile to report an error because these parameters should never cause overflow in this calculation. CID 1269692 Calling risky function (DC.WEAK_CRYPTO) ??? This call was added in 2006 but I see no calls to the rest of the rand48 family of functions. Anyway, we SHOULD NEVER be calling seed48, srand, etc because it messes with user code. Removed the call to seed48. CID 1269823 Dereference null return value (NULL_RETURNS) This is likely a false positive. The endpoint lock is being held so no other thread should be able to remove fragments from the list. Also, mca_btl_openib_endpoint_post_send should not be removing items from the list. If a NULL fragment is ever returned it will likely be a coding error on the part of an Open MPI developer. Added an assert() to catch this and quiet the coverity error. CID 1269671 Unchecked return value (CHECKED_RETURN) Added a check for the return code of mca_btl_openib_endpoint_post_send to quiet the coverity error. It is unlikely this error path will be traversed. CID 1270229 Missing break in switch (MISSING_BREAK) Add a comment to indicate that the fall-through is intentional. CID 1269735 Dereference after null check (FORWARD_NULL) There should always be an endpoint when handling a work completion. The endpoint is either stored on the fragment or can be looked up using the immediate data. Move the immediate data code up and add an assert for a NULL endpoint. CID 1269740 Dereference after null check (FORWARD_NULL) CID 1269741 Explicit null dereferenced (FORWARD_NULL) Similar to CID 1269735 fix. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 08:38:09 -06:00
Ralph Castain	ce915b5757	Fix a typo that incorrectly set the alignment threshold in the openib BTL. Thanks to Xavier Besseron for pointing it out	2015-05-25 07:12:57 -07:00
Jeff Squyres	76222f462e	btl openib: if ibv_open_device() returns NULL, it's not supported When a libibverbs driver returns NULL for its context, it's the Open MPI libibverbs fake driver. Hence, this device is simply not supported -- ignore it.	2015-04-29 18:07:12 -07:00
Jeff Squyres	df6f7597a4	btl openib: only initialize CPCs if there are devices to use Defer initializing the CPCs until we know that we have devices/ports to use. This both prevents some useless work at startup when there are no devices/ports to use, and also prevents librdmacm complaining that there are no verbs-capable RDMA devices available (e.g., if a Cisco usNIC device is present, but does not present a verbs RDMA interface).	2015-04-29 17:52:41 -07:00
Jeff Squyres	a50ad505e7	There were corner cases that allowed max_reg to be uninitialized. Set a default value so that those corner cases would still have an initialized value in max_reg.	2015-04-28 14:34:17 -07:00
Nathan Hjelm	b06b8584f7	btl/openib/udcm: OBJ_CONSTUCT UDCM module member before attempting to initialize This commit fixes an assert when trying to cleanup a module we failed to initialize. There is no protection around the OBJ_DESTRUCT calls so they will always be called so similarly we should always call OBJ_CONSTRUCT at init. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-22 13:25:12 -06:00
Mike Dubman	00784ae3ba	btl/openib: fix compiler warning, by HalR	2015-03-13 13:17:23 +02:00
adrianreber	714d9aa67e	Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart Topic/orte cr continue like restart	2015-03-12 14:54:02 +01:00
Nathan Hjelm	ce6caab2a7	Merge pull request #463 from hjelmn/cuda_async btl/openib: cuda: fix CUDA-aware support with async copy	2015-03-11 09:52:48 -06:00
Adrian Reber	f45dd069bd	FT: fix compilation using --with-ft (1/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. This first patch moves orte_cr_continue_like_restart from ORTE to opal_cr_continue_like_restart in OPAL. This only leaves three calls from OPAL to ORTE in the FT code. As it is not yet 100% clear how to handle these calls the code orte_sstore.set_attr() has been #ifdef'd out for now.	2015-03-11 14:23:33 +01:00

1 2 3 4 5

238 Коммитов