openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	657e701c65	Add debug verbosity to the orte data server and pmix pub/lookup functions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-25 18:41:27 -07:00
Joshua Ladd	02c288c853	Merge pull request #3567 from markalle/pr/yalla_datatypes yalla with irregular contig datatype -- Fixes 3566	2017-05-24 10:48:16 -04:00
Mark Allen	36f51bca26	yalla with irregular contig datatype -- Fixes 3566 Yalla has a macro PML_YALLA_INIT_MXM_REQ_DATA that checks if a datatype is contiguous via opal_datatype_is_contiguous_memory_layout(dt,count) and if so it selects a size and lb that presumably is what will rdma, as ompi_datatype_type_size(_dtype, &size); \ ompi_datatype_type_lb(_dtype, &lb); \ This failed when I gave it a datatype constructed as [ ...] with extent 4. What I mean by that datatype is lens[0] = 3; disps[0] = 1; types[0] = MPI_CHAR; MPI_Type_struct(1, lens, disps, types, &tmpdt); MPI_Type_create_resized(tmpdt, 0, 4, &mydt); So there are 3 chars at offset 1, and the LB is 0 and the UB is 4. So that macro decides that size=4 and lb=0 and later I suppose size is getting updated to 3 for the final rdma, and so a send of a buffer [ 0 1 2 3 ] gets recved as [ 0 1 2 _ ]. I think it should use the true lb and the true extent. For "regular" contig datatypes it would be the same, and for the irregular ones that are still deemed contiguous by that utility function it should still be the right thing to use. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-05-23 20:56:12 -04:00
Josh Hursey	facd6d6b33	Merge pull request #3563 from jjhursey/fix/opal-stderr opal/stacktrace: Fix stderr target for opal_stacktrace_output	2017-05-23 08:53:31 -05:00
Joshua Hursey	fce28c31d0	opal/stacktrace: Fix stderr target for opal_stacktrace_output Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-05-22 13:46:02 -05:00
Jeff Squyres	13bd776d8f	Merge pull request #3560 from thananon/pr/fi_ep_bind btl/usnic : changed fi_ep_bind flags for AV from NULL to 0.	2017-05-22 14:01:27 -04:00
Thananon Patinyasakdikul	bf7534d32c	btl/usnic: changed fi_ep_bind flags for AV from NULL to 0 due to compiler warning. This commit fixed compiler warning generated from earlier commit : ddbe1726c5d19cddbb5754a6d4a20bf2a5966654 Signed-off-by: Thananon Patinyasakdikul <apatinya@cisco.com>	2017-05-22 10:09:43 -07:00
Geoff Paulsen	50f9287c03	Merge pull request #2941 from markalle/pr/mpi-info-update2 Finally Merging this in. MPI_*_get_info/set_info(). Targeting v3.1 release. @hjelmn were you interested in switching some internal pieces to begin using this? Should we target v3.1 (or whatever we call the Oct 15th release?)	2017-05-22 09:22:04 -05:00
Ryan Grant	b59eb76fcf	Merge pull request #3528 from tkordenbrock/topic/mtl-portals4.mtl.rndv.get.race mtl-portals4: in rendezvous, reissue PtlGet() if it fails	2017-05-17 12:57:45 -06:00
Jeff Squyres	ddbe1726c5	Merge pull request #3539 from thananon/usNIC_fi_ep_bind usNIC: fix fi_ep_bind flag. FI_RECV should not be associated with address vector.	2017-05-17 06:25:23 -04:00
Gilles Gouaillardet	c4f64c39d1	Merge pull request #3526 from ggouaillardet/topic/unpack_hetero opal/datatype: do not compute ptypes for OPAL predefined datatypes	2017-05-17 14:35:49 +09:00
Mark Allen	482d84b6e5	fixes for Dave's get/set info code The expected sequence of events for processing info during object creation is that if there's an incoming info arg, it is opal_info_dup()ed into the obj at obj->s_info first. Then interested components register callbacks for keys they want to know about using opal_infosubscribe_infosubscribe(). Inside info_subscribe_subscribe() the specified callback() is called with whatever matching k/v is in the object's info, or with the default. The return string from the callback goes into the new k/v stored in info, and the input k/v is saved as __IN_<key>/<val>. It's saved the same way whether the input came from info or whether it was a default. A null return from the callback indicates an ignored key/val, and no k/v is stored for it, but an __IN_<key>/<val> is still kept so we still have access to the original. At MPI__set_info() time, opal_infosubscribe_change_info() is used. That function calls the registered callbacks for each item in the provided info. If the callback returns non-null, the info is updated with that k/v, or if the callback returns null, that key is deleted from info. An __IN_<key>/<val> is saved either way, and overwrites any previously saved value. When MPI__get_info() is called, opal_info_dup_mpistandard() is used, which allows relatively easy changes in interpretation of the standard, by looking at both the <key>/<val> and __IN_<key>/<val> in info. Right now it does 1. includes system extras, eg k/v defaults not expliclty set by the user 2. omits ignored keys 3. shows input values, not callback modifications, eg not the internal values Currently the callbacks are doing things like return some_condition ? "true" : "false" that is, returning static strings that are not to be freed. If the return strings start becoming more dynamic in the future I don't see how unallocated strings could support that, so I'd propose a change for the future that the callback()s registered with info_subscribe_subscribe() do a strdup on their return, and we change the callers of callback() to free the strings it returns (there are only two callers). Rough outline of the smaller changes spread over the less central files: comm.c initialize comm->super.s_info to NULL copy into comm->super.s_info in comm creation calls that provide info OBJ_RELEASE comm->super.s_info at free time comm_init.c initialize comm->super.s_info to NULL file.c copy into file->super.s_info if file creation provides info OBJ_RELEASE file->super.s_info at free time win.c copy into win->super.s_info if win creation provides info OBJ_RELEASE win->super.s_info at free time comm_get_info.c file_get_info.c win_get_info.c change_info() if there's no info attached (shouldn't happen if callbacks are registered) copy the info for the user The other category of change is generally addressing compiler warnings where ompi_info_t and opal_info_t were being used a little too interchangably. An ompi_info_t* contains an opal_info_t*, at &(ompi_info->super) Also this commit updates the copyrights. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-05-17 01:12:49 -04:00
Gilles Gouaillardet	384387bb53	Merge pull request #3411 from ggouaillardet/topic/mpi_f08_interfaces_callbacks f08: make procedure(MPI_User_function) type available from mpi_f08	2017-05-17 09:02:26 +09:00
Thananon Patinyasakdikul	a705f2cf7b	usNIC: fix fi_ep_bind flag. FI_RECV should not be associated with av. Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>	2017-05-16 18:22:28 -04:00
Jeff Squyres	23325c31d3	Merge pull request #3338 from jjhursey/topic/ompi_info_show_failed `ompi_info --show-failed` feature	2017-05-16 17:08:43 -04:00
Jeff Squyres	39fa1d5c05	Merge pull request #3500 from bosilca/topic/any_source Allow MPI_ANY_SOURCE in MPI_Sendrecv_replace.	2017-05-16 16:36:00 -04:00
Jeff Squyres	3e7ce4c034	Merge pull request #3537 from rhc54/topic/news Add 1.10.7 NEWS	2017-05-16 16:17:37 -04:00
Ralph Castain	efb0795ce2	Add 1.10.7 NEWS Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-16 08:48:51 -07:00
Todd Kordenbrock	27ee862964	mtl-portals4: in rendezvous, reissue PtlGet() if it fails This commit fixes a race condition in the rendezvous protocol. The race occurs because the sender does not wait for the link event on the send buffer. Even though this has not been seen in the wild, it is possible for the receiver to issue the PtlGet() before the ME is linked which causes a NAK at the receiver. This commit resolves this race by reissuing the PtlGet() when a NAK occurs. Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>	2017-05-15 13:11:13 -05:00
Gilles Gouaillardet	22ab73cb1a	Merge pull request #3471 from ggouaillardet/topic/execve_cmd odls: fix handling of the orte fork agent	2017-05-15 15:07:39 +09:00
Gilles Gouaillardet	5a35a8e82c	opal/datatype: do not compute ptypes for OPAL predefined datatypes Fixes open-mpi/ompi#3522 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-15 11:43:48 +09:00
Ralph Castain	e682b5d7d8	Merge pull request #3523 from rhc54/topic/cleanup Remove debug	2017-05-12 13:45:55 -07:00
Ralph Castain	b527c40dae	Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-12 12:41:36 -07:00
David Solt	50aa143ab6	Major structural changes to data types: .super infosubscriber ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it. MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal. An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object. Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory. Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t The data structure changes are primarily in the following files: communicator/communicator.h ompi/info/info.h ompi/win/win.h ompi/file/file.h The following new files were created: opal/util/info.h opal/util/info.c opal/util/info_subscriber.h opal/util/info_subscriber.c This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info. The new model can be seen in the following files: ompi/mpi/c/comm_get_info.c ompi/mpi/c/comm_set_info.c ompi/mpi/c/file_get_info.c ompi/mpi/c/file_set_info.c ompi/mpi/c/win_get_info.c ompi/mpi/c/win_set_info.c The current subscribers where changed as follows: mca/io/ompio/io_ompio_file_open.c mca/io/ompio/io_ompio_module.c mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks") mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig") Signed-off-by: Mark Allen <markalle@us.ibm.com> Conflicts: AUTHORS ompi/communicator/comm.c ompi/debuggers/ompi_mpihandles_dll.c ompi/file/file.c ompi/file/file.h ompi/info/info.c ompi/mca/io/ompio/io_ompio.h ompi/mca/io/ompio/io_ompio_file_open.c ompi/mca/io/ompio/io_ompio_file_set_view.c ompi/mca/osc/pt2pt/osc_pt2pt.h ompi/mca/sharedfp/addproc/sharedfp_addproc.h ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c ompi/mpi/c/lookup_name.c ompi/mpi/c/publish_name.c ompi/mpi/c/unpublish_name.c opal/mca/mpool/base/mpool_base_alloc.c opal/util/Makefile.am	2017-05-12 14:41:05 -04:00
Ralph Castain	23af6c9d02	Merge pull request #3519 from rhc54/topic/nolocal Fix --nolocal	2017-05-12 09:57:52 -07:00
Ralph Castain	4e5e8be85e	Merge pull request #3520 from rhc54/topic/slotsalloc Fix total_slots_allocated computation	2017-05-12 09:38:00 -07:00
Ralph Castain	45bbd598c1	Fix --nolocal Fix the --nolocal option by ensuring we always check/remove the HNP from the list of available nodes if the flag is set Ensure that the HNP node is included as available when nothing else is given Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-12 09:03:26 -07:00
Ralph Castain	29e083bffd	Fix total_slots_allocated computation On unmanaged allocations, we need to update the total_slots_allocated once the daemons have been launched and "discovered" their topology Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-12 08:21:52 -07:00
Jeff Squyres	9f317f0a5c	Merge pull request #1390 from jsquyres/pr/minor-monitoring-library-cleanups monitoring lib: rename to ompi_monitoring_prof.so	2017-05-11 11:11:02 -04:00
Ralph Castain	2f507a1113	Merge pull request #3517 from rhc54/topic/cisco2 When a daemon force-terminates, we don't get the show_help message it…	2017-05-11 07:47:25 -07:00
Ralph Castain	9164afbb08	When a daemon force-terminates, we don't get the show_help message it was trying to send because the message is at a lower priority than the termination event. Resolve this by putting the oob in its own progress thread. Also, use only that one thread by default - if someone needs more progress threads in the OOB, they can use the MCA param to get them. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-11 06:52:55 -07:00
KAWASHIMA Takahiro	0650d4141f	Merge pull request #3401 from kawashima-fj/pr/fortran-argv-null fortran: Fix `MPI_ARGV(S)_NULL` compilation error	2017-05-11 11:23:12 +09:00
KAWASHIMA Takahiro	854fa5fc55	Merge pull request #3489 from kawashima-fj/pr/group-remote-peers-2nd group: Fix `ompi_group_have_remote_peers` (2nd try)	2017-05-11 11:22:15 +09:00
Ralph Castain	987339ed76	Merge pull request #3515 from rhc54/topic/cisco2 Add some more debug output	2017-05-10 17:29:18 -07:00
Ralph Castain	f47124e4d3	Finally fix the problem - the key was knowing there were more than 2 topologies involved, and that the HNP is not allocated. Give up on being cute and just search the darned list of topologies - there won't be that many, and if there are (so the scan takes awhile), then too bad. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-10 16:44:19 -07:00
Ralph Castain	3b29b78a19	Merge pull request #3507 from rhc54/topic/cleanup Sigh - remove debug	2017-05-10 13:36:27 -07:00
Matias A Cabral	644641d06f	PSM and PSM2 MTLs check on the max message size allowed by API. OMPI send and receive mesages use size_t for the lenght while PSM and PSM2 psm(2)mq_send/receive use uint32_t. Type size_t is 64 bits in 64 bits arch. Therefore, this patch adds a sanity check on the lenght of the message and fails gracefully. Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>	2017-05-10 12:45:11 -07:00
Ralph Castain	55f4b825af	Add verbose output to nidmap code for debugging as this is a new, and sometimes fragile, feature Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-10 12:40:02 -07:00
Ralph Castain	911961ee21	Sigh - remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-10 11:26:42 -07:00
Ralph Castain	2d93d15aa7	Merge pull request #3502 from rhc54/topic/cisco Fix nidmap computation to deal with hetero nodes	2017-05-10 11:21:12 -07:00
Ralph Castain	c42ce3eeea	Merge pull request #3505 from rhc54/topic/rmlofi Update the RML OFI by copying the updated files from @anandhis branch	2017-05-10 11:20:46 -07:00
Ralph Castain	50646b07ce	Update the RML OFI by copying the updated files from @anandhis branch Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-10 09:17:06 -07:00
Jeff Squyres	c34ba88b22	monitoring lib: fix some Makefile.am macros * Use the proper lib prefix name * Use the proper extra LDFLAGS Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-10 09:03:59 -07:00
Jeff Squyres	626167f2a9	monitoring lib: rename to ompi_monitoring_prof.so The library that is installed is specific to Open MPI, so put an "ompi_" prefix on it. Also do some minor line wrappings and cleanups of text. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-10 09:03:56 -07:00
Ralph Castain	442e307a6e	Fix the nidmap computation to deal with hetero nodes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-10 08:43:28 -07:00
Gilles Gouaillardet	026f3dd2dd	pmix2x: plug a misc memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-10 14:57:44 +09:00
Gilles Gouaillardet	3c6631ff6c	opal: fix FIND_FIRST_ZERO macro for opal_pointer_array internal handling Thanks George for the patch. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-10 14:57:44 +09:00
George Bosilca	86a7b317a5	Allow MPI_ANY_SOURCE in MPI_Sendrecv_replace. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-05-09 16:57:15 -04:00
bosilca	d7ebcca93f	Add volatile to the pointer in the list_item structure. (#3468 ) This change has the side effect of improving the performance of all atomic data structures (in addition to making the code crrect under a certain interpretation of the volatile usage). This commit fixes #3450. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-05-09 10:12:20 -04:00
bosilca	cbf03b3113	Topic/datatype (#3441 ) * Don't overflow the internal datatype count. Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Optimize the datatype creation. The internal array of counts of predefined types is now only created when needed, which is either in a heterogeneous environment, or when one call get_elements. It saves space and makes the convertor creation a little faster in some cases. Rearrange the fields in the datatype description structs. The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the static array was only partially created. All predefined types should have the ptypes array created and initialized. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Fix the boundary computation. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * test/datatype: add test for short unpack on heteregeneous cluster Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Trying to reduce the cost of creating a convertor. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Respect the unpack boundaries. As Gilles suggested on #2535 the opal_unpack_general_function was unpacking based on the requested count and not on the amount of packed data provided. Fixes #2535. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-05-09 09:31:40 -04:00

1 2 3 4 5 ...

27090 Коммитов