openmpi

Автор	SHA1	Сообщение	Дата
Elena	c905fe9b78	pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff	2014-10-09 06:15:31 +02:00
Ralph Castain	fd6a044b7f	Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages. Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.	2014-10-03 16:02:57 -06:00
Jeff Squyres	413e775dbf	version configury: make dist now works Update the VERSION file scheme: * Remove "want_repo_rev". * Add "tarball_version". All values are now always included (major, minor, release, greek, repo_rev). However, configure.ac now runs "opal_get_version.sh ... --tarball", which will return the value of tarball_version (if it is non-empty) or the "full" version string (i.e., "major.minor.releasegreek").	2014-10-02 11:32:54 -07:00
Ralph Castain	69328c30f5	Simplify the check for abort_print_stack by removing stale #ifdefined cmr=v1.8.4:reviewer=jsquyres This commit was SVN r32821.	2014-09-30 19:38:29 +00:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Jeff Squyres	0a398c155f	opal MCA params: Move (and adapt) help message to opal help file This commit was SVN r32547.	2014-08-16 11:54:41 +00:00
Jeff Squyres	132375f07f	helpfiles: fix filenames referenced by calls to show_help() This commit was SVN r32453.	2014-08-08 13:34:15 +00:00
Gilles Gouaillardet	f7b13d1126	Fix missing ampersand. also replase the OMPI_CAST_RTE_NAME macro with an inline function if OPAL_ENABLE_DEBUG, so we can get warnings from the compiler if ampersand is missing. Thanks to Paul Hargrove for reporting the bugs This commit was SVN r32408.	2014-08-04 02:52:56 +00:00
Ralph Castain	daeb9b6c4f	Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain. Thanks to Gilles for pointing out some of the discrepancies. This commit was SVN r32398.	2014-08-01 14:44:11 +00:00
George Bosilca	9b2fcd898e	No more ORTE specifics in this file. This commit was SVN r32384.	2014-07-31 22:34:16 +00:00
George Bosilca	f39abb9e69	Reverting r32355: a number of processes is not a notion that a low level communication library should use to initialize itself. Ralph will champion this change back with an RFC if there is a realistic need/use case from the community. This commit was SVN r32361. The following SVN revision numbers were found above: r32355 --> open-mpi/ompi@c903917f47	2014-07-30 20:11:35 +00:00
Ralph Castain	c903917f47	Expose the num_procs information to the opal layer as the info is needed in several BTLs This commit was SVN r32355.	2014-07-30 09:33:41 +00:00
Ralph Castain	bcade48e27	Move the opal_process_info initialization to the right place - all that info is known (for ourselves only) immediately after rte_init. This commit was SVN r32329.	2014-07-28 19:19:35 +00:00
George Bosilca	a3feb627cf	Move some of the ompi_process_info down in OPAL. This commit was SVN r32324.	2014-07-26 21:43:34 +00:00
Ralph Castain	552c9ca5a0	George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-) WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic. This commit was SVN r32317.	2014-07-26 00:47:28 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Jeff Squyres	852af8b834	ompi_mpi_abort: fix corner cases, simplify logic I recently found a case where ompi_mpi_abort() segv's: {{{ $ mpirun --mca btl non_existent_btl_name ... }}} In this case, the BML init fails because we have no paths to any peers. It calls ompi_mpi_abort(), but this is before ompi_comm_self has been setup. ompi_mpi_abort() assumes that if the comm parameter is != NULL, it can be used. But since we aborted so early in MPI_INIT, that's a false assumption. (note that this isn't happening on v1.8 because the check for INIT/FINALIZE in ompi_mpi_abort() is a little different. Hence: this is a trunk issue -- at least for now) When fixing this problem, I noticed a few other problems in ompi_mpi_abort(): * the group access was incorrect (it didn't use accessor functions) * it wasn't clear that ORTE's ompi_rte_abort_peers() returns NOT_IMPLEMENTED and falls through down to ompi_rte_abort() * the check for my proc in the communicator was a little more complicated than necessary * the logic for checking for aborts early in MPI_INIT wasn't right * some comments were stale * the hostname output in error messages would be NULL if MPI_FINALIZE had been invoked * it was possible to abort, but still exit with a 0 status This commit fixes all of the above problems, and makes the logic a little more straightforward. Thanks to Ralph Castain and George Bosilca for the assists with this patch. This commit was SVN r32125.	2014-07-03 02:38:27 +00:00
George Bosilca	843ef1fcb0	ompi_mpi_abort had one extra argument that was never used. Clean it up. This commit was SVN r32124.	2014-07-03 00:34:44 +00:00
Jeff Squyres	8e52ba423f	finalize/disconnect: add explicit comment about why we use an RTE barrier Based on extensive discussions before/at the June 2014 developer's meeting, put a lengthy comment explaining a second reason why we ''must'' use an RTE barrier during MPI_FINALIZE and MPI_COMM_DISCONNECT (i.e., unreliable transports). Slightly explain more the original reason why we do this, too (BTLs can lie/buffer a message without actually injecting it on the network). This commit was SVN r32095.	2014-06-26 14:31:40 +00:00
Ralph Castain	f3cb124e50	Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. This commit was SVN r32089. The following SVN revision numbers were found above: r32070 --> open-mpi/ompi@12d92d0c22 r32082 --> open-mpi/ompi@aa6438ef7a	2014-06-25 20:43:28 +00:00
Ralph Castain	f70b4a33ec	Per the developer conference, let's be a little nicer during MPI_Finalize and ease up on the cpu by inserting usleep into the loop over opal_progress while waiting for the RTE barrier to complete. This is a non-performant area of the code, and while most codes may call finalize at close-to-similar times, there are some that may choose to have one or more procs continue to perform some work prior to finalizing. So save a little power while we are waiting. cmr=v1.8.2:reviewer=jladd:subject=save power during finalize This commit was SVN r32077.	2014-06-24 21:59:50 +00:00
Ralph Castain	12d92d0c22	Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS This commit was SVN r32070.	2014-06-24 17:05:11 +00:00
Ralph Castain	248a4b100f	Per Artem, we don't know our VPID at the time of getting the initial timing mark, so just get it if timing is requested This commit was SVN r31951.	2014-06-04 16:28:41 +00:00
George Bosilca	750c6c7861	Update the UTK copyright on the topology related files. This commit was SVN r31805.	2014-05-16 22:23:52 +00:00
Nathan Hjelm	e97e4cf924	Add missing include. cmr=v1.8.2:ticket=trac:4639 This commit was SVN r31784. The following Trac tickets were found above: Ticket 4639 --> https://svn.open-mpi.org/trac/ompi/ticket/4639	2014-05-15 19:52:06 +00:00
Nathan Hjelm	faf008f527	Fix bugs that were causing leaks in finalize. This commit fixes leaks of bml endpoints in finalize. A summary of the bugs/fixes is below. 1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but never released the reference to them (ompi_proc_all called OBJ_RETAIN on all the procs returned). When calling del_procs at finalize it should suffice to call ompi_proc_world which does not increment the reference count. 2) del_procs is called BEFORE ompi_comm_finalize. This leaves the references to the procs from calling the pml_add_comm function. The fix is to reorder the calls to do omp_comm_finalize, del_procs, pml_finalize instead of del_procs, pml_finalize, ompi_comm_finalize. 3) The check in del_procs in r2 checked for a reference count of 1. This is incorrect. At this point there should be 2 references: 1 from ompi_proc, and another from the add_procs. The fix is to change this check to look for a reference count of 22. This check makes me extremely uncomforable as nothing will call del_procs if the reference count of a procs is not 2 when del_procs is called. Maybe there should be an assert since this is a developer error IMHO. cmr=v1.8.2:reviewer=bosilca This commit was SVN r31782. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2014-05-15 18:28:03 +00:00
Nathan Hjelm	e4db2c3ebb	ompi: fix various small leaks This commit fixes three leaks: - bml/r2: fix leak of del_procs in mca_bml_r2_del_procs - Release the modex data in btl/scif, btl/ugni, and btl/vader - ompi_mpi_finalize: close the allocator framework cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31778. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2014-05-15 15:59:51 +00:00
Ralph Castain	ab4f8585b0	When we abort during MPI_Init, we currently emit a totally incorrect error message stating that we were unable to aggregate error messages and cannot guarantee all other processes were killed. This simply isn't true IF the rte has been initialized. So track that the rte has reached that point, and only emit the new message if it is accurate. Note that we still generate a TON of output for a minor error: Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[50239,1],2]) is on host: Ralphs-iMac Process 2 ([[50239,1],2]) is on host: Ralphs-iMac BTLs attempted: sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) * An error occurred in MPI_Init * on a NULL communicator * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, * and potentially your MPI job) -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This usually means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[50239,1],2] Exit code: 1 -------------------------------------------------------------------------- [Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail Ralphs-iMac:examples rhc$ Hopefully, we can agree on a way to reduce this verbage! This commit was SVN r31686. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2014-05-08 15:48:16 +00:00
Ralph Castain	c4c9bc1573	As per the RFC: http://www.open-mpi.org/community/lists/devel/2014/04/14496.php Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM This commit was SVN r31557.	2014-04-29 21:49:23 +00:00
Nathan Hjelm	595a6e94e6	Fix typos in r31260 Also added some missing values and sentinels. cmr=v1.8:ticket=trac:4470 This commit was SVN r31263. The following SVN revision numbers were found above: r31260 --> open-mpi/ompi@69036437b7 The following Trac tickets were found above: Ticket 4470 --> https://svn.open-mpi.org/trac/ompi/ticket/4470	2014-03-27 22:34:28 +00:00
Jeff Squyres	12a4d1a27f	Minor update to r30430: put the variables at the top of the function instead of making an inner block. Refs trac:4185 This commit was SVN r30588. The following SVN revision numbers were found above: r30430 --> open-mpi/ompi@ea3cb1e110 The following Trac tickets were found above: Ticket 4185 --> https://svn.open-mpi.org/trac/ompi/ticket/4185	2014-02-06 18:37:19 +00:00
Jeff Squyres	fad3cbf639	Revert r30571. This commit was SVN r30587. The following SVN revision numbers were found above: r30571 --> open-mpi/ompi@081b679881	2014-02-06 18:35:30 +00:00
Mike Dubman	081b679881	OMPI: add call to del_procs fixed by AlexM, reviewed by miked cmr=v1.7.5:reviewer=ompi-rm1.7 This commit was SVN r30571.	2014-02-06 08:38:32 +00:00
George Bosilca	bde9619386	Various minor cleanups. This commit was SVN r30431.	2014-01-26 17:27:12 +00:00
George Bosilca	ea3cb1e110	Don't forget to call del_procs. This commit was SVN r30430.	2014-01-26 17:26:40 +00:00
Mike Dubman	2af0f878bc	remove bml_init call, called from btl add_proc. Refs trac:3763 This commit was SVN r30310. The following Trac tickets were found above: Ticket 3763 --> https://svn.open-mpi.org/trac/ompi/ticket/3763	2014-01-17 16:52:20 +00:00
Mike Dubman	b7750ccbf4	OSHMEM: bml initialization is moved into ompi_init it fixes race of mca_var segfault in finalization of shmem based on this thread: http://www.open-mpi.org/community/lists/devel/2014/01/13778.php Refs trac:3763 fixed by Igor, reviewed by Brian This commit was SVN r30304. The following Trac tickets were found above: Ticket 3763 --> https://svn.open-mpi.org/trac/ompi/ticket/3763	2014-01-17 06:09:29 +00:00
Ralph Castain	e7710873a1	Open/close the RTE framework cmr=v1.7.4:reviewer=hjelmn This commit was SVN r30270.	2014-01-13 17:43:24 +00:00
Ralph Castain	286ff6d552	For large scale systems, we would like to avoid doing a full modex during MPI_Init so that launch will scale a little better. At the moment, our options are somewhat limited as only a few BTLs don't immediately call modex_recv on all procs during startup. However, for those situations where someone can take advantage of it, add the ability to do a "modex on demand" retrieval of data from remote procs when we launch via mpirun. NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message! Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place. This behavior will only take effect under the following conditions: 1. launched via mpirun 2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX 3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to want this behavior to get it. The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes: 1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data". 2. adjust the SM rendezvous logic to loop until the required file has been created Creating a placeholder to bring this over to 1.7.5 when ready. cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale This commit was SVN r30259.	2014-01-11 17:36:06 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Jeff Squyres	365ce2cd03	Fix minor MPI thread memory leak / fix valgrind still-reachable warning. cmr=v1.7.5:reviewer=brbarret:subject=Fix minor MPI thread memory leak This commit was SVN r30072.	2013-12-24 11:05:51 +00:00
Rolf vandeVaart	4cd1958deb	Fix so we do not get warnings when running on system without CUDA software installed and CUDA-aware compiled in. This commit was SVN r30032.	2013-12-20 20:39:25 +00:00
Ralph Castain	77553f72be	Per this email thread: http://www.open-mpi.org/community/lists/devel/2013/12/13412.php fix the backtrace function to avoid async issues. Thanks to Takahiro Kawashima for the patch This commit was SVN r29955.	2013-12-18 17:57:37 +00:00
Ralph Castain	0995a6f3b9	Revert r29917 and replace it with a fix that resolves the thread deadlock while retaining the desired debug info. In an earlier commit, we had changed the modex accordingly: * automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff * if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info * BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior. This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change. cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock This commit was SVN r29931. The following SVN revision numbers were found above: r29917 --> open-mpi/ompi@1a972e2c9d	2013-12-17 03:26:00 +00:00
Jeff Squyres	16c63c5bbe	Fix conditional: don't just check the constant (thanks to clang for an excellent warning message!) cmr=v1.7.4:reviewer=hjelmn:subject=fix MCA_BASE_VAR_SOURCE_OVERRIDE test This commit was SVN r29773.	2013-12-02 19:41:59 +00:00
Rolf vandeVaart	ee7510b025	Remove redundant macro. This was from reviewed of earlier ticket. Fixes trac:3878. Reviewed by jsquyres. This commit was SVN r29581. The following Trac tickets were found above: Ticket 3878 --> https://svn.open-mpi.org/trac/ompi/ticket/3878	2013-11-01 12:19:40 +00:00
Mike Dubman	2141e9e6b4	tools: Add oshmem_info utility Reworked ompi_info tool to be close with orte_info implementation. ompi_info_register_types(), ompi_info_close_components() and ompi_info_show_ompi_version() are moved to runtime/ompi_info_support.c. Added runtime/oshmem_info_support layer that exports following api to be used into oshmem_info tool as oshmem_info_register_types() oshmem_info_register_framework_params() oshmem_info_close_components() oshmem_info_show_oshmem_version() These functions call ompi_info_support related interfaces as long as Oshmem supports Open MPI/SHMEM combination. Now orte_info/ompi_info/oshmem_info have identical implementation approach. Possible improvement: OSHMEM processing of --config option is the same as OMPI`s (code is duplicated). Probably list of info_support interfaces can be extended by xxx_info_do_config(). developed by Igor, reviewed by miked This commit was SVN r29429.	2013-10-12 19:03:32 +00:00
Ralph Castain	9902748108	*** THIS INCLUDES A SMALL CHANGE IN THE MPI-RTE INTERFACE *** Fix two problems that surfaced when using direct launch under SLURM: 1. locally store our own data because some BTLs want to retrieve it during add_procs rather than use what they have internally 2. cleanup MPI_Abort so it correctly passes the error status all the way down to the actual exit. When someone implemented the "abort_peers" API, they left out the error status. So we lost it at that point and always exited with a status of 1. This forces a change to the API to include the status. cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch This commit was SVN r29405.	2013-10-08 18:37:59 +00:00

1 2 3 4 5 ...

381 Коммитов