openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	731a878ff3	Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe This commit was SVN r32620.	2014-08-27 19:52:20 +00:00
Ralph Castain	5fb7c7d23b	Don't explicitly add the hostname to the data fetch when we already cached a remote blob This commit was SVN r32619.	2014-08-27 16:18:05 +00:00
Ralph Castain	3c24770bce	Protect debug printing on backend nodes This commit was SVN r32618.	2014-08-27 16:17:28 +00:00
Ralph Castain	b87b69e977	Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction This commit was SVN r32617.	2014-08-27 16:16:46 +00:00
Ralph Castain	842aaf6167	Correctly end mapping oversubscribed nodes round-robin byslot cmr=v1.8.3:reviewer=rhc This commit was SVN r32616.	2014-08-27 16:15:18 +00:00
Gilles Gouaillardet	2679629a12	pmix: fix compilation when configured with --without-hwloc This commit was SVN r32604.	2014-08-26 08:31:05 +00:00
Ralph Castain	1221e8a96f	Compare the full signature - thanks to Gilles for identifying the problem This commit was SVN r32595.	2014-08-25 14:52:06 +00:00
Ralph Castain	5a13cdb739	Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check This commit was SVN r32587.	2014-08-22 22:48:16 +00:00
Ralph Castain	039b7acfb5	Fix the quoting algorithm so only rsh command lines get quoted values cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32586.	2014-08-22 22:47:38 +00:00
Ralph Castain	f00af81c1d	Little more cleanup under the abort cases cited by Gilles. All seem to be working now This commit was SVN r32585.	2014-08-22 19:57:57 +00:00
Ralph Castain	b1a7375192	Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors This commit was SVN r32584.	2014-08-22 19:20:45 +00:00
Ralph Castain	6ff2a60829	Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario This commit was SVN r32578.	2014-08-22 14:26:24 +00:00
Ralph Castain	8f1b9b463e	Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound. This commit was SVN r32577.	2014-08-22 05:17:51 +00:00
Ralph Castain	c6f78d6e54	The PMI ess component now gets used for more than direct launch, so only set standalone_operation flag if no daemon uri is available so we aggregate show_help messages This commit was SVN r32574.	2014-08-22 03:00:56 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Ralph Castain	b4511913f6	Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us. Refs trac:4870 This commit was SVN r32558. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 20:58:07 +00:00
Ralph Castain	fa28710d53	Track down the last piece of the connection problem. It appears that providing a netmask of 0 to opal_net_samenetwork results in everything looking like it is on the same network. Hence, we were not retaining any of the alternative addresses, so we had no other way to check them. Refs trac:4870 This commit was SVN r32556. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 16:55:36 +00:00
Ralph Castain	343038af7b	Frazzle-frump! Missed that we reset the peer state just before the new check. Refs trac:4870 This commit was SVN r32554. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-19 22:34:49 +00:00
Ralph Castain	0a91fdf85f	If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure. cmr=v1.8.2:reviewer=edgar This commit was SVN r32553.	2014-08-19 19:48:24 +00:00
Ralph Castain	024572cb6c	Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it. cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket This commit was SVN r32552.	2014-08-19 19:40:20 +00:00
Jeff Squyres	1551339eba	rsh: revert part of r32517: keep the quoting As part of reviewing CMR #4860, I talked through r32517 with Ralph. In attempt to fix various rsh quoting problems, r32517 removed all the quoting from the main code path and then only added it back in at the end in some cases. This commit puts back the quoting parts that were removed in r32517 (r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca <foo_equivalent> 1" so that de-duplication works, and b) change a != to ==). refs trac:4860 This commit was SVN r32524. The following SVN revision numbers were found above: r32517 --> open-mpi/ompi@7342bce58f The following Trac tickets were found above: Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860	2014-08-13 19:27:10 +00:00
Gilles Gouaillardet	f96d382d1d	Fix typo. Thanks to Christopher Samuel for reporting it This commit was SVN r32520.	2014-08-13 05:54:59 +00:00
Ralph Castain	7342bce58f	Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing. Fixes trac:4857 cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32517. The following Trac tickets were found above: Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857	2014-08-13 03:51:04 +00:00
George Bosilca	de7191132d	Remove few warnings. This commit was SVN r32506.	2014-08-11 13:34:44 +00:00
Gilles Gouaillardet	a873f45a90	Fix r32460 race condition resolution when procs call MPI_Abort. do not invoke orte_session_dir_finalize(...) so orte_ess_base_app_abort(...) can successfully createi <orte_process_info.proc_session_dir>/aborted cmr=v1.8.2:reviewer=rhc This commit was SVN r32498. The following SVN revision numbers were found above: r32460 --> open-mpi/ompi@abedb97be4	2014-08-11 05:50:32 +00:00
Gilles Gouaillardet	e184733ef6	check-help-strings cleanup This commit was SVN r32496.	2014-08-11 03:26:21 +00:00
Gilles Gouaillardet	f24699623f	check-help-strings cleanup This commit was SVN r32495.	2014-08-11 03:25:22 +00:00
Gilles Gouaillardet	c3c364a262	check-help-strings cleanup This commit was SVN r32494.	2014-08-11 03:22:05 +00:00
Gilles Gouaillardet	d9e0212e0e	check-help-strings cleanup This commit was SVN r32493.	2014-08-11 03:21:08 +00:00
Gilles Gouaillardet	d139f75db4	check-help-strings cleanup This commit was SVN r32492.	2014-08-11 03:20:37 +00:00
Ralph Castain	abedb97be4	Resolve race condition when procs call MPI_Abort. Since we go thru the errmgr instead of the normal proc termination routines, we need to ensure we mark that the proc has fired its waitpid and is no longer alive. Otherwise, the local daemon won't terminate because it thinks there is still a local proc alive and we hang. Thanks to Gilles for tracking it down. cmr=v1.8.2:reviewer=rhc This commit was SVN r32460.	2014-08-08 15:58:49 +00:00
Ralph Castain	8ea576c870	I have no idea how they did it, but someone managed to write a test that circled around and around and eventually reached this point with a NULL pointer. So protect against that possibility. This commit was SVN r32434.	2014-08-05 16:20:46 +00:00
Ralph Castain	42c5073aa3	Safely cleanup the opal_proc_t structure for non-MPI procs. This commit was SVN r32402.	2014-08-01 16:38:49 +00:00
Ralph Castain	7758528d72	Apparently, someone else is destructing the opal_proc_t, so don't destruct it ourselves This commit was SVN r32400.	2014-08-01 14:54:22 +00:00
Ralph Castain	d29a5ab69d	Okay, now handle the non-MPI apps This commit was SVN r32399.	2014-08-01 14:49:25 +00:00
Ralph Castain	daeb9b6c4f	Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain. Thanks to Gilles for pointing out some of the discrepancies. This commit was SVN r32398.	2014-08-01 14:44:11 +00:00
Ralph Castain	8cfadd1842	Per Paul Hargrove, add missing include to fix build under FreeBSD RM-approved cmr=v1.8.2:reviewer=ompi-gk1.8 This commit was SVN r32397.	2014-08-01 13:37:41 +00:00
George Bosilca	daa076995a	orte_rmaps_numa_node_t -> opal_rmaps_numa_node_t This commit was SVN r32380.	2014-07-31 19:58:47 +00:00
Ralph Castain	5db717f090	Some small leak cleanups cmr=v1.8.3:reviewer=artpol This commit was SVN r32358.	2014-07-30 15:46:02 +00:00
Ralph Castain	98b5a86a58	Now that we are using the radix routed module, teach it how to behave nicely with singletons This commit was SVN r32312.	2014-07-24 22:46:17 +00:00
Ralph Castain	0cad281a92	Single-word cmd line values for orted are dealt with in orte_plm_base_orted_append_basic_args, so protect against special characters there. Have the rsh module only deal with multi-word arguments as those were skipped by orte_plm_base_orted_append_basic_args. Refs trac:4802 This commit was SVN r32293. The following Trac tickets were found above: Ticket 4802 --> https://svn.open-mpi.org/trac/ompi/ticket/4802	2014-07-23 17:06:51 +00:00
Jeff Squyres	4da3c85b54	fortran: revert Absoft-based fixes Rever r32246, r32254, and 32255 -- they were fixing side-effects of the real bug. Real fix coming after this one. This commit was SVN r32286. The following SVN revision numbers were found above: r32246 --> open-mpi/ompi@08d2a1a48d r32254 --> open-mpi/ompi@232d4dbb7b	2014-07-22 21:49:22 +00:00
Ralph Castain	a94a97bd50	Cleanup the passing of MCA params on the orted cmd line in ssh by ensuring that we quote all values since they could be multi-word and/or contain special characters. Thanks to Dirk Schubert for pointing it out. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32280.	2014-07-22 18:22:06 +00:00
Ralph Castain	2f579806ae	Make the radix routed component the default pending repair/completion of debruijn option This commit was SVN r32276.	2014-07-22 16:48:51 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	1feaffbb15	Get the blasted singleton comm_spawn working again. There remain problems with the Slurm interaction in this use-case as the PMI components (if configured to build) try to run even when a Slurm allocation hasn't been made, but I leave that to someone else to resolve. I did, however, tell the Slurm ess to quit interfering with applications launched in this use-case by ORTE daemons, so things do work when inside a Slurm allocation. Also discovered that the rsh launcher is not picking up --enable-orterun-prefix-by-default when invoked during singleton comm_spawn, but I was unable to see why that was happening and ran out of time. cmr=v1.8.2:reviewer=rhc This commit was SVN r32229.	2014-07-13 14:47:22 +00:00
Ralph Castain	d55f16db50	Fix a hang in daemon collectives when run on multinode systems This commit was SVN r32222.	2014-07-12 00:43:12 +00:00
Ralph Castain	2ce11ed5c4	Fix one spot missed by recent commit This commit was SVN r32210.	2014-07-10 20:08:57 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Ralph Castain	8c85ca350e	Remove debug This commit was SVN r32200.	2014-07-10 18:28:24 +00:00
Jeff Squyres	6cc538ae16	help-orterun.txt: wrap long messages, clarify new messages Clarify the new -x/mca_base_env_list help messages. This commit was SVN r32199.	2014-07-10 17:24:52 +00:00
Ralph Castain	796f57f709	Protect against problems if someone passes us thru a pipe and then abnormally terminates the pipe early This commit was SVN r32189.	2014-07-09 22:41:53 +00:00
Ralph Castain	7beb8f6799	Silence used-before-set var warning This commit was SVN r32188.	2014-07-09 22:37:47 +00:00
Joshua Ladd	801e2cb544	Fix error and warning messages after reverting the mca_base_env_list to being semicolon delimited. This commit was SVN r32179.	2014-07-09 14:46:19 +00:00
Joshua Ladd	30da6d3a17	Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option. mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3 should now be expressed as mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. This feature was developed by Elena Shipunova and was reviewed by Josh Ladd. This commit was SVN r32163.	2014-07-09 00:38:25 +00:00
Ralph Castain	5bb5b22573	When a user asks for cpus/rank > 1 and only has one slot, we need to ensure we always map at least one process when they don't tell us -np cmr=v1.8.2:reviewer=rhc:subject=correct num_procs in corner case This commit was SVN r32142.	2014-07-04 17:00:35 +00:00
Ralph Castain	9f97c74ba3	Silence warning This commit was SVN r32136.	2014-07-03 17:29:04 +00:00
Ralph Castain	356e7ea904	Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now. This commit was SVN r32133.	2014-07-03 15:58:13 +00:00
Ralph Castain	0a4639308e	Remove a potential race condition - we'll cleanup the local children when we are all done This commit was SVN r32132.	2014-07-03 14:13:43 +00:00
George Bosilca	2883adcdf3	Remove useless variables. This commit was SVN r32123.	2014-07-03 00:30:54 +00:00
Ralph Castain	149810f02c	Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message This commit was SVN r32122.	2014-07-02 14:46:00 +00:00
Ralph Castain	e9d69ca370	Remove stale test This commit was SVN r32104.	2014-06-29 16:37:19 +00:00
Adrian Reber	47b118c0ae	fix FT compilation This commit was SVN r32094.	2014-06-26 03:40:07 +00:00
Adrian Reber	cabf1d4e68	use the orte attributes in the FT code to fix compile errors This commit was SVN r32093.	2014-06-26 03:19:17 +00:00
Adrian Reber	10c1a50705	"handle" removal of opal_db.remove() in the FT code This commit was SVN r32092.	2014-06-26 03:11:37 +00:00
Ralph Castain	f3cb124e50	Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed. This commit was SVN r32089. The following SVN revision numbers were found above: r32070 --> open-mpi/ompi@12d92d0c22 r32082 --> open-mpi/ompi@aa6438ef7a	2014-06-25 20:43:28 +00:00
Adrian Reber	9f73e79d91	also change the callback function prototype (to get the FT code to compile again) This commit was SVN r32088.	2014-06-25 20:37:02 +00:00
Adrian Reber	4aca7095dc	fix a syntax error in the FT code This commit was SVN r32087.	2014-06-25 20:35:50 +00:00
Adrian Reber	4b25e92194	get the FT code to compile again by adding/removing #includes This commit was SVN r32086.	2014-06-25 18:42:17 +00:00
Ralph Castain	8fca77c3d3	Protect the binding policy setting so it builds when --without-hwloc Refs trac:4742 This commit was SVN r32085. The following Trac tickets were found above: Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742	2014-06-25 18:13:54 +00:00
Adrian Reber	72f1c7941f	use a consistent naming scheme for the SNAPSHOT attributes This commit was SVN r32083.	2014-06-25 15:26:24 +00:00
Nathan Hjelm	563eaf0726	Fix support for Cray alps The alps ras and plm components were broken by recent changes in ORTE. This commit resolves those issues. Changes: - Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's PMI implementation which does not define (for some reason) PMI2_SUCCESS. We had previously just used PMI_SUCCESS. - Add missing definition and a typo in pml_alps_module. - launch_id is no longer available in the orte_node_t structure. Use the attribute lookup to get the value. - Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use opal_list_sort instead (O(nlogn)). This commit was SVN r32076.	2014-06-24 21:29:04 +00:00
Ralph Castain	5f6be06b54	Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended. cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load This commit was SVN r32072.	2014-06-24 18:11:39 +00:00
Ralph Castain	12d92d0c22	Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS This commit was SVN r32070.	2014-06-24 17:05:11 +00:00
Ralph Castain	34e5573988	Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination. This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose. Refs trac:4717 This commit was SVN r32063. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-21 17:09:02 +00:00
Ralph Castain	9cfc408fd4	Little more debug - getting close to figuring this one out Refs trac:4717 This commit was SVN r32060. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-20 16:24:06 +00:00
Ralph Castain	f9da295682	Add some additional debug Refs trac:4717 This commit was SVN r32059. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-20 14:14:36 +00:00
Ralph Castain	645df5e823	Don't release the node_name field as it gets used in the slots parsing - will be released at newline detection This commit was SVN r32058.	2014-06-20 13:18:46 +00:00
Ralph Castain	9a47e45a09	<laugh> ensure we really compare the things we want to compare This commit was SVN r32055.	2014-06-19 20:54:25 +00:00
Ralph Castain	e65538e91b	Add some defensive programming, fix a typo This commit was SVN r32054.	2014-06-19 20:52:13 +00:00
Ralph Castain	b43f760f93	If you don't specify all the rank-file mapping for all procs, then you'll segfault - which is probably a bad idea. I can't see an easy workaround, so just error out for now and let's see if anyone really cares. cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32053.	2014-06-19 20:30:06 +00:00
Ralph Castain	b618b36a2f	Fix potential issue if opal_hwloc_topology is NULL cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32050.	2014-06-19 18:52:41 +00:00
Ralph Castain	61fe4daa33	Add some further debug Refs trac:4717 This commit was SVN r32047. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-19 15:59:51 +00:00
Ralph Castain	65275d6326	Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message This commit was SVN r32030.	2014-06-18 19:20:28 +00:00
Ralph Castain	3f032d39e8	Mark the proc as alive so waitpid callback system doesn't immediately activate the callback Refs trac:4717 This commit was SVN r32026. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-18 14:04:55 +00:00
Ralph Castain	8e7c0257f0	Cleanup some missed updates to orte_wait_cb as params have changed Refs trac:4717 This commit was SVN r32025. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 23:40:31 +00:00
Ralph Castain	5dbf4a62c4	Cleanup: we were accidentally killing ourselves (bad idea) Refs trac:4717 This commit was SVN r32022. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 20:38:42 +00:00
Ralph Castain	5216bd5558	Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb Refs trac:4717 This commit was SVN r32020. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 18:46:40 +00:00
Ralph Castain	42bf7466fc	This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces very long delays at scale. Just kill the procs and move on. Refs trac:4717 This commit was SVN r32019. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 17:57:51 +00:00
Ralph Castain	ab52f16100	Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process. cmr=v1.8.2:reviewer=rhc:subject=cache termination cleanups This commit was SVN r32008.	2014-06-16 20:46:35 +00:00
Ralph Castain	561983ae52	Fix static builds by renaming conflicting type This commit was SVN r32006.	2014-06-14 17:39:28 +00:00
Ralph Castain	3f04d50cb0	Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions: (a) default binding policy is in effect. In this case, we will emit a warning and default to not binding unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option. (b) user-specified binding policy is in effect. In this case, we will error out unless the user provided the "oversubscribe" or "overload" modifier to the "bind-to" option as we cannot meet the directive. Either "bind-to" modifier (oversubscribe or overload) will be accepted for now - in 1.9, we will deprecate the "overload" term in favor of "oversubscribe". Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy. Closes trac:4345 cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions This commit was SVN r32005. The following Trac tickets were found above: Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345	2014-06-14 15:38:32 +00:00
Ralph Castain	56c3575c0e	Can't emit an error for an unrecognized mapping policy modifier as the ppr policy relies on not doing so. This commit was SVN r31998.	2014-06-13 20:10:09 +00:00
Ralph Castain	3ed282bf44	Per patch from Tetsuya, correct the cpus-per-proc logic so we correctly detect when the user is attempting to bind too low for that option Refs trac:4702 This commit was SVN r31988. The following Trac tickets were found above: Ticket 4702 --> https://svn.open-mpi.org/trac/ompi/ticket/4702	2014-06-13 16:32:52 +00:00
Ralph Castain	ba926d8635	The TCP component will have set the hash table entry to NULL, but that doesn't remove the key. So the hash_table retrieval function will return success, but with a NULL pointer - protect against that scenario Patch provided by Gilles - reviewed by rhc. RM-approved cmr=v1.8.2:reviewer=ompi-gk1.8 This commit was SVN r31971.	2014-06-09 17:46:22 +00:00
Ralph Castain	5d5ae41ea5	Cleanup a memory leak in the daemons - thanks to Artem for spotting it This commit was SVN r31970.	2014-06-09 17:14:02 +00:00
Ralph Castain	06dbfa3098	Make the cpus-per-proc equivalent a little more intuitive: * allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc. * if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior. Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier. cmr=v1.8.2:reviewer=rhc This commit was SVN r31967.	2014-06-08 20:26:59 +00:00
Ralph Castain	8db76e9c6f	Ensure that we change to the session dir if we preload binaries so we'll use the loaded one Special patch created for v1.8 and CMR filed This commit was SVN r31963.	2014-06-06 21:43:23 +00:00
Ralph Castain	b7c08582ba	Add new tag to avoid conflicts This commit was SVN r31960.	2014-06-06 17:23:35 +00:00
Ralph Castain	638c24f655	Correct the bind-in-place algorithm to better handle comm_spawn. If the location identified by the mapper is already occupied by procs from another job, then we need to shift either right or left until we find an unoccupied location where we can be bound. If nothing is available, then check for the overload flag (and bind us in the original location if provided), or see if this was the default binding policy instead of one specified by the user - if so, then just don't bind this process. cmr=v1.8.2:reviewer=rhc This commit was SVN r31959.	2014-06-06 12:36:14 +00:00

1 2 3 4 5 ...

4508 Коммитов