1
1

4416 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
a94a97bd50 Cleanup the passing of MCA params on the orted cmd line in ssh by ensuring that we quote all values since they could be multi-word and/or contain special characters. Thanks to Dirk Schubert for pointing it out.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32280.
2014-07-22 18:22:06 +00:00
Ralph Castain
2f579806ae Make the radix routed component the default pending repair/completion of debruijn option
This commit was SVN r32276.
2014-07-22 16:48:51 +00:00
Ralph Castain
6c5e592785 Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment.
Update the orte/test/mpi/coll_test.c test to show the revised example.

This commit was SVN r32234.

The following SVN revision numbers were found above:
  r32203 --> open-mpi/ompi@a523dba41d
  r32210 --> open-mpi/ompi@2ce11ed5c4
  r32222 --> open-mpi/ompi@d55f16db50
2014-07-15 03:48:00 +00:00
Ralph Castain
1feaffbb15 Get the blasted singleton comm_spawn working again. There remain problems with the Slurm interaction in this use-case as the PMI components (if configured to build) try to run even when a Slurm allocation hasn't been made, but I leave that to someone else to resolve. I did, however, tell the Slurm ess to quit interfering with applications launched in this use-case by ORTE daemons, so things do work when inside a Slurm allocation.
Also discovered that the rsh launcher is not picking up --enable-orterun-prefix-by-default when invoked during singleton comm_spawn, but I was unable to see why that was happening and ran out of time.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32229.
2014-07-13 14:47:22 +00:00
Ralph Castain
d55f16db50 Fix a hang in daemon collectives when run on multinode systems
This commit was SVN r32222.
2014-07-12 00:43:12 +00:00
Ralph Castain
2ce11ed5c4 Fix one spot missed by recent commit
This commit was SVN r32210.
2014-07-10 20:08:57 +00:00
Ralph Castain
a523dba41d NOTE: this modifies the MPI-RTE interface
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.

This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:

 * In order to allow scalable
 * generation of collective id's, they are formed as:
 *
 * top 32-bits are the jobid of the procs involved in
 * the collective. For collectives across multiple jobs
 * (e.g., in a connect_accept), the daemon jobid will
 * be used as the id will be issued by mpirun. This
 * won't cause problems because daemons don't use the
 * collective_id
 *
 * bottom 32-bits are a rolling counter that recycles
 * when the max is hit. The daemon will cleanup each
 * collective upon completion, so this means a job can
 * never have more than 2**32 collectives going on at
 * a time. If someone needs more than that - they've got
 * a problem.
 *
 * Note that this means (for now) that RTE-level collectives
 * cannot be done by individual threads - they must be
 * done at the overall process level. This is required as
 * there is no guaranteed ordering for the collective id's,
 * and all the participants must agree on the id of the
 * collective they are executing. So if thread A on one
 * process asks for a collective id before thread B does,
 * but B asks before A on another process, the collectives will
 * be mixed and not result in the expected behavior. We may
 * find a way to relax this requirement in the future by
 * adding a thread context id to the jobid field (maybe taking the
 * lower 16-bits of that field).

This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.

This commit was SVN r32203.
2014-07-10 18:53:12 +00:00
Ralph Castain
8c85ca350e Remove debug
This commit was SVN r32200.
2014-07-10 18:28:24 +00:00
Jeff Squyres
6cc538ae16 help-orterun.txt: wrap long messages, clarify new messages
Clarify the new -x/mca_base_env_list help messages.

This commit was SVN r32199.
2014-07-10 17:24:52 +00:00
Ralph Castain
796f57f709 Protect against problems if someone passes us thru a pipe and then abnormally terminates the pipe early
This commit was SVN r32189.
2014-07-09 22:41:53 +00:00
Ralph Castain
7beb8f6799 Silence used-before-set var warning
This commit was SVN r32188.
2014-07-09 22:37:47 +00:00
Joshua Ladd
801e2cb544 Fix error and warning messages after reverting
the mca_base_env_list to being semicolon delimited.

This commit was SVN r32179.
2014-07-09 14:46:19 +00:00
Joshua Ladd
30da6d3a17 Opal: add a new MCA parameter that allows the user to specify a list of environment variables. This parameter will become the standard mechanism by which environment variables are set for OMPI applications replacing the -x option.
mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3  should now be expressed as

mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3. 

The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc. 

This feature was developed by Elena Shipunova and was reviewed by Josh Ladd.

This commit was SVN r32163.
2014-07-09 00:38:25 +00:00
Ralph Castain
5bb5b22573 When a user asks for cpus/rank > 1 and only has one slot, we need to ensure we always map at least one process when they don't tell us -np
cmr=v1.8.2:reviewer=rhc:subject=correct num_procs in corner case

This commit was SVN r32142.
2014-07-04 17:00:35 +00:00
Ralph Castain
9f97c74ba3 Silence warning
This commit was SVN r32136.
2014-07-03 17:29:04 +00:00
Ralph Castain
356e7ea904 Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now.
This commit was SVN r32133.
2014-07-03 15:58:13 +00:00
Ralph Castain
0a4639308e Remove a potential race condition - we'll cleanup the local children when we are all done
This commit was SVN r32132.
2014-07-03 14:13:43 +00:00
George Bosilca
2883adcdf3 Remove useless variables.
This commit was SVN r32123.
2014-07-03 00:30:54 +00:00
Ralph Castain
149810f02c Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro
cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message

This commit was SVN r32122.
2014-07-02 14:46:00 +00:00
Ralph Castain
e9d69ca370 Remove stale test
This commit was SVN r32104.
2014-06-29 16:37:19 +00:00
Adrian Reber
47b118c0ae fix FT compilation
This commit was SVN r32094.
2014-06-26 03:40:07 +00:00
Adrian Reber
cabf1d4e68 use the orte attributes in the FT code to fix compile errors
This commit was SVN r32093.
2014-06-26 03:19:17 +00:00
Adrian Reber
10c1a50705 "handle" removal of opal_db.remove() in the FT code
This commit was SVN r32092.
2014-06-26 03:11:37 +00:00
Ralph Castain
f3cb124e50 Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
This commit was SVN r32089.

The following SVN revision numbers were found above:
  r32070 --> open-mpi/ompi@12d92d0c22
  r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Adrian Reber
9f73e79d91 also change the callback function prototype (to get the FT code to compile again)
This commit was SVN r32088.
2014-06-25 20:37:02 +00:00
Adrian Reber
4aca7095dc fix a syntax error in the FT code
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Adrian Reber
4b25e92194 get the FT code to compile again by adding/removing #includes
This commit was SVN r32086.
2014-06-25 18:42:17 +00:00
Ralph Castain
8fca77c3d3 Protect the binding policy setting so it builds when --without-hwloc
Refs trac:4742

This commit was SVN r32085.

The following Trac tickets were found above:
  Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742
2014-06-25 18:13:54 +00:00
Adrian Reber
72f1c7941f use a consistent naming scheme for the SNAPSHOT attributes
This commit was SVN r32083.
2014-06-25 15:26:24 +00:00
Nathan Hjelm
563eaf0726 Fix support for Cray alps
The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.

Changes:

 - Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
   PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
   had previously just used PMI_SUCCESS.

 - Add missing definition and a typo in pml_alps_module.

 - launch_id is no longer available in the orte_node_t structure. Use the
   attribute lookup to get the value.

 - Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
   opal_list_sort instead (O(nlogn)).

This commit was SVN r32076.
2014-06-24 21:29:04 +00:00
Ralph Castain
5f6be06b54 Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load

This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Ralph Castain
12d92d0c22 Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Ralph Castain
34e5573988 Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination.
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.

Refs trac:4717

This commit was SVN r32063.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-21 17:09:02 +00:00
Ralph Castain
9cfc408fd4 Little more debug - getting close to figuring this one out
Refs trac:4717

This commit was SVN r32060.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 16:24:06 +00:00
Ralph Castain
f9da295682 Add some additional debug
Refs trac:4717

This commit was SVN r32059.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 14:14:36 +00:00
Ralph Castain
645df5e823 Don't release the node_name field as it gets used in the slots parsing - will be released at newline detection
This commit was SVN r32058.
2014-06-20 13:18:46 +00:00
Ralph Castain
9a47e45a09 <laugh> ensure we really compare the things we want to compare
This commit was SVN r32055.
2014-06-19 20:54:25 +00:00
Ralph Castain
e65538e91b Add some defensive programming, fix a typo
This commit was SVN r32054.
2014-06-19 20:52:13 +00:00
Ralph Castain
b43f760f93 If you don't specify all the rank-file mapping for all procs, then you'll segfault - which is probably a bad idea. I can't see an easy workaround, so just error out for now and let's see if anyone really cares.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32053.
2014-06-19 20:30:06 +00:00
Ralph Castain
b618b36a2f Fix potential issue if opal_hwloc_topology is NULL
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32050.
2014-06-19 18:52:41 +00:00
Ralph Castain
61fe4daa33 Add some further debug
Refs trac:4717

This commit was SVN r32047.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-19 15:59:51 +00:00
Ralph Castain
65275d6326 Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel
cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message

This commit was SVN r32030.
2014-06-18 19:20:28 +00:00
Ralph Castain
3f032d39e8 Mark the proc as alive so waitpid callback system doesn't immediately activate the callback
Refs trac:4717

This commit was SVN r32026.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-18 14:04:55 +00:00
Ralph Castain
8e7c0257f0 Cleanup some missed updates to orte_wait_cb as params have changed
Refs trac:4717

This commit was SVN r32025.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 23:40:31 +00:00
Ralph Castain
5dbf4a62c4 Cleanup: we were accidentally killing ourselves (bad idea)
Refs trac:4717

This commit was SVN r32022.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 20:38:42 +00:00
Ralph Castain
5216bd5558 Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb
Refs trac:4717

This commit was SVN r32020.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 18:46:40 +00:00
Ralph Castain
42bf7466fc This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
Refs trac:4717

This commit was SVN r32019.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
ab52f16100 Attempt to cleanup the race condition Rolf keeps encountering in MTT by adding some protection to ensure orted's try to terminate once their local procs die. Also, fix a problem whereby a failure to comm_spawn would result in a hang of the parent process.
cmr=v1.8.2:reviewer=rhc:subject=cache termination cleanups

This commit was SVN r32008.
2014-06-16 20:46:35 +00:00
Ralph Castain
561983ae52 Fix static builds by renaming conflicting type
This commit was SVN r32006.
2014-06-14 17:39:28 +00:00
Ralph Castain
3f04d50cb0 Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions:
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.

(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.

Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".

Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.

Closes trac:4345

cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions

This commit was SVN r32005.

The following Trac tickets were found above:
  Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
2014-06-14 15:38:32 +00:00