1
1

533 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
f2c49c6c19 Fix the map-by object mapper to handle cpus-per-proc by accounting for the request when computing the number of procs to put on each object. This ensures that the binding routine doesn't automatically overload the cores.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r29843.
2013-12-08 16:59:25 +00:00
Ralph Castain
7480beb7f0 Per request from Nathan, add an offset value to the job struct so we can construct a "global rank" that spans multiple jobs during dynamic launch operations. Store a new ORTE_DB_GLOBAL_RANK value for each process in the database, and ensure that we share our own value during connect_accept so both sides can see it.
This isn't being used yet - just enabling Nathan to do what he needs.

***** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *****

This commit was SVN r29708.
2013-11-14 17:01:43 +00:00
Mike Dubman
840e2cb4a2 mindist: cosmetic, use fallback to byslot if unable to read NUMA info, small fix.
fixed by Elena, reviewed by Ralph/Mike
cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r29679.
2013-11-13 09:26:40 +00:00
Ralph Castain
e35ad23176 Correctly compute usage for dynamic spawns when binding is invoked. Ensure we correctly account for existing process usage on each node when computing bindings during dynamic spawns.
cmr=v1.7.4:reviewer=hjelmn:subject=Correctly compute usage for dynamic spawns when binding is invoked

This commit was SVN r29649.
2013-11-10 00:38:01 +00:00
Joshua Ladd
d594ffbfc7 Backing out Elena's patch - abstraction violation
This commit was SVN r29645.
2013-11-08 13:12:07 +00:00
Joshua Ladd
da3e272fdd Adds a check in the mindist mapper for whether or not the user asks for a specific device. This patch was submited by Elena Elkina and reviewed by Josh Ladd and should be added to
cmr=v1.7.4:reviewer=jladd

This commit was SVN r29644.
2013-11-08 04:28:53 +00:00
Ralph Castain
960a255e7f Do some cleanup of the --without-hwloc build - no need to work on coprocessors since we can't detect them anyway, cleanup some unused variables in the ppr mapper
This commit was SVN r29476.
2013-10-23 01:45:21 +00:00
Jeff Squyres
758cd25fff Move the MCA / MPI_T level of the LAMA component down to 5 (from 9).
This commit was SVN r29214.
2013-09-20 15:23:27 +00:00
Ralph Castain
d9f0505952 Fix the lama verbose outputs so they don't segfault if someone asks for verbose output, but isn't using lama
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29108.
2013-09-03 17:55:35 +00:00
Ralph Castain
2bfa99e945 If a rankfile is given and the number of procs not specified in the mpirun cmd line, then set the number of procs to the number of ranks in the rankfile
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29104.
2013-09-02 15:04:40 +00:00
Ralph Castain
7a7cfdd519 A little cleanup - the base function to sort numa lists must return something or you get a warning about non-void function returning without value, so cleanup the return values. Ensure the mindist module actually checks for a return of "error" so it won't segfault, and have it emit a polite message when that happens.
cmr:v1.7.3:reviewer=jladd

This commit was SVN r29089.
2013-08-29 20:01:06 +00:00
Joshua Ladd
1802aabf1a Add support for autodetecting a MLNX HCA in the rmaps min distance feature. In this way, .ini files distributed with software stacks need not specify a particular HCA but instead may select the key word auto which will automatically select the discovered device. To use this feature, simply pass the keyword auto instead of a specific device name, --mca rmaps_base_dist_hca auto. If more than one card is installed, the mapper will inform the user of this and, at this point, the user will then need to specify which card via the normal route, e.g. --mca rmaps_base_dist_hca <dev_name>. This should be added to \ncmr=v1.7.4:reviewer=rhc:subject=Autodetect logic for min dist mapping
This commit was SVN r29079.
2013-08-28 16:23:33 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
b2d86e1857 Silence uninitialized var warning
This commit was SVN r29034.
2013-08-16 21:35:51 +00:00
Ralph Castain
7a21661785 Silence a warning when --without-hwloc is used
This commit was SVN r28783.
2013-07-13 17:17:17 +00:00
Dave Goodell
3741d62308 fix --without-hwloc build failure
All builds since r28682 configured with '--without-hwloc' fail at "make"
time without this fix.

Reviewed by rhc@

This commit was SVN r28769.

The following SVN revision numbers were found above:
  r28682 --> open-mpi/ompi@446e33a5d8
2013-07-12 17:21:14 +00:00
Ralph Castain
62378209f0 Even if we don't find the default hostfile, and nothing else was provided, then use all the known nodes.
cmr:v1.7.3:#3653:reviewer=jsquyres
cmr:v1.6.6:#3654:reviewer=jsquyres

This commit was SVN r28718.
2013-07-03 22:31:32 +00:00
Ralph Castain
443a6802b9 If the default hostfile is empty, we need to pickup all the known nodes, not just the head node.
cmr:v1.7.3:reviewer=jsquyres
cmr:v1.6.6:reviewer=jsquyres

This commit was SVN r28717.
2013-07-03 22:25:51 +00:00
Ralph Castain
446e33a5d8 There are cases where we want to use the novm state machine, but the backend node topology differs from that where mpirun is executing. In those cases, we can wind up thinking we are oversubscribed because the head node has fewer cores than the compute nodes.
To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it.

This commit was SVN r28682.
2013-06-27 03:04:50 +00:00
Ralph Castain
a51a0a8c48 Fix uninitialized var
This commit was SVN r28652.
2013-06-18 22:41:47 +00:00
Joshua Ladd
61ffb47573 Minor fix for the min-dist mapping algorithm: we need to call 'get_nbobjs_by_type' first, before we get the sorted list of nodes - we need to add node objects and fill them in the summary object for the current topology. This patch was submitted by Elena Elkina and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28578.
2013-05-31 15:19:59 +00:00
Jeff Squyres
6d173af329 This commit introduces a new "mindist" ORTE RMAPS mapper, as well as
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases.  This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.

Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).

-----

Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework.  It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.

To use this algorithm, specify:

   {{{mpirun --map-by dist:<device_name>}}}

where <device_name> can be mlx5_0, ib0, etc.

There are two modes provided:

 1. bynode: load-balancing across nodes
 1. byslot: go through slots sequentially (i.e., the first nodes are
     more loaded)

These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:

    {{{mpirun --map-by dist:<device_name>,span}}}

So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.

If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.

You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.

The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.

This commit was SVN r28552.
2013-05-22 13:04:40 +00:00
Jeff Squyres
089c632cce Remove a bunch of dead code: gcc 4.7 warns of set-but-unused
variables.  So get rid of them.

This commit was SVN r28538.
2013-05-17 21:45:49 +00:00
Ralph Castain
e100b8d165 don't need the return value, but should check for error
This commit was SVN r28534.
2013-05-16 15:15:02 +00:00
Jeff Squyres
128cc27417 Minor type fix (they're both enums/ints, so the compiler previously
silently cast them).

This commit was SVN r28532.
2013-05-16 00:47:37 +00:00
Ralph Castain
3a372a65b8 Mapping policies must be tested as equalities as they are values, not bitmasks
This commit was SVN r28526.
2013-05-15 13:45:00 +00:00
Ralph Castain
29e4b0cc50 Cannot test equality on mapping directives as it is a bitmask
This commit was SVN r28525.
2013-05-15 13:41:49 +00:00
Ralph Castain
5296099ecb Fix the cpus-per-rank when binding to hwthreads. Add cpus-per-rank to diag printout
Thanks to Elena for reporting the problem

This commit was SVN r28508.
2013-05-14 20:17:50 +00:00
Ralph Castain
427b6b0b47 Fix the verbosity of yet another framework...sigh.
This commit was SVN r28481.
2013-05-13 14:36:32 +00:00
Jeff Squyres
456df1c9f7 Remove redundant opal_output() messages from the module; the called
functions will now show_help() their own error messages if something
goes wrong (per r28470).

This commit was SVN r28471.

The following SVN revision numbers were found above:
  r28470 --> open-mpi/ompi@2ff95a7739
2013-05-10 15:12:07 +00:00
Jeff Squyres
2ff95a7739 Proper show_help error messages for LAMA.
This commit was SVN r28470.
2013-05-10 15:06:25 +00:00
Ralph Castain
707d0e653a Must use equal and not & comparison for mapping directives
This commit was SVN r28451.
2013-05-06 15:07:12 +00:00
Ralph Castain
5d7a93c032 Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed.
Interesting how many includes had to be fixed here and there to fill in missing dependencies :-)

This commit was SVN r28411.
2013-04-29 17:02:37 +00:00
Ralph Castain
252147fba6 Cleanup error message if unknown host is given in -host and -hostfile options
This commit was SVN r28262.
2013-03-28 16:52:10 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
e7ac6c9bde Don't build rank_file if you can't use it anyway
This commit was SVN r28233.
2013-03-27 15:12:40 +00:00
Ralph Castain
256414121e Protect the cpus-per-rank MCA param registration so that --without-hwloc will build
This commit was SVN r28232.
2013-03-27 14:53:30 +00:00
Ralph Castain
317915225c Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct.
This commit was SVN r28223.
2013-03-26 20:09:49 +00:00
Ralph Castain
6ee32767d4 Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument.
This commit was SVN r28219.
2013-03-26 18:27:50 +00:00
Ralph Castain
2f43989d22 Add debug and handle the use-case where someone (a) uses a hostfile while in a managed allocation to sub-allocate runs, and (b) includes the HNP's node in one of those hostfiles.
cmr:v1.7

This commit was SVN r28203.
2013-03-22 00:53:33 +00:00
Ralph Castain
cf9796accd Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Jeff Squyres
8e25b927ab Clean some minor warnings: remove variables that were set but never
used.

This commit was SVN r27974.
2013-01-29 23:35:42 +00:00
Ralph Castain
112f8eedb1 Handle the case where rankfile is providing the allocation
This commit was SVN r27971.
2013-01-29 20:37:58 +00:00
Ralph Castain
f6b4db0b79 Fix rank_file operations. We changed the syntax to use semi-colons between multiple slot assignments so that we could use the comma to separate specific cores, but somehow the flex definitions didn't get updated to accept that character. We also incorrectly zero'd the bitmap between slot assignment sections, and so multiple slot assignments only wound up making the last one in the list.
This commit was SVN r27908.
2013-01-25 18:33:25 +00:00
Nathan Hjelm
3e1b13b13a Re-add support for old flex (2.5.4a and earlier) while still cleaning up properly in new flex.
This commit was SVN r27657.
2012-12-07 00:12:43 +00:00
Nathan Hjelm
e0f5137e46 add prototypes for lex destroy functions
This commit was SVN r27580.
2012-11-09 22:00:27 +00:00
Nathan Hjelm
8658bbc902 instead of relying on yyterminate to clean up the lex context call the destroy functions directly (after closing the file)
This commit was SVN r27577.
2012-11-09 16:10:55 +00:00
Ralph Castain
9b729794f2 A prior commit apparently broke the trunk when something was inadvertently left behind - so remove a reference to a no-longer-existing function
This commit was SVN r27574.
2012-11-07 11:11:05 +00:00