openmpi

Автор	SHA1	Сообщение	Дата
Dave Goodell	dd82bd3c19	usnic: fix invalid rfstart initialization endpoint_rfstart was being initialized from a value which was not yet set. Also ensure that rfstart is a valid index in the range 0..WINDOW_SIZE-1, since it is used as the index into endpoint_rcvd_segs, which has WINDOW_SIZE elements. Without this change there is significant risk of memory corruption or segfaults, resulting in hangs or crashes, if malloc ever returns us a value >=WINDOW_SIZE (4096). Right now we seem to be getting lucky that the malloc is returning zero-pages to us when we are allocating endpoint structures (possibly because the freelist performs a single large allocation for all endpoints). Fixes Cisco bug CSCui88781. Reviewed-by: rfaucett@cisco.com Reviewed-by: jsquyres@cisco.com cmr=v1.7.3:reviewer=jsquyres This commit was SVN r29075.	2013-08-27 22:43:20 +00:00
Ralph Castain	45e695928f	As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time: * add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit. * remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL" * modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded * removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base * added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames This commit was SVN r29052.	2013-08-20 18:59:36 +00:00
Ralph Castain	611d7f9f6b	When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require. This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times. Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes: * upon first request for data, have the OPAL db pmi component fetch and decode all the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally * reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test * reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued). Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time. This commit was SVN r29040.	2013-08-17 00:49:18 +00:00
Jeff Squyres	87910daf51	Fix a collection of bugs found by QA and Coverity, and make some minor improvements: * Fix minor memory leaks during component_init * Ensure that an initialization loop does not underflow an unsigned int * Improve mlock limit checking * Fix set of BTL modules created during component_init when failing to get QP resources or otherwise excluding some (but not all) usnic verbs devices * Fix/improve error messages to be consistent with other Cisco documentation * Randomize the initial sliding window sequence number so that we silently drop incoming frames from previous jobs that still have existant processes in the middle of dying (and are still transmitting) * Ensure we don't break out of add_procs too soon and create an asymetrical view of what interfaces are available This commit was SVN r28975.	2013-08-01 16:56:15 +00:00
Jeff Squyres	4b6006402d	Use the RTE framework instead of calling ORTE directly. Brian (rightfully) hit me on the head with the don't-use-ORTE-use-the-rte-framework clue bat; the usnic BTL now nicely plays with the RTE framework. This commit was SVN r28907.	2013-07-22 17:28:23 +00:00
Jeff Squyres	194b285447	First commit of the Cisco usNIC BTL. This BTL accesses the Cisco usNIC Linux device via the Linux verbs API via Unreliable Datagram queue pairs. A few noteworthy points: * This BTL does most of its own fragmentation; it tells the PML that it has a very high max_send_size (much higher than the network MTU). * Since UD fragments are, by definition, unreliable, the usnic BTL handles all of its own reliability via a sliding window approach using the opal_hotel construct and many tricks stolen from the corpus of knowledge surrounding efficient TCP. * There is a fun PML latency-metric based optimization for NUMA awareness of short messages. * Note that this is ''not'' a generic UD verbs BTL; it is specific to the Cisco usNIC device. This commit was SVN r28879.	2013-07-19 22:13:58 +00:00

6 Коммитов