Might as well save a few bytes when sending this struct across the
network via the __opal_attribute_packed__ attribute.
That being said, also re-order the elements in this struct so that
there's no holes to begin with. Do this so that the compiler/runtime
won't effect (slow) unaligned reads/writes because of the
__opal_attribute_packed__ attribute.
The "packed" attribute is really more about defensive programming
(e.g., if we make a mistake and have a hole, "packed" will remove it
for us).
*** Do not bring this commit back to existing/already-released release
branches: it will cause incompatibility, since it effectively changes
the usNIC BTL wire protocol.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
30 seconds before sending SIGTERM then another 30 seconds before sending
SIGKILL to remaining processes. This usually happens on an abnormal
termination. Sometimes the user wants to delay the cleanup to give the
system time to write out corefile or run other diagnostics.
* The problem is that child processes may be completing while ORTE is
in this loop. The SIGCHLD will interrupt the `sleep` system call.
Without the loop the sleep could effectively be ignored in this case.
- Sleep returns the amount of time remaining to sleep. If it was
interrupted by a signal then it is a positive number less than or
equal to the parameter passed to it. If it slept the whole time
then it returns 0.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
one off patch for v4.0.x. for some reason commit on master
didn't have this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 5f3dbdb5c8)
Note that this commit is actually a cherry-pick from the v4.0.x
branch. This is the opposite direction than what we nornmally do: we
usually commit to master first and then cherry-pick to the release
branches (vs. the other way around).
As is probably evident from the original commit message above, through
a comedy of errors, this commit was actually applied to the v4.0.x
branch first and then cherry-picked back to master (i.e., the problem
*did* exist in the original master commit
3aca4af548, but it was not recongized at
the time).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.
Signed-off-by: Ralph Castain <rhc@pmix.org>
INTERNAL: STL-59403
The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.
This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.
Change-Id: I05aa71d332f2df897133b30c28bf37d98f061996
Signed-off-by: Michael Heinz <michael.william.heinz@intel.com>
Reviewed-by: Adam Goldman <adam.goldman@intel.com>
Reviewed-by: Brendan Cunningham <brendan.cunningham@intel.com>
- due to some refactoring and adding new functionality compilation
of ikrit module was broken
- this commit restores compilation
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
Currently, there is no function that allows the user to retrieve the
data they have stored in a vertex easily. Using the internal macros and
knowledge of the structures, the new function will return a pointer to
the user provided vertex data.
Signed-off-by: William Zhang <wilzhang@amazon.com>
If both types of interfaces are enabled, don't error out if one of them
isn't able to open listener sockets. Only one interface family may be
available on some machines, but someone might want to build the code to
run more generally.
Refs https://github.com/pmix/prrte/pull/249
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.
References #6568
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Trying out to run processes via mpirun in Podman containers has shown
that the CMA btl_vader_single_copy_mechanism does not work when user
namespaces are involved.
Creating containers with Podman requires at least user namespaces to be
able to do unprivileged mounts in a container
Even if running the container with user namespace user ID mappings which
result in the same user ID on the inside and outside of all involved
containers, the check in the kernel to allow ptrace (and thus
process_vm_{read,write}v()), fails if the same IDs are not in the same
user namespace.
One workaround is to specify '--mca btl_vader_single_copy_mechanism none'
and this commit adds code to automatically skip CMA if user namespaces
are detected and fall back to MCA_BTL_VADER_EMUL.
Signed-off-by: Adrian Reber <areber@redhat.com>
I'm restoring the info function pointers to the IO module
but allowing the function pointers to be NULL (eg in ompio).
And letting romio321 set its function pointers for those
routines.
This means the info system uses the new OMPI-level info
system for most things, but skips it and uses the pre-existing
romio info system just for the romio module.
It's possible to convert romio, but I went a ways down that
path and found it kind of convoluted. Having pointers from
the lower level ADIO_File back to the higher level ompi_file_t
wasn't too bad, but I got stuck trying to figure out where/how
to register the infosubscribe_subscribe callbacks vs the way
initial k/v values are scattered around the romio code currently.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
These variables were renamed in
904276bb44caec207638247f23139bc21bc6a09e; update them to use the new
names.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>