openmpi

Автор	SHA1	Сообщение	Дата
Artem Polyakov	d9ad918a14	orte/iof: Address the case when output is a regular file Regular files are always write-ready, so non-blocking I/O does not give any benefits for them. More than that - if libevent is using "epoll" to track fd events, epoll_ctl will refuse attempt to add an fd pointing to a regular file descriptor with EPERM. This fix checks the object referenced by fd and avoids event_add using event_active instead. In the original configuration that uncovered this issue "epoll" was used in libevent, it was triggering the following warning message: "[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted" And the side effect was accumulation of all output in mpirun memory and actually writing it only at mpirun exit. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-07-01 02:24:14 +07:00
Joshua Hursey	3b780ac137	opal/mca: Fix mca_base_verbose file suffix processing * `-mca mca_base_verbose file:foo` should create an output file with the suffix `foo`. But since we free the pointer at the end of this function then by the time we use it it is pointing to invalid memory. * This commit fixes that corruption * This commit also fixes the behavior of `file:` with no suffix. Makes it the same as `file` without the colon. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-06-27 16:52:56 -05:00
Ralph Castain	ecacde0cd5	Purge whitespace errors Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-23 11:12:14 -07:00
Nathan Hjelm	9c621ad5a4	opal/info: fix abstraction break The new info infrastructure introduced an abstration break by including mpi.h and using MPI_ constants in opal. This commit fixes the break by changing the constants to their opal equivalents. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-06-23 08:03:01 -06:00
Nathan Hjelm	ffd8ee2dfd	opal: use opal_list_t convienience macros This commit cleans up code in opal to use OPAL_LIST_FOREACH(_SAFE), OPAL_LIST_DESTRUCT, and OPAL_LIST_RELEASE. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-06-20 12:37:12 -06:00
KAWASHIMA Takahiro	3afc61644d	opal/util: Get rid of `\0` from abort delay message My recent commit `6b91edd` had this bug. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-06-19 20:08:34 +09:00
KAWASHIMA Takahiro	6b91eddc8b	Apply `opal_abort_delay` to the signal handler This commit expands the effect of the MCA parameter `opal_abort_delay` to the OPAL signal handler. This allows attaching of a debugger on segmentation fault etc. before quitting the job. The sleep code is moved to the `opal_delay_abort` function from the `ompi_mpi_abort` and `oshmem_shmem_abort` functions for code cleanup. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-06-08 19:34:48 +09:00
Joshua Hursey	fce28c31d0	opal/stacktrace: Fix stderr target for opal_stacktrace_output Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-05-22 13:46:02 -05:00
Mark Allen	482d84b6e5	fixes for Dave's get/set info code The expected sequence of events for processing info during object creation is that if there's an incoming info arg, it is opal_info_dup()ed into the obj at obj->s_info first. Then interested components register callbacks for keys they want to know about using opal_infosubscribe_infosubscribe(). Inside info_subscribe_subscribe() the specified callback() is called with whatever matching k/v is in the object's info, or with the default. The return string from the callback goes into the new k/v stored in info, and the input k/v is saved as __IN_<key>/<val>. It's saved the same way whether the input came from info or whether it was a default. A null return from the callback indicates an ignored key/val, and no k/v is stored for it, but an __IN_<key>/<val> is still kept so we still have access to the original. At MPI__set_info() time, opal_infosubscribe_change_info() is used. That function calls the registered callbacks for each item in the provided info. If the callback returns non-null, the info is updated with that k/v, or if the callback returns null, that key is deleted from info. An __IN_<key>/<val> is saved either way, and overwrites any previously saved value. When MPI__get_info() is called, opal_info_dup_mpistandard() is used, which allows relatively easy changes in interpretation of the standard, by looking at both the <key>/<val> and __IN_<key>/<val> in info. Right now it does 1. includes system extras, eg k/v defaults not expliclty set by the user 2. omits ignored keys 3. shows input values, not callback modifications, eg not the internal values Currently the callbacks are doing things like return some_condition ? "true" : "false" that is, returning static strings that are not to be freed. If the return strings start becoming more dynamic in the future I don't see how unallocated strings could support that, so I'd propose a change for the future that the callback()s registered with info_subscribe_subscribe() do a strdup on their return, and we change the callers of callback() to free the strings it returns (there are only two callers). Rough outline of the smaller changes spread over the less central files: comm.c initialize comm->super.s_info to NULL copy into comm->super.s_info in comm creation calls that provide info OBJ_RELEASE comm->super.s_info at free time comm_init.c initialize comm->super.s_info to NULL file.c copy into file->super.s_info if file creation provides info OBJ_RELEASE file->super.s_info at free time win.c copy into win->super.s_info if win creation provides info OBJ_RELEASE win->super.s_info at free time comm_get_info.c file_get_info.c win_get_info.c change_info() if there's no info attached (shouldn't happen if callbacks are registered) copy the info for the user The other category of change is generally addressing compiler warnings where ompi_info_t and opal_info_t were being used a little too interchangably. An ompi_info_t* contains an opal_info_t*, at &(ompi_info->super) Also this commit updates the copyrights. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-05-17 01:12:49 -04:00
David Solt	50aa143ab6	Major structural changes to data types: .super infosubscriber ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it. MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal. An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object. Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory. Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t The data structure changes are primarily in the following files: communicator/communicator.h ompi/info/info.h ompi/win/win.h ompi/file/file.h The following new files were created: opal/util/info.h opal/util/info.c opal/util/info_subscriber.h opal/util/info_subscriber.c This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info. The new model can be seen in the following files: ompi/mpi/c/comm_get_info.c ompi/mpi/c/comm_set_info.c ompi/mpi/c/file_get_info.c ompi/mpi/c/file_set_info.c ompi/mpi/c/win_get_info.c ompi/mpi/c/win_set_info.c The current subscribers where changed as follows: mca/io/ompio/io_ompio_file_open.c mca/io/ompio/io_ompio_module.c mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks") mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig") Signed-off-by: Mark Allen <markalle@us.ibm.com> Conflicts: AUTHORS ompi/communicator/comm.c ompi/debuggers/ompi_mpihandles_dll.c ompi/file/file.c ompi/file/file.h ompi/info/info.c ompi/mca/io/ompio/io_ompio.h ompi/mca/io/ompio/io_ompio_file_open.c ompi/mca/io/ompio/io_ompio_file_set_view.c ompi/mca/osc/pt2pt/osc_pt2pt.h ompi/mca/sharedfp/addproc/sharedfp_addproc.h ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c ompi/mpi/c/lookup_name.c ompi/mpi/c/publish_name.c ompi/mpi/c/unpublish_name.c opal/mca/mpool/base/mpool_base_alloc.c opal/util/Makefile.am	2017-05-12 14:41:05 -04:00
Nathaniel Graham	01312b2f90	Additional mpirun --help changes This commit recategorizes several mpirun arguments, and moves the information for mpirun --help arguments to the bottom of the general help message. I also added the OPAL_CMD_LINE_OTYPE field to two commands that were missed initially because they were not in the same area as the others. Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-19 11:43:45 -06:00
Ralph Castain	dadc924cde	Cleanup warnings when timing is not enabled Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-11 17:29:27 -07:00
Artem Polyakov	4477b87e1d	Merge pull request #3303 from karasevb/timing2/master OMPI timings	2017-04-11 07:52:40 -07:00
Boris Karasev	d132eab4a5	ompi/timings: fixed the error of opal timings env import Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-04-11 12:08:48 +06:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Boris Karasev	36a0e71f2d	ompi/timings: preparing to production state Adds: - enabling/disabling of timings throught environment variable `OMPI_TIMING_ENABLE` - output format: [file name]:[function name]:[description]: avg/min/max - dynamically extending array of results for case then inited size was exhausted - catch and collect errors - cleanup Note: For use feature need to configure with `--enable-timings` and set env `OMPI_TIMING_ENABLE = 1` Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-04-07 21:16:57 +06:00
Artem Polyakov	45898a9c65	opal/timing: add the draft of env-based timings This commit adds new timing feature that uses environment variables to expose timing information. This allows easy access to this data (if timing is enabled) from any other part of the application for the subsequent postprocessing. In particular this will be integrated with OMPI-level timing framework that whill use MPI_Reduce functionality to provide more compact and easy-to use information. This commit also adds the example of usage of this framework by annotating rte_init function. The result is not used anywhere for now. It will be postprocessed in subsequent commits. NOTE: that functionality is currently disabled untill it will be verified at runtime Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:22 +06:00
Artem Polyakov	88ed79ea25	opal/timing: remove old framework Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-04-07 21:16:22 +06:00
Nathaniel Graham	36d660e07a	Add parsable option to help arguments This commit adds a "parsable" option to the help arguments, which prints out a machine readable list of all the mpirun options. Fixes #3279 Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-05 17:01:43 -06:00
Mark Allen	655a06f559	IB fork The key change was in btl_openib_connect_udcm.c where a buffer was being pinned with size 65664 (whether openib was being used or not). The start of the buffer was page aligned, but because of the size the end wasn't. That makes it too easy for a forked child to accidentally touch pinned memory on the same page as the end of that buffer. So this change increases the size of the allocated buffer to use the rest of the page. I inspected the rest of the ibv_reg_mr() calls and changed one other place to page align its buffer too, although I think the above is the one that really matters. Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-04-05 17:35:52 -04:00
Nathaniel Graham	19e5d15491	mpirun --help output revamp This commit modifies the output from the mpirun --help command. The options have been split into groups, to make the output smaller and more readable. The groups are: general, debug, output, input, mapping, ranking, binding, devel, compatibility, launch, dvm, and unsupported. There is also a special "full" command that can be used to get the old behaviour of printing out all of the options. Unsupported options may only be seen with this full output. This commit also adds a special case for the help argument. It makes it possible for the user to enter 0 or 1 arguments instead of having to always enter an argument. This defaults to printing out the "general" help options so the user can then see what help arguments there are. Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>	2017-04-04 10:59:32 -06:00
George Bosilca	b0f8d2c460	Never free the statically allocated buffer. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-03-01 13:21:03 -05:00
Jeff Squyres	fec519a793	hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h Per a prior commit, the presence of "hwloc.h" can cause ambiguity when using --with-hwloc=external (i.e., whether to include opal/mca/hwloc/hwloc.h or whether to include the system-installed hwloc.h). This commit: 1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h. 2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc. 3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the rest of the code base. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-28 07:48:42 -08:00
Jeff Squyres	45b791542c	Merge pull request #2809 from jjhursey/fix/ibm/opal-verbose opal/output: Make sure verbose gets updated when id 0 gets updated.	2017-01-31 12:18:38 -05:00
Josh Hursey	2e64bf42fb	Merge pull request #2810 from jjhursey/fix/ibm/stdiag-to-stdout Extend options for stddiag routing	2017-01-26 14:29:16 -06:00
Jeff Squyres	2c277a66fd	Merge pull request #2772 from jjhursey/topic/stacktrace-improv master: opal/stacktrace improvements	2017-01-26 10:48:41 -08:00
Joshua Hursey	6d98559be9	stacktrace: Add flexibility in stacktrace ouptut - New MCA option: opal_stacktrace_output - Specifies where the stack trace output stream goes. - Accepts: none, stdout, stderr, file[:filename] - Default filename 'stacktrace' - Filename will be `stacktrace.PID`, or if VPID is available, then the filename will be `stacktrace.VPID.PID` - Update util/stacktrace to allow for different output avenues including files. Previously this was hardcoded to 'stderr'. - Since opal_backtrace_print needs to be signal safe, passing it a FILE object that actually represents a file stream is difficult. This is because we cannot open the file in the signal handler using `fopen` (not safe), but have to use `open` (safe). Additionally, we cannot use `fdopen` to convert the `int fd` to a `FILE fh` since it is also not signal safe. - I did not want to break the backtrace.h API so I introduced a new rule (documented in `backtrace.c`) that if the `FILE file` argument is `NULL` then look for the `opal_stacktrace_output_fileno` variable to tell you which file descriptor to use for output. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-26 11:55:32 -06:00
Joshua Hursey	f8918e37a9	opal/stacktace: Raise the signal after processing - This prevents us for accidentally masking a signal that was meant to terminate the application. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-26 11:55:28 -06:00
Joshua Hursey	dcd9801f7c	orte/iof: Add orte_map_stddiag_to_stdout option * Similar to `orte_map_stddiag_to_stderr` except it redirects `stddiag` to `stdout` instead of `stderr`. * Add protection so that the user canot supply both: - `orte_map_stddiag_to_stderr` - `orte_map_stddiag_to_stdout` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:22:59 -06:00
Joshua Hursey	2596983593	opal/output: Make sure verbose gets updated when id 0 gets updated. - This allows the following MCA option to have an impact on the framework verbose output as well. * `-mca mca_base_verbose stdout` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:14:11 -06:00
Gilles Gouaillardet	1a6c17ec7d	opal/util: plug a memory leak by using opal_setenv() instead of putenv() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:12:47 +09:00
Gilles Gouaillardet	dffaad9de2	opal/util: fix a race condition in opal_os_dirpath_create() always check the permissions of the created directory, in case some one else created the very same directory but with incompatible permissions Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-19 14:02:47 +09:00
Ralph Castain	6da4dbbb33	Quick fix: save the errno from the mkdir call as the call to stat will likely overwrite it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-18 15:42:31 -08:00
Ralph Castain	b257c32d2c	Cleanup the os_dirpath logic so it doesn't error out if the directory actually gets created (regardless of what mkdir returns), and pretty-prints the error if it does error out. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-18 12:05:47 -08:00
Gilles Gouaillardet	a3f21fb2aa	opal_os_dirpath_create: fix TOCTOU as reported by Coverity with CID 70396 (cherry picked from commit `58d1b3f4d0`)	2017-01-18 11:48:30 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Gilles Gouaillardet	c9aeccb84e	opal/if: open the if framework once in opal_init_util the if framework is no more open in opal_if*, which plugs several memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-01 14:24:30 +09:00
Gilles Gouaillardet	8fd1c3f0df	opal/util: handle a race condition in opal_os_dirpath_destroy An file might have been destroyed by an other task between readdir() and stat(), so simply ignore stat() failure. That typically occurs when one task is removing the job_session_dir and an other task is still removing its proc_session_dir. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-24 10:45:48 +09:00
Gilles Gouaillardet	eaee1332e1	opal/util/ethtool: add missing headers and get Open MPI build on OpenBSD 6.0	2016-09-23 11:22:19 +09:00
Gilles Gouaillardet	e6f7facd7d	opal/util: improve error message in opal_os_dirpath_create()	2016-09-18 17:10:47 +09:00
Gilles Gouaillardet	4b47daeeb0	opal/util: improve return status of opal_os_dirpath_create()	2016-09-18 12:32:42 +09:00
Gilles Gouaillardet	277c319389	opal/util: fix (again and again) incorrect type casting in opal_path_df and silence CID 1371767 this fixes previous commits : - open-mpi/ompi@2eec8970ff - open-mpi/ompi@a439afce5b	2016-08-26 09:42:45 +09:00
Gilles Gouaillardet	2eec8970ff	opal/util: fix (again) incorrect type casting in opal_path_df this fixes previous commit open-mpi/ompi@a439afce5b	2016-08-24 12:50:15 +09:00
Gilles Gouaillardet	a439afce5b	opal/util: fix incorrect type casting in opal_path_df	2016-08-24 10:26:13 +09:00
Ralph Castain	0e58609327	Fix a bug where we were requiring that all paths in $PATH be absolute. Some users provide relative paths in their environment, and we should respect those.	2016-08-12 11:28:57 -07:00
Gilles Gouaillardet	13009aa290	opal/alfg: have opal_random() wrapper always return a positive int	2016-08-09 17:12:30 +09:00
Thananon Patinyasakdikul	b3e9dadff2	libevent: use opal_random() instead of rand(3) This commits changed rand(3) and family in libevent to use internal random function provided in opal to prevent pertubing user's random seed. Fixes open-mpi/ompi#1877	2016-08-03 09:18:12 -07:00
Gilles Gouaillardet	1f651d17c1	opal/util/ethtool: fix (infamous) strncpy usage the infamous strncpy does not NULL terminate the destination when the buffer is truncated do it ourself ! fix CID 1362576	2016-06-09 09:54:50 +09:00
Gilles Gouaillardet	5f565dfec3	configury: clean the flex generated .c files	2016-06-01 11:13:31 +09:00
Ralph Castain	42ecffb6d0	Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations Take another shot at untangling the spaghetti orterun: fix for command line parsing orte-submit calls opal_init_util () before parsing out MCA command line options (-mca, -am, etc). This prevents mpirun from setting opal MCA variables for some frameworks as well as the MCA base. This is because when a framework is opened all of its variables are set to read-only. Eventually we want to lift this restriction on some MCA variables but since -mca is affected we must parse out the MCA command line options before opal_init_util(). This commit fixes the bug by adding a new option to opal_cmd_line_parse (ignore unknown option) so orte-submit can pre-parse the command line for MCA options. Signed-off-by: Nathan Hjelm <hjelmn@me.com> Minor cleanups to avoid releasing/recreating the cmd line	2016-05-20 09:59:50 -07:00
Gilles Gouaillardet	a01a5487a8	opal/util/ethtool: use system ethtool_cmd_speed when available Refs: open-mpi/ompi#1679	2016-05-20 09:05:09 +09:00
Jeff Squyres	87233aae49	ethtool: better handle portability Be sure to handle the case where we don't have ethtool support at all. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-19 10:57:14 -07:00
Gilles Gouaillardet	fd93d236b1	opal/util/ethtool: fix compilation on older Linux when struct ethtool_cmd has no speed_hi field Refs: open-mpi/ompi#1628	2016-05-19 11:58:04 +09:00
Karol Mroz	31e33a64f9	opal/util: add function to obtain interface speed If kernel ethtool_cmd_speed() is not available, use copies if possible. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-05-18 16:25:51 +02:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Karol Mroz	e1c64e6e59	opal: standardize on max hostname length Define OPAL_MAXHOSTNAMELEN to be either: (MAXHOSTNAMELEN + 1) or (limits.h:HOST_NAME_MAX + 1) or (255 + 1) For pmix code, define above using PMIX_MAXHOSTNAMELEN. Fixup opal layer to use the new max. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-24 08:19:47 +02:00
Nathan Hjelm	27f8a4e806	opal: add code patcher framework This commit adds a framework to abstract runtime code patching. Components in the new framework can provide functions for either patching a named function or a function pointer. The later functionality is not being used but may provide a way to allow memory hooks when dlopen functionality is disabled. This commit adds two different flavors of code patching. The first is provided by the overwrite component. This component overwrites the first several instructions of the target function with code to jump to the provided hook function. The hook is expected to provide the full functionality of the hooked function. The linux patcher component is based on the memory hooks in ucx. It only works on linux and operates by overwriting function pointers in the symbol table. In this case the hook is free to call the original function using the function pointer returned by dlsym. Both components restore the original functions when the patcher framework closes. Changes had to be made to support Power/PowerPC with the Linux dynamic loader patcher. Some of the changes: - Move code necessary for powerpc/power support to the patcher base. The code is needed by both the overwrite and linux components. - Move patch structure down to base and move the patch list to mca_patcher_base_module_t. The structure has been modified to include a function pointer to the function that will unapply the patch. This allows the mixing of multiple different types of patches in the patch_list. - Update linux patching code to keep track of the matching between got entry and original (unpatched) address. This allows us to completely clean up the patch on finalize. All patchers keep track of the changes they made so that they can be reversed when the patcher framework is closed. At this time there are bugs in the Linux dynamic loader patcher so its priority is lower than the overwrite patcher. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-13 17:16:13 -06:00
Nathan Hjelm	4cac623aeb	opal/patch: add call to check if binary patching is supported Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-13 17:16:12 -06:00
Nathan Hjelm	7aa03d66b3	opal/memory: add support for patch based memory hooks This commit adds support for runtime binary patching. The support is broken down into two parts: util/opal_patcher.[ch] which contains the functionality for runtime patching of symbols, and mca/memory/patcher which patches the various symbols needed to provide support for memory hooks. This work is preliminary and is based off work donated by IBM. The patcher code is disabled if dlopen is disabled. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-04-13 17:14:31 -06:00
Ralph Castain	8c14df2328	Revert "Modify singularity support per patch from Greg Kurtzer" This reverts commit open-mpi/ompi@f7257a8310. Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable. Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl	2016-03-24 11:27:18 -07:00
Nathan Hjelm	607be72de9	opal/keval_parse: fix conditional ordering Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-08 10:06:14 -07:00
Nathan Hjelm	63bac9a4e0	opal/util: fix bug in key value parser This commit fixes a bug in the opal key value parser that might cause the filename parser to go past the beginning of the string. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-03-07 14:51:29 -07:00
Nathan Hjelm	32236736a4	Fix parsing of envvars in MCA files This commit fixes a memory corruption bug when parsing lines of the form: -x FOO=bar The code was making changes to the size of the buffer allocated for key_buffer without making the appropriate changes to key_buffer_len. This was causing subsequent calls to save_param_name to write to invalid memory. This commit makes the following changes: - Fix the above bug by modifying trim_name to move the string within the buffer instead of re-allocating space for the trimmed string. - Cleaned up both trim_name and save_param_name. Both functions took a prefix and suffix to trim. Problem was the prefix was not treated like a prefix. Instead the "prefix" was located inside the string using strstr then the trimmed value started after the substring (even in the middle of the string). To allow trimming both -x and --x (as well as -mca and --mca) trim_name is now called with each prefix. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-17 14:58:05 -07:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Edgar Gabriel	722aab92e6	- extend opal_path_nfs to retrieve the file system type - use opal_path_nfs in the fs_base function to avoid code duplication.	2016-01-26 13:36:21 -06:00
Ralph Castain	4dad5de8ff	Silence a couple of warnings - strncpy returns a char*, not an int	2016-01-16 09:44:52 -08:00
Gilles Gouaillardet	1d38430e43	opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid	2016-01-14 10:39:03 +09:00
KAWASHIMA Takahiro	2dcb2d711b	Makefile: Move fd.c to `SOURCES` from `headers`. And reorder fd.h and few.h in alphabetical order.	2015-11-04 11:28:43 +09:00
Nathan Hjelm	6d3041335f	opal/keyval: reset buffer pointer/size in finalize Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-20 13:10:44 -06:00
Gilles Gouaillardet	dc883cff8d	opal/util: fix parse_ipv4_dots prototype	2015-10-01 14:03:08 +09:00
Nathan Hjelm	408da16d50	ompi/proc: add proc hash table for ompi_proc_t objects This commit adds an opal hash table to keep track of mapping between process identifiers and ompi_proc_t's. This hash table is used by the ompi_proc_by_name() function to lookup (in O(1) time) a given process. This can be used by a BTL or other component to get a ompi_proc_t when handling an incoming message from an as yet unknown peer. Additionally, this commit adds a new MCA variable to control the new add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in the process falls below the threshold a ompi_proc_t is created for every process. If the number of ranks is above the threshold then a ompi_proc_t is only created for the local rank. The code needed to generate additional ompi_proc_t's for a communicator is not yet complete. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-10 08:55:54 -06:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	0d5814b5ca	Cleanup Coverity issues	2015-08-29 21:19:27 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	023936e84b	Silence coverity warnings	2015-07-29 07:28:08 -07:00
Ralph Castain	8d128fe090	Remove the non-null attributes from the cmd_line parser as this isn't something we can guarantee, and the optimization isn't worth the potential for error	2015-06-25 13:26:20 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Gilles Gouaillardet	58d1b3f4d0	opal_os_dirpath_create: fix TOCTOU as reported by Coverity with CID 70396	2015-06-17 11:17:54 +09:00
Gilles Gouaillardet	de66447ebb	opal_cmd_line_get_usage_msg: silence warning as reported by Coverity with CID 1269967	2015-06-17 11:17:54 +09:00
Gilles Gouaillardet	f2f66e6e63	opal_daemon_init: silence warning as reported by Coverity with CID 710642	2015-06-17 11:17:53 +09:00
Gilles Gouaillardet	8427e87ee9	opal_argv_delete: silence warning as reported by Coverity with CID 71914	2015-06-17 11:17:53 +09:00
Gilles Gouaillardet	bcdb2d1380	add missing #include sscanf requires stdio.h fixes commit open-mpi/ompi@6ca57724c4	2015-06-08 09:13:11 +09:00
Jeff Squyres	0acec2b676	opal/util/net.c: remove stale comment Also wrap a long "if" statement -- but make no code logic changes.	2015-06-06 10:17:20 -07:00
Jeff Squyres	6ca57724c4	opal/util/net.c: remove superflous #include	2015-06-06 10:17:20 -07:00
Nathan Hjelm	0e3c32a98a	opal/sys_limits: fix coverity issue CID 996175 Dereference before null check (REVERSE_NULL) If lims is NULL then we ran out of memory. Return an error and remove the NULL check at cleanup. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 08:38:10 -06:00
Nathan Hjelm	f5389cbb03	opal/keyval: fix coverity issues CID 1292738 Dereference after null check (FORWARD_NULL) It is an error if NULL is passed for val in add_to_env_str. Removed the NULL-check @ keyval_parse.c:253 and added a NULL check and an error return. CID 1292737 Logically dead code (DEADCODE) Coverity is correct, the error code at the end of parse_line_new is never reached. This means we fail to report parsing errors when parsing -x and -mca lines in keyval files. I moved the error code into the loop and removed the checks @ keyval_parse.c:314. I also named the parse state enum type and updated parse_line_new to use this type. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-28 08:38:09 -06:00
Nathan Hjelm	9caffa5dd8	mca/base: fix source file name bug for synonyms This commit fixes synonyms so the source file is correctly printed out by ompi_info. This commit also adds support for printing out the line number where the variable is set. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-12 09:52:31 -06:00
Gilles Gouaillardet	c809aace47	initialize common symbols from opal A few uninitialized common symbols are remaining: common symbols generated by flex : * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext * opal/util/show_help_lex.l: opal_show_help_yyleng * opal/util/show_help_lex.l: opal_show_help_yytext common symbol generated by "external" hwloc library: * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map	2015-05-08 09:48:51 +09:00
Ralph Castain	9cb2fcfa5c	Cleanup the qos code when --enable-timings is given	2015-05-06 20:24:27 -07:00
Nadezhda Kogteva	116169c38a	opal timing: added ability to choose the timer type	2015-04-17 11:15:55 +03:00
Nathan Hjelm	75f210fdb9	opal/util/error: check for existing convertor for error range This commit fixes a bug when opal_error_init is called with the same values multiple times. If opal_error_init is called too many times it will start failing with OPAL_ERR_OUT_OF_RESOURCE. To fix the problem check if an existing convertor matching the requested one and return that one instead. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-09 11:51:36 -06:00
Nathan Hjelm	9cd955badf	opal: fix multiple bugs in MCA and opal This commit fixes the following bugs: - opal_output_finalize did not properly set internal state. This caused problems when calling the sequence opal_output_init (), opal_output_finalize (), opal_output_init (). - opal_info support called mca_base_open () but never called the matching mca_base_close (). mca_base_open () and mca_base_close () have been updated to use a open count instead of an open flag to allow mca_base_open to be called through multiple paths (as may be the case when MPI_T is in use). - orte_info support did not register opal variables. This can cause orte-info to not return opal variables. - opal_info, orte_info, and ompi_info support have been updated to use a register count. - When opening the dl framework the reference count was added to ensure the framework stuck around. The framework being closed prematurely was a bug in the MCA base that has since been corrected. The increment (and associated decrement) have been removed. - dl/dlopen did not set the value of mca_dl_dlopen_component.filename_suffixes_mca_storage on each call to register. Instead the value was set in the component structure. This caused the value to be lost when re-loading the component. Fixed by setting the default value in register. - Reset shmem framework state on close to avoid returning a stale component after reloading opal/shmem. - MCA base parameters were not properly deregistered when the MCA base was closed. This commit may fix #374. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-07 19:13:20 -06:00
Elena	90f5b2bb84	Introduce -tune command line option to set env vars and mca params from file	2015-03-26 18:33:53 +02:00
Gilles Gouaillardet	dc0bc756dc	iof/base: fix misc memory leak as reported by Coverity with CID 1196732	2015-03-10 14:37:53 +09:00
Jeff Squyres	0a2767a5d3	opal lt_interface: remove in favor of opal_dl interface	2015-03-09 08:18:13 -07:00
Gilles Gouaillardet	3511475e29	opal/util: fix misc memory leak as reported by Coverity with CID 996174	2015-02-27 19:19:46 +09:00
Jeff Squyres	9d7171e8f1	convert: remove unnecessary/unused opal_size2int() function The comments in the file even said "This file will hopefully not last long in the tree...".	2015-02-16 07:17:33 -08:00
Gilles Gouaillardet	ccbdf64de4	opal/util: fix memory leak in opal_util_init_sys_limits as reported by Coverity with CID 996174 previous commit (open-mpi/ompi@ca3a275823) dit not fix this CID	2015-02-16 11:05:35 +09:00

1 2 3 4 5 ...

666 Коммитов