Gilles Gouaillardet
189da7fdab
pmix2x: plug a memory leak in _event_hdlr()
...
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-24 09:13:30 +09:00
Ralph Castain
e25e69dc2f
Resolve Coverity issues
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-07 10:45:52 -08:00
Ralph Castain
6509f60929
Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node.
...
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.
Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:
$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
Daemon: 12.483398
Client: 6.514648
Data for node rhc002
Daemon: 11.865234
Client: 4.643555
Sampling memory usage after MPI_Barrier
Data for node rhc001
Daemon: 12.520508
Client: 6.576660
Data for node rhc002
Daemon: 11.879883
Client: 4.703125
Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-05 10:32:17 -08:00
Ralph Castain
9eab9a1ed3
Remove stale global variables
...
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-02 14:04:24 -08:00
Gilles Gouaillardet
22db1d36b6
pmix2x: silence misc warnings
...
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-26 13:35:17 +09:00
Ralph Castain
256b5adac5
Transfer across final fixes from debugger attach work
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-19 00:34:27 -08:00
Ralph Castain
f91f8ce494
Protect against NULL param
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 21:19:32 -08:00
Ralph Castain
f633d5c1b1
Initialize var
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 21:08:45 -08:00
Ralph Castain
a43fae74a5
Avoid hanging in show_help if PMIx was unable to initialize
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 10:12:58 -08:00
Ralph Castain
1a0bccb536
Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-02 15:44:43 -08:00
Ralph Castain
af67f16422
Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master
...
Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component
2016-08-25 18:19:05 -07:00
Ralph Castain
61ffba668b
Roll in the latest PMIx version - includes shared memory datastore and reduced memory footprint
2016-08-20 07:53:06 -07:00
Ralph Castain
48d35a9627
Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program
2016-08-12 21:14:29 -07:00
Ralph Castain
b0cc9b0bc8
Update to latest PMIx toolext branch
...
Fix indentations
Update the ext20 component to match latest PMIx master.
Cleanup name conflicts and uninit vars
2016-08-11 12:29:48 -07:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
Jeff Squyres
72f41d4490
pmix: replace all tabs with spaces
...
No code or logic changes
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-17 15:08:33 -04:00
Ralph Castain
99f7096031
Fix permissions
2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c
Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling
2016-07-16 13:20:41 -07:00
Ralph Castain
20a91c2baf
Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
...
Cleanup debug message
2016-07-13 15:28:33 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
6e434d6785
Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
...
Update to match PMIx RFC
Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Ralph Castain
dde69e1be2
Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761.
...
Fixes #1792
2016-06-18 12:28:46 -07:00
Ralph Castain
5d330d5220
Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
...
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00