Gilles Gouaillardet
7d6b75f3b2
orte_util_snprintf_jobid: return ORTE_SUCCESS or ORTE_ERROR
2016-01-18 09:44:33 +09:00
Ralph Castain
4dad5de8ff
Silence a couple of warnings - strncpy returns a char*, not an int
2016-01-16 09:44:52 -08:00
Gilles Gouaillardet
1d38430e43
opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid
2016-01-14 10:39:03 +09:00
Ralph Castain
64b695669a
Cleanup warnings in opal and orte layers when building optimized on Mac
2015-12-17 07:51:24 -08:00
Jeff Squyres
8bd356549a
orte proc_info.h: use symbolic names
...
This fix was actually applied in the v2.x branch first (as commit
open-mpi/ompi-release@a9b22afc1a ).
2015-11-10 13:39:21 -08:00
Ralph Castain
f1483eb2dc
Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.
2015-11-06 21:35:24 -08:00
Ralph Castain
68996d6858
Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with
2015-11-05 07:57:54 -08:00
Ralph Castain
fe0c995f6b
Fix a couple of minor issues identified by Jeff
2015-11-03 17:30:51 -08:00
Ralph Castain
24419b6523
Fix relative node syntax for dash-host option
2015-10-31 19:00:46 -07:00
Ralph Castain
0140ff048d
Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
...
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.
Also do a little cleanup to avoid bombarding the user with multiple error messages.
Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe
Plug a few memory leaks identified by valgrind
2015-09-23 15:21:04 -07:00
Ralph Castain
e6add86e4f
Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.
2015-09-07 09:19:24 -07:00
Ralph Castain
d97bc29102
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 16:54:40 -07:00
Ralph Castain
cf6137b530
Integrate PMIx 1.0 with OMPI.
...
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
bc7815e178
Adjust the process type flags to remove confusion between orted and dvm state machines
2015-08-21 07:50:08 -07:00
Ralph Castain
0b1d4b62be
Cleanup some cruft and update to coordinate with CM operations:
...
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Nathan Hjelm
4d92c9989e
more c99 updates
...
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770
Purge whitespace from the repo
2015-06-23 20:59:57 -07:00
Gilles Gouaillardet
ac5921d7da
orte/util: fix misc memory leak
...
as reported by Coverity with CID 1196738-1196739
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
67638690ea
orte/util: fix a misc memory leak
...
as reported by Coverity with CID 710652
2015-06-17 11:17:54 +09:00
Ralph Castain
ea35e47228
Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
...
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Ralph Castain
b5382c9bf9
Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
...
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Gilles Gouaillardet
2e384a3b65
initialize common symbols from orte
...
A few uninitialized common symbols are remaining (generated by flex) :
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
1f8de276de
Consolidate all the QOS changes into one clean commit
2015-05-06 19:48:42 -07:00
Ralph Castain
7d1980ba83
Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param.
2015-04-30 20:35:23 -07:00
Ralph Castain
a013f3059f
For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid.
2015-03-19 21:24:13 -07:00
Ralph Castain
b01e8c1063
Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see.
2015-03-17 06:26:26 -07:00
Ralph Castain
a0487e014c
Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.
2015-03-16 19:42:05 -07:00
Ralph Castain
5ae42c816e
Attempt to reduce the RARP traffic during definition of allocations
2015-03-16 16:26:40 -07:00
Gilles Gouaillardet
89806c6261
orte/util: fix memory leaks
...
as reported by Coverity with CIDs 70845, 71855, 710652,
1196738, 1196739, 1196757, 1196758, 1269863 and 1269883
2015-03-05 14:06:18 +09:00
Ralph Castain
332e4fa7aa
Minor fix - relative host name syntax cannot support usernames as you can't know which hosts will be selected
2015-02-24 12:15:28 -08:00
Ralph Castain
f7c28ea706
Fix bad test - opal_buffer and opal_ptr can support NULL locations
2015-02-17 21:46:23 -08:00
Ralph Castain
c1282d5b99
The opal_buffer type also generates its own alloc, so need to let it pass thru the check
2015-02-17 21:06:19 -08:00
Ralph Castain
624b16e070
Protect the unload attribute function
2015-02-17 14:21:23 -08:00
Ralph Castain
78245e8a33
Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations.
2015-02-17 12:51:11 -08:00
Gilles Gouaillardet
b762766969
orte/util: fix misc memory leaks
...
as reported by Coverity with CIDS 70314, 710653-710657 and 1196741-1196744
2015-02-17 12:27:23 +09:00
Ralph Castain
46fb850bb0
Continue adding support for options on orte-submit - still need to shift some of the MCA params to job object attributes
2015-02-10 13:56:14 -08:00
Ralph Castain
2b0b012460
Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it.
2015-02-04 06:21:54 -08:00
Howard Pritchard
1e94d84ae6
orte/util: minor improvement to show_help
...
Make sure the show help gives it a good try to
print an error message locally if the
send_buffer_nb method returns an error.
2015-01-23 13:54:03 -08:00
Ralph Castain
9d5135e6cd
Function definition should use the correct type
2014-12-09 01:04:31 -08:00
Ralph Castain
b1bf557024
Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.
2014-12-05 15:47:09 -08:00
Ralph Castain
c88f181efe
Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.
2014-12-03 18:11:17 -08:00
Ralph Castain
3f9d9ae8b6
Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids.
...
Once validated, a version of this will be backported to the v1.8.4 release.
2014-11-30 11:50:31 -08:00
Ralph Castain
f48b9012cb
Some minor cleanup. We really don't need another peer error constant to indicate that a peer closed as we already have one for "connection failed", and that's all we really know. Update the orte constants to track their opal equivalents.
2014-11-25 08:02:29 -08:00
Ralph Castain
48f702827e
First part of memory leak cleanups from Gilles
2014-11-24 16:53:33 -08:00
rhc54
7c0273ecb3
Merge pull request #276 from teng-lin/master
...
Fixed a bug that fails to parse hostname starting with numbers.
2014-11-19 16:39:00 -08:00
Teng Lin
07ff51f43f
Fixed a bug that fails to parse hostname starting with numbers.
...
According to RFC 1123, hostnames that begin with numbers are valid.
2014-11-19 16:03:55 -08:00
Ralph Castain
bb91517349
All other layers to register their own print-attribute functions so we can maintain pretty-print capabilities as the attributes are extended.
2014-11-19 09:37:59 -08:00
Ralph Castain
37593b232d
Add a marker for the max attr value being used by ORTE so that other, higher-levels can also use the attribute system
2014-11-19 09:37:59 -08:00
Gilles Gouaillardet
84b21d726e
orte/util: add OPAL_{VPID,JOBID} types to orte_attr_{load,unload}
2014-11-14 15:55:25 +09:00