1
1

mtl/ofi: Print descriptive error message on modex failure

With MTLs, there's no "other transport" when the remote side
does not have an active NIC, so we should print a useful error
message when the modex failed (indicating lack of a NIC on
the remote side).

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Этот коммит содержится в:
Brian Barrett 2019-01-18 14:00:10 -08:00
родитель 352b667323
Коммит fe25097194
2 изменённых файлов: 14 добавлений и 3 удалений

Просмотреть файл

@ -65,3 +65,13 @@ are more threads than the available contexts.
Local host: %s
Location: %s:%d
[modex failed]
The OFI MTL was not able to find endpoint information for a remote
endpoint. Most likely, this means that the remote process was unable
to initialize the Libfabric NIC correctly. This error is not
recoverable and your application is likely to abort.
Local host: %s
Remote host: %s
Error: %s (%d)

Просмотреть файл

@ -98,9 +98,10 @@ ompi_mtl_ofi_add_procs(struct mca_mtl_base_module_t *mtl,
(void**)&ep_name,
&size);
if (OMPI_SUCCESS != ret) {
opal_output_verbose(1, ompi_mtl_base_framework.framework_output,
"%s:%d: modex_recv failed: %d\n",
__FILE__, __LINE__, ret);
opal_show_help("help-mtl-ofi.txt", "modex failed",
true, ompi_process_info.nodename,
procs[i]->super.proc_hostname,
opal_strerror(ret), ret);
goto bail;
}
memcpy(&ep_names[i*namelen], ep_name, namelen);