The two HCAs cannot talk to each other because that's how the routing in your subnet is configured. The fact that you can talk from a third machine to both of the "problematic" machines indicates that this is not hosts' problem, but subnet problem.
Infiniband routing is a complicated issue, and just by your description I can't tell how to fix it.
In general, Subnet Manager is calculating and configuring routing on all switches. What kind of Subnet Manager are you using? Is it OpenSM that runs on some host, or Qlogic's SM that runs embedded on one of the switches?
If it's Qlogic, you need to go to their management UI and change/fix routing algorithm.
If it's OpenSM, you can run it with "minhop
" routing (run "opensm -h"
to see usage) - this should fix the problem.
However, this won't really FIX the problem - you probably have something bad in the subnet topology, and this is where you need to focus if/once minhop routing solves the issue.