Question

I get a cluster of which the nodes are connected in fat tree IB. The switches are Qlogic 12300.

The problem I have is certain nodes can't talk with each other. Even there are other nodes, which can talk with both of the impacted nodes.

I used ibtracert to diag the problem. The amazing thing is if I run that command on a separate node which can talk with both the nodes, they are fine and reported a feasible route.

However the ibtracert command run into error if I issued it from the two impacted nodes.

Can I ask what the likely reason for this?

Thanks.

Was it helpful?

Solution

The two HCAs cannot talk to each other because that's how the routing in your subnet is configured. The fact that you can talk from a third machine to both of the "problematic" machines indicates that this is not hosts' problem, but subnet problem.

Infiniband routing is a complicated issue, and just by your description I can't tell how to fix it.

In general, Subnet Manager is calculating and configuring routing on all switches. What kind of Subnet Manager are you using? Is it OpenSM that runs on some host, or Qlogic's SM that runs embedded on one of the switches?

If it's Qlogic, you need to go to their management UI and change/fix routing algorithm. If it's OpenSM, you can run it with "minhop" routing (run "opensm -h" to see usage) - this should fix the problem. However, this won't really FIX the problem - you probably have something bad in the subnet topology, and this is where you need to focus if/once minhop routing solves the issue.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top