Basically I trying to understand how does ESXi host select a path during path state change event i.e how does ESXi gives weightage to a path based on all RTPG responses which it receives from the available target ports
I have a ESXi 6.x host with a clustered storage (Netapp cDOT)
During a takeover event in a 4 node Clustered where
4 node cluster consisting of (Node1,Node2) a pair and another pair being (Node3,Node4)
Nodes (Node3,Node4) are Out of Quorum (OOQ) that means they cannot sync with other nodes in the cluster
Node Node1 takes over Node2 i.e Node2 going down after transferring the LUN ownership to Node1
TPG ID:
Node1 (1000/0x03E8)
Node2 (1001/0x03E9)
Node3 (1010/0x03F2)
Node4 (1011/0x03F3)
For a given LUN in Node2
Initial ALUA states for RTPG data looks as below
RTPG to Node1
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - ANO
Node2 (1001/0x03E9) - AO
Node3 (1010/0x03F2) - ANO
Node4 (1011/0x03F3) - ANO
RTPG to Node2
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - ANO
Node2 (1001/0x03E9) - AO
Node3 (1010/0x03F2) - ANO
Node4 (1011/0x03F3) - ANO
RTPG to Node3
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - ANO
Node2 (1001/0x03E9) - AO
Node3 (1010/0x03F2) - ANO
Node4 (1011/0x03F3) - ANO
RTPG to Node4
RTPG Data:
Node1 (1000/0x03E8) - ANO
Node2 (1001/0x03E9) - AO
Node3 (1010/0x03F2) - ANO
Node4 (1011/0x03F3) - ANO
Questions:
----------------
1. During the transition stage after a check condition to a I/O command followed by a RTPG response of new AO and ANO paths as below, why is ESXi continuing to route I/O through the same path which is marked as ANO
Is it because the last reproted RTPG from Node4 says Node2 port (1001/0x03E9) is AO ?
ALUA states for RTPG data looks as below and its mentioned in the sequence how its send and received in trace which I analyzed
RTPG to Node1
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - AO (Changed from ANO due to takeover)
Node2 (1001/0x03E9) - ANO (Changed from AO due to takeover)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
RTPG to Node2
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - AO (Changed from ANO due to takeover)
Node2 (1001/0x03E9) - ANO (Changed from AO due to takeover)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
RTPG to Node3
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - Unavailable
Node2 (1001/0x03E9) - AO (No Change in path states because Node3 is out of quorum)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
RTPG to Node4
RTPG Data:
Node1 (1000/0x03E8) - Unavailable
Node2 (1001/0x03E9) - AO (No Change in path states because Node4 is out of quorum)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
2. After the takeover is completed for Node2 (i.e Node2 completely down) RSCNs for Node2 were received from switch followed by RTPGs with the below mentioned states reported by target , why is ESXi going into an endless loop of path probing/RTPGs ? Is it because the last reproted RTPG from Node4 says Node2 port (1001/0x03E9) is AO ? when its really down and the host knows about it from the RSCN received?
ALUA states for RTPG data looks as below and its mentioned in the sequence how its send and received in trace which I analyzed
RTPG not send to Node2 as its down after takeover
RTPG to Node1
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - AO
Node2 (1001/0x03E9) - Unavailable
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
RTPG to Node3
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - Unavailable
Node2 (1001/0x03E9) - AO (No Change in path states because Node3 is out of quorum)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
RTPG to Node4
RTPG Data:
RTPG Data:
Node1 (1000/0x03E8) - Unavailable
Node2 (1001/0x03E9) - AO (No Change in path states because Node4 is out of quorum)
Node3 (1010/0x03F2) - Unavailable
Node4 (1011/0x03F3) - Unavailable
3.And why is ESXi always probing or sending RTPGs to path in the below order only
Node1 (1000/0x03E8)
Node2 (1001/0x03E9)
Node3 (1010/0x03F2)
Node4 (1011/0x03F3)