XenServer stay in Disabled state after rebooting, after a while it got disconnected from pool

XenServer stay in Disabled state after rebooting, after a while it got disconnected from pool

book

Article ID: CTX489512

calendar_today

Updated On:

Description

XenServer stay in Disabled state after rebooting, and after a while it got disconnected from pool.

Try to enable host by CLI, it reports "Pool Master is unreachable"
[root@test-xs ~]# xe host-enable host=test-xs

Seen from /var/log/xensource.log like below:

  • Mar 16 11:28:48 test-xs xapi: [debug||0 |bringing up management interface D:b81ee3b0d9e3|xapi] Management IP address is: 192.168.172.2
  • Mar 16 11:28:48 test-xs xapi: [error||0 |bringing up management interface D:b81ee3b0d9e3|master_connection] Caught Master_connection.Goto_handler
  • Mar 16 11:28:48 test-xs xapi: [debug||11 ||dummytaskhelper] task dom0 networking update D:0980145a5128 created by task D:d04c84090a7b
  • Mar 16 11:28:48 test-xs xapi: [debug||0 |bringing up management interface D:b81ee3b0d9e3|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message).
  • Mar 16 11:28:48 test-xs xapi: [debug||11 |dom0 networking update D:0980145a5128|xapi_mgmt_iface] Checking to see if hostname or management IP has changed
  • Mar 16 11:28:48 test-xs xapi: [error||0 |bringing up management interface D:b81ee3b0d9e3|master_connection] Connection to master died. I will continue to retry indefinitely (supressing future logging of this message).
Verify eth1 is working by shutdown interface eth0, ping is available and host is still online:
[root@test-xs ~]# ip link set eth0 down
For Active/Passive bond:
[root@test-xs ~]# ovs-appctl bond/show
for LACP bond: 
[root@test-xs ~]# ovs-appctl lacp/show

Verify eth0 is working by shutdown interface eth1, ping is available and host is still online:
[root@test-xs ~]# ip link set eth0 up
[root@test-xs ~]# ip link set eth1 down
For Active/Passive bond:
[root@test-xs ~]# ovs-appctl bond/show
For LACP bond: 
[root@test-xs ~]# ovs-appctl lacp/show

Check ls -lh /var/xapi/state.db has over 35MB

By using tcpdump that captures traffic between slave and master hosts, we can see there are many TCP ZeroWindow indicating master sends packet too fast:

17:29:31.300871    0.003255    192.168.172.2    192.168.172.1    TCP    [TCP ZeroWindow] 48742 > https [ACK] Seq=677 Ack=8135441 Win=0 Len=0 TSV=2591394380 TSER=1953369206
17:29:31.301006    0.000135    192.168.172.2    192.168.172.1    TCP    [TCP Window Update] 48742 > https [ACK] Seq=677 Ack=8135441 Win=85760 Len=0 TSV=2591394381 TSER=1953369206

Resolution

On each host, disable management interfaces TSO and GSO option:

[root@test-xs ~]# ethtool -K eth0 tso off gso off 

Double-check the option has effect immediately:
[root@test-xs ~]# ethtool -k eth0 |egrep "tcp-segmentation-offload|generic-segmentation-offload"

To make settings permanent take effect, set it up to pif of management interface:

[root@test-xs ~]# xe host-list params=name-label,uuid
uuid ( RO)          : 0f0f622a-49bf-4de6-9317-3a1299b3c4e9
    name-label ( RW): test-xs

[root@test-xs ~]# xe pif-list device=eth0 host-uuid=0f0f622a-49bf-4de6-9317-3a1299b3c4e9 params=uuid,device,other-config
uuid ( RO)            : 0fd6ca3a-c38c-54a5-dcb3-15bd94b8d790
          device ( RO): eth0
    other-config (MRW):

[root@test-xs ~]# xe pif-param-set uuid=0fd6ca3a-c38c-54a5-dcb3-15bd94b8d790 other-config:ethtool-gso="off"
[root@test-xs ~]# xe pif-param-set uuid=0fd6ca3a-c38c-54a5-dcb3-15bd94b8d790 other-config:ethtool-tso="off"

Follow above steps again and apply settings to eth1 as well:
[root@test-xs ~]# xe pif-param-set uuid=9ddffa3f-d535-4cc4-b4fc-312e1ef178c2 other-config:ethtool-gso="off"
[root@test-xs ~]# xe pif-param-set uuid=9ddffa3f-d535-4cc4-b4fc-312e1ef178c2 other-config:ethtool-tso="off"