XenServer uses GFS2 to make thin provisioning available by using block-based storage devices that are accessed through iSCSI software initiator or Hardware HBA.
This article provides a guide for troubleshooting common issues when GFS2 SR(Thin Provisioning) is being used in XenServer or Citrix Hypervisor.
Instructions
Problem scenario 1: All hosts can ping each other, but creating a cluster is not possible.
- The clustering mechanism uses following specific ports. Please check if there are any firewalls or network configurations between the hosts in the pool are blocking these ports. Please ensure that these ports are open.
- TCP: 8892, 21064.
- UDP:5404,5405(not multicast)
- If you have configured HA in the pool, please disable the HA before enabling clustering.
Problem scenario 2: Cannot add a new host to the clustered pool.
- Please make sure the new host has following ports open.
- TCP: 8892, 21064.
- UDP:5404,5405(not multicast)
- Please make sure the new host can ping all hosts in the clustered pool.
- Please ensure no host is offline when a new host is trying to join the clustered pool.Ensure that the host has an IP address allocated on the NIC that will be joining the cluster network of the pool
Problem scenario 3: A host in the clustered pool is offline and it can't be recovered. How to remove the host from the cluster?You can mark a host as dead forcefully by below command:
xe host-declare-dead uuid=<host uuid>
If you need to mark multiple hosts as dead, you must include all of their
<host uuid> in on single CLI invocation.
This above command removes the host from the cluster permanently and decreases the number of live hosts required for quorum.
Please note, once a host is marked as dead, it cannot be added back into the cluster. To add this host back into the cluster, you must do a fresh installation of XenServer on the host.
Problem scenario 4: Some members of the clustered pool are not joining cluster automatically.
- You can use following command to resync the members of the clustered pool to fix the issue.
xe cluster-pool-resync cluster-uuid=<cluster_uuid>
- You can run xcli diagnostic dbg on the problematic hosts and other hosts to confirm if the cluster information consists on those hosts.
Items to be checked from the command output:
- id: node ID
- addr: IP address used to communicates with other hosts
- Cluster_token_timeout_ms: Cluster token timeout
- Config_valid: if configuration is valid
Command output example:
xcli diagnostic dbg
{
is_running: true is_quorate: true num_times_booted: 1 token: (Value filtered) node_id: 1 all_members: [
{id: 3 addr: [IPv4 192.168.180.222]
}{id: 2 addr: [IPv4 192.168.180.221]
}{id: 1 addr: [IPv4 192.168.180.220]
}] is_enabled: true saved_cluster_config: {
cluster_token_coefficient_ms: 1000 cluster_token_timeout_ms: 20000 config_version: 1 authkey: (Value filtered)
…etc…
config_valid: true
- In case above actions do not help, you may try a re-attaching GFS2 SR following below steps(These steps can also be used to recover from a situation where you end up with an invalid cluster configuration):
- ) Detach GFS2 SR from XenCenter or xe CL xe pbd-unplug uuid=<UUID of PBD> on each host
- ) Disable clustered pool from XenCenter or xe CL xe cluster-pool-destroy cluster-uuid=<cluster_uuid>
Or forcefully disable clustered pool by xe cluster-host-force-destroy uuid=<cluster_host> on each host
- ) Re-enable clustered pool from XenCenter or xe CL
xe cluster-pool-create network-uuid=<network_uuid> [cluster-stack=cluster_stack] [token-timeout=token_timeout] [token-timeout- coefficient=token_timeout_coefficient]
- ) Re-attach GFS2 SR from from XenCenter or xe CL xe pbd-plug uuid=<UUID of PBD> on each host
Problem scenario 5: A host in the clustered pool encountered into self-fencing loop.In this case, you can start the host by adding “nocluster” option. To do this, connect to the hosts physical or serial console and edit the boot arguments in grub.
Example:
/boot/grub/grub.cfg
menuentry 'XenServer' {
search --label --set root root-oyftuj
multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G console=vga vga=mode-0x0311
module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable xencons=hvc console=hvc0 console=tty0 quiet vga=785 splash plymouth.ignore-serial-consoles nocluster
module2 /boot/initrd-4.4-xen.img
}
menuentry 'XenServer (Serial)' {
search --label --set root root-oyftuj
multiboot2 /boot/xen.gz com1=115200,8n1 console=com1,vga dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=192M,below=4G
module2 /boot/vmlinuz-4.4-xen root=LABEL=root-oyftuj ro nolvm hpet=disable console=tty0 xencons=hvc console=hvc0 nocluster
module2 /boot/initrd-4.4-xen.img
Problem scenario 6: A pool master get restarted in a clustered pool.For a cluster to have quorum, at least 50% of the hosts (rounding up) must be able to reach each other. For example in a 5 host pool, 3 hosts are required for quorum. In a 4 host pool, 2 hosts are required for quorum.
In a situation where the pool is evenly split (i.e. 2 groups of 2 hosts that can reach each other) then the segment with the lowest node id will stay up whilst the other half fences. You can find the node IDs of hosts by using the command
xcli diagnostic dbg. Note that the pool master may not have the lowest node id.
Problem scenario 7: After a host is shut down forcibly in the clustered pool, all of pool has vanished.If a host is shutdown non-forcefully then it will be temporarily removed from quorum calculations until it is turned back on. However you force shutdown a host or it loses power then it will still count towards quorum calculations. For example if you had a pool of 3 hosts and forcefully shutdown 2 of them the remaining host would fence as it would no longer have quorum.
Problem scenario 8: All of the hosts within the clustered pool get restarted at same time.If number of contactable hosts in the pool is less than
- even number of hosts in total: n/2
- odd number of hosts in total: (n+1)/2
all hosts would be considered as not having quorum, hence all hosts would self-fence, and you would see all hosts restarted.
You may check following to get more information.
- /var/opt/xapi-clusterd/boot-times to see if any boot occurred at an unexpected time.
- Crit.log to check if there is any self-fencing message outputted.
- XenCenter to check the notification of the timestamp you encountered the issue to see if self-fencing occurred.
- dlm_tool status command output, check fence x at x x
example of a working case of dlm_tool status output
dlm_tool status
cluster nodeid 1 quorate 1 ring seq 4 4
daemon now 436 fence_pid 0
node 1 M add 23 rem 0 fail 0 fence 0 at 0 0
In case of collecting logs for debugging, please collect diagnostic information from
all hosts in the cluster. In the case where a single host has self-fenced, the other hosts in the cluster are more likely to have useful information.
If the host is connected to XenCenter, from the menu select Tools > Server Status Report. Choose the all hosts to collect diagnostics from and click Next. Choose to collect all available diagnostics. Click Next. After the diagnostics have been collected, you can save them to your local system.
Or you can connect to the host console and use the xen-bugtool command to collect diagnostics.
Problem scenario 9: Error when changing the cluster settings
You might receive the following error message about an invalid token ("[[\"InternalError\",\"Invalid token\"]]") when updating the configuration of your cluster.
Resolve this issues by completing the following steps:
- (Optional) Backup the current cluster configuration by collecting an SSR with the xapi-clusterd and system log boxes ticked
- Use XenCenter to detach the GFS2 SR from the clustered pool
- From the CLI of any host in the cluster, force destroy the cluster: xe cluster-pool-force-destroy cluster-uuid=<uuid>
- Use XenCenter to re-enable clustering on your pool
- Use XenCenter to reattach the GFS2 SR to the pool