Understanding HA fence due to storage

Description

This article describes how to distinguish High Availability (HA) storage heartbeat fencing from real host crashes.
While the majority of HA fencing events occur as regular reboots and leave clear and easily found log statements identifying the event as a fence, there is one particular type of fence which is “performed” by way of a forced hypervisor crash, namely the storage heartbeat timeout fence.
These fences do not leave log statements specifying that they were fence events, and moreover cause the hosts to come back up with the red “!”, indicating the server crashed. Because of this, these fence events have a tendency to cause undue worry, and are reported to Citrix Technical Support as “host crashes” requiring root cause analysis.
This article shows you how to easily identify if a “crash” was actually a storage heartbeat fence event.

Instructions

Complete the following procedure:

Ensure that the CPU stack trace entries in /var/crash/crash.log look similar to the following statement. The addresses (the hexadecimal values in the first column) do not have to match, but the sequence of function calls on the stack should look similar to the following:
PCPU0:
        Call Trace:

           ffff82c48010fb30] elf_core_save_regs+0/0xb0
           ffff82c480110c57 kexec_crash+0x57/0x70
           ffff82c480124103 dom0_shutdown+0x73/0xa0
           ffff82c480103bb6 domain_shutdown+0xb6/0xd0
           ffff82c480120150 domain_watchdog_timeout+0/0x40
           ffff82c4801263fc execute_timer+0x4c/0x70
           ffff82c4801268a5 timer_softirq_action+0x85/0x220
           ffff82c480124215 __do_softirq+0x65/0x90
           ffff82c480154905 idle_loop+0x25/0x50

xen.log shows:

           (XEN) Watchdog timer fired for domain 0
           (XEN) Domain 0 shutdown: watchdog rebooting machine.
           (XEN) Executing crash image

Other PCPUs:

[ffff828c8010e310] dump_domains+0x4d0
           ffff828c80175c1c  crash_nmi_callback+0x2c
           ffff828c8015f2e9  do_nmi+0x39
           ffff828c801d6877  handle_ist_exception+0x52
           ffff828c80178852  acpi_safe_halt+0x2
Verify the timestamp of the /var/crash-* directory.
Verify /var/log/xha.log, and find the entries immediately preceding the crash timestamp.
Search for a series of log entries similar to the following:

Jan 18 09:29:06 EST 2011 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". h
ost[3].time_since_last_update_on_sf=35037.

The number following “time_since_last_update_on_sf” will be ticking upwards towards the actual HA timeout. The value is in milliseconds, and the actual timeout by default is 45 seconds on some older versions of XenServer, 75 seconds in XenServer 5.5 Update 2 and later. Many HA installations that use multipathing will have have this value tuned up to 120 seconds. Sometimes, but not always, the very last few lines will look similar to the following:

Feb 02 13:26:11 EST 2011 [debug] WD: now - settime = 61729 timeout = 75000.
Feb 02 13:26:11 EST 2011 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Feb 02 13:26:18 EST 2011 [debug] WD: now - settime = 68740 timeout = 75000.
Feb 02 13:26:18 EST 2011 [notice] Watchdog is expiring soon id=1 label=statefile.
Feb 02 13:32:11 EST 2011 [notice] HA daemon started - built at Nov 30 10:17:06 EST 2010 - be2db7784f42

The last few log statements are not always captured, because the HA storage fence-by-hypervisor crash is very abrupt, and the dom0 kernel does not get any chance to flush write buffers to disk. Because of this, the last bit of logging might still be in the buffers at the time of the event, and never gets written to disk.

“Watchdog is expiring soon” is the very last line before the fence – the following line which shows “HA daemon started” is actually the first log statement after the server is up.

Issue/Introduction

This article describes how to distinguish High Availability (HA) storage heartbeat fencing from real host crashes.

Additional Information

Storage heartbeat loss of this type is almost always caused by SAN problems. It is generally a good idea to examine the xha.log after the fence – you might find previous or subsequent indications of temporary loss of storage heartbeat, indicating performance issues; the SAN being either simply too busy to keep up with the heartbeat or other types of SAN issues causing intermittent glitches.

Before SAN troubleshooting, however, look for driver errors from drivers related to the SAN (NICs assigned to the storage network in the case of software iSCSI, HBAs in the case of FibreChannel, SAS, or iSCSI HBAs) in /var/log/kern.log. Driver stack traces or other errors and warnings could indicate that the issue is on the XenServer side rather than on the SAN.

Welcome to "KB Articles"