Processes, Logs And Configuration Files Participating in Citrix ADM High Availability

Processes, Logs And Configuration Files Participating in Citrix ADM High Availability

book

Article ID: CTX261729

calendar_today

Updated On:

Description

The following processes participate in Citrix ADM HA operations:

/usr/local/bin/python /mps/mas_hb_monit.py
This process is run by both nodes. It is responsible for sending and receiving heartbeats and healthchecks. By default, it uses UDP port 5005. 
The configuration file for this process is: /mpsconfig/mas_hb_monit.conf
The log file for this process: /var/mps/mas_hb_monit.py.log
If the configuration file does not exist or is corrupted – the process will not start.

 

/mps/pgxl_node_client_handler.py
This process is run by both nodes. It is responsible for various OS and filesystem level operations.
It doesn’t have specific configuration file, however the files /mpsconfig/multinode.conf and /mpsconfig/cluster.conf must be present.
 
 
Postgresql database
Both nodes run Postgresql database, however there will be a difference in the processes:

Primary only  runs multiple processes like:
postgres: mpsroot mpsdb 127.0.0.1(nnnnnn) idle (postgres)
Primary only runs the following processes (please note that those are examples, so the exact outputs may be slightly different):
postgres: wal writer process    (postgres)
postgres: wal sender process masrepuser ip_of_standby_ha_node(number) streaming AF/EC1D96C8 (postgres)
Secondary only will run the following process:
postgres: wal receiver process   streaming 97/C4BAF798 (postgres)


The following configuration files are necessary for HA to work:

/mpsconfig/mas_hb_monit.conf
Below there is a fragment of the configuration file related to HA:
  
"config_parameters": {
        "app_failover_timeout": 180,
        "db_slave_conf_file": "/var/mps/db_pgsql/data/recovery.conf",
        "ha_conf_file": "/mpsconfig/cluster.conf",
        "ha_recovery_time": 300,
        "healthcheck_frequency": 10,
        "heartbeat_frequency": 1,
        "node_failover_timeout": 180,
        "peer_ip": "192.168.200.212",
        "peer_port": 5005,
        "stay_as_secondary": "false",
        "virtual_ip_alias": "192.168.200.213",
        "wal_file_sync_lag": 10,
        "wal_xlog_poll_interval": 300

The most important parameters are:
-           peer_ip - ip of the second node from HA pair
-           virtual_ip_alias - virtual, floating ip, which should be set for HA pair
-           app_failover_timeout: 180 - used to failover - we check if all processes come up by itself for 3 minutes i.e. in case if primary node is rebooted and comes up within 3 minutes, by default the failover is not triggered.
-           ha_recovery_time: 300 -  after the failover the failover should not be triggered. Internally it is checked if it still in recovery period if the forced failover is pressed.
-           healthcheck_frequency: 10 -  every 10s subsytem and postgres are checked
-           heartbeat_frequency: 1 -  every 1s we send the message to the other node


/mpsconfig/multinode.conf
This file must be present and has 0 size

/mpsconfig/cluster.conf
This file must be present


The  log file /var/mps/mas_hb_monit.py.log
Example from Primary
2019-06-03 10:20:05:receiveHeartbeat:312: [DEBUG] Received proc health status. Sender: '192.168.200.212', Status: {"readyForFailover": "T", "uptime": 879645, "H": "T", "P": "F", "role": "F", "V": "F"}
2019-06-03 10:20:05:monitorHeartbeat:535: [DEBUG] Last HB received was 0.0603921413422 seconds ago
2019-06-03 10:20:05:sendHeartbeat:179: [DEBUG] Current process status from cache: {u'mas_abdp': True, u'mas_afdecoder': True, u'postgres': True, u'mas_sysop': True, u'mas_config': True, u'mas_event': True, u'mas_perf': True, u'mas_service': True, u'mas_control': True, u'mas_agentmsgrouter': True, u'mas_afanalytics': True, u'mas_inventory': True}
2019-06-03 10:20:05:sendHeartbeat:265: [DEBUG] No information for interface: pflog0
2019-06-03 10:20:05:sendHeartbeat:265: [DEBUG] No information for interface: lo0
2019-06-03 10:20:05:sendHeartbeat:282: [DEBUG] Current status to send: {'readyForFailover': 'T', 'uptime': 879278, 'H': 'T', 'P': 'T', 'role': 'T', 'V': 'T'}


Example from Secondary
2019-05-27 19:47:37:monitorHeartbeat:535: [DEBUG] Last HB received was 0.869044065475 seconds ago
2019-05-27 19:47:37:sendHeartbeat:240: [DEBUG] Current process status from cache: {u'wal receiver process': True, u'postgres': True}
2019-05-27 19:47:37:sendHeartbeat:265: [DEBUG] No information for interface: pflog0
2019-05-27 19:47:37:sendHeartbeat:265: [DEBUG] No information for interface: lo0
2019-05-27 19:47:37:sendHeartbeat:265: [DEBUG] No information for interface: 1/1
2019-05-27 19:47:37:sendHeartbeat:282: [DEBUG] Current status to send: {'readyForFailover': 'T', 'uptime': 863495, 'H': 'T', 'P': 'F', 'role': 'F', 'V': 'F'}
2019-05-27 19:47:37:receiveHeartbeat:312: [DEBUG] Received proc health status. Sender: '192.168.200.211', Status: {"readyForFailover": "T", "uptime": 863128, "H": "T", "P": "T", "role": "T", "V": "T"}


Log Reference:
  • On primary the processes are tracked by mas_hb_monit.py – postgres as well as service
  • On secondary – postgres and wal (replication) are monitored
  • With heartbeat there are following information exchanged between nodes:
    • readyForFailover - T/F -> if the value is F, the failover cannot happen
    • uptime -> uptime in seconds
    • H – health - T/F
    • P – primary – T/F
    • role -  T/F –F for secondary, can be staysecondary -> the failover cannot happen (i.e. after failover the next failover cannot be triggered for a while)
    • V – T/F – virtual ip set

Issue/Introduction

This document describes the processes and configuration files related to ADM 12.1 high availability mechanism