HA Failover Hold Timers
Resolution
Issue
After the failover of one of the devices in a HA active/passive cluster, the newly active device does not go down even if one of the monitoring interfaces goes down for a minute.
Resolution
The one minute "monitor hold timer" just after failover, is a pre-set timer to prevent unnecessary fail over flaps. After a fail over, the process will not allow another failover if it detects the traffic link down within the one minute timer limit. A link down after the timer expires will subsequently cause a failover. This timer is not configurable.
In the following scenario, ethernet1/2 is disconnected at 21:53:10 once after the device became Active at 21:53:00.
But the link down was not detected due to the monitor hold timer. At 21:54:00, the link-monitor detected an interface down at the same time the monitor hold timer ends.
- ha_agent.log
Nov 21 21:53:00 HA Group 15: Moved from state Passive to state Active <--- this box became active!!
Nov 21 21:53:00 ha_sysd_dev_state_update(ha_sysd.c:1402): Set dev state to Active
Nov 21 21:53:00 ha_state_start_preemption_hold(ha_state.c:1705): Group 15: no need for preemption waiting
Nov 21 21:53:00 ha_state_start_monitor_hold(ha_state.c:940): Starting monitor hold for group 15; linkmon not monitored <---- monitor hold timer started!!!
<-- around 21:53:10 ethernet1/2 went down for flapping, but it's not detected due to monitor hold timer.
Nov 21 21:54:00 ha_state_monitor_hold_callback(ha_state.c:1539): Group 15: ending monitor hold <--- ending monitor hold timer!!!
Nov 21 21:54:00 Warning: ha_event_log(ha_event.c:47): HA Group 15: Link group 'VW-monitor' link 'ethernet1/2' is down
Nov 21 21:54:00 Warning: ha_event_log(ha_event.c:47): HA Group 15: Link group 'VW-monitor' failure; one or more links are down
<-- Link monitor (VW-monitor) detected link down just after monitor hold timer.
Nov 21 21:54:00 ha_state_transition(ha_state.c:982): Group 15: transition to state Non-Functional
Nov 21 21:54:30 ha_state_start_nonfunc_hold(ha_state.c:2021): Starting NonFunc holdtime for group 15
<--- then "monitor fail hold timer" started!!!
Another NonFunc timer is known as the "monitor fail hold timer".
It is the amount of time for a device to stay in a non-functional state after after a downgrade from an active state.
CLI command:
# set deviceconfig high-availability group xx mode active-passive monitor-fail-hold-down-time
<value> <1-60> Interval in minutes to stay in non-functional state following a link/path monitor failure, default 1
owner: yogihara