Intermittent HA2 link flaps on HSCI when using Ethernet transport
Symptom
After a software upgrade, the HA2 link on HSCI starts flapping.
HA1 remains stable and overall device resource usage looks normal, but HA2 state sync is repeatedly lost and re established. This results in periodic HA2 keepalive down events and temporary HA2 unavailable status.
You may see some or all of the following:
System log examples
Repeated HA2 keepalive failures while the physical HSCI link remains up, for example:critical ha ha2-keep-alive 0 HA Group 1: All HA2 keep-alives are downcritical ha ha2-keep-alive 0 HA Group 1: Local HA2 keep-alive downcritical ha ha2-keep-alive 0 HA Group 1: Peer HA2 keep-alive downhigh ha session-synch 0 HA Group 1: Ignoring session synchronization due to HA2-unavailable
These are followed by recovery messages such as:info ha ha2-keep-alive 0 HA Group 1: Local HA2 keep-alive upinfo ha ha2-keep-alive 0 HA Group 1: Peer HA2 keep-alive up
HA agent log examples
HA agent logs show HA2 keepalive and status transitions, for example:Warning: ha_event_log(src/ha_event.c:59): HA Group 1: Peer HA2 keep-alive downWarning: ha_event_log(src/ha_event.c:59): HA Group 1: All HA2 keep-alives are downWarning: ha_event_log(src/ha_event.c:59): HA Group 1: Ignoring session synchronization due to HA2-unavailable
You may also see HA2 monitor failures such as:ha2 monitoring fails: rcv_bitmap fffffffffffc0000, probe_seq 4
Counters and dataplane indicators
- HA2 monitor sent and received counters do not match during the impact window
- Per core HA2 monitor counters show most HA2 monitor messages handled by a small number of cores
- NIC statistics for HSCI show RX missed counts increasing over time
- Dataplane CPU shows one or a few cores at higher load, while overall dataplane CPU is not saturated
Environment
- Active passive HA deployment
- PA 3400 series firewalls
- PAN OS release and platform where the HA2 configuration already exposes UDP as a selectable transport option for HA2 on HSCI
- HA2 configured on the HSCI interface
- HA2 transport type currently set to Ethernet
- No HA2 backup link in use
Cause
- With HA2 transport set to Ethernet on HSCI, HA2 packets can be scheduled to a limited set of receive cores.
- During periods of activity, these cores handle both HA2 update traffic and HA2 monitor messages. The extra load can delay processing of HA2 keepalive packets.
- The HA process then detects missed HA2 keepalives and marks HA2 as unavailable even though the physical HSCI interface stays up. Once packets are processed, HA2 recovers and the logs show periodic HA2 down and up events.
- Using UDP transport for HA2 changes how the packets are handled and lets them be distributed more evenly across cores, which removes the delay that causes the flaps.
Resolution
Note: The procedure in this section applies only if the HA2 configuration on your device already shows UDP as a selectable transport option for HA2 on HSCI. If the HA2 transport field does not include UDP, your current PAN OS release for this platform does not support it. In that case, you must upgrade to a PAN OS version that adds HA2 UDP transport support before using this workaround. Refer to the release notes for your target PAN OS version for feature availability.
If UDP is available for HA2 on HSCI, change HA2 transport from Ethernet to UDP on both peers.
Steps
- Open Device.
- Open High Availability.
- Edit the HA2 configuration that uses HSCI.
- Change the transport type from Ethernet to UDP.
- Commit on both peers.
Planning
- If you have an HA2 backup link, keep it active during the change so that HA2 state sync is preserved.
- If you do not have a backup link, perform the change in a short maintenance window because HA2 state sync will restart when HA2 is reconfigured.
Verification
After switching HA2 to UDP on both peers:
- Confirm in the HA dashboard that both peers are in the expected HA state and are synchronized.
- Review system logs for at least the length of the previous impact window and confirm that HA2 keepalive down and all HA2 keepalives are down messages are no longer recurring.
- Review HA agent logs and confirm that HA2 no longer transitions to HA2 unavailable during normal load.
- Optionally, monitor HA2 monitor counters and NIC statistics and verify that:
- HA2 monitor sent and received counters stay aligned
- RX missed counters on the interface carrying HA2 traffic are not increasing in correlation with HA events
Additional Information
This behavior has been observed on PA 3400 series devices using HSCI for HA2 with Ethernet transport in specific PAN OS releases.
HA2 UDP transport encapsulation for PA 3400 series was introduced in later maintenance releases and is tracked internally in engineering Jira. For older releases where HA2 UDP is not available in the configuration, this mitigation cannot be used and an upgrade is required.
Customers should review the release notes for their target PAN OS version and plan an upgrade path that includes both availability of HA2 UDP transport on their platform and any related fixes that improve HA2 handling on HSCI.