High Availability - "HA Peer Connection Status"

34785

Created On 04/28/22 20:18 PM - Last Modified 08/23/23 22:21 PM

Active-Active

Active-Passive

High Availability

9.1

10.2

10.1

11.0

PAN-OS

Strata

Strata Cloud Manager

Symptom

When Peer Connection Status is showing 'down' in the output of the CLI Command > show high-availability all

Environment

PAN-OS
High Availability (Active-Passive or Active-Active)

Cause

If the output of >show high-availability all shows Peer Information as 'Connection status: down' on the Active or Active-Primary firewall in the HA pair, the user may experience failovers or a degraded network environment.

While the most common reason for HA peer not being detected is the HA links going down, there can be other reasons such as:
• Peer firewall not able to process/receive HA heartbeats at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• Peer firewall not able to respond to/send HA heartbeats at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• HA link hardware issue (faulty cable, faulty SFP, faulty port, firewall backplane issue, electrical issue)
• Any latency or packet loss on the HA links
• Geographic distance between the 2 firewalls in the HA pair is too long/far for the HA cable/SFP type specification
• Other system / process issues that can occur

Resolution

Identify the exact date and timestamp the HA failover / HA failure occurred

In the Firewall Web GUI, navigate to Monitor > System Logs
Navigate to the date and timestamp the HA failure occurred, and identify if there are any other System Logs around that time which could indicate an issue with the firewall health overall (any interfaces going down, processes exiting, high CPU/memory utilization, Link and Path Monitoring going down, etc.)
If other events are found which could have contributed to the HA connection being down, find that event's root cause and resolve it

Verify both firewalls meet the requirements for HA

Using the document below, verify that both firewalls have the exact same model, PAN-OS version, interfaces, licenses, vsys capabilities, etc.
Prerequisites for Active/Passive HA
Prerequisites for Active/Active HA

Identify HA Peer Connection down reason

Review the output of the below CLI command to identify the cause for the HA Peer Connection down on both firewalls:

>show high-availability all
Group 1: 
 Mode: Active-Active
Local Information:
    Mode: Active-Active
    State: active-primary (last 1 hours)
    Last non-functional state reason: Dataplane down: brdagent exiting
Peer Information:
    Connection status: down
    Connection down reason: HA1 link went down
    Last non-functional state reason: Dataplane down: user triggered

Other possible Connection down reasons include:

Heartbeat ping failure
Never able to connect to peer
Error in connection detected
Peer HA agent exiting
Hello protocol failure
Capability exchange with peer failed
HA1 encryption configuration mismatch
SSH Tunnel reset

Tip: It is also a good idea to take note of 'Last non-functional state reason' as it can often help you find the root cause of the failure

Verify the status of the HA interfaces and resolve any hardware or software interface/link issues on both firewalls

Dashboard > click Widgets > System > click High Availability

>show high-availability interface < ha1 | ha2 | ha3 >

Verify a supported SFP is being used

Always use SFP's from the list of supported SFP's by Palo Alto Networks for the HA ports. Unsupported SFP's have not been tested and validated for use in Palo Alto Networks devices. If an unsupported SFP is used, it is likely that the interface may never come up, flap, and other issues may occur. Palo Alto Networks TAC may refuse support if an unsupported SFP is used. If you are currently using an unsupported SFP, replace it with an SFP from the list of supported SFP's below before proceeding.

List of Supported SFP's/Transceivers
https://live.paloaltonetworks.com/t5/operations-documentation/transceiver-history-reference-810-000096-00y-updated-on-03-23/ta-p/227987?attachment-id=10684
https://live.paloaltonetworks.com/t5/operations-documentation/hw-accessory-cross-reference-810-000077-0av-updated-on-03-23/ta-p/63422?attachment-id=10683
How to see currently installed SFPs

Resolve any hardware/physical link issues by trying known-good/working hardware components

Reseat the HA cable in both firewalls
Reseat the HA port SFP in both firewalls
Replace the HA cable with a known-good, working HA cable of the same type
Replace the HA port SFP with a known-good, working HA port SFP of the same type

After performing each of the above steps, check if the HA Link issue is still occurring

Resolve any Management Plane or Dataplane Performance Issues (high CPU, high memory, high Packet Buffers/Packet Descriptors)

If the Management Plane or Dataplane get too busy for some reason, the firewall may not be able to reliably receive, process, or send HA heartbeat messages. Use the below steps to identify, troubleshoot and resolve the high Management Plane or Dataplane utilization

Review Monitor > System Logs around the time of the HA failure occurring to identify if there was any high CPU / memory / Packet Buffer / Packet Descriptor utilization during that time
Check the output of the following CLI commands:

>show system resources follow - shows current MP CPU/Memory usage

Look for any high CPU or high Memory on a certain process - identify which process that is (Ex: mgmtsrvr, useridd, ha-agent, logrcvr, routed, authd etc.), troubleshoot why that process has high CPU/memory, and resolve it

In the example below, excessive logging was configured on the firewall in Security Policy rules, and in turn that was causing the logrcvr process on the firewall to use 100% of the Management Plane CPU. This caused other processes in the firewall to have issues such as the firewall ha_agent not being able to respond to HA Heartbeats in that moment. Once the amount of logging was reduced in Security Policy rules, the issue went away, and HA became stable again

High CPU on logrcvr process on Management Plane - show system resources follow

>show running resource-monitor - shows current DP CPU/Memory/Packet Buffer/Packet Descriptor usage

Look for any high utilization of CPU, Packet Buffers, Packet Descriptors, or Memory - identify which resource has high utilization and resolve it

In the example below, there was a large volume of traffic (similar to a DDoS) passing through the firewall at that time. As a result, the Data Plane CPU/packet buffers/packet descriptors became heavily utilized, and the firewall HA Heartbeats could not be processed by the firewall interfaces properly. Once the offending traffic flows were identified and stopped from coming through the firewall, Data Plane utilization went back down to normal levels and HA became stable again

High CPU Packet Buffers Packet Descriptors on Dataplane - show running resource-monitor

You can use the commands below to check these log files for MP/DP usage values in the past at the date + timestamp of the recent HA failure:
>less mp-log mp-monitor.log
>less dp0-log dp-monitor.log

Use the resources in the Additional Information section below to further identify, troubleshoot and resolve the high Management Plane or Data Plane utilization

Verify HA status is healthy

Once the issue that caused HA Peer Connection Status to be down in the first place has been identified and resolve (HA link issue, MP/DP resource issue, system process issue, etc.), if needed, un-suspend the previously-unhealthy unit from Device > High Availability > Operational Commands > click Make local device functional for high availability

Verify HA shows healthy again in both firewalls

Dashboard > click Widgets > System > click High Availability

Dashboard Widget - HA - Active-Secondary

>show high-availability all

Additional Information

Management Plane
Example: How to Identify Management Plane high utilization
Management Plane vs Dataplane Processes
How to Interpret Output of "show system resources"
Resource List: Performance and Stability

Dataplane
How to Troubleshoot High Dataplane Utilization
How to Troubleshoot DoS Attacks
How to Troubleshoot High Packet Buffer and Packet Descriptor Issues
How to Troubleshoot High Packet Descriptors (on-chip)
Resource List: Performance and Stability

Other Resources
How to Troubleshoot Palo Alto Networks Firewalls (Video Course)
Resource List: Troubleshooting Performance Issues
Resource List: High Availability Configuration and Troubleshooting
Resource List: Troubleshooting High Availability Issues

Other users also viewed:

How to download GlobalProtect from the Customer Support Portal

Download and Install the GlobalProtect App for Windows

Resource List: GlobalProtect Configuring and Troubleshooting

Where to find the current preferred software versions? (PAN-OS, GlobalProtect, User-ID Agent, Plugins)

PAN-OS Software Updates