High Availability - "HA Peer Connection Status"

High Availability - "HA Peer Connection Status"

17392
Created On 04/28/22 20:18 PM - Last Modified 08/23/23 22:21 PM


Symptom


When Peer Connection Status is showing 'down' in the output of the CLI Command > show high-availability all

Environment


  • PAN-OS
  • High Availability (Active-Passive or Active-Active)


Cause


If the output of >show high-availability all shows Peer Information as 'Connection status: down' on the Active or Active-Primary firewall in the HA pair, the user may experience failovers or a degraded network environment.

While the most common reason for HA peer not being detected is the HA links going down, there can be other reasons such as: 
• Peer firewall not able to process/receive HA heartbeats at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• Peer firewall not able to respond to/send HA heartbeats at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• HA link hardware issue (faulty cable, faulty SFP, faulty port, firewall backplane issue, electrical issue)
• Any latency or packet loss on the HA links
• Geographic distance between the 2 firewalls in the HA pair is too long/far for the HA cable/SFP type specification
• Other system / process issues that can occur


Resolution


  1. Identify the exact date and timestamp the HA failover / HA failure occurred
  1. In the Firewall Web GUI, navigate to Monitor > System Logs
  2. Navigate to the date and timestamp the HA failure occurred, and identify if there are any other System Logs around that time which could indicate an issue with the firewall health overall (any interfaces going down, processes exiting, high CPU/memory utilization, Link and Path Monitoring going down, etc.)
  3. If other events are found which could have contributed to the HA connection being down, find that event's root cause and resolve it
  1. Verify both firewalls meet the requirements for HA
Using the document below, verify that both firewalls have the exact same model, PAN-OS version, interfaces, licenses, vsys capabilities, etc.
Prerequisites for Active/Passive HA
Prerequisites for Active/Active HA
 
  1. Identify HA Peer Connection down reason
Review the output of the below CLI command to identify the cause for the HA Peer Connection down on both firewalls:
>show high-availability all
Group 1: 
 Mode: Active-Active
Local Information:
    Mode: Active-Active
    State: active-primary (last 1 hours)
    Last non-functional state reason: Dataplane down: brdagent exiting
Peer Information:
    Connection status: down
    Connection down reason: HA1 link went down
    Last non-functional state reason: Dataplane down: user triggered

Other possible Connection down reasons include:
  • Heartbeat ping failure
  • Never able to connect to peer
  • Error in connection detected
  • Peer HA agent exiting
  • Hello protocol failure
  • Capability exchange with peer failed
  • HA1 encryption configuration mismatch
  • SSH Tunnel reset
 
Tip: It is also a good idea to take note of 'Last non-functional state reason' as it can often help you find the root cause of the failure
 
  1. Verify the status of the HA interfaces and resolve any hardware or software interface/link issues on both firewalls
  1. Dashboard > click Widgets > System > click High Availability
HA healthy dashboard
 
  1. >show high-availability interface < ha1 | ha2 | ha3 >
show high-availability interface ha1
show high-availability interface ha3
  1. Verify a supported SFP is being used
Always use SFP's from the list of supported SFP's by Palo Alto Networks for the HA ports. Unsupported SFP's have not been tested and validated for use in Palo Alto Networks devices. If an unsupported SFP is used, it is likely that the interface may never come up, flap, and other issues may occur. Palo Alto Networks TAC may refuse support if an unsupported SFP is used. If you are currently using an unsupported SFP, replace it with an SFP from the list of supported SFP's below before proceeding.

List of Supported SFP's/Transceivers
https://live.paloaltonetworks.com/t5/operations-documentation/transceiver-history-reference-810-000096-00y-updated-on-03-23/ta-p/227987?attachment-id=10684
https://live.paloaltonetworks.com/t5/operations-documentation/hw-accessory-cross-reference-810-000077-0av-updated-on-03-23/ta-p/63422?attachment-id=10683
How to see currently installed SFPs
 
  1. Resolve any hardware/physical link issues by trying known-good/working hardware components
  • Reseat the HA cable in both firewalls
  • Reseat the HA port SFP in both firewalls
  • Replace the HA cable with a known-good, working HA cable of the same type
  • Replace the HA port SFP with a known-good, working HA port SFP of the same type
After performing each of the above steps, check if the HA Link issue is still occurring
 
  1. Resolve any Management Plane or Dataplane Performance Issues (high CPU, high memory, high Packet Buffers/Packet Descriptors)
If the Management Plane or Dataplane get too busy for some reason, the firewall may not be able to reliably receive, process, or send HA heartbeat messages. Use the below steps to identify, troubleshoot and resolve the high Management Plane or Dataplane utilization
 
  1. Review Monitor > System Logs around the time of the HA failure occurring to identify if there was any high CPU / memory / Packet Buffer / Packet Descriptor utilization during that time
  2. Check the output of the following CLI commands:
>show system resources follow  - shows current MP CPU/Memory usage
Look for any high CPU or high Memory on a certain process - identify which process that is (Ex: mgmtsrvr, useridd, ha-agent, logrcvr, routed, authd etc.), troubleshoot why that process has high CPU/memory, and resolve it
 
In the example below, excessive logging was configured on the firewall in Security Policy rules, and in turn that was causing the logrcvr process on the firewall to use 100% of the Management Plane CPU. This caused other processes in the firewall to have issues such as the firewall ha_agent not being able to respond to HA Heartbeats in that moment. Once the amount of logging was reduced in Security Policy rules, the issue went away, and HA became stable again
High CPU on logrcvr process on Management Plane - show system resources follow
 
>show running resource-monitor - shows current DP CPU/Memory/Packet Buffer/Packet Descriptor usage

Look for any high utilization of CPU, Packet Buffers, Packet Descriptors, or Memory - identify which resource has high utilization and resolve it
 
In the example below, there was a large volume of traffic (similar to a DDoS) passing through the firewall at that time. As a result, the Data Plane CPU/packet buffers/packet descriptors became heavily utilized, and the firewall HA Heartbeats could not be processed by the firewall interfaces properly. Once the offending traffic flows were identified and stopped from coming through the firewall, Data Plane utilization went back down to normal levels and HA became stable again
High CPU Packet Buffers Packet Descriptors on Dataplane - show running resource-monitor
You can use the commands below to check these log files for MP/DP usage values in the past at the date + timestamp of the recent HA failure:
>less mp-log mp-monitor.log
>less dp0-log dp-monitor.log
  1. Use the resources in the Additional Information section below to further identify, troubleshoot and resolve the high Management Plane or Data Plane utilization
 
  1. Verify HA status is healthy
  1. Once the issue that caused HA Peer Connection Status to be down in the first place has been identified and resolve (HA link issue, MP/DP resource issue, system process issue, etc.), if needed, un-suspend the previously-unhealthy unit from Device > High Availability > Operational Commands > click Make local device functional for high availability
How to Unsuspend HA
  1. Verify HA shows healthy again in both firewalls
Dashboard > click Widgets > System > click High Availability
Dashboard Widget - HA - Active-Primary
Dashboard Widget - HA - Active-Secondary
 
>show high-availability all
HA healthy CLI
 
 
 


Additional Information


Management Plane
Example: How to Identify Management Plane high utilization
Management Plane vs Dataplane Processes
How to Interpret Output of  "show system resources"
Resource List: Performance and Stability

Dataplane
How to Troubleshoot High Dataplane Utilization
How to Troubleshoot DoS Attacks
How to Troubleshoot High Packet Buffer and Packet Descriptor Issues
How to Troubleshoot High Packet Descriptors (on-chip)
Resource List: Performance and Stability

Other Resources
How to Troubleshoot Palo Alto Networks Firewalls (Video Course)
Resource List: Troubleshooting Performance Issues
Resource List: High Availability Configuration and Troubleshooting
Resource List: Troubleshooting High Availability Issues


Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000oNlUCAU&lang=en_US&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

Choose Language