High Availability - "HA Backup"

High Availability - "HA Backup"

15210
Created On 05/04/22 16:19 PM - Last Modified 08/23/23 22:13 PM


Symptom


HA1-Backup Link is in down state
HA2-Backup Link is in down state


Environment


PAN-OS

Cause


If the output of >show high-availability all shows HA1 Backup Control Link or HA2 Backup Data Link as 'Link state: down' on the Active or Active-Primary firewall in the HA pair, the user may experience failovers and/or a degraded network state.

While the most common reason for HA links going down is the physical link itself going down, there can be other reasons such as: 
• Either firewall not able to process/receive HA heartbeats over that link at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• Either firewall not able to respond to/send HA heartbeats over that link at that time (Example: high CPU, high memory, resource issue, overutilization/DDoS, link issue etc.)
• HA link hardware issue (faulty cable, faulty SFP, faulty port, firewall backplane issue, electrical issue)
• Geographic distance between the 2 firewalls in the HA pair is too long/far for the HA cable/SFP type specification
• Other system / process issues that can occur


Resolution


  1. Identify the exact date and timestamp the HA1 Backup Link or HA2 Backup Link went down
    1. Firewall GUI: Monitor > Logs click System
    2. Filter by the date and timestamp the HA1 Backup or HA2 Backup link went down
    3. Review the time of the HA Backup Link issue to see if any other interface, process, resource, or system issue occurred in the firewall around that time that could indicate an overall health issue with the firewall or any related event
    4. If other events are found which could have contributed to the HA1-Backup or HA2-Backup link(s) going down, find that event's root cause and resolve it
  1. Verify HA Backup links interface status
    1. Dashboard > Widgets > System > click High Availability
HA1-Backup Link Down Web GUI
  1. Use the command show high-availability interface < ha1-backup | ha2-backup >HA2-Backup Fail CLI
    Note: Always use SFP's from the list of supported SFP's by Palo Alto Networks for the HA ports. Unsupported SFP's have not been tested and validated for use in Palo Alto Networks devices. If an unsupported SFP is used, it is likely that the interface may never come up, flap, and other issues may occur. Palo Alto Networks TAC may refuse support if an unsupported SFP is used. If an unsupported SFP is currently being used, replace it with an SFP from the list of supported SFP's below before proceeding.

List of Supported SFP's/Transceivers
  1. Resolve any hardware/physical link issues by trying known-good/working hardware components                                                                                                                            
    1. Reseat the HA cable in both firewalls
    2. Reseat the HA port SFP in both firewalls
    3. Replace the HA cable with a known-good, working HA cable of the same type
    4. Replace the HA port SFP with a known-good, working HA port SFP of the same type
  1. Resolve any Management Plane or Data Plane Performance Issues (high CPU, high memory, high Packet Buffers/Packet Descriptors)
If the Management Plane or Data Plane get too busy, the firewall may not be able to reliably receive, process, or send HA heartbeat messages over the HA link(s). Use the below steps to identify, troubleshoot and resolve the high Management Plane or Data Plane utilization.
 
  1. Review Monitor > Log click System around the time of the HA failure occurring to identify if there was any high CPU / memory / Packet Buffer / Packet Descriptor utilization during that time
  2. Check the output of the following CLI commands: 
> show system resources follow  (shows current MP CPU/Memory usage)
Look for any high CPU or high Memory on a certain process - identify which process that is (Ex: mgmtsrvr, useridd, ha-agent, logrcvr, routed, authd etc.), troubleshoot why that process has high CPU/memory, and resolve it
 
In the example below, excessive logging was configured on the firewall in Security Policy rules, and in turn that was causing the logrcvr process on the firewall to use 100% of the Management Plane CPU. This caused other processes in the firewall to have issues such as the firewall ha_agent not being able to respond to HA Heartbeats in that moment. Once the amount of logging was reduced in Security Policy rules, the issue went away, and HA became stable again
High CPU on logrcvr process on Management Plane - show system resources follow                                                                      
 
show running resource-monitor (shows current DP CPU/Memory/Packet Buffer/Packet Descriptor usage)
Look for any high utilization of CPU, Packet Buffers, Packet Descriptors, or Memory - identify which resource has high utilization and resolve it

In the example below, there was a large volume of traffic (similar to a DDoS) passing through the firewall at that time. As a result, the Data Plane CPU/packet buffers/packet descriptors became heavily utilized, and the firewall HA Heartbeats could not be processed by the firewall interfaces properly. Once the offending traffic flows were identified and stopped from coming through the firewall, Data Plane utilization went back down to normal levels and HA became stable again
High CPU Packet Buffers Packet Descriptors on Dataplane - show running resource-monitor                                           
  1. Use the commands below to check log files for MP/DP usage values in the past at the date + timestamp of the recent HA failure:
>less mp-log mp-monitor.log
>less dp0-log dp-monitor.log
  1. Use the resources in the Additional Information section below to further identify, troubleshoot and resolve the high Management Plane or Data Plane utilization
  1. Verify HA Status is healthy
  1. Once the issue that caused the HA1 Backup or HA2 Backup link to be down in the first place has been identified and resolved (physical issue, HA link issue, MP/DP resource issue, system process issue, etc.), if needed, un-suspend the previously-unhealthy unit from Device > High Availability Operational Commands > click Make local device functional for high availability
How to Unsuspend HA
 
  1. Verify HA shows healthy again in both firewalls
Dashboard > click Widgets > System > click High Availability
HA1-Backup Healthy Dashboard Web GUI
Use the command show high-availability all
HA1-Backup healthy CLIHA2-Backup Healthy CLI
 


Additional Information


Management Plane
Example: How to Identify Management Plane high utilization
Management Plane vs Dataplane Processes
How to Interpret Output of  "show system resources"
Resource List: Performance and Stability

Data Plane
How to Troubleshoot High Dataplane Utilization
How to Troubleshoot DoS Attacks
How to Troubleshoot High Packet Buffer and Packet Descriptor Issues
How to Troubleshoot High Packet Descriptors (on-chip)
Resource List: Performance and Stability

Other Resources
How to Troubleshoot Palo Alto Networks Firewalls (Video Course)
Resource List: Troubleshooting Performance Issues
Resource List: High Availability Configuration and Troubleshooting
Resource List: Troubleshooting High Availability Issues


Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000oNqUCAU&lang=en_US&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

Choose Language