How to Identify the Root Cause of HA Firewall Non-Functional States and Failovers
Objective
- Identify the root cause of HA firewall non-functional states.
- Restore the HA firewalls to a healthy, redundant state.
Environment
- Palo Alto Networks Firewalls
- High Availability (HA) active/passive or active/active
Procedure
- Find the reason for non-functional state of a firewall in HA by accessing its peer:
- Check the UI: high-availability dashboard. Navigate to DASHBOARD > High-Availability widget.
-
- Check the output of the CLI command:
> show high-availability all- Look under the "Peer Information" for the State Reason.
Peer Information: Connection status: up Version: 1 Mode: Active-Passive State: non-functional (last 28 minutes) State Reason: State synchronization mismatch <<<<< - The various reasons why a firewall in HA goes into non-functional State are listed here:
- Dataplane down: dataplane exit failure
- Dataplane down: brdagent exiting
- Slot x: slot down: brdagent exiting
- Dataplane down: path monitor failure
- Link down
- Path down
- Policy push to dataplane failed
- Mode mismatch between peers
- State synchronization mismatch
- A/A mode device-id overlap
- A/A mode packet-forward mismatch
- A/A mode session load share mismatch
- A/A mode QOS config sync mismatch
- A/A mode router config sync mismatch
- Peer version not compatible only seen for Panorama covered under
- URL vendor mismatch
- HA3 link is down
- HA2 IPv4/IPv6 mismatch with peer
- HA2-backup IPv4/IPv6 mismatch with peer
- HA2 port mismatches with peer
- HA2-backup port mismatches with peer
- Local and peer HA1 IP mismatches
- Group ID mismatch between peers
- System monitor failure
- Waiting for tentative-hold-time
- Waiting for policy push to dataplane
- Waiting for state synchronization completion
- VM License mismatches with peer
- Version mismatches with peer for VMS
- GTP enable mismatches with peer
- SCTP enable mismatches with peer
- NAT oversubscription mismatch
- Drive error detected
- The remediation steps for each of these causes are listed below:
- Dataplane down: dataplane exit failure : To start the investigation on this issue open a support case.
- Dataplane down: brdagent exiting : To start the investigation on this issue open a support case.
- Slot x: slot down: brdagent exiting : To start the investigation on this issue open a support case.
- Dataplane down: path monitor failure : To start the investigation on this issue open a support case.
- Link down :
Firewall went into non-functional due to link group 'link-detection' failure. Refer to VIDEO TUTORIAL: WHAT IS LINK GROUP MONITORING IN HA?
and How to troubleshoot physical port flap or link down issue - Path down :
Firewall went into non-functional due Path down check the HA path-monitoring configuration and troubleshoot the issue:
Refer to Link Monitoring and Path Monitoring Behavior and HA link and path monitoring.show high-availability path-monitoring - Policy push to dataplane failed : To start the investigation on this issue open a support case.
- Mode mismatch between peers :
Ensure that both firewalls in HA are configured with matching HA mode.
1- If your HA is A/P then both firewalls need to have the HA mode under Device >> High Availability >> General >> Setup, set to "Active Passive".2- If your HA is A/A then both firewalls need to have the HA mode under Device >> High Availability >> General >> Setup, set to "Active Active".
Make sure that you commit the firewall configuration after the config change. - State synchronization mismatch :
Refer to HA Active / Passive firewall Non Functional reason "State synchronization mismatch". - A/A mode device-id overlap
Ensure that you set the Device ID to different values (0 or 1) on each peer:
Step 1: In Device> High Availability> General, edit Setup.
Step 2: Select Device ID as follows:
When configuring the first peer, set the Device ID to 0.
When configuring the second peer, set the Device ID to 1. - A/A mode packet-forward mismatch :
- A/A mode session load share mismatch :
To remedy both A/A mode packet-forward and session load share mismatch:Step 1: Configure Session Owner and Session Setup.
In Device> High Availability> HA Communications,, edit Packet Forwarding.
1- For Session Owner Selection, select one of the following:
First Packet—The firewall that receives the first packet of a new session is the session owner (recommended setting). This setting minimizes traffic across HA3 and load shares traffic across peers.
Primary Device—The firewall that is in active-primary state is the session owner.
2- For Session Setup, select one of the following:
IP Modulo—The firewall performs an XOR operation on the source and destination IP addresses from the packet and based on the result, the firewall chooses which HA peer will set up the session.
Primary Device—The active-primary firewall sets up all sessions.
First Packet—The firewall that receives the first packet of a new session performs session setup (recommended setting).
Start with First Packet for Session Owner and Session Setup, and then based on load distribution, you can change to one of the other options.
IP Hash—The firewall uses a hash of either the source IP address or a combination of the source and destination IP addresses to distribute session setup responsibilities.
Click OK.
https://docs.paloaltonetworks.com/pan-os/11-1/pan-os-admin/high-availability/set-up-activeactive-ha/configure-activeactive-ha - A/A mode QOS config sync mismatch :
Different QoS sync setting on both HA peer will trigger non-functional (due to A/A mode QoS config sync mismatch) on one device. It's recommended to follow the below steps:
Step 1. Suspend on the Active-secondary device.
Move to Device > High-Availability > Operational Commands > Click "Suspend local device"
Step 2. Enable QoS sync setting on both HA peer.
Move to Device > High-Availability > Active/Active Config > Packet Forwarding > Click in "QoS sync", then commit
Step 3. Manually perform a "Sync to peer" from Active-primary
Move to Dashboard > click "Sync to peer" under the High Availability widget
Step 4. Make functional on the Active-secondary device.
Move to Device > High-Availability > Operational Commands > Click "Make local device functional"
Step 5: Commit the changes in the Active-Primary box and the error will go away. - A/A mode router config sync mismatch : Ensure that the routers on both firewalls in HA Active/Active have the same mode and that you haven't for example set the advanced routing mode on one and not on the other.
- URL vendor mismatch :
Refer to Mismatched URL Vendor on High Availability Pair - HA3 link is down :
Note: If the HA3 link fails, one of the firewalls in HA A/A will transition to the non-functional state. To prevent this condition, configure a Link Aggregation Group (LAG) interface with two or more physical interfaces as the HA3 link. The firewall does not support an HA3 Backup link. An aggregate interface with multiple interfaces will provide additional capacity and link redundancy to support packet forwarding between HA peers.
For further details on how to troubleshoot HA3 link failure, refer to High-Availability - HA links statusAdditional information: The HA3 link is a Layer 2 link that uses MAC-in-MAC encapsulation. It does not support Layer 3 addressing or encryption. PA-7000 Series firewalls synchronize sessions across the NPCs one-for-one. On PA-800 Series, PA-3200 Series, and PA-5200 Series firewalls, you can configure aggregate interfaces as an HA3 link. The aggregate interfaces can also provide redundancy for the HA3 link; you cannot configure backup links for the HA3 link. On PA-3200 Series, PA-5200 Series, and PA-7000 Series firewalls, the dedicated HSCI ports support the HA3 link. The firewall adds a proprietary packet header to packets traversing the HA3 link, so the MTU over this link must be greater than the maximum packet length forwarded."
- HA2 IPv4/IPv6 mismatch with peer : Ensure that the HA2 configuration on both firewalls in HA have matching settings and same IP address version and differ only with their assigned IP addresses. Refer to HA2 configuration.
- HA2-backup IPv4/IPv6 mismatch with peer : Ensure that the HA2-backup configuration on both firewalls in HA have matching settings and same IP address version and differ only with their assigned IP addresses. Refer to HA2-backup configuration.
- HA2 port mismatches with peer : Ensure that the HA2 configuration on both firewalls in HA have matching settings and differ only with their assigned IP addresses. Refer to HA2 configuration.
- HA2-backup port mismatches with peer : Ensure that the HA2-backup configuration on both firewalls in HA have matching settings and differ only with their assigned IP addresses. Refer to HA2-backup configuration.
- Local and peer HA1 IP mismatches :
Ensure that the HA1 IP address is properly configured and is not the same on both firewalls in HA. If your firewall in HA have their configuration pushed from panorama ensure that the template pushed from Panorama doesn't configure HA1 with the same IP address on both firewalls in HA. You can opt for local HA settings configuration if needed or use template variables.
reference:
How to configure template Variables for High Availability Active/Passive
Migrate a Firewall HA Pair to Panorama Management and Reuse Existing Configuration Configure Active/Passive HA
Configure Active/Active HA - Group ID mismatch between peers :
Ensure that the Group ID is the same for both firewalls in HA.
For information on How to Change The Group ID in a HA Environment refer to How to Change The Group ID in a HA Environment. - System monitor failure : Very rare state.
- Waiting for tentative-hold-time : Transitional state.
- Waiting for policy push to dataplane : Transitional state.
- Waiting for state synchronization completion : Transitional state.
- VM License mismatches with peer :
Refer to HA moved to non-functional due to vm license mismatches with peer even when both firewalls have identical licenses - Version mismatches with peer for VMS :
The solution might also require an upgrade or intervention from TAC for root access. Because of PAN-244673.
- GTP enable mismatches with peer :
Ensure that the GTP is enabled or disabled on both firewalls in HA.
Enabling or disabling GTP Stateful Inspection requires a commit and a reboot. - SCTP enable mismatches with peer :
Ensure that the SCTP is enabled or disabled on both firewalls in HA.
Enabling or disabling SCTP Security requires a commit. - NAT oversubscription mismatch :
Refer to HA Non functional with Error message: "Nat oversubscription mismatch" after upgrade. - Drive error detected :
Reseat the logging drive if the drive is still down it may need to be replaced. Open a support case and refer support to internal KB Strata Cloud Manager "Logging drive failure" Alert for more details.
- Look under the "Peer Information" for the State Reason.
- Check the output of the CLI command:
Additional Information
For additional help and if unable to find the proper steps for remediation open a support case.
The reason why some of those non-functional states require opening a support case is because the problem could be related to SW issue or HW problem. These type of cases need troubleshooting and may most of the time involve engineering team to debug further the problem:
Some of the logs and show commands which support team will need to check: masterd.log (md.log), mprelay-def-hb-fail.log, masterd_detail.log/DP, controlplane-down.log, mpreplay.log, brdagent.log, ha_agent.log, path_monitor_hb_fail_s<slot>.log, messages etc...
show system packet-path-test statusshow system files
So it is recommended to collect the techsupport file of both firewalls in HA as soon as the issue happens and attach them to the support case.