How to troubleshoot SD-WAN link down

How to troubleshoot SD-WAN link down

14484
Created On 01/20/23 02:49 AM - Last Modified 09/15/23 05:10 AM


Objective


We got a notification that one or more of SDWAN's tunnel interfaces is down, and we'll try to explore some possible scenarios to isolate and resolve the issue.

Environment


  • PAN-OS 10.0 and above
  • SD-WAN


Procedure


  1. Identify possible resource depletion in the Palo Alto firewall.
    1. If the firewall is monitored by Strata Cloud Manager, use How to identify high CPU, Packet Buffer, and Packet Descriptor in the firewall with Strata Cloud Manager
    2. For non-Strata Cloud Manager monitored firewalls, use the following steps 
      1. Use How To Troubleshoot High Packet Buffer Or Packet Descriptors Usage to check if your firewall is having high dataplane resources usage.
      2. Determine if the data plane CPU utilization is high
        1. Under the firewall's GUI, go to DASHBOARD > Widgets > System > click on System Resources
        2. To resolve this issue, use How to Troubleshoot High DataPlane CPU
      3. Determine if the management plane CPU utilization is high
        1. Under the firewall's GUI, go to DASHBOARD > Widgets > System > click on System Resources
        2. To resolve this issue, use  TIPS & TRICKS: Reducing Management Plane Load
NOTE: A high Packet Descriptor and/or Packet Buffer value of 60% and above is likely a possible sign of Denial of Service (DoS).
  1. Identify possible bandwidth depletion
    1. For non-Strata Cloud Manager monitored firewalls, use the following steps
      1. collect the interface-id of the concerned tunnel by executing > show sdwan connection all
​​​​​sdwankb1-1.png
  1. We'll need to know the time when our SDWAN tunnel went down using the command > grep mp-log sdwand.log pattern "if_idx X", X value is based on the interface-id from above.
sdwankb2-2.png
 
  1. In your Network Bandwidth monitoring tool (ex., PRTG, WhatUp, Nagios, etc.), correlate the obtained time and see if the bit rate of the interface concerning the issue had a flat/leveled-off trend in the graph. If yes, this confirms that we need more bandwidth to support the current requirement and escalate to the service provider if the SLA isn't being met.
  2. Using the same command under step 2.a.i, Are other tunnel interfaces mapped to the same physical interface experiencing the same issue?
  1. Identify possible IPSEC configuration mismatch
    1. In most topologies, the Panorama seats behind the hub firewall, while the reachability between the branch firewalls and Panorama goes through the SD-WAN/IPSEC tunnel.
      A problem arises when we push SDWAN/IPSEC-related config to hub and branch firewalls simultaneously, which may trigger a tunnel renegotiation. If the commit in the hub happens first versus the branch, the hub will wait for the branch to initiate the IKE/IPSEC negotiation. After the commit completes at the branch, but the tunnel doesn't exist, the branch firewall losses its connectivity to the Panorama and will revert the changes due to Enable Automated Commit Recovery . Now we have two devices that have a different SDWAN/IPSEC parameter.
      1. go to Panorama > Managed Devices > Summary, to verify any out of sync firewall.
      2. use How to troubleshoot IPSec VPN Tunnel Down , look for any IPSEC config discrepancy at the branch, and try to temporarily edit the branch firewall config to match it to the hub's config. 
      3. Once the tunnel is up, push the Panorama config to the branch firewall to sync up their configs.
  2. Identify possible issue due to configuration preventing the ICMP tunnel monitor from functioning.
    1. Zone Protection with Strict IP check applied on a tunnel interface
      1. execute > show counter global filter delta yes | match drop

sdwan-strick1-1.png
  1. determine the zone applied to the tunnel interface, from the branch go to NETWORK > Interfaces > Tunnel, in most cases this zone-to-hub
  2. from the Panorama, go to NETWORK > Template > [click on the appropriate template] > Network Profiles > Zones Protection > [click on the appropriate zone as per step 4.a.ii] > Packet Based Attack Protection > IP Drop > uncheck Strick IP Address Check > Commit
  3. assuming that the topology is based on Step 3.a, and we have no reachability between Panorama and the branch, we'll need to execute step 4.a.iii directly in the branch, from the branch GUI, go to NETWORK > Network Profiles > Zones Protection > [click on the appropriate zone as per step 4.a.ii] > Packet Based Attack Protection > IP Drop > uncheck Strick IP Address Check > Commit
 


Additional Information


  1.  How do we interpret tl_0102_007951000299644_0104
tl = tunnel
0102 = ethernet1/2 of the local firewall, which is one of the ends of the tunnel
007951000299644 = serial number of the remote peer
0104 = ethernet1/4 of the remote firewall with the above serial number 
 
sdwankb3-1.png
 


Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000kGOvCAM&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

Choose Language