How to Troubleshoot SD-WAN Latency, Jitter, and Packet Loss
Objective
The PAN-OS firewall SD-WAN feature monitors SD-WAN Virtual Interfaces (which contain both physical ISP interfaces and/or VPN Tunnel interfaces) using SD-WAN Probes. If these probes are experiencing an amount of Latency(ms), Jitter(ms), or Packet Loss(%) which is above the Thresholds configured in the Path Quality Profile, then that metric(s) will be marked as unhealthy, and as a result, traffic could be changed to a different path.
This document gives you the network troubleshooting steps to narrow down, verify, and resolve the root cause of this latency, jitter, or packet loss being detected in your network over the affected link(s).
Tip: The SD-WAN feature measures the Latency, Jitter, and Packet Loss of the link by measuring the ICMP probe packets, not the Latency, Jitter, or Packet Loss of actual traffic packets
It is important to keep in mind that the firewall does not check if the actual traffic is experiencing Latency, Jitter or Packet Loss. Instead, the firewall checks if the SD-WAN probe packets are experiencing Latency, Jitter, or Packet Loss over that link (which the actual traffic also does go). Then, the firewall simply checks if those values are above or below your configured thresholds in the Path Quality Profile for that application or not.
Environment
- PAN-OS
- SD-WAN
Procedure
- Investigate which SD-WAN link is experiencing Latency, Jitter, or Packet Loss and identify which Application traversing that link is exceeding the configured Path Quality Profile threshold(s) due to that using Web UI and CLI
Panorama > Objects > SD-WAN Link Management > Path Quality Profile
CLI Commands
> show sdwan path-monitor stats vif sdwan.1
===slot1 dp0 health_ver:(High sensitive) 15 (Medium sensitive) 11 (Low sensitive) 2 ===
----------------------------------------------------------------
ethernet1/1 idx: 16 Probing: Enabled Monitor-mode: Aggressive
----------------------------------------------------------------
probe-req-send:30920 State: up
probe-reply-recv:30919
packet loss : real-time crt-use change
per 1000 pkt: 0 0 0
latency jitter pkt_loss health_ver
3000ms average
real time: 16 1 0
current use: 0 0 0 2
10000ms average
real time: 16 1 0
current use: 0 1 0 8
25000ms average
real time: 16 1 0
current use: 0 0 0 2
- Identify the path (and ISP) that the traffic experiencing the Latency, Jitter, or Packet Loss is taking to reach its destination
- Call your ISP and have them resolve the Latency, Jitter, or Packet Loss
If there are other devices (routers, switches, etc.) in the path of these SD-WAN Probes aside from the ISP, proceed with the steps below.
- Identify which device or interface (or if the ISP) in the path of the traffic is causing the latency
- Review any third-party traffic or network monitoring tools to identify the point at which the latency is occurring in the path of that traffic flow
- Identify if there have been any configuration changes or new devices introduced into your network in the path of this traffic that may have caused this latency, jitter, or packet loss
- Login to and check the devices in the path of the traffic to see if they show any sign of issues processing traffic. Start with any device(s) you suspect to be most-probable to cause this type of slowness being experienced or any new devices introduced since the traffic was working as expected. Tip: Use that vendor's built-in traffic diagnostic tools (packet-trace, packet capture, performance logs, traffic logs, etc.) to diagnose why that traffic flow (by Source IP and Destination IP) is traversing through that device slowly
- Take Packet Captures at various points in the network along the path of this traffic flow. Compare the timestamps in the packet captures at various points in the network side-by-side to identify at which capture point (i.e. device) the packet takes a longer time to arrive at or ingress/egress. Doing so until you can narrow down to a single device or link of the network directly causing the latency will allow you to then troubleshoot, resolve, or make the needed configuration changes on that device.
- Check with the ISP and ask them to provide proof/evidence that there is no latency, packet loss, or jitter on the path that traffic takes
- Identify and reduce or eliminate any heavy load, utilization, or congestion on any devices or links in the path
Common culprits include:
- Any device under a DDoS attack/traffic flood of any kind
- Any device which redirects or proxies that traffic flow
- Any device that heavily inspects (decryption devices) that traffic unnecessarily
- Any device experiencing high resource issues such as high CPU, memory, buffers, etc.
- Configure QoS on whichever device(s) necessary to prioritize that traffic's packets along the traffic path
- Optimize routing and the path
Common culprits of this include:
- Firewalls doing heavy inspections
- Decryption devices
- Proxy devices
- Unnecessary / sub-optimal VPN tunnel routing
- (Optional) Lower application settings or use a lighter, faster protocol / technology to transmit that traffic
- Limiting video quality (from 4k to 1080p)
- Limiting audio quality codec (from G.722 to G.711 or G.729)
- Assess if there is a lighter version or implementation of the protocol/application you are using that has lower bandwidth requirements if needed
- Create a Path Quality Profile with less strict requirements for Packet Loss, Latency, or Jitter
If you or your ISP are unable to get the application/path to perform to the threshold levels you specified in its current Path Quality Profile, you may need to edit the Packet Loss, Latency, or Jitter thresholds in the Path Quality Profile to a lower levels
- If you are seeing the "NGFW SD-WAN Link Performance" Alert in Strata Cloud Manager, the conditions for that alert are:
- SD-WAN Link Jitter
- Critical alert will be triggered if the jitter value is greater than 30 for 50% of datapoints persistent for atleast 2 hours
- Warning alert will be triggered if the jitter value is greater than 2 for 50% of datapoints persistent for atleast 2 hours
- Alert will be cleared if all the datapoints are below the warning threshold for atleast 2 hours
- SD-WAN Link Packet Loss
- Critical alert will be triggered if the loss value is greater than 9 for 50% of datapoints persistent for atleast 2 hours
- Warning alert will be triggered if the jitter value is greater than 1 for 50% of datapoints persistent for atleast 2 hours
- Alert will be cleared if all the datapoints are below the warning threshold for atleast 2 hours
- SD-WAN Link Latency
- Critical alert will be triggered if the latency value is greater than 300 for 75% of datapoints persistent for atleast 80 mins
- Warning alert will be triggered if the latency value is greater than 20 for 75% of datapoints persistent for atleast 80 hours
- Alert will be cleared if all the datapoints are below the warning threshold for atleast 2 hours