High on-chip descriptor and packet buffer usage due to policy deny resulting in traffic latency and drops

High on-chip descriptor and packet buffer usage due to policy deny resulting in traffic latency and drops

108159
Created On 11/23/20 10:43 AM - Last Modified 06/24/22 21:22 PM


Symptom


  • Latency/slowness observed, in some cases packet loss for traffic traversing through the Firewall.
  • Usually the DataPlane (DP) CPU usage remains within the expected range and no huge deviation from the usual or "working condition".
  • Spike in flow_policy_deny global counters.
  • The top session in show running resource-monitor ingress-backlogs has grp id 2 or grp id 16
Note : Group IDs could be other than 16. Refer additional information section for more details. 
Example:
> show running resource-monitor ingress-backlogs 
<snip>
-- SLOT: s1, DP: dp0 --
USAGE - ATOMIC: 88% TOTAL: 89%

TOP SESSIONS:
SESS-ID         PCT     GRP-ID  COUNT
2022536315      88%     16      3640

<snip>
 
  • show session id shows "Bad Key", and its consumption is taking a majority % of on-chip.
Example:
> show session id 2022536315

Session      2022536315

            Bad Key: c2s: 'c2s'
            Bad Key: s2c: 's2c'
        index(local):                        : 9270395

​​​​
  • Spike in on-chip packet descriptor and packet buffer usage, Example (from a live debug output):
> show running resource-monitor second last 5

DP s1dp0:

Resource monitoring sampling data (per second):

CPU load sampling by group:
flow_lookup                    :     3%
flow_fastpath                  :     3%
flow_slowpath                  :     3%
flow_forwarding                :     3%
flow_mgmt                      :     0%
flow_ctrl                      :     3%
nac_result                     :     3%
flow_np                        :     3%
dfa_result                     :     3%
module_internal                :     3%
aho_result                     :     3%
zip_result                     :     3%
pktlog_forwarding              :     9%
lwm                            :     0%
flow_host                      :     3%

CPU load (%) during last 5 seconds:
core   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
       *   0   3   2   2   3   3   3   4   4   2   3   2   3   3   2
       *   0   3   2   2   3   3   3   3   4   2   3   2   3   3   2
       *   0   3   3   2   4   4   3   3   3   2   2   2   3   3   3
       *   0   3   2   2   3   4   3   3   3   2   3   2   3   4   2
       *   0   3   3   2   3   3   3   3   3   2   2   2   3   4   2
core  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
       2   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3
       2   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3
       2   3   2   2   2   3   3   3   2   4   3   3   2   3   3   2
       2   3   2   2   2   3   2   3   2   3   3   3   2   3   3   3
       2   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3
core  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
       3   2   2   2   2   2   3   2   2   2   2   2   3   2   4   9
       3   2   2   2   2   2   2   2   2   2   2   2   3   2   4   9
       3   2   2   3   2   2   2   2   2   2   2   2   3   2   4   9
       3   2   2   2   3   2   2   2   2   2   2   2   3   2   4   9
       3   2   3   2   2   2   3   2   2   2   2   2   3   2   4   9

Resource utilization (%) during last 5 seconds:
session:
  0   0   0   0   0 
packet buffer:
 14  13  13  13  12 
packet descriptor:
  0   0   0   0   0 
packet descriptor (on-chip):
100 100 100 100 100
<snip>

 


Environment


  • PAN-OS
  • Zone Protection
  • DOS Protection


Cause


NOTE:
High on-chip descriptor and packet buffer usage can occur under various conditions.
This article covers a specific case   of slowpath policy deny, that results in high on-chip descriptors and packet buffers.


 
Please note that this article applies only if the following conditions are met:

1. The top session in show running resource-monitor ingress-backlogs has grp id 2 or grp id 16
More information about grp id 2 and grp id 16 in "Additional Information" section

Example:
> show running resource-monitor ingress-backlogs 
<snip>
-- SLOT: s1, DP: dp0 --
USAGE - ATOMIC: 88% TOTAL: 89%

TOP SESSIONS:
SESS-ID         PCT     GRP-ID  COUNT
2022536315      88%     16      3640

<snip>

2.  There is Spike in flow_policy_deny global counters.
3. show session id <id> of the session obtained from ingress-backlogs, shows "Bad Key"
Example:
> show session id 2022536315

Session      2022536315

            Bad Key: c2s: 'c2s'
            Bad Key: s2c: 's2c'
        index(local):                        : 9270395
Note: In most cases, it is same 6 tuples UDP syslog traffic, that causes the issue.
UDP syslog traffic is usually high volume and uses the same source-port (socket).


4. The denied traffic has the same 6 tuples (source/destination IP, source/destination port, protocol (L3 header), ingress zone) ​​.

When/Why does policy deny cause spike/high on-chip descriptor usage?
  • If an incoming packet does not match an existing session it is subjected to slowpath.
  • If the packet matches a deny policy in slowpath (with session logging enabled), the packet is dropped and a traffic log entry is created, but a session is not installed.
  • The next packet with the same 6 tuples would go through the same path as the previous packet.
  • slowpath, as the name suggests, can take a greater number of processing cycles, because it is in this step that all the tasks associated with establishing a session are done. 
  • Time taken to complete the slowpath depends on Firewall configuration and traffic pattern.
  • For instance, if there are a large number of security or nat policies, the time taken to complete slowpath would be higher.
  • All the packets with the same 6 tuples are subjected to ATOMIC or serial packet processing in the incoming order (one-at-a-time).
  • As these packets have to be serially processed, they can not be sent to different cores or threads for parallel processing.
  • As the packets are waiting to be processed by the DP CPU, depending on the incoming rate of packets and time taken to complete slowpath for each packet, there can be accumulation of packets.
  • If there is a significant rate/amount of such same 6 tuples traffic hitting the Firewall, getting denied on slowpath, the on-chip descriptors and eventually the packet buffers can get filled up.
  • At this point you would start to see traffic issues.
For more information refer to the "Additional Information" section.


Resolution


NOTE
  • Once you have determined that the policy deny is significantly high or is higher than usual, the next step is to Identify the source of this Traffic.
  • If "Log at session-end" is enabled on the policy denying the traffic, the "offending" traffic can be found by filtering the Traffic logs for policy-deny.
Note: In most cases, it is same 6 tuples UDP syslog traffic, that causes the issue.
UDP syslog traffic is usually high volume and uses the same source-port (socket).

Mitigation

1.
Once the source of the denied traffic is identified, check if it is feasible to stop this traffic at source or closer to source.
Example: If there is a device that is flooding syslog messages to a particular destination, you can remove the syslog server destination from that device to stop the flood.

2. Allowing traffic with security policies 
  • Before moving into mitigation techniques, we should make sure the traffic that is supposed to be allowed is indeed allowed by security policies.
  • If the traffic needs to be allowed, create the required security policy. Once the traffic is allowed, a session would be installed and the traffic is not subjected to slowpath.
  • If traffic pattern in the network is not known, a security policy can be created to allow high volume traffic like syslog from internal/trusted zones(apply security profiles as required).
3. Protection against known offenders (DoS policy)
  • If exact IPs of the hosts causing the issue are known, creating a DoS policy with the action “Deny” will help.
  • DoS policy rules are specific (source/destination zone, IPs, service port), and they can replace similar security policy with action deny.
  • Since DoS policies are evaluated before security policy lookup and do not have large number of entries, packets are blocked earlier thereby saving firewall resources. 
4. Protection against unknown offenders (DoS policy)
  • Creating a DoS classified policy with the action “Protect” 
  • A classified DoS policy can be applied with action “Protect” and address matched to “source-ip only” or “src-dest-ip-both”.
User-added image
***Note:  Configured threshold values are only examples, these need to be tweaked based on customer environment

User-added image
  • Once the configured threshold is reached, the DoS policy would create a DoS ip-block-table, which would start dropping packets without being subjected to slowpath.
  • In devices that have an offload processor, the block table would be installed in the offload hardware to further reduce the load on the DP CPU.
  • For further reading refer : Monitor Block List and DoS Protection Profiles and Policy Rules
 Caveat:
  • Thresholds for the classified DoS object would change based on customer traffic pattern and network, the default values may not be applicable to all environments.
  • For slowpath deny attacks only  “source-ip only” or “src-dest-ip-both” would work, using "destination-ip only" does not help.
  • For internet facing zones, since source ips could be potentially huge, the firewall doesn’t have the capacity to store counters for every possible IP address on the internet.
  • Refer: Classified vs Aggregate DoS profiles
5.  Packet Buffer Protection (PBP) 
  • Packet Buffer Protection (PBP) is a feature available starting with PAN-OS 8.0.
  • PBP is preferred, as it is automatic and is triggered based on actual resource utilization, when compared to DoS policy which is triggered on pre-configured connections per second threshold
  • PBP protects the firewall both from slowpath and fastpath (existing session) buffer depletion.
  • Firewall monitors buffer abusers automatically.
  • After reaching the configured activate threshold (default 50%), the firewall starts dropping offending traffic (RED).
  • If buffer utilization is above 80% ( this threshold is internally hardcoded and not configurable) for a duration of block hold time a dos block table entry is created.
  • Refer: Packet Buffer Protection
In this specific case of slowpath deny usually a combination of PBP + DoS classified policy with the action “Protect” provides better results.

Monitoring:

SNMP can be leveraged to monitor buffer utilization among other things. DP resources are part of  HOST-RESOURCES-MIB. More information can be found here: 
SNMP for Monitoring Palo Alto Networks Devices
snmp-mibs

List of useful OIDs:
1. Description - .1.3.6.1.2.1.25.2.3.1.3.xxxx
Example:
.1.3.6.1.2.1.25.2.3.1.3.1011 = STRING: "Slot-1 Data Processor-0 Hardware Packet Buffers"
.1.3.6.1.2.1.25.2.3.1.3.1111 = STRING: "Slot-1 Data Processor-1 Hardware Packet Buffers"

2.  Hardware Packet Buffer pool size - .1.3.6.1.2.1.25.2.3.1.5.xxxx
Example:
.1.3.6.1.2.1.25.2.3.1.5.1011 = INTEGER: 17203 
.1.3.6.1.2.1.25.2.3.1.5.1111 = INTEGER: 17203

3. Current Buffer Utilization - .1.3.6.1.2.1.25.2.3.1.6.xxxx
Example:
.1.3.6.1.2.1.25.2.3.1.6.1011 = INTEGER: 122 
.1.3.6.1.2.1.25.2.3.1.6.1111 = INTEGER: 128 
 

DoS related counters via SNMP (part of PAN-COMMON-MIB): 
 

MIB IdentityCounter DescriptionOID
panFlowPolicyDenyflow_policy_denySession setup: denied by policy.1.3.6.1.4.1.25461.2.1.2.1.19.8.10
panFlowDosBlkNumEntriesflow_dos_blk_num_entriesNumber of entries in DOS block table.1.3.6.1.4.1.25461.2.1.2.1.19.8.2
 
panFlowDosBlkSwEntriesflow_dos_blk_sw_entriesNumber of entries in DOS Software block table.1.3.6.1.4.1.25461.2.1.2.1.19.8.33
panFlowDosBlkHwEntries flow_dos_blk_hw_entries Number of entries in DOS Hardware block table.1.3.6.1.4.1.25461.2.1.2.1.19.8.34
panFlowDosDropIpBlocked flow_dos_drop_ip_blockedPackets dropped: Flagged for blocking and under block duration by  DoS or other modules.1.3.6.1.4.1.25461.2.1.2.1.19.8.13
panFlowDosRuleDrop flow_dos_rule_drop Packets dropped: Rate limited or IP blocked.1.3.6.1.4.1.25461.2.1.2.1.19.8.23


 


Additional Information




Why do we see 'Bad Key' when running show session id <id> of the session obtained from ingress-backlogs?
Since at this stage the session is not yet installed, the value under SESS-ID is an internal tag value and not the actual session id.
In this specifc case of slowpath deny, if you carefully look at the valued under SESS-ID, the value is much higher than the session index range. (Which by itself is a clue. that it is nota session id)
Hence when you use the show session id <idx> since the value is not an actual session id you see 'Bad Key'.

Why do we need ATOMIC or serial packet processing (on-at-a-time)?

  • Each packet processing stage has different requirements in terms of whether it should be subjected to serial packet processing (one-at-a-time in the incoming order) or parallel packet processing for a given 6 tuple/session.
  • The goal of serial packet processing is to avoid multiple cores/threads performing the same operation or accessing/writing on the same data. 
  • For instance, if we receive a burst of new traffic that should belong to the same session, we don’t want multiple cores/threads to perform session installation, hence we don't process the packets in parallel but subject them to serial processing.
Why is the DP CPU low, but the on-chip descriptor and packet buffer usage is high?
  • As explained earlier, since the same 6 tuples deny traffic is subjected to serial processing, it is possible that most of the cores are just waiting for work/packets to be processed.
  • Hence the CPU usage would be low, but the packet buffer and on-chip descriptor resources could fill up.
When would show running resource-monitor ingress-backlogs display grp id 2 and when would it display grp id 16?
  • Each group viz. flow_fastpath, flow_slowpath etc, is assigned a group id (grp id)
  • grp id 2 and 16 both stand for slowpath/ flow_slowpath
  • On older hardware models (Ex: PA-3000, PA-5000 etc) and in case of PA-7000 series older NPC models you would see grp id 2
  • On newer hardware models (Ex: PA-5200, PA-3200)  and in case of PA-7000 series newer NPC models you would see grp id 16 or grp id 17 or grp id 18 etc.
  • Please note that the grp id values can also be 17 or 18, but still stand for Slowpath.
  • This is because on Newer Platforms to get the actual grp id a right shift of 3 bits is needed. 
  • For values like 16,17,18 etc, they all will result in the value of 2 after right shift of 3 bits is done. 
  • So any grp id that results in the value of 2 after a 3 Bit right shift will represent slowpath/flow_slowpath.
 
 


Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA14u000000HBjNCAW&lang=en_US%E2%80%A9&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

Choose Language