How to Troubleshoot High Dataplane CPU

How to Troubleshoot High Dataplane CPU

250041
Created On 09/25/18 18:55 PM - Last Modified 07/23/20 23:10 PM


Symptom

A number of factors can cause the dataplane's CPUs to spike or continuously run high: a sudden increase due to the implementation of a new service or resource, or a buildup over time due to added connected networks, segments and hosts. These factors and more can cause increases in packet rate, packet buffer utilization, or a huge amount of new connections per second that lead to the dataplane CPU reaching a critical level.

Learn to troubleshoot high dataplane CPU. This article shows a couple of ways to verify the actual load of the dataplane and help determine the potential impact on your network.

Getting an understanding of the impact of the high dataplane CPU is important: high dataplane CPU usage may not present an immediate cause for concern, but being aware of what is causing high CPU and possible consequences will help you respond appropriately. 

Several factors that need to be identified before taking action:

  • Is some, all, or no traffic through the firewall affected?
  • Are users experiencing high latency, and if so, does latency affect all traffic or just certain applications?
  • If issues are being reported, are these reports coinciding with specific peak times of the day, at regular intervals, or at totally random moments?


The first step is to isolate where the performance issue is occurring:

  • Data Plane (DP) CPU
  • Packet Buffers
  • Session
  • Management Plane (MP)


Resolution
How to identify high dataplane CPU

When a customer reports a performance issue, generate a tech support file while the issue is occurring. Look for the "---panio" string in the dp-monitor log (this information is logged every 10 minutes) or run the show running resource-monitor command from the CLI to view DP resource usage.

show running resource-monitor


This command can be used to review dataplane CPU usage. Add a time operator to reflect a timeframe you would like to review.

  • 'second'  shows the last 60 seconds of CPU usage in per second increments
  • 'minute' shows  the last 60 minutes in minute increments and so on
  • If no time operator is used, all views will be listed in one long output

> show running resource-monitor
  > day Per-day monitoring statistics
  > hour Per-hour monitoring statistics
  > ingress-backlogs show statistics
  > minute Per-minute monitoring statistics
  > second Per-second monitoring statistics
  > week Per-week monitoring statistics


The actual output will look similar to this but may differ slightly depending on the amount of dataplanes and cores your system has.

> show running resource-monitor 

DP dp0:

Resource monitoring sampling data (per second):

This first section shows CPU utilization by processes running on the dataplane

CPU load sampling by group:
flow_lookup : 0%
flow_fastpath : 0%
flow_slowpath : 0%
flow_forwarding : 0%
flow_mgmt : 0%
flow_ctrl : 1%
nac_result : 0%
flow_np : 0%
dfa_result : 0%
module_internal : 0%
aho_result : 0%
zip_result : 0%
pktlog_forwarding : 0%
lwm : 0%
flow_host : 1%

This part of the output shows utilization of individual cores on the dataplane.
Many cores regularly surpassing 75% warrants an investigation.


CPU load (%) during last 60 seconds:
core  0  1  2  3  4  5  6  7  8  9  10  11
      0  39 47 38 46 73 81 73 81 82 88  83
      0  39 46 38 45 73 82 73 82 82 89  82
      0  39 46 39 45 73 81 73 81 83 88  83
      0  38 52 37 50 71 81 71 80 82 98  82
      0  40 46 39 45 74 82 74 82 83 89  83
      0  40 50 39 49 74 85 74 84 83 90  83
      0  39 46 38 45 72 81 72 81 81 88  81
      0  42 52 41 50 75 86 75 86 84 91  84
      0  42 76 79 85 75 10075 100100100 85  >>> indicates actual high CPU
      0  44 51 43 50 78 85 78 86 86 91  86
      0  42 51 41 50 76 86 76 86 85 92  85
      0  43 53 43 52 77 89 77 89 86 93  86
      0  38 46 37 45 71 81 72 81 81 88  81
      0  42 50 41 50 76 85 76 85 85 90  85
... cut for brevity

                  

This section shows session table utilization, which is the maximum number of active sessions supported by the platform. Any value above 80% warrants investigation.

Resource utilization (%) during last 60 seconds:
session:
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


The next 3 sections show packet buffer utilization.  Packet buffers are used to ensure no packets are lost while a previous packet is still being processed by a core or process. Any value above 80% needs to be investigated.

packet buffer:
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 
packet descriptor:
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 

packet descriptor (on-chip):
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
  2   2   2   2   2   2   2   2   2   2   2   2   2   2   2


Things to check in the output:

  1. Check the CPU load during the last 60 seconds. If any number is at or close to 100, then high CPU is likely the cause of the performance issue.
  2. Check the "packet buffer" and "packet descriptor" sections. If any number is at or close to 100, then the issue is likely caused by running out of packet buffers.
  3. Check the session section. If any number is close to or above 80, then the performance issue is most likely session related.


Several commands are available that can help identify why certain measurements are above the threshold in the resource monitor:

show session info


Outputs all data relevant to how many sessions are supported, how many are currently active and will inform you about the packet rate, new connection establish rate and throughput. If any of these parameters are exceptionally high, they can cause the dataplane CPU to rise.

> show session info

target-dp: *.dp0
--------------------------------------------------------------------------------
Number of sessions supported: 262142
Number of active sessions: 3   < If this figure rises to the level of the supported sessions,
Number of active TCP sessions: 0       the platform is reaching maximum capacity.
Number of active UDP sessions: 1
Number of active ICMP sessions: 0
Number of active BCAST sessions: 0
Number of active MCAST sessions: 0
Number of active predict sessions: 0
Session table utilization: 0%
Number of sessions created since bootup: 332481
Packet rate: 28/s  An unusual and sudden high packet rate could indicate a DoS attack.
Throughput: 13 kbps
New connection establish rate: 0 cps
--------------------------------------------------------------------------------
Session timeout 
If timeouts values are too aggressive or too relaxed, the system could run out of resources.
TCP default timeout: 3600 secs
TCP session timeout before SYN-ACK received: 5 secs
TCP session timeout before 3-way handshaking: 10 secs
TCP half-closed session timeout: 120 secs
TCP session timeout in TIME_WAIT: 15 secs
TCP session timeout for unverified RST: 30 secs
UDP default timeout: 30 secs
ICMP default timeout: 6 secs
other IP default timeout: 30 secs
Captive Portal session timeout: 30 secs
Session timeout in discard state:
TCP: 90 secs, UDP: 60 secs, other IP protocols: 60 secs
--------------------------------------------------------------------------------
Session accelerated aging: True
If accelerated aging has been turned off, some sessions may remain active in the session table
longer than needed and take up resources.

Accelerated aging threshold: 80% of utilization
Scaling factor: 2 X
--------------------------------------------------------------------------------
Session setup
If any of these have been turned off for troubleshooting and not turned back on, they can
consume additional resources to inspect malformed packets that would otherwise
be discarded.

TCP - reject non-SYN first packet: True
Hardware session offloading: True
IPv6 firewalling: True
Strict TCP/IP checksum: True
ICMP Unreachable Packet Rate: 200 pps
--------------------------------------------------------------------------------
Application trickling scan parameters:
Timeout to determine application trickling: 10 secs
Resource utilization threshold to start scan: 80%
Scan scaling factor over regular aging: 8
--------------------------------------------------------------------------------
Session behavior when resource limit is reached: drop
This indicates the behavior of the firewall when it has reached its extreme limit of
resource depletion.

--------------------------------------------------------------------------------
Pcap token bucket rate : 10485760
--------------------------------------------------------------------------------
Max pending queued mcast packets per session : 0
--------------------------------------------------------------------------------

 

debug dataplane pool statistics

Returns the status of all the buffers being used by the system and their status.

  • The number on the left indicates how much buffer is still available
  • The number on the right indicates the total size
  • If the number on the left drops to 0, the buffer is depleted

> debug dataplane pool statistics

Hardware Pools
[ 0] Packet Buffers : 57223/57344 0x8000000030c00000
[ 1] Work Queue Entries : 229315/229376 0x8000000037c00000
[ 2] Output Buffers : 1007/1024 0x800000000fc00000
[ 3] DFA Result : 3995/4000 0x8000000039800000
[ 4] Timer Buffers : 4096/4096 0x8000000039be8000
[ 5] PAN_FPA_LWM_POOL : 1024/1024 0x800000000fd00000
[ 6] ZIP Commands : 1023/1024 0x800000000fd40000
[ 7] PAN_FPA_BLAST_PO : 1024/1024 0x800000000ff40000

Software Pools
[ 0] software packet buffer 0 ( 512): 32767/32768 0x8000000043b2a680
[ 1] software packet buffer 1 ( 1024): 32768/32768 0x8000000044b4a780
[ 2] software packet buffer 2 ( 2048): 81920/81920 0x8000000046b6a880
[ 3] software packet buffer 3 (33280): 20480/20480 0x8000000050bba980
[ 4] software packet buffer 4 (66048): 304/304 0x80000000795cea80
[ 5] Shared Pool 24 ( 24): 560000/560000 0x800000007a8f6780
[ 6] Shared Pool 32 ( 32): 530000/530000 0x800000007b7eaa80
[ 7] Shared Pool 40 ( 40): 70000/70000 0x800000007ca1cf00
[ 8] Shared Pool 192 ( 192): 1269999/1270000 0x800000007cd0cf80
[ 9] Shared Pool 256 ( 256): 140000/140000 0x800000008ba70880
[10] ZIP Results ( 184): 1024/1024 0x80000000a1827300
[11] CTD AV Block ( 1024): 32/32 0x80000000be43d380
[12] Regex Results (11544): 8000/8000 0x80000000be466100
[13] SSH Handshake State ( 6512): 64/64 0x80000000c580c680
[14] SSH State ( 3248): 512/512 0x80000000c5872480
[15] TCP host connections ( 176): 15/16 0x80000000c5a08e80

 

show counter global filter delta yes

Returns, from the second time the command is issued onward, a snapshot of the global counters triggered during the timeframe between the current and previous iteration, and could expose unusual numbers of discarded packets.


 

> show counter global filter delta yes

Global counters:
Elapsed time since last sampling: 2.981 seconds

name                value  rate severity category aspect    description
--------------------------------------------------------------------------------
pkt_recv               89    29 info     packet   pktproc   Packets received
pkt_recv_zero          87    29 info     packet   pktproc   Packets received from QoS 0
pkt_sent                3     1 info     packet   pktproc   Packets transmitted
session_allocated       1     0 info     session  resource  Sessions allocated
session_freed           1     0 info     session  resource  Sessions freed
session_installed       1     0 info     session  resource  Sessions installed
flow_arp_pkt_rcv       83    27 info     flow     arp       ARP packets received
flow_host_pkt_rcv       2     0 info     flow     mgmt      Packets received from control plane
flow_host_pkt_xmt       2     0 info     flow     mgmt      Packets transmitted to control plane
flow_host_service_allow 2     0 info     flow     mgmt      Device management session allowed
appid_ident_by_icmp     1     0 info     appid    pktproc   Application identified by icmp type
dfa_sw                  2     0 info     dfa      pktproc   The total number of dfa match using software
aho_sw                  1     0 info     aho      pktproc   The total usage of software for AHO
ctd_pkt_slowpath        2     0 info     ctd      pktproc   Packets processed by slowpath
pkt_flow_np            87    29 info     packet   resource  Packets entered module flow stage np
pkt_flow_host           2     0 info     packet   resource  Packets entered module flow stage host
--------------------------------------------------------------------------------
Total counters shown: 16
--------------------------------------------------------------------------------


Some common counters explained:

CounterDescription
pkt_recvpacket received
session_allocatedsession allocate
flow_ipfrag_recvipfrag receive rate
dfa_swdfa matching by sw
dfa_fpgadfa matching by fpga
ctd_pkt_slowpathpkt processed by ctd slowpath
aho_swaho by sw
aho_fpgaaho by fpga
log_traffic_cnttraffic log count
flow_fwd_l3_ttl_zeropackets dropped: IP TTL
fpga_pktpkt hold by fpga request.
Note: this is not a cumulative counter but represents how many packets are being currently held waiting on FPGA.


    show system statistics session

    Displays aggregated packet rate and throughput values of all dataplanes in a live dashboard fashion, which refreshes automatically


    System Statistics: ('q' to quit, 'h' for help) 

    Device is up : 0 day 2 hours 34 mins 57 sec
    Packet rate : 32/s
    Throughput : 92 Kbps
    Total active sessions : 14
    Active TCP sessions : 8
    Active UDP sessions : 4
    Active ICMP sessions : 2

    Toggle to view 'top 20 applications' by pressing 'a' and back by pressing 's'

    Top 20 Application Statistics: ('q' to quit, 'h' for help)

    Virtual System: vsys1
    application                      sessions   packets      bytes
    -------------------------------- ---------- ------------ ------------
    ssl                              349        47051        53368023
    firefox-update                   4          32503        31679603
    google-base                      57         19879        17687367
    ms-update                        30         2513         2413399
    tor                              1          715          710161
    web-browsing                     60         1140         588357
    ping                             2920       5840         572320
    windows-push-notifications       57         819          266737
    apt-get                          6          662          260149
    dns                              728        1484         185056
    dhcp                             124        248          86800
    ms-spynet                        3          71           45382
    ocsp                             9          215          33311
    google-update                    2          28           8624
    ntp                              45         90           8100
    ipv6                             2          64           5248
    sip                              5          5            2283
    ldap                             8          9            2241
    insufficient-data                5          13           780
    icmp                             6          6            520


    It is also important to note that type of traffic and complexity in configuration can also add to CPU utilization. Hence it is important to look at the traffic pattern to understand the potential application causing the high CPU utilization.

    Management Plane

    Check management plane resource usage by either searching for "--- top" in the mp-monitor.log or by running the show system resources command from the CLI. Below is an example output of this command:

    >show system resources

    top - 03:40:57 up 20 min,  0 users,  load average: 0.00, 0.01, 0.03

    Tasks: 116 total,   1 running, 115 sleeping,   0 stopped,   0 zombie

    Cpu(s):  1.6%us,  1.3%sy,  0.0%ni, 96.2%id,  0.5%wa,  0.0%hi,  0.4%si,  0.0%st

    Mem:   3635080k total,  2327864k used,  1307216k free,    37772k buffers

    Swap:  2007992k total,        0k used,  2007992k free,  1871932k cached

     

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

    1837 root      15  -5 18036 3508 2032 S    8  0.1   0:35.23 sysd

        1 root      20   0  1768  560  488 S    0  0.0   0:00.75 init

        2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd

        3 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0


    The following are indications that resource usage on the management plane may be causing issues.

    • %wa on the CPU line is high, this indicates that the management plane is waiting at io (in swap and disk-bound).
    • Free memory on the Swap line is high.
    • CPU for the devsrvr, mgmtsvr, and appweb processes are high.


    Traffic Pattern Analysis

    If traffic patterns are the suspected cause of the performance issue, request external packet-captures of network traffic from customer. It is not recommended to perform capture from firewall itself as it may not show all traffic especially if the traffic is offloaded.



    Attachments
    Actions
    • Print
    • Copy Link

      https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClRTCA0&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

    Attachments
    Choose Language