How to troubleshoot log throttling in Panorama

6603

Created On 05/15/23 08:18 AM - Last Modified 08/29/25 03:49 AM

Log Collector

Ingestion

10.2

10.1

11.0

Panorama

Strata Cloud Manager

Objective

Logs not showing in Panorama
"Throttle ON" seen in the logs

> less mp-log logd.log
Throttle ON. Disable ms conn read: tb_hash_nitems:6 tb_hash_max_size:1024 notifier_nitems:18 notifier_qsize:256 wb_cache_nitems:64 wb_cache_max_size:64
Throttle OFF. Enable ms conn read: tb_hash_nitems:0 tb_hash_max_size:1024 notifier_nitems:55 notifier_qsize:256 wb_cache_nitems:38 wb_cache_max_size:64
Throttle ON. Disable ms conn read: tb_hash_nitems:5 tb_hash_max_size:1024 notifier_nitems:77 notifier_qsize:256 wb_cache_nitems:64 wb_cache_max_size:64
Throttle OFF. Enable ms conn read: tb_hash_nitems:5 tb_hash_max_size:1024 notifier_nitems:120 notifier_qsize:256 wb_cache_nitems:51 wb_cache_max_size:64

> less mp-log vldmgr.log
Turning OFF throttling for cs:logd in vldmgr:vldmgr
Turning ON throttling for cs:logd in vldmgr:vldmgr
Turning OFF throttling for cs:logd in vldmgr:vldmgr
Turning ON throttling for cs:logd in vldmgr:vldmgr

> less mp-log vld-2-0.log  
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=201

The article focuses on troubleshooting the components of the Log Collector responsible for processing logs.
It uses a provided diagram to explain the log's journey from the firewall to being written to the disk.
The goal is to identify potential issues or bottlenecks to offer effective solutions and improve performance.

Environment

Panorama in Panorama mode
Panorama in Log Collector mode
PAN-OS version 10.0 and above

Procedure

Ensure that all Panoramas in Panorama mode and Log-Collector mode can communicate with each other; for guidance, refer to How to troubleshoot inter Log Collector connection issue.
Verify if the active_shards created by Elasticsearch are ideal
1. Issue the command 'show log-collector-es-cluster health' from the CLI;
2. Confirm the "Active Shards" is within the limits supported. Refer to How to calculate maximum primary shard supported?

> show log-collector-es-cluster health

{
"cluster_name" : "__pan_cluster__",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 594,
"active_shards" : 596,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

If the number of active shards exceeds the computed maximum parameter,
Go to the GUI: Panorama > Collector Group > (collector group name ) > Log Storage > click on the link > Set the "Max Days" in the "Log Storage Settings"
The 'Max Days' setting should be set to a value lower than the current value
Commit to Panorama
Push to the Collector Group.
If the value max days is not configured, set the value lower than "Retention" days seen in the CLI command "show system search-engine-quota".

This will purge some of the older data, which in turn will reduce the number of "active_shards" that Elasticsearch maintains.
Give it a day or two for the purging to take place.
If the number of "active_shards" remains high, reduce the value of max days still further.

Verify that Log Collector's ElasticSearch is healthy, including the Log Collector local to the Panorama.

Below is a sample of an unhealthy ElasticSearch. Note the status is seen as "red"

> debug elasticsearch es-state option health 

epoch      timestamp cluster         status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1683877500 07:45:00  __pan_cluster__ red             1         1    461 461    0    0      115             0                  -     80.03472222222221

Open a Support case to resolve the above,
Below is a sample of a healthy ElasticSearch. The status is seen as green.

> debug elasticsearch es-state option health 

epoch      timestamp cluster         status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1683877131 07:38:51  __pan_cluster__ green           1         1     32  32    0    0        0             0                  -                100.0%

Check if the logging rate is within the capacity of the Panorama, Log Collector.
1. Extract the device's logging rate using the command below.

> grep mp-log mp-monitor.log pattern "Incoming log rate"
Incoming log rate = 19610.18
Incoming log rate = 19723.54
Incoming log rate = 24201.00

For the VM series, use VM Panorama Logging Rate as a reference.
If the Incoming log rate exceeds the published capacity, upgrade to a higher platform or add another Log Collector in the Collector Group.

Add more logging disks.
The introduction of new disks facilitates parallel logging, which means multiple disks can be used simultaneously for writing logs. This parallel logging approach not only increases the overall logging capacity but also significantly improves the speed at which logs are written to the disks.

With parallel logging, the workload is distributed across multiple disks, allowing for faster and more efficient log write operations. This helps prevent bottlenecks and ensures that logs are promptly stored on the disk, reducing the risk of log data loss or delays in logging critical events.
1. For Virtual Appliance, use Expand Log Storage Capacity on the Panorama Virtual Appliance as a reference.
2. For M-Series Appliance, use Add Additional Drives to an M-Series Appliance as a reference.
Relocate Device Management and Device Log Collection, and Collector Group Communication away from the Management interface.
1. The snapshot below illustrates the default settings for the management interface.

The recommended settings are provided below.

Confirm that logd is able to properly communicate with vldmgr

look for any of the pattern below from logd.log

> grep mp-log logd.log pattern "Error enque'ing into notifier in:logd"
2023-03-03 17:10:32.518 +0000 Error:  logd_send_iovec(pkt.c:5369): Error enque'ing into notifier in:log

> grep mp-log logd.log pattern "Error sending iovec to notifier in:logd"
2023-03-03 17:10:32.518 +0000 Error:  logd_send_pkt(pkt.c:5432): Error sending iovec to notifier in:logd

> grep mp-log logd.log pattern "block to vldmgr failed"
2023-03-03 17:10:32.518 +0000 Error:  _handle_blk_redist(pkt.c:4755): sending pkt type 0 block to vldmgr failed

Any of the above logs is an indication that 'logd' and 'vldmgr' are not communicating properly. To restore the link, restart the 'logd' and 'vldmgr' processes using the commands below.

> debug software restart process logd
> debug software restart process vldmgr

NOTE: restarting vldmgr, also restarts all the vld-X-0 processes that it manages.

Follow the steps in How to Restart the Management server "mgmtsrvr" Process
Restart the Log Collector/Panorama

> request restart system

How to troubleshoot log throttling in Panorama

Objective

Environment

Procedure

Other users also viewed: