How to troubleshoot log throttling in Panorama
Objective
- Logs not showing in Panorama
- "Throttle ON" seen in the logs
> less mp-log logd.log
Throttle ON. Disable ms conn read: tb_hash_nitems:6 tb_hash_max_size:1024 notifier_nitems:18 notifier_qsize:256 wb_cache_nitems:64 wb_cache_max_size:64
Throttle OFF. Enable ms conn read: tb_hash_nitems:0 tb_hash_max_size:1024 notifier_nitems:55 notifier_qsize:256 wb_cache_nitems:38 wb_cache_max_size:64
Throttle ON. Disable ms conn read: tb_hash_nitems:5 tb_hash_max_size:1024 notifier_nitems:77 notifier_qsize:256 wb_cache_nitems:64 wb_cache_max_size:64
Throttle OFF. Enable ms conn read: tb_hash_nitems:5 tb_hash_max_size:1024 notifier_nitems:120 notifier_qsize:256 wb_cache_nitems:51 wb_cache_max_size:64
> less mp-log vldmgr.log
Turning OFF throttling for cs:logd in vldmgr:vldmgr
Turning ON throttling for cs:logd in vldmgr:vldmgr
Turning OFF throttling for cs:logd in vldmgr:vldmgr
Turning ON throttling for cs:logd in vldmgr:vldmgr
> less mp-log vld-2-0.log
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=200
Signal read throttle on to main thread. Size=201
- The article focuses on troubleshooting the components of the Log Collector responsible for processing logs.
- It uses a provided diagram to explain the log's journey from the firewall to being written to the disk.
- The goal is to identify potential issues or bottlenecks to offer effective solutions and improve performance.
Environment
- Panorama in Panorama mode
- Panorama in Log Collector mode
- PAN-OS version 10.0 and above
Procedure
- Ensure that all Panoramas in Panorama mode and Log-Collector mode can communicate with each other; for guidance, refer to How to troubleshoot inter Log Collector connection issue.
- Verify if the active_shards created by Elasticsearch are ideal
- Issue the command 'show log-collector-es-cluster health' from the CLI;
- Confirm the "Active Shards" is within the limits supported. Refer to How to calculate maximum primary shard supported?
> show log-collector-es-cluster health
{
"cluster_name" : "__pan_cluster__",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 2,
"active_primary_shards" : 594,
"active_shards" : 596,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
- If the number of active shards exceeds the computed maximum parameter,
- Go to the GUI: Panorama > Collector Group > (collector group name ) > Log Storage > click on the link > Set the "Max Days" in the "Log Storage Settings"
- The 'Max Days' setting should be set to a value lower than the current value
- Commit to Panorama
- Push to the Collector Group.
- If the value max days is not configured, set the value lower than "Retention" days seen in the CLI command "show system search-engine-quota".
- This will purge some of the older data, which in turn will reduce the number of "active_shards" that Elasticsearch maintains.
- Give it a day or two for the purging to take place.
- If the number of "active_shards" remains high, reduce the value of max days still further.
- Verify that Log Collector's ElasticSearch is healthy, including the Log Collector local to the Panorama.
- Below is a sample of an unhealthy ElasticSearch. Note the status is seen as "red"
> debug elasticsearch es-state option health
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1683877500 07:45:00 __pan_cluster__ red 1 1 461 461 0 0 115 0 - 80.03472222222221
- Open a Support case to resolve the above,
- Below is a sample of a healthy ElasticSearch. The status is seen as green.
> debug elasticsearch es-state option health
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1683877131 07:38:51 __pan_cluster__ green 1 1 32 32 0 0 0 0 - 100.0%
- Check if the logging rate is within the capacity of the Panorama, Log Collector.
- Extract the device's logging rate using the command below.
> grep mp-log mp-monitor.log pattern "Incoming log rate"
Incoming log rate = 19610.18
Incoming log rate = 19723.54
Incoming log rate = 24201.00
- For the VM series, use VM Panorama Logging Rate as a reference.
- If the Incoming log rate exceeds the published capacity, upgrade to a higher platform or add another Log Collector in the Collector Group.
- Add more logging disks.
The introduction of new disks facilitates parallel logging, which means multiple disks can be used simultaneously for writing logs. This parallel logging approach not only increases the overall logging capacity but also significantly improves the speed at which logs are written to the disks.
With parallel logging, the workload is distributed across multiple disks, allowing for faster and more efficient log write operations. This helps prevent bottlenecks and ensures that logs are promptly stored on the disk, reducing the risk of log data loss or delays in logging critical events.
-
For Virtual Appliance, use Expand Log Storage Capacity on the Panorama Virtual Appliance as a reference.
-
For M-Series Appliance, use Add Additional Drives to an M-Series Appliance as a reference.
-
- Relocate Device Management and Device Log Collection, and Collector Group Communication away from the Management interface.
- The snapshot below illustrates the default settings for the management interface.
- The recommended settings are provided below.
- Confirm that logd is able to properly communicate with vldmgr
- look for any of the pattern below from logd.log
> grep mp-log logd.log pattern "Error enque'ing into notifier in:logd"
2023-03-03 17:10:32.518 +0000 Error: logd_send_iovec(pkt.c:5369): Error enque'ing into notifier in:log
> grep mp-log logd.log pattern "Error sending iovec to notifier in:logd"
2023-03-03 17:10:32.518 +0000 Error: logd_send_pkt(pkt.c:5432): Error sending iovec to notifier in:logd
> grep mp-log logd.log pattern "block to vldmgr failed"
2023-03-03 17:10:32.518 +0000 Error: _handle_blk_redist(pkt.c:4755): sending pkt type 0 block to vldmgr failed
- Any of the above logs is an indication that 'logd' and 'vldmgr' are not communicating properly. To restore the link, restart the 'logd' and 'vldmgr' processes using the commands below.
> debug software restart process logd
> debug software restart process vldmgr
- Follow the steps in How to Restart the Management server "mgmtsrvr" Process
- Restart the Log Collector/Panorama
> request restart system