Log forwarding delays or Missing Logs due to high latency between log collectors in a collector group

Log forwarding delays or Missing Logs due to high latency between log collectors in a collector group

54978
Created On 12/28/18 08:30 AM - Last Modified 10/15/19 18:21 PM


Symptom
A Collector Group's ability to handle logs can suffer greatly when the latency between log collectors in the collector group is greater than 10 ms and/or when the logging rate is high. Under such conditions, a slowness or delay might be seen when forwarding logs. In some instances, logs may even get lost.

Environment
Environments where this issue is more likely to occur:
  • Latency is high between LCs – a latency greater than 10ms could trigger the problem.
  • High logging rate – high end FWs (PA-7k, PA-5200), forwarding logs to LC or many firewalls forwarding logs
  • Log redundancy is set.


Cause

In a functioning system, a firewall forwards logs to a single log collector in a collector group based on the configuration of its log forwarding preference list. The log collector that receives the logs further distributes these logs equally to other log collectors in the group for storing on disk. The receiving log collector buffers the logs till it receives an acknowledgement from the peer log collector(s) in case a communication failure requires the logs to be sent again. If this buffer fills up then it can no longer receive logs from the firewall. 

In a system under stress (for example, under high logging rate and high latency between log collectors), the acknowledgement packets can get delayed. This in turn causes buffers on the receiving log collector to reach maximum capacity. While buffers are at max capacity, the log collector won't accept additional logs from firewalls. This introduces delays in storage of logs on disk and in extreme cases a loss of logs. For example, in the case of a PA-5200 or a PA-7K which can have very high logging rates, the firewall's log buffer could roll over resulting in a loss of logs.

Detection of a System under Stress

  1. Run CLI debug log-collector log-collection-stats show incoming-logs

On the log collector that is receiving logs from the firewall, issue the following command:

> debug log-collector log-collection-stats show incoming-logs

check the output for the highlighted field:
blk shipping stats per destination LC:
        007307001057(Successes:3156, Fails:156543)
        007307001044(Successes:166874, Fails:0)


When the problem exists, the Fails count increases constantly for the remote log collector.
 
  1. Run the netstat CLI's to see if the communication channel is congested

On the log collector, run the following two netstat commands:

> show netstat numeric yes program yes | match logd

If the recv-q (the second column) shows a huge number, the system most likely has run into this problem.

tcp   6100410      0 127.0.0.1:41742             127.0.0.1:pan-mgmtsrv       ESTABLISHED

 

> show netstat numeric yes program yes | match <ip address of the other LC>

If the recv-q (the second column) and / or send-q (the third column) for the connection shows a large number, the system has most likely run into the problem.

tcp   4306978 162176 ::ffff:10.:pan-mgmt-interlc ::ffff:172.25.0.14:50471    ESTABLISHED



Resolution

Solution

Ensure at least 10ms latency between log collectors. 

Workaround

  1. Turn off Log Redundancy for CG
Redundancy doubles the traffic volume between the LCs. Reducing the traffic may help ease the pressure.
  1. Turn on Inter-LC Data Compression

Data compression for inter-lc is default for 8.1. In 8.0, do the following to turn on the data compress for inter-lc communication on all the LCs in the group:
debug log-collector inter-log-collector data-compression
debug software restart process logd

There is no operational impact, but can result in minimal increase of CPU usage of logd.

  1. Split a single Collector Group into multiple Collector Groups

It reduces or in some cases completely eliminates the inter-log collector communication therefore mitigates the likelihood of this problem. It has no operational impact as long as all the log collectors are up and running. If one of the log collectors goes down briefly, firewalls (other than PA-7k and PA-5200) will re-forward the logs and logs will not be lost. 
Note: This workaround will impact log forwarding from PA-7K and PA-5200s. Due to buffer capability and extremely high logging rate of these FWs, buffers can get full and they may not be able to re-forward the logs to the log collector once it comes back up. 


Additional Information

 

 


Attachments
Actions
  • Print
  • Copy Link

    https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000CmUnCAK&refURL=http%3A%2F%2Fknowledgebase.paloaltonetworks.com%2FKCSArticleDetail

Attachments
Choose Language