Log forwarding delays or Missing Logs due to high latency between log collectors in a collector group
Symptom
A Collector Group's ability to handle logs can suffer greatly when the latency between log collectors in the collector group is greater than 10 ms and/or when the logging rate is high. Under such conditions, a slowness or delay might be seen when forwarding logs. In some instances, logs may even get lost.
Environment
Environments where this issue is more likely to occur:
- Latency is high between LCs – a latency greater than 10ms could trigger the problem.
- High logging rate – high end FWs (PA-7k, PA-5200), forwarding logs to LC or many firewalls forwarding logs
- Log redundancy is set.
Cause
In a functioning system, a firewall forwards logs to a single log collector in a collector group based on the configuration of its log forwarding preference list. The log collector that receives the logs further distributes these logs equally to other log collectors in the group for storing on disk. The receiving log collector buffers the logs till it receives an acknowledgement from the peer log collector(s) in case a communication failure requires the logs to be sent again. If this buffer fills up then it can no longer receive logs from the firewall.
In a system under stress (for example, under high logging rate and high latency between log collectors), the acknowledgement packets can get delayed. This in turn causes buffers on the receiving log collector to reach maximum capacity. While buffers are at max capacity, the log collector won't accept additional logs from firewalls. This introduces delays in storage of logs on disk and in extreme cases a loss of logs. For example, in the case of a PA-5200 or a PA-7K which can have very high logging rates, the firewall's log buffer could roll over resulting in a loss of logs.
Detection of a System under Stress
-
Run CLI debug log-collector log-collection-stats show incoming-logs
> debug log-collector log-collection-stats show incoming-logs
blk shipping stats per destination LC:
007307001057(Successes:3156, Fails:156543)
007307001044(Successes:166874, Fails:0)
When the problem exists, the Fails count increases constantly for the remote log collector.
-
Run the netstat CLI's to see if the communication channel is congested
> show netstat numeric yes program yes | match logd
tcp 6100410 0 127.0.0.1:41742 127.0.0.1:pan-mgmtsrv ESTABLISHED
> show netstat numeric yes program yes | match <ip address of the other LC>
tcp 4306978 162176 ::ffff:10.:pan-mgmt-interlc ::ffff:172.25.0.14:50471 ESTABLISHED
Resolution
Solution
Ensure at least 10ms latency between log collectors.Workaround
- Turn off Log Redundancy for CG
-
Turn on Inter-LC Data Compression
Data compression for inter-lc is default for 8.1. In 8.0, do the following to turn on the data compress for inter-lc communication on all the LCs in the group:
debug log-collector inter-log-collector data-compression
debug software restart process logd
There is no operational impact, but can result in minimal increase of CPU usage of logd.
-
Split a single Collector Group into multiple Collector Groups
Note: This workaround will impact log forwarding from PA-7K and PA-5200s. Due to buffer capability and extremely high logging rate of these FWs, buffers can get full and they may not be able to re-forward the logs to the log collector once it comes back up.