Every couple of day's we receive that one of our agents located on a server has not checked in, the notification happens every 5 minutes and last for 30 minutes and then in goes back to normal. When I first receive this message i locate the server and see it shown online in the Agents status screen and it shows the last check in time. (for example, alert says it server hasn't checked in at 1:00 PM, Agent status shows last check in time to be 1:00 PM)
Any recommendation of what to look at for resolving what appears to be a false positive?
That's a pretty odd one.The only thing I can think of, is that VSS snapshots do generally freeze the system entirely, preventing an agent from checking in momentarily - but they only last a few moments (usually) and then things go back to normal. Usually the first thing to look for when the freeze starts exactly on-the-hour, is backups, VSS snapshots, restore point creation or windows 'previous versions' taking a long time to process, in this case.
Is your missed-checkin alarm set too sensitively (and what monitor are you using exactly??) How frequently do agents check in?
We check in every 30 seconds but only alarm if it misses checking in for 3 minutes - which avoids most false-triggers and also generally avoids false-triggering on a simple reboot (long reboots e.g. due to patching processes typically show up as a short outage).
Currently it set to check in every 1 minute and an alarm set to go out if agent is offline for more than 1 minute with a rearm time of 1 hour. Since i started working at this location this is the first time I am seeing this message and when i talked to the senior tech he mentioned to me that he hasn't seen this type of false positive before.