I have an agent installed on a VEEAM Windows 2016 server. We are currently experiencing issues with this server where the VM becomes completely unresponsive. I can't console to the server and have to hard power it to get it back online. As far as I can tell, the VM is NOT sending heartbeats to the VSA, but the VSA is reporting that it is online and checking in. It doesn't matter how long it is offline for, but it never alerts. I checked the Kaseya agent logs and the logs seem to stop recording around the time of the VM enters this state.
I have opened a ticket with Kaseya Support and they said the the server must still be sending heartbeats, but I can't find any evidence of this. I asked if there is a log of this, they said it should be in the agent log, and when i pointed out the lack of entries, they said that it is most likely still sending heartbeats, but not recording them.
My question to you guys is:
Is there another way that you guys can think of to monitor for this state? The nic still replies to pings so i cannot simply ping from another server.
Is there another place for me to check if the heartbeats are (or are not) coming in?
We have also experienced this in windows 10 enterprise endpoints. When this happens for us, we need to manually power the machine off the back on again.
Here's how I have diagnosed these kind of issues:
I wrote a script that writes a timestamp to the registry every minute. Before writing the registry, it reads the value and determines if more than 1 minute has elapsed (use cTime values for easy calculation). If it is longer than 1 minute, it converts the cTime value to a DateTime string and writes it along with the current time to a log. I use a difference of 65 seconds in the calculation to provide a little leeway.
I've used SrvAny to run this as a service - auto start.
The log reports any gaps, such as when the system sleeps/hibernates. If the system crashes and then I manually restart, it reports the time that the system was last functional enough to update the timestamp. IF - the last good time corresponds roughly with when we power-cycled the system, the system was operational but the console interface had failed. That would explain why VSA doesn't show it offline. If you can prove that the system stopped responding at hh:mm but the agent didn't report offline, you have some hard evidence to take to Kaseya support.
I used Kixtart (www.kixtart.org) to create this script, and I've published some time-calc functions there in the online forums - TimeDiff() [calculates elapsed seconds between two timestamps] and TimeConvert() [converts a timestamp to cTime and vice-versa]. cTime is just the number of seconds since a standard date (epoch), so it's easy to calculate elapsed times. With these two functions and Kixtart's ability to directly read/write the registry, it should be pretty easy to put something together.
Thanks for that gbarnas, that all makes sense. When I get some time I might try to leverage this.