tldr: (too long don't read - for those who don't like reading essays) Agentmon.exe service carks it, doesn't report back into the vsa, but still is noted as running, server is "offline" til agent is restarted, IDEAS?
Firstly, this matter is with support but its an ongoing issue im pushing to resolve.Secondly, i believe someone in the world has had this problem or the problem currently exists with other users, which is why im asking the forums.
The Kaseya Agent service drops offline at ALWAYS random times, the server remains operational.IMPORTANT: In the services.msc console, the service, whilst not checking into the VSA, is still RUNNING, i have to restart the service, to get it to report into the VSAThis happens across all customers, and can affect any server
Support has mentioned the agent is fine but the customers may be experiencing network contention, then they washed their hands of it.
My question is:if network contention causes a server service to fail, how do i go about simulating the crash,If i can simulate the crash by pumping the agent full for traffic, then i can monitor it to anticipate when the agent is going to fail.
While it is with support i'd like any one who has any theories:1. What causes a server service to stop (other than not in use)2. What tool can i can (if it is network load) to crash a service on purpose to simulate the error and work towards prevention?
Other things i will be doing:1. Historically, Agentmon.exe page faults A LOT, i will check at times of the event what the page faulting is like2. I will monitor event view > application, i have seen many event logs for Agentmon.exe but nothing that alludes to the problem (to my knowledge)3. I will run wire shark captures to see if theres anything "obvious" but i am not 100 percent sure what to look for
The impact is massive for us, we garentee uptime (to a degree) when customers see server uptime of 50 percent across random servers, they ask why.Us telling them the agent is flaky isn't good enough to them and rightly so.
This is my biggest complaint with Kaseya. Every time I complain, they want me to upgrade everything to agentmon.exe 6.x.x.x, but they have never fixed this problem. I see this with agents that we are actually doing "stuff" on (heavily scripted), and I see this on agents that are simply monitoring up/down (no KES, no updates, no nothing except agent status), so I would doubt that it is caused by an overload of the agent. I don't really have any good theories, but I know that there was once-upon-a-time released a KSVC Remote Restart script (by Ben L, I think) in a K2 Scriptpack. Sometimes it's good for restarting the service from another agent, sometimes it's not. I've tried sc.exe scripts to no avail. I've added 4GB RAM to both frontend and backend of the VSA (total of 8GB on front, 16GB on back, 13GB db). It's extremely frustrating to get "server offline" alerts in the middle of the night, only to discover that everything is running fine and only Kaseya is broken. Sorry, but I really can't contribute anything worthwhile other than "I feel your pain". Hopefully Kaseya will get this fixed SOON.
I'll throw my 2cents in, and an "I feel your sleepless night pains" too, I hear, or read rather, much about the Kaseya set up and after having done a couple of hardware upgrades and an OS upgrade too I started to look elsewhere and what I've noticed is that this happens (for me at least) when the server in question is under a lot of stress, be it high CPU or RAM usage. So normally I'll get these when servers are doing backups or some heavy number crunching, so yea things like database and Exchange servers that are at the cusp of being under powered tend to do it more often than the more robust machines out there.
This is just what I've noticed it may not be correct, but something to look at, if you notice something like that chime in and please let me know that I'm not as crazy as everyone thinks that I am...
Just to get back to you all for taking the time to reply.
The issue is still ongoing for us, we had Kaseya ask us to make exceptions across all customers AntiVirus and it looks like it might be doing the trick, i did it to one site and haven't had a fall over since, i let it go on another customers site and it they are still having agents drop offline
Just as a follow up / 2 cents worth:
Servers with Sophos on them haven't had the problem.
Servers with McAfee have definitely had the problem
Servers with Microsoft Forefront protection have had the problem
If you guys / gals are experiencing this issue could you post back with the AV on the servers / workstations you are taking care of? Perhaps we can get community consensus that: 1. It is or isn't the cause, and 2. If it is the cause, which AV does it and how / where to create the exclusions.
This is a very important thread and we need to keep this alive as long as possible. Keep posting everything you might find.
I get this as well, but its more an annoyance than anything for us. Interestingly I have one server that seems to fail like this almost every day. Im not sure why this server is worse than others but I at least have a machine that will reasonably reliably fail. I got around it by restarting the service every morning at 6am. Ugly, yes. But hasnt come back.
Kaseya REALLY needs to find the reason for this. Its a difficult problem for them to solve, but it clearly effects many many people.
Keep the info coming.
A couple of thinks, Thanks for the input jdvuyk,
I have put this thread as part of my ticket to Kaseya. I would have thought with Kaseyas connections to Microsoft, Sophos etc, that it wouldn't be an issue OOB (Out of box)
My other thoughts are that if the server is experiencing high network load it shuts the service down (Backups running etc at the same time Kaseya is running)
I am really not keen on band aid fixes for something thats pretty clearly an issue, whether the agent has a bug or theres other things beyond Kaseya's control, its something thats affecting everyone so Kaseya should be concerned.
I noticed that everyone on this thread are the people that are most vocal in the forums, and often give the most informed replies to any issue.
I mentioned in a previous thread that a current developer blog / current issues log would be good to have available so we know it is going on, maybe take selected people from the forum to feedback to a Dev group / developers blog, the most current issues the users are experiencing or something, similar to the user groups here in Australia, but at a forum level.
The whole "submit a feature request - and if enough people do it we will put it on the list" doesn't work ,people are way to apathetic to do that, myself included sometimes.
If the top 5, top 10 forum contributers had some form of legitimate clout, to get these "top issues" looked at it'd be nice.
I am well aware of most everything the community suffers pain with, i spend probably 2 / 3 hours on these forums a day, they are invaluable, the next step is access to Kaseya staff to get the pain across succinctly, separate to raising a ticket
Anyway, another rant over, let me know your thoughts.
I've had this as well. Agents will show off-line but are up and running. Checking Trend Micro logs I do see that they're being blocked by Trend. I can also cause this issue by making a change in say ... event sets for monitoring. The agent will go off-line and trend logs will show it being blocked. I've set Exclusions and the agent stays up now.
Here's a list of my current exclusions if it helps anyone:
If you note we use 'c:\support' as our agent temp folder. If you don't exclude that a vbscript may trigger a virus alert.
Have you set any perfmon's on the servers in question? I'm wondering if there's a way to prove that high cpu is the cause. I've got a few Servers that trigger high utilization load alerts for 2 hours at a time, but my agents don't drop offline. I get the high load alerts during backups and sql maintenance (when the customer runs HIS sql cleanup scripts).
Yes I do have perfmon watching CPU (individual as well as _Total) same with ram on all the servers, I'm not sure if I would go to court with what I'm finding, I just notice the trend on some of the systems that we watch. And I would agree with you the higher loads do show up during backups and SQL/Exchange maintenance cycles. The trend I'm seeing are for servers that are ~5 years of age and tend to live in the 60% load or higher category.
@thirteentwenty, would setting the agentmon.exe priority higher for those servers allow for that process to stay active and connected during the high use times? Agentmon.exe doesn't really use a ton of resources, especially just for check-in's.
Just to mix it up a bit. We have only 1 machine that uses Trend Micro and its pretty much rock solid. All our machines use NOD32 and appear to be about as reliable as everyone else's here. I think the AV route might be a bit of a red herring. The CPU load (or something similar) is probably more likely IMO.
@jdvuyk - in my case I have virus logs showing agentmon.exe being blocked. After adding exclusions I now get a full nights sleep without interruptions. Obviously you would know more about your products and systems then I. If you believe the cpu load is the issue I'd recommend checking into perfmon and matching that up with your agent logs for off-line status. Also I don't think it would hurt anything by adding AV exclusions for Kaseya, it would rule out AV issues blocking your agents.
@danrche, I haven't tried that yet as we're working on getting those servers replaced.
@danrche, valid point. I agree that we should all look at doing this as a matter of course to see if we see any changes.
Just tested another AV, McAfee (Version 2009 (Don't ask im trying to up sell ( Yes i know what is on there is like having no AV at all) ) )
Anyway, i don't know if 2010 / 2011 does this, but the version my customer has has no central console.
I can only exclude exe's and no Folders. Looks like Agentmon.exe for the time being will just have to do.
Good list going here. I am slightly worried that there are other factors other than AV like danrche mentioned.
I will see about setting agentmon.exe priority soon after seeing if setting this latest set of exceptions has helped.