Kaseya Community

Agent service failing: Server guru advice required.

  • jdvuyk

    @danrche, valid point.  I agree that we should all look at doing this as a matter of course to see if we see any changes.

    Make sure you set ALL exclusions, I know in Trend there's a behavior monitoring section that's also a big issue with blocking Agentmon.exe when it tries to access certain parts of the system or perform some actions. AV in General for me is a pain in the butt 

  • I just opened a ticket with K support on this very same issue (and then i saw this thread). One client is using Sophos and two servers are giving us false alerts out of 10 servers. One of the servers that constantly give us false alerts is an Edge transport and the other is just a plain file/print server. The other client has 12 servers and running AVG and only one server is giving these alerts - its an old box that is only serving printing for one network printer with few users remaining. We tried the route of AV exclusions but to no avail. CPU/memory utilization is not a problem for us either.

    I think there is already a consensus that this issue is rampant and server role/environment doesn't really point to the real cause. Why not Kaseya start really digging into this as this has been going on since the previous versions. Just my thought (frustrations).

  • I just wanted to add that I see this problem too, but it doesn't seem to be as rampant as it is for other folks.  I would estimate seeing it about 2-3 times a month.

    It is quite disspointing to have to call a client and schedule a time to manually restart their server because the Kaseya agent will not report when the service will not stop and restart.

  • I just scheduled one this morning (a reboot of an otherwise perfectly operational server because the Kaseya Agent service crashed).  2003stdSP2.  Xeon dual core, 3.5GB RAM.  AVG 9 (KES) with Kaseya exceptions.  Hosting Exchange and SQL (SAGE).  When the agent went "offline", I remoted in from another server and tried to restart the service, but now it's hung in a "stopping" state.  Reboot scheduled for 5pm.

  • Just for the record, you dont need to reboot the server.  just stop the "agentmon" process on the server and restart the service.  This at least makes it painless for the client but not much less painless for the MSP!

  • My tuppence worth.

    This is a problem we've just learned to live with using Kaseya.  If I look at my list of offline servers just now there are two 'awaiting customer reboot' because the techs simply can't get it back online without resorlting to this.

    @jdvuyk - You've been lucky if you've gotten away with just restarting the service. It's the first thing the guys try then they try to use TSKILL which can give the odd success.  But if these don't work the only resolution is to bounce the server, which as has been suggested above is a total pain.

    We have a fair mixture of AV products mostly Sophos then Trend and a mix of everything else.  We've looked into which is the most likely culprit but there is no discernable pattern of which AV product is most likely to be on the problem machine.

    I also don't think high CPU usage is a factor as the server we had the worst issues with was a TS which had occasional use and if that CPU hit 2% on a busy day I'd be surprised.  That particular machine stopped checking in so was rebooted, still wouldn't check in so agents were installed and uninstalled, it would stay on for the day then overnight disconnected and the process started again.  The customer ended up agreeing that as this machine was used so little we could leave it off.  Then one day it started checking back in and it's never been offline since <sigh>.

    Doesn't really shed any light on the situation but thought I'd share my woes - darn I feel so much better about it now - no, not really, it's a pain we've learned to live with.

  • Hi All,

    I had been on this with a good support staff member from Kaseya (they are all good, this guys just ace)

    He suggested a few things, some of which were noted at Kaseya conferences in the past, but many of us (myself included) never noted.

    1. Check your agent procedures and when they are going to run. AUDITING is intense, even if its a minor daily audit. By default, a basic audit is done daily every 24 hours. Perhaps coincidently, we had a lot of drop outs at the time these audits were running. Instead of the audits being run at 9:00am every morning, daily, i changed it to 10:00PM once a week. Think carefully about your auditing. At this point in time i couldn't care less about the information auditing brings because we only monitor servers, and nothing in the auditing information changes / concerns us.

    2. When you are next due to restart the agent service, before you do, check and make note of any agent procedure that is set to run, if it is the same each time, there could be something wrong, or Kaseya may have an inability to process the command, make note, send it to a specialist.

    3.  Firewalls and Network cards. Kaseya needs persistent TCP Connection (or something to that affect). Apparently different firewalls can drop the connection if it has been idle for a while (my understanding (KASEYA STAFF WHERE ARE YOU ON THIS?) Anyway, apparently you can increase the TCP Connection time out value on most good firewalls. I am in the process of checking this.

    4. Update VMware tools if you are running VMware

    5. Disable power management on your network cards. Some vendors NICS will shut the card down to save power, thats never going to play nice with the agent service.

    Hope these things help define a comprehensive list of things that affect the agent service.

    So far i have gone 24 hours without a drop out, that is a record for us :) The only real thing i actioned was item 1, 4, and 5.

    Let me know your thoughts

  • I've always known that the auditing procedures were intense as I've pretty much brought down networks by messing up the audit times.  I didn't think that it would be enough to take the agent offline, this is good to know.  And it also goes pretty much hand in hand with seeing agents go offline that are running on older servers that are ready (read: scheduled for) replacement.

    In your point number 2, you state "restart the agent service", do you mean just the service or the machine itself?  In either case, if I'm understanding that correctly, you state that the scheduled procedure times should change.  Is this the case for all scheduled procedures or just those that are tardy so to speak, or just next to run?

  • Sorry about point 2. To clarify, next time the agent drops offline but the server is still running, you either:

    A: Restart the service or
    B: Restart the server

    Basically before you do either A or B, check Kaseya which procedures were going to next run.
    I am just following what the Kaseya guy told me, All you can really do for point 2 is to note what the procedure is and follow it up with Kaseya

    I think what Kaseya is telling me is that we should schedule any and all procedures at different times, even the built in Kaseya procedures look like they have potential to crash the service.
    I think we are on the same page i just tried to clarify point 2 better :)

    Hope that explained things a little better :) 

    I am definitely not letting this whole thing slide, every piece of info i get from Kaseya will be going here :)

  • Ahh, thank you for clearing that up,  I know that things sometimes don't come out as good as they should the first time...

    As for what I did to clear things up on our side pretty much coinsides with what you've stated and learned from Kaseya, but I've also stepped back on some of the monitoring also.  I still get an occasional offline maybe one a week or so, but again, I cannot stress this part enough these are servers that are very old and very under powered, these also have been slated for replacement. 

    Oh there is one agent that I can drop almost on demand, I would say 90% of the time that I try, it'll go down.  It's a virtual 2k8 sbs machine, but it's pretty short on horsepower, but I suspect there are other things that are ailing that machine as well... but if you can find one that you can down on demand... just remember April fools is coming up... the techs you work with will love you for it!

    Now I'm off to try and figure out why there are some agents on workstations that don't check in... I suspect it's for the same, or similar reasons.

  • Just had a false alarm on a beefy domain controller that's not doing anything other than File Server and Active Directory.  AVG (KES) with appropriate Kaseya exceptions.  Absolutely no scheduled scripts scheduled to run at the time of the crash.  No backed up scripts pending.  No scripts running at the time.  32GB of RAM, 2003R2x64Sp2, 8core Xeon.  I remoted in using my backup plan (pinned session in Bomgar) and restarted the Kaseya Agent service and now it's back online.  This server is not underpowered, it's overpowered.  No scripts are even remotely timed close to the time of the crash.  The AV is KES with correct exceptions.  I think this might put a couple of holes in the theories above.  I do have the Kaseya-recommended monitoring sets on the frontend and backend of the KServer and have not gotten any SQL or MessageSys alerts recently.

  • I just wanted to point out that the comment "This server is not underpowered, it's overpowered." is the biggest understatement I have seen so far this year! Big Smile
    seriously, good stuff.  This is the sort of thing we need to see.

  • There's always gotta be that one guy... /joke...

    I guess it's back to the drawing board... and if you've got some extra overpowered servers please send them my way!

  • LOL.  Customer bought them before they had us as an MSP, but who's to complain if you've got the $$$ to buy something like that...

  • I love spanners in the works! Thats pretty upsetting. I think there could be any combination of the above events that would cause it to fail.
    Usually when any number of events can cause an application to crash its a bad program.

    Enabling exceptions didn't really work for me, forefront online protection was what we were using on one site that was problematic.
    Was their any backups happening at that site at that time?

    Right now i am at a loss, stopping daily auditing worked for us, but i'm still not satisfied it wont happen again.

    The only thing that remains untested is Network congestion, and the TCP Keep alive settings in the various firewalls out there. 

    Its a frivolous exercise for me because my sites have moved back to being reliable, but is anyone where familar with wireshark?
    If you could set up a wireshark capture to run for 24 hours, only capturing traffic on port 5721, save and attach the capture here and we can all look at what happened when the service stopped working. 

    Kaseya doesn't seem massively willing to do that. 
    Lastly, could i get a quick guage of what time of the day this usually happens for people, or is it random?

    I think if i had a Wireshark capture + MAYBE some stats on memory use at the time of failure

    I also have a suspicion Data Execution Prevention may have something to do with it. Anyone here created a DEP exception before for the Kaseya Agent?