Kaseya Community

Issues with KServer stopping

  • Hi guys,
    We have been having an issue for the past 2 weeks with the following
    Loosing the ability to remote control
    All agents going offline
    Kserver stopping

    It has happened every evening for the last 4 evening and also has happened during business hours.

    We have logged an incident with Kaseya and after a very long 2 weeks of emails looking for updates and escalating the issue we finally got some sort of decent response yesterday, they are putting it down to our server being underspec'd. Which is fine, we are getting a new server in the next few days, but I still have my concerns that after we get our new server that the issue will still be there.

    It's not like we have increased our agents dramatically our increased the load on the server dramatically in the last 2 weeks, I would understand if we put a pile of agents on and then started having issues, but everything was working fine up to 2 weeks ago. Then our issues started happening around the same time that we added a client with a couple of Windows 2008 servers, we now have 2 clients with Windows 2008 and one of our engineers has noticed that he has been remoted into either one of these a few minutes before all our problems happen. A reboot of the server resolves the issues but I'm a bit concerened that were not getting all of our alerts while the issue is occuring and we could miss important alerts. Has anyone else had any similar problems?

    Our spec is
    KServer
    Windows 2003 Standard Edition SP2
    single quad-core (4cores) 1.8ghz
    3gb RAM
    1x136gb Disk

    We are running 3341 agents

    Legacy Forum Name: Issues with KServer stopping,
    Legacy Posted By Username: lornacummins
  • Wow... for 3000 plus agents that's a light config of hardware.

    My first thoughts are to check and see if someone rescheduled your scans to run around that time of day (look at the distribution under the scripts tab).

    My second thought is to say with 3000+ agents I'd recommend you look at upgrading your server to the folowing:

    2 procs (QC)
    15K SAS drives (raid 5 or better)
    16GB or RAM (meaning Win 2K3 64-bit)

    Then you should see a more consistent performance over the next 3 years (At least). Frankly I'm really surprised you haven't bumped your head before.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: boudj
  • I would guess its the load as well. As boudj mentioned I would check the distribution of scripts, etc and at least look at more disk i/o.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: Coldfirex
  • Our server stopped also for the first time ever the other day. Not sure if it is related but was very strange.

    We only have 1500 agents and never had an issue before.

    I am monitoring the situation now to see what happens.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: mmartin
  • I got an alert email earlier this week saying our server had stopped. Got it at about 3 in the morning. Have never had an issue before and the server was running just fine when we got in at 7. I have been monitoring it and haven't seen anything abnormal since.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: timbuktech
  • Stay away from RAID 5. Go Raid 1+0 for performance so at least 4 x 15K SAS drives.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: smbtechnology
  • So did anyone else have the issue continue? I still am concerened that's its not just the speck of the server I've stopped nearly everything now, except a few event log montors for the backups, have stopped all my av scripts lan watches audit and patch scans and the KSERVER services still just increased in memory gets up to about 2gb then the server runs out.... just have to wait for the new server to come in I suppose and see if it keeps happening.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: lornacummins
  • If the kserver process is actually using 2GB of RAM then yes it will cause a problem because unless you're using the 3GB flag in boot.ini 2GB of RAM is the largest available address space for an application on a 32 bit system.

    We had a bad patch of reliability for a while which baffled us until SQL managed to produce an error indicating it was starved for RAM. Upped the server from 4GB to 12GB and have just had the odd intermittent problem since. We have around 3000 users as well

    regards

    Nick

    Legacy Forum Name: How-To,
    Legacy Posted By Username: nviner
  • Since upgrading to KES2.5 We've had many instances lately of our k server crashing between 6:00 pm and 7:00pm. We too have opened kaseya support kaseya and was told it was the load on our server. However they where pointing to having daily audit etc put too much load on our database server, not the actual server config. i've changed this now so audit runs only every 7 days space out all scripts etc. and we are still having problems. I'll refernce this thread in my next communicaiton with support.


    I doubt very much it's our server config: - 2000 agents-
    Seperate Kserver (win32 4gb) & SQL servers (win64 / SQL64 / 8 gb of RAM) running on highend virtual infrastructure with fibre attached SAN.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: camorton
  • Support did try an say it was scripts and audits aswell for me, until i stopped them all just to prove it wasnt, i still dont think its hardware releated either, if it was it would of gradually got slower not just one day the proccess starts using up to 2gb of RAM, I must say I am totally annoyed with their support!

    Legacy Forum Name: How-To,
    Legacy Posted By Username: lornacummins
  • This happend again for us last night, and it's now been 7 days since I last heard from support.

    I'm emailing an escalation request to Kaseya top management now. I'm also including a link to this thread and indicated that this problem is effecting at least both of our servers and possible others as well.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: camorton
  • Same thing happening to me as well. For the past month or so I receive KServer stopped messages. Support checked my specs and they are fine. I've also ensured that all of my scripts and process are spread out.

    I assume it has something to do with the upcoming Ksseya 6 that will be pushed out some day....

    Legacy Forum Name: How-To,
    Legacy Posted By Username: jahlberg@waident.com
  • Camorton - did u get anywhere with Kaseya? We still havent implemeted our new servers so I can't really keep on at them until we do that, we should be implementing it this Friday though, and I'll be severly unimpressed if the same issue happens then.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: lornacummins
  • Kaseya is an I/O pig. With the number of agents you have and the hard drive you mentioned, I can guarantee that IO is an issue. You have only a few solutions: 1) Get a better disk subsystem. 15K RPM SAS drives in a RAID 10 config is probably a good start (more the merrier). 2) Figure out what is going on when the agent stops. Are you running the backup? Bunch of scripts firing off? Defrag? 3) Try increasing your check-in times. We pushed Workstations out to 90 seconds, and I swear it made a difference! You can setup some monitor sets to try to see what is going on. Look for SQL Server:Locks: Average Wait time (ms). Here is a monitor set that we use on the Kaseya SQL box (modify it to reflect your drives):

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <monitor_set_definition version="1.0">
    -<MonitorSet name="Kaseya SQL Server" description='Monitors the Performance of the SQL Server'>
    -<Counters>
    <Counter name='SQLServer:General Statistics:User Connections'  description='Displays the number of users connected to the SQL server' counterObject='SQLServer:General Statistics'  counter='User Connections'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='1' thresholdOperator='Over'  thresholdAmount='100' thresholdDuration='3600' thresholdWarning='0' thresholdReArm='1'/>
    <Counter name='SQLServer:Buffer Manager: Buffer cache hit ratio'  description='Displays the Cache hit ratio' counterObject='SQLServer:Buffer Manager'  counter='Buffer cache hit ratio'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='1' thresholdOperator='Over'  thresholdAmount='99.9' thresholdDuration='3600' thresholdWarning='0' thresholdReArm='1'/>
    <Counter name='SQLServerBig Smileatabases: Active Transactions'  description='Displays the number of currently active transactions in the system.' counterObject='SQLServerBig Smileatabases'  counter='Active Transactions'  counterInstance='_Total'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='4' thresholdOperator='Over'  thresholdAmount='100000' thresholdDuration='3600' thresholdWarning='0' thresholdReArm='4'/>
    <Counter name='SQLServerBig Smileatabases: Transactions/sec'  description='Value indicates how active the SQL Server is.' counterObject='SQLServerBig Smileatabases'  counter='Transactions/sec'  counterInstance='_Total'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='4' thresholdOperator='Over'  thresholdAmount='500' thresholdDuration='3600' thresholdWarning='10' thresholdReArm='4'/>
    <Counter name='SQLServer:Locks: Average Wait Time (ms)'  description='This is the average wait time in milliseconds to acquire a lock. Lower the value the better it is. If the value goes higher then 500, there may be blocking going on; we need to run blocker script to identify blocking.Average wait time in ms to acquire a lock' counterObject='SQLServer:Locks'  counter='Average Wait Time (ms)'  counterInstance='_Total'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='4' thresholdOperator='Over'  thresholdAmount='1500' thresholdDuration='3600' thresholdWarning='0' thresholdReArm='4'/>
    <Counter name='SQLServer:Access Methods'  description='null' counterObject='SQLServer:Memory Manager'  counter='Maximum Workspace Memory (KB)'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='14' trendReArm='1' thresholdOperator='Over'  thresholdAmount='10485760' thresholdDuration='3600' thresholdWarning='0' thresholdReArm='14400'/>
    <Counter name='LD CDQL D'  description='null' counterObject='LogicalDisk'  counter='Current Disk Queue Length'  counterInstance='D:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='0' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Equal'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='3600'/>
    <Counter name='LD ADQL D'  description='null' counterObject='LogicalDisk'  counter='Avg. Disk Queue Length'  counterInstance='D:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='0' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Equal'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='3600'/>
    <Counter name='LD ADWQL D'  description='null' counterObject='LogicalDisk'  counter='Avg. Disk Write Queue Length'  counterInstance='D:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='0' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Equal'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='3600'/>
    <Counter name='LD ADRQL D'  description='null' counterObject='LogicalDisk'  counter='Avg. Disk Read Queue Length'  counterInstance='D:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='0' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Equal'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='3600'/>
    <Counter name='LogicalDisk C:'  description='LD AD sec/Transfer' counterObject='LogicalDisk'  counter='Avg. Disk sec/Transfer'  counterInstance='C:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='86400'/>
    <Counter name='LogicalDisk D:'  description='LD AD sec/Transfer' counterObject='LogicalDisk'  counter='Avg. Disk sec/Transfer'  counterInstance='D:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='86400'/>
    <Counter name='LogicalDisk E:'  description='LD AD sec/Transfer' counterObject='LogicalDisk'  counter='Avg. Disk sec/Transfer'  counterInstance='E:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='86400'/>
    <Counter name='LogicalDisk F:'  description='LD AD sec/Transfer' counterObject='LogicalDisk'  counter='Avg. Disk sec/Transfer'  counterInstance='F:'  counterSampleInterval='1800' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='25' thresholdWarning='10' thresholdReArm='86400'/>
    </Counters>
    -<Services>
    <Service name='MSSQLSERVER'  serviceDescription='MSSQLSERVER' description='SQL2-M6' restartAttempts='3' restartInterval='60' reArm='43200'/>
    </Services>
    -<Processes>
    </Processes>
    </MonitorSet>
    </monitor_set_definition>
    Anyone having problems with agents just going off-line, can probably trace it back to IO. Good Luck! Chris Amori Virtual Administrator

    Legacy Forum Name: How-To,
    Legacy Posted By Username: chris@networkdepot.com



    [edited by: Brendan Cosgrove at 5:36 PM (GMT -8) on 12-20-2010] .
  • Hi guys,
    Just to let you know if anyone is interested, we upgraded our hardware and still had the same issue, so we went back to Kaseya to get a resoloution, turns out that we had 200 workstations logging in with the same computer name (someone had deployed an imaging and managed to create 200 workstations with the same SID and same Computer Name) so when these machines were connecting to Kaseya it couldnt handel it, the developers have now rewritten the code in order for the Kserver not to stop if this happenes anywhere else. Since the new code last week we've had no more crashes. Happy days.

    Legacy Forum Name: How-To,
    Legacy Posted By Username: lornacummins