We recently experience what was essentially an outage caused by run away automation we created. The database was locked up with virtually all CPU resources utilized and tons of deadlocks. The Kserver CPU was also being heavily used. We resolved the outage by cancelling procedures and waiting 24 hours or more, now we are stable again. I have questions about CPU resources for both the Kserver and our external Microsoft SQL DB server. We don’t want to just throw MS SQL CPU core licenses at the problem if we don’t have to because of their cost.
Our environment consists of:
Here is an image showing resources over the last month. The arrows indicate when things were non-functional
Here is an image showing the last week, with the environment stable. Note there are still tasks waiting for CPU.
My questions revolve around:
I don't have a direct answer to your questions, but I think the VSA server specs call for at least quad core? For 2500 endpoints, 2 cores seems underpowered to me.
Are the Kaseya server VM and the SQL server both VMs on the same physical host?
What kind of deployment schedule are you doing for your automation?
- We learned early on in our Kaseya deployment to stagger mass operations over at least one day, if not more. Our ~3000 endpoints are sorted out into many machine groups with different schedules for mass operations.
If you're trying to run operations on 2500 machines at once, I'd expect it to kill performance no matter how many CPU cores and RAM you throw at it.
Thank you Eric. The Kserver and SQL server are separate VMs. I could not find documentation beyond help.kaseya.com/.../reqs which I didn't really find helpful. The agents are all laptops, so we have to target business hours, but do select to have the procedures run on the agent time zone rather than the server, so to help spread the load, but we may have 1500 or more in one time zone. We are spreading the procedures as much as we can while completing them between 9am-4pm. The procedures run remote powershell commands and write back entries to the procedure log, and to 2 or 3 custom fields.
To give you an idea we have maxed out our VSA at 20K endpoints with our SQL server using 36 cores and 72GB of RAM. This also required separating all of the database & log files into individual provisioned IOPS SSDD drives with 64KB allocation units.
Even with all of the optimizations that have been done physically and logically (extra indexes, process profiling, etc..) we still get periods where there are excessive locks & blocks. PSE has worked on our servers extensively and the bottom line is that some of those locks & blocks are by design and you just can't eliminate them.
Follow the best practices for your SQL server, allocate your drives with 64KB blocks on SSD drives. Have provisioned IOPS on the drives holding the ksubscribers database & log files. Split ksubscribers over two drives, split the logs over two drives. Assign separate drives for 8 tempdb database files & logs and make sure your SQL server is dedicated to your VSA. It's not the number of cores and memory that causes the bottlenecks, it's the disk IO.
VSA is, by nature, connection & transaction-intensive, for this reason, your SQL server needs to be dedicated and even then you may see problems with transactions being dropped and performance hits due to the front-end passing too many IP connections to the SQL server.
We also have about a 2500 agent VSA, and utilize two servers (App and DB, both VMs). Our database is licensed for 6-cores (32gb ram, DB size is about 75gb), and is utilizing SSD hybrid storage for the database itself on our SAN. The performance is quite acceptable, but there are times when the 6-cores are at full tilt during days where patching or other larger scans/policy adjusts/etc. are being handled. The app server has 4 cores and 8gb ram, and generally never hits those specs very hard at all. It's really all in the DB. Hope this is helpful, I think at your scale, you probably do need at least 2 more cores if I were to guess.
I would recommend that you grab some of the diagnostics and performance tools for SQL from HTTP://www.sqlskills.com. in particular look at the "ShortPeriodIOLatencies.sql" query.
You can also use the following to check the SQL broker service which is responsible for managing the queries and results:
SELECT SUBSTRING(far_service, 1, 50) 'conversation_endpoints FarServiceName',
GROUP BY far_service,
ORDER BY cnt DESC;