Kaseya Community

Monitor a service that slows

This question is not answered

I'm trying to figure out how to use a Monitor Set to keep track of a service running on one of our servers that has the tendency to slow to a crawl and stop responding.  The service doesn't actually stop which is why a standard Monitor Set won't work.  The hard part for me right now is that it's a service I don't know much about.  It's a socket service for our RF guns in the warehouse and what happens is the service slows down and becomes unresponsive.  When that happens we have to log into the server and just restart the service.  I don't know what's causing the service to degrade, just that it tends to do it at inopportune times like 2am on a weekday.

Verified Answer
  • My first thought is if there's any ability to approach this from another direction, like if there's any logging on the service that reports an error, you can use log parsing to detect the error and then an agent procedure to restart the service.  You could also schedule restarts at regular intervals if there's regular downtime and possibly that helps, perhaps at a nightly shift change once a week or so.  But if it's dying after 2 days and then after 2 weeks that makes it a bit difficult.  

    There are also monitor sets that detect CPU and memory usage, do you have an idea of what's normal for the service and what it's doing when it's exhibiting the slowdown?  For example, does the CPU spike or does it drop to nothing where it normally is very high?  Those are both things you should be able to detect with a Monitor Set.  

    Also, if you have an application actually stop responding, you can monitor for an "Application Hang" Windows event log entry, usually with an Event ID of 1000 or 1002.  I have some programs that won't outright crash but windows will report that the program stopped responding and brings up a dialogue box, the windows event log entry is created when that happens, and I can watch for it and restart the application/service when that happens.

    It sounds like realistically you need a way for the computer to be able to detect when the slowdown is happening, be it a log, performance, or event entry, and once you can discover what that is, you can build a monitor set in Kaseya to detect it and an agent procedure to restart it automatically.  

All Replies
  • My first thought is if there's any ability to approach this from another direction, like if there's any logging on the service that reports an error, you can use log parsing to detect the error and then an agent procedure to restart the service.  You could also schedule restarts at regular intervals if there's regular downtime and possibly that helps, perhaps at a nightly shift change once a week or so.  But if it's dying after 2 days and then after 2 weeks that makes it a bit difficult.  

    There are also monitor sets that detect CPU and memory usage, do you have an idea of what's normal for the service and what it's doing when it's exhibiting the slowdown?  For example, does the CPU spike or does it drop to nothing where it normally is very high?  Those are both things you should be able to detect with a Monitor Set.  

    Also, if you have an application actually stop responding, you can monitor for an "Application Hang" Windows event log entry, usually with an Event ID of 1000 or 1002.  I have some programs that won't outright crash but windows will report that the program stopped responding and brings up a dialogue box, the windows event log entry is created when that happens, and I can watch for it and restart the application/service when that happens.

    It sounds like realistically you need a way for the computer to be able to detect when the slowdown is happening, be it a log, performance, or event entry, and once you can discover what that is, you can build a monitor set in Kaseya to detect it and an agent procedure to restart it automatically.  

  • Thanks for giving me some ideas.  I'll have to take some of these up with the Director who put the system together.  I'm basically trying to  leverage Kaseya  to do things to make our lives much easier on the L2 side of things.

    Unfortunately reboots get a bit tricky.  It's a system for our warehouse which runs a Sunday shift shy of 24/7.  We actually have the server in question reboot weekly on Sunday at midnight but for some reason the service will lag over time and stop responding during the week.  So looking for a log to verify against could be a direction to go or to see if there's any events or errors to home in on to kick off a procedure to just restart the service.

    Thanks for the advice it definitely gives me a direction to start in!

  • This is an all-too-common situation - an application has a performance issue and management says "check the logs". You are going to have to dive in to PerfMon to get anything useful by creating your own logs.

    Here's the general process:

    • Identify the application run by the service and any secondary applications that it runs.
    • Launch PerfMon, define a log collection, and in "Processes", find the application(s) identified above. You'll want to focus on memory at first. This sounds like it is either consuming memory (leak) or failing to properly manage the memory it has. You don't say the server itself becomes unresponsive, so it doesn't sound like CPU. You can monitor that application's processor _Total as well if you see the server load is high. Resist the temptation to monitor everything - it clouds the issue and places undue load on the system!
    • Allow the logs to collect until the system/service is unresponsive. Close PerfMon (if possible).
    • Collect the log that was generated to your workstation. Download PAL (Performance Analysis of Logs) and feed it the data. This tool is awesome when analyzing PerfMon logs - it will explain the results and offer suggestions to remediate. If you identify a memory leak, it will have to be addressed by the software developer.

    Kaseya can monitor performance and trigger alerts, but not to the level of a targeted PerfMon collection and PAL. You CAN set up the PerfMon to generate an event log alert, for specific conditions, which you can monitor by Kaseya to become aware of when the issue starts or reaches a critical point. You can even respond to the monitor by recycling the service, scheduling it for the next shift-change.

    I have some tools that work with PerfMon that will define a collection, start/stop the monitor, cycle it at midnight and consolidate the day's logs. I've used this in production environments to solve these kind of problems without having to manually set up the collection each day.

    PM me here or contact me through the mspbuilder.com website if you'd like to try this PerfMon manager tool. It's production ready, but not commercial ready, if that makes sense.

    Glenn

  • If you can identify a process associated with this service you can apply high CPU/RAM usage alerts with required threshold.

  • Guys, the problem with Alerting for performance issues is that you need to be prepared to respond immediately to investigate the problem. I've seen very few organizations in my 43 years in IT that have those kind of resources available to respond. In a typical MSP environment, by the time the alert arrives, is processed, assigned, and an engineer experienced with performance analysis actually gets to the machine, the situation has often changed dramatically, and the event that caused the performance issue has passed. This is where detailed logging is needed to capture the events that lead up to the performance issue.

    The PerfMon logging can trigger specific alerts, but there is no rush to fight this fire. The data has been collected leading up to the event, the alert triggered, and now the tech can simply collect the performance log, send it to the performance engineer for analysis, and restart the service to resolve the situation. This process could even be automated with Kaseya procedures, and my PerfMonMgr tool can accept such command-line args to stop the collection, allow VSA to perform a GetFiles on the log data, restart the service, and then restart the data collection. This resolves the performance issue quickly without losing any essential information needed to identify the root cause.

    Glenn

  • In this specific case it's a bit easier since once I have a way to detect that the service has degraded to the point where it's unavailable to our RF guns I can have it just kick a procedure off to restart the service.  Our current solution is the warehouse calls L1, L1 calls the on-call phone, one of the three 2L  folks (me and my two co-workers) logs into the server and restarts the service.

    My simple solution right now will be to just write the restart service procedure and 1L can kick it off on their own since they have access to Kaseya.  Ultimately I want to fully automate it out so the downtime goes from 10-15 minutes to 5-10.

  • So how do the RF guns talk to the service?  It it something listening over TCP where you can simply check with a quick connection on that TCP Port?  Basically it sounds like what you need is something that can simulate that RF Gun's connection to the service to detect when it's not responding.