Hi, I'm relatively new to Kaseya, we're using it to monitor processes on Linux hosts. I have a monitor for a certain daemon that should be running. If not, Kaseya throws an alert. I’m testing the idea of a scripted remediation when the monitor detects a problem. I have that working, but I’m wondering if there’s a way to have the monitor try the remediation script action first, then if that fails, escalate it to an alert?
Basically, what I'm trying to do is:
1. daemon 'foo' should be running
2. daemon 'foo' crashes
3. Kaseya monitor notices that 'foo' is not running
4. Kaseya agent procedure 'recover foo' is run
5. If daemon 'foo' is now running, go back to monitoring.
6. If daemon 'foo' is not running, create alarm, send email, wake up the helpdesk, etc.
Our service desk handles this exact thing..
1. we monitor something..
2. An alert arrives at service desk, where it is parsed and a remediation task is potentially identified.
3. The remediation procedure is invoked on the agent
4a. If the remediation is successful, the service desk ticket is closed and a "Complete" status ticket is optionally sent to the PSA.
4b. If the remediation fails, a "New" status ticket is sent to the PSA.
We also developed what we call "smart monitors" - applications that monitor a condition, can self-remediate the condition; auto-adjust the threshold based on the environment where the alert is running; and can suppress transient conditions, generating an alert only when the condition persists for a specific period. Kaseya initiates the smart monitor daily, where it runs once or for 24 hours, alerting Kaseya to an issue only when it can't resolve it by itself.
The key is that "Service Desk" is not the ticketing system, it's an "Alert Triage & Response" system. After a year of using this kind of automation through Service Desk, we've seen 62% fewer tickets on our help desk ticketing system and the staff has spent nearly 50% less time on tickets because of the auto and self-remediation has taken care of the most common issues, eliminating the need for baseline troubleshooting.
Thanks Glenn. I'm trying to do this without the service desk (people) being involved at all at the first level. After first reinventing the process monitor as an external script, I used that to fire the remediation Script. The remediation script produces a log file, indicating success or failure. I then configured a second external monitor to check the log file, and that one fires the Alarm and Email options to alert us that something isn't working and that auto-remediation has failed.
In our case, the ticketing system is reached via email, so I can use Email on either of these monitors to generate a ticket. We'll probably do an auto closed ticket for the first level, just to track it, and an open and assign it to somebody ticket for the second level.
It seems a bit hokey to have to build two monitors to do the escalation, but it works and I guess that'll have to be good enough. What I'd prefer to see is a feature where the four actions can be selected, with an escalation order. In this case I'd want it to be Script, Alarm, Email. In other cases, maybe Email, Alarm would be more appropriate.
That'd also require that Script could provide feedback, so as to prevent the escalation from going to the next level, and some kind of a timeout between the levels. So it'd have to be something like Script - wait 5 minutes, then Alarm - wait 30 minutes, then Email - wait 30 minutes, then the escalation ends, because there are no more configured steps for it to take. If Script was successful, Alarm never fires. If Alarm is acknowledged, then Email never fires.
We're doing something like your "smart monitors" here as well, via the external scripts. There are actually very few monitors (the stuff under Monitor / Agent Monitoring / Assign Monitoring) that are doing anything at all, and that is probably going to go down in the future. Almost all of the useful monitoring is under the External Monitor / System Check, so that it can be scripted (these are Linux hosts) to do what we actually need. I'm adding some under the Log Monitor as well, in concert with the external scripts. There are some bugs in the Linux agent and log monitoring, but we've found ways to work around them to get it to work.
People? Service Desk is a Kaseya module used for automated alert triage - sorry if that wasn't clear.
We've eliminated more than 60% of the tickets that hit the help desk (people) by utilizing Service Desk ("Kaseya robots") automation to identify the alert, see if it's associated with a known remediation procedure, invoke that procedure, and then verify its success. If the auto-remediation succeeds, a completed ticket is sent to the help desk system for tracking, but never seen by a technician. The HD Manager will give them a quick review before setting their status to Closed.
Our Service Desk software can get alerts via integrated ticketing or from external devices via email. We filter the email at input to discard stuff like "the UPS started/passed a self-test" and such. We also reformat some message headers to be parsable so that the automation can take effect. In some cases, the automation can evaluate the condition and simply change the priority even if an auto-remediation isn't available.
Service Desk remains in control of the remediation process from beginning to end, and even sets a limit on the remediation time (such as 10 minutes for critical alerts and 30 minutes for non-critical). There's no "looping" of emails in and out of Kaseya. The remediation procedure reports success/fail status directly to Kaseya, so we don't need multiple procedure steps. Also - we use the same software to remediate every alert. The alert ID is parsed in Service Desk, then it's used to perform a SQL query to determine which, if any, remediation procedure is available. The remediation procedures all start and end with the same code, and have custom remediation steps in the middle for consistency.
We handle live calls and alerts from over 3000 endpoints with no more than 3 engineers on the help desk at any time. Their goal is to have at least 6 hours of "on-ticket" time per day, so sometimes they do some project ticket work like workstation build or hardware configuration tasks. All the engineers are L3 or above for a "one and done" customer experience and rotate between help-desk and project work.
Ah, thanks Glenn. Sorry, I was unaware of Service Desk. We don't seem to be using it here. We're doing something else, and it looks like I'm reinventing some of the functionality. I'll have to ask why we're not doing Service Desk, it sounds like we should be looking at it.
Yeah - unfortunately, Kaseya has been marketing this gem as a ticketing system. It can be used as such, but it's strength is the automation that can be applied to every alert BEFORE it gets to the ticketing system that your techs use..
Several of us have been campaigning for a new name for this product. The Service Desk in BMS doesn't have the automation yet, so again, bad choice of naming.
Well, if it helps any, you can add my name to the list of people that think they've mis-marketed this one. It's not up to me, but who knows, maybe we'd be interested in buying a remediation automation system. We're not interested in buying a (another) ticketing system.