If anyone knows how to do this please help me - I wrote a janky open-source app to handle it, but open to "the right way" if anyone knows.
We want an alert when an "office" goes down, so we setup the agents to email us when they go offline - then we get those cool emails like:
"10 machines in group hq.company went offline"
But - every time an employee shuts down their computer (no matter how much we plead them to leave them running) - we get alerts. We get probably 50+ of these a day, and always have. I got sick of it.
What I want:
Email me when 5+ machines in a group go offline within a few minutes, or at all once. Better yet - email to opsgenie so we get voice calls, because this usually means a whole office is offline.
I made a quick open source app you can run, send your "machine offline" alerts to it- and it will handle knowing when to email you, and when to page you (secondary email that goes to opsgenie/etc)
Add features, send me a PR - happy to add stuff for people too..
(just built it today, haven't got all the quirks out yet - but it's usable at this point)
So the right way to do this is to use Service Desk, when an offline alert comes in (we use 10 mins of offline time before they alert to limit false positives), use SQL Read step to check status of all machines using the same Public IP Address, if they are all offline... then switch the ticket from Server Offline (we only alert on offline servers and other critical systems... not workstations) to Site Offline. This works very well... plus we use the public IP to look up the ISP and add that to the ticket as well. We also have org custom fields for alert and maintenance contacts to alert all any relevant parties (if they opt in).
Actually we check the the network address, not the specific public ip address... found this to be more accurate differentiator between isp/power issues and local network / equipment issues.
ghanssen what do you mean by "check the network address" exactly? What network address are you talking about?
Can you explain how this works for you please as I can't see any way to differentiate between an external client's Internet link being down, or just their server down, other than looking the the VSA and checking if "all machines" are showing as offline. This doesn't always work as for some clients we only have the agent loaded on their server...however if we can't ping their public IP address after seeing their server offline, it's a pretty good bet it's either an internet outage or a power failure at their end....and we eliminate power failures by monitoring the UPS.