In the course of the last several years we've found incidents where event logs simply were not alerting -- when we viewed the events in the db the events did not exist when compared to event log on the endpoint. Generally this was due to high volume of events in the moment, or performance issues. We have corrected most of that as much as possible but the challenge we face is making sure we know when events are being skipped BEFORE we actually need them.
Does anyone have a working solution for this -- sort of outside the box way of making sure we know when an event log on an endpoint is not being written / logged by Kaseya. This is most important for security logs for us.
We're super SQL savvy, but i'm not entirely sure how that will help here. Effectively we need to know when a count of events on the machine doesn't equal the count of events in kaseya.
One of our tools might help, or at least provide some ideas...
We have a "what don't you know" tool that can be run monthly. It runs on servers and dumps the last 30 days of event logs, filtering on warnings and errors. Ignoring dates, it gets a list of unique events that occurred.
It would be a simple change to re-scan the data for each unique event to obtain counts - it doesn't do this currently. This could easily create two CSV files - an event list with counts (and the status below - Monitored, Don't Care, or New), and a second CSV that maps the dates to each event.
Once we have the list of unique events, each are compared against a table of events that we either do monitor or don't care to monitor. What's left are events that are new and unseen. We can then decide to add these to either the Monitored or Don't Care tables, and create or update a monitor set if we choose to monitor the new event. When we onboard new clients, we run this with a parameter that dumps up to 365 days of events. This helps identify customer LOB alerts that we can then evaluate and monitor if appropriate.
The files can be collected, combined, and imported into SQL from where you can create the reports you need. Simply comparing alert counts between this data and VSA will give you a good idea of any disparity. Being on-prem and having good SQL skills will be needed here, or possibly being downright awesome with the reporting tools if on SAAS.
BTW - We're writing to CSV and not SQL because these tools run directly on each endpoint for maximum performance. Collect, concatenate the file-pairs, and import into SQL should be pretty easy.
PM me or contact me via the mspbuilder.com website if you want to discuss this further.
With the event log monitors, there is a rearm value, minimum is one minute. I've found if two or more of the same monitored event occur in under 1 minute, Kaseya will miss the second and subsequent events (until the one minute rearm expires) due to this.
We monitor shadowprotect events and regularly miss backup events due to this.
The logic Kaseya uses is it will not alert on a different event if it is in the rearm period and if the event belongs to the same event set. So, effectively Kaseya is blind to any event from the same set during the rearm period. That's an important 'feature'. Apart from creating single event sets, there's not much you can do.
Not sure at what number of different event sets Kaseya runs into performance issues. You could single out the most destructive events, but due to the sheer number of options, it doesn't feel like a good option. Kaseya could rewrite the rearm function to act on a per line basis for every line in an event set, to get around this. I 'm not sure if that;s a good plan....
I'd just like to turn rearm off. I want *every* instance of an event i'm monitoring for. I'm not silly enough to monitor an even that triggers hundreds of times a minute (and if there ever was, i'd surely want to know about it).
Thanks Glenn. I'll definitely follow up with you later in the week on this. For our purposes what we're seeing isn't just rearm related stuff but it's super random. So a catch-all would be helpful
Chris, something else to keep in mind from your first statement... The events not existing in the DB, could simply be to the event log collection configuration in Kaseya. They made a change several years ago so that you don't necessarily have to be *collecting* specific events to be able to alert on them... In other words, they moved the event log alerting out to the end point. So you may be running into Rearm stuff without really realizing if as Eric pointed out you have multiple events combined in one event set.
The events in the DB will only show the specific event logs and categories that you have defined in Agent->Agents->Event Log Settings, and it's honestly recommended that you keep those settings to a minimum. From what I've been told, the only thing you really *need* to bring back from the agent to the db are events that you might want to generate reports on later. The eventlog *monitoring* uses separate logic that all takes place on the agent end.
I know from several years ago that configuring the Event Log Settings to collect *all* event categories for *all* event log types is a good way to bring your server to it's knees and flood your available bandwidth :)
To add to Jonathan's response... in order to prevent this mistaken "distributed denial of service attack" from your Agents against your server, once a machine sends up ~ 2000 events (exact number has changed from time to time) within an hour, we stop collecting from that machine for 1 hour, which can result in missed events.
Thanks Jonathan/Kirk -- both are great information but, on the db collection vs alerting -- I've heard both that it is and is not required for alerting from members of support and actually at Connect conference as well so we collect for 30 days regardless to CYA.
In the few examples we had a definitely not alerting (and not collecting -- they always coincide) it was not a 2000 event within an hour scenario. That would totally be understandable.
I mean it's not something we've noticed happen a whole lot, but it's definitely happened, we provided proof of it happening and it's not the sort of thing we want to happen and not know it's happening. Without firm data comparing actual set of events to what is collected and alerted on it's super hard to tell how prevalent the problem is.
I definitely appreciate all the responses on this so far!