Introduction:

Lately we have seen a number of customers who are interested in having the guard guard the guard. They would like to see more robust monitoring from Traverse in monitoring of the Traverse application itself. Particularly we have had people request that they be able to setup alerts based on the information provided on the 'Superusers' -> 'Health' page. This post attempts to provide a resolution to that request.

While the problem posed is tricky in nature, it is possible to leverage some of the design principals of Traverse to come up with a partial solution using the plugin framework.

Overview:

Traverse is designed to be extensible. Each installation is capable of having 1 central web application and object store, collectively called a BVE (Business Visibility Engine). Connected to the BVE are a number (1 ... n) of monitoring components which are capable of polling and storing results in a local data store, these are referred to as DGEs (Data Gathering Engine). New to the family of components are the DGE's lightweight younger brother which is a pure monitoring component with no storage referred to as the DGE eXtension. There can be any number (0 ... n) of DGE eXtensions connected to a DGE.

From this take away the following notes:

  • BVEs do not perform monitoring (by design). An artifact of this is the next point.
  • BVEs do not send actionable alerts for tests but do alert in the event that results from multiple DGE's are required to determine state ( Aggregated Results / Containers ).
  • DGEs do perform monitoring and send email alerts on tests.



Additional points of interest.

  • DGEs are not aware of another DGEs connectivity to the BVE. Each DGE is logically isolated from the next and quite often are physically isolated.
  • DGE and DGEx components require connectivity to the BVE at start up in order to gather information about which devices / tests it owns. That connection can be dropped after startup and the DGE or DGEx will continue to monitor and record results without any further intervention.
  • Each critical component sends 'heartbeats' to the BVE at interval in order to validate that connectivity is available and the component is running as expected. When you log into the product as a super user and click on the 'Superuser' -> 'Health' page, this is the information you are viewing.



Sample Environment:

We will be using my lab environment as a sample. It's simple, we have 1 BVE and 2 DGEs. We will talk about DGEx considerations later but we can consider them the same as a DGE for now.

We have:
Machine (A) - n.n.n.128 - BVE
Machine (B) - n.n.n.116 - DGE
Machine (C) - n.n.n.117 - DGE

Where B and C are attached to A.

Solution:

Keeping the mind set that we have the guard guarding the guard. Keeping in mind that A is unable to monitor anything we can guess that the following configuration provides the best available solution:

B MONITORS A, C
C MONITORS A, B

Now if A goes down we receive alerts from B and C.
If B goes down we receive an alert from C.
If C goes down we receive an alert from B.

If A and B were able to reach each other we could simply provision each machine within Traverse using existing tests. Since we cannot guarantee that they are capable of reaching each other but we can guarantee that they can both reach the BVE we need to create tests where B can monitor C via A. This is where the heartbeat messages come into play.

We need to create a plugin that allows the DGE to query the BVE for the last heartbeat sent to A, if that heartbeat is too old we are in trouble. If we cannot connect to A we are in trouble (we are down send an alert).

Solution Plugin:

Attached to this archive is a plugin written in Perl which provides the described functionality.

Please download this plugin and unzip in your [Traverse Home] directory (files should be expanded automatically within your [Traverse Home]\plugin\monitors directory).

If you are running this plugin in the linux environment you will need to remember to make the plugin executable before it will run correctly.

This can be done by writing the following command set from the command prompt:

Code:
cd [Traverse Home]/plugin/monitors/trv_comp_check
sudo chmod +x comp_check.pl

After you finish installing the plugin on all BVE, DGE, and DGEx components you will need to restart the web application (BVE) and data gathering engine components on all components (you must do this any time you add a test type definition to Traverse).

Within windows you can do this using the Traverse Services Controller within linux you can enter the following commands from the command line:

Code:
cd [Traverse Home]/etc
sudo ./webapp.init restart
sudo ./monitor.init restart

Plugin Monitor Configuration:

Going back to our simple network layout we will now want to create a test from a device provisioned on B which monitors the heartbeat sent in by C. In order to do this we will want to do the following:

We want to find an appropriate device already provisioned on the B DGE which we can add a custom test to which will track C's heartbeats. If you don't have an appropriate device consider making one.

Once the device has been selected we will want to navigate to 'Administration' -> 'Devices' and then 'Tests' in the right hand column across from our selected device. At the next screen we will want to choose 'Create New Standard Tests' and then choose the radio button labeled 'Create new tests by selecting specific monitors'.



If you installed correctly you should now see a monitor option for 'trv_comp_check'. Check this box and choose 'Add Tests'. Within the following screen choose 'Continue'.



Within the next screen you will be presented with an opportunity to configure DGE and Message Service self tests. The only tests 99% of you will need to modify are the BVE IP address, this will need to be the address or FQDN of the server hosting your BVE. The next field 'Component IP Address' will also need to be filled out. This is going to be the address of the component in question as it is known by Traverse (The address from the perspective of the BVE). In our example you will want to provide the address of DGE C.

Unless you've been tinkering the other fields should not need to be configured manually. Warning and Critical thresholds signify the amount of time that has passed since the last heartbeat was received. These values match the values configured within Traverse by default.



When you are done, select 'Provision Selected Tests'. You should now be monitoring the heartbeat interval in a way that matches the statistics found on the 'Superusers' -> 'Health' page. You can add actions against these tests which will alarm you if C stops sending heartbeats or B is unable to connect to A. Repeat the configuration steps for the C device against B and you are all set.

Special Consideration for DGE eXtensions:

DGE eXtensions are capable of performing monitoring but do not store historical data locally or send out alerts. Because of these limitations they are unsuitable for the role of monitoring the component heartbeats. They do however send heartbeats of their own which can be monitored.

It should be safe to have an upstream DGE monitor its own extensions because the DGE will still correctly identify a failure on the DGE eXtension only and a outage at the DGE level would imply an outage of the eXtension.

It may very well make sense to have the extensions monitored from the same machine (device) monitoring the upstream DGE for logical reasons.

Conclusion:

If you require any additional assistance please feel free to contact us through the usual support channels. Any questions, comments are always appreciated.