Kaseya Community

Disk Performance Counter Monitoring Set

This question is answered

Hi;

Has anybody used the Disk Drive Performance monitoring sample set successfully yet?

I have been using it for a while and I have noticed a few things about it.

The first thing I have noticed is the Alarm thresholds are higher then the normal ones you would configure under MS Perfmon. To explain this better according to some documentation I found when you monitor Disk Time usage by %, Logical Disk usage time 55% or greater for 10min or more indicates I/O bottlenecks. Also if the average Disk Queue Length exceeding the value of 2 over a period of 30min or more indicates I/O Bottlenecks. However the Logical Disk usage time usage threshold is set to 80% and the Disk Queue Length threshold value is 10.

The second thing I have noticed is that  I have found 3-4 servers that exceed  these thresholds by 10-100 times what is expected making me think that the value scale might be wrong for these servers.

 

Verified Answer
  • Hi there,

    We use these performance monitors quite extensively, i'd advise to use them in conjunction with other perf mon alerts as a sign of good / poor performance.

    I am about to ramble so be patient. I hope there is something of value here.

    I'd suggest mainly to use it in conjunction with other performance alerts, RAM and so on. Things that effect the values you are asking about are RAID type, Disk type and so on, it requires a bit of maths and research to know if the values are bad. What is bad for one server might not be so bad for the others. The maths to calculate the stuff can go right down to the spindle unfortunately, so we never realllllly pay much attention to disk stuff.

    Would i be on the right track to suggest those 3-4 servers are busy servers? Exchange? SQL? File Servers?
    Are they virtual servers in VMWare or HyperV?
    What are the disks being used for the OS?
    Are they connecting to online storage (fast disk, SAS etc?) or slow disks (SATA etc)?

    If they are virtual and connected to a SAN, check the SAN's performance logs,
    If it is a DELL Equalogic for example, you see what the DQL is on the individual drives in any of the disk groups with Dell's SAN management tools.
    I can only guess that the same data is available on a HP MSA or an EMC.

    If it is is an Exchange Box, or an SQL box, are they clustered?
    A common setup for clustered exchange is to have 2 mail box stores on each server, 1 primary store on each server, 1 secondary store on each server.
    If one of the servers falls over, its mailboxes resume normal operation on the secondary store of the second server.
    That secondary store is usually slower disk, and your DQL will be <5 All the time, Certainly <10 on occasion.

    Perhaps all my reply highlights is that its not a great measurement of performance, and you can really just it as an indication 

  • I'd pay close attention to the SAN, There are hyper V counters aswell that you should look at . I don't know where i got it from, but below is a Hyper-V set. DQL in hyper v's monitor set is 12!!! thats astronomical. It might be like the Page Fault counter for VMware being a false positive. I don't know. Monitor Set below.

    <?xml version="1.0" encoding="UTF-16" ?>

    <monitor_set_definition version="1.0">

    <MonitorSet name="_003_008_Hyper-V Monitor Set (New)" description='Microsoft Recommended Hyper-V Monitoring Set'>

    <Counters>

    <Counter name='Hyper-V Virtual Machine Health Summary'  description='null' counterObject='Hyper-V Virtual Machine Health Summary'  counter='Health Critical'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='0' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Virtual Machine Health Summary'  description='null' counterObject='Hyper-V Virtual Machine Health Summary'  counter='Health Ok'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='4' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor Guest Run Time Total'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Guest Run Time'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='80' thresholdDuration='180' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor % HyperVisor Run Time Total'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Hypervisor Run Time'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='50' thresholdDuration='1200' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor % Total Run Time'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Total Run Time'  counterInstance='_Total'  counterSampleInterval='60' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='85' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='PhysicalDisk Current Queue Length Total'  description='Current Disk Queue Length Total (Should be around 2 per Drive Max)' counterObject='PhysicalDisk'  counter='Current Disk Queue Length'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='12' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='PhysicalDisk Bytes / Second'  description='Disk Bytes / Second Total' counterObject='PhysicalDisk'  counter='Disk Bytes/sec'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='40000000' thresholdDuration='900' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Memory Pages / Second'  description='Pages per Second' counterObject='Memory'  counter='Pages/sec'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='900' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Memory Available MBytes'  description='Memory Available MBytes' counterObject='Memory'  counter='Available MBytes'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='500' thresholdDuration='3600' thresholdWarning='10' thresholdReArm='86400'/>

    </Counters>

    <Services>

    <Service name='vmms'  serviceDescription='Hyper-V Virtual Machine Management' description='Management Service for Hyper-V' restartAttempts='3' restartInterval='60' reArm='3600'/>

    </Services>

    <Processes>

    </Processes>

    </MonitorSet>

    </monitor_set_definition>

All Replies
  • Hi there,

    We use these performance monitors quite extensively, i'd advise to use them in conjunction with other perf mon alerts as a sign of good / poor performance.

    I am about to ramble so be patient. I hope there is something of value here.

    I'd suggest mainly to use it in conjunction with other performance alerts, RAM and so on. Things that effect the values you are asking about are RAID type, Disk type and so on, it requires a bit of maths and research to know if the values are bad. What is bad for one server might not be so bad for the others. The maths to calculate the stuff can go right down to the spindle unfortunately, so we never realllllly pay much attention to disk stuff.

    Would i be on the right track to suggest those 3-4 servers are busy servers? Exchange? SQL? File Servers?
    Are they virtual servers in VMWare or HyperV?
    What are the disks being used for the OS?
    Are they connecting to online storage (fast disk, SAS etc?) or slow disks (SATA etc)?

    If they are virtual and connected to a SAN, check the SAN's performance logs,
    If it is a DELL Equalogic for example, you see what the DQL is on the individual drives in any of the disk groups with Dell's SAN management tools.
    I can only guess that the same data is available on a HP MSA or an EMC.

    If it is is an Exchange Box, or an SQL box, are they clustered?
    A common setup for clustered exchange is to have 2 mail box stores on each server, 1 primary store on each server, 1 secondary store on each server.
    If one of the servers falls over, its mailboxes resume normal operation on the secondary store of the second server.
    That secondary store is usually slower disk, and your DQL will be <5 All the time, Certainly <10 on occasion.

    Perhaps all my reply highlights is that its not a great measurement of performance, and you can really just it as an indication 

  • These 4 servers are virtual and yes a 2 of them are Exchange/SBS, 1 is an IIS server and the other is an SQL server.

    So what you are saying is I should monitor the SAN or the Virtual host server's physical disks and ignore the Virtual disks/volumes, I suppose it makes sense. I just found out that Hyper-V has its own perf counters that I should be using, the downside will be that you will need to make a custom monitor set for each Hyper-V host.

  • I'd pay close attention to the SAN, There are hyper V counters aswell that you should look at . I don't know where i got it from, but below is a Hyper-V set. DQL in hyper v's monitor set is 12!!! thats astronomical. It might be like the Page Fault counter for VMware being a false positive. I don't know. Monitor Set below.

    <?xml version="1.0" encoding="UTF-16" ?>

    <monitor_set_definition version="1.0">

    <MonitorSet name="_003_008_Hyper-V Monitor Set (New)" description='Microsoft Recommended Hyper-V Monitoring Set'>

    <Counters>

    <Counter name='Hyper-V Virtual Machine Health Summary'  description='null' counterObject='Hyper-V Virtual Machine Health Summary'  counter='Health Critical'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='0' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Virtual Machine Health Summary'  description='null' counterObject='Hyper-V Virtual Machine Health Summary'  counter='Health Ok'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='4' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor Guest Run Time Total'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Guest Run Time'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='80' thresholdDuration='180' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor % HyperVisor Run Time Total'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Hypervisor Run Time'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='50' thresholdDuration='1200' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Hyper-V Hypervisor Logical Processor % Total Run Time'  description='null' counterObject='Hyper-V Hypervisor Logical Processor'  counter='% Total Run Time'  counterInstance='_Total'  counterSampleInterval='60' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='85' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='PhysicalDisk Current Queue Length Total'  description='Current Disk Queue Length Total (Should be around 2 per Drive Max)' counterObject='PhysicalDisk'  counter='Current Disk Queue Length'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='12' thresholdDuration='300' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='PhysicalDisk Bytes / Second'  description='Disk Bytes / Second Total' counterObject='PhysicalDisk'  counter='Disk Bytes/sec'  counterInstance='_Total'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Over'  thresholdAmount='40000000' thresholdDuration='900' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Memory Pages / Second'  description='Pages per Second' counterObject='Memory'  counter='Pages/sec'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='-1' thresholdDuration='900' thresholdWarning='10' thresholdReArm='3600'/>

    <Counter name='Memory Available MBytes'  description='Memory Available MBytes' counterObject='Memory'  counter='Available MBytes'  counterSampleInterval='300' collectionOperator='Over'  collectionThreshold='-1' trendTimeSpan='1209600' trendReArm='3600' thresholdOperator='Under'  thresholdAmount='500' thresholdDuration='3600' thresholdWarning='10' thresholdReArm='86400'/>

    </Counters>

    <Services>

    <Service name='vmms'  serviceDescription='Hyper-V Virtual Machine Management' description='Management Service for Hyper-V' restartAttempts='3' restartInterval='60' reArm='3600'/>

    </Services>

    <Processes>

    </Processes>

    </MonitorSet>

    </monitor_set_definition>

  • Thanks for the info, I will have a look at it  Big Smile

    Just a note about posting code, if you use the "use rich formatting" with the code and /code commands each enclosed in square brackets [ and ] then you get something like this;


    your code here



    [edited by: HardKnoX at 4:47 PM (GMT -7) on 6-9-2011] blah
  • Ah sweet ! Learn something new every single day :) Cheers

  • Hey guys. It's been a little more than one month since you discussed this.

    I now have some time to get into better monitoring my virtual environments.

    I must be a little rusty or something because I can't even properly import the code for the monitor set. It says "The Monitor Set was not a valid Kaseya Monitor Set xml format. Error: -1072896657"

    Any help with this and other Hyper-V monitoring would be helpful.

    Thanks for your time.

  • Yeah you might have to write your own from the values posted in the exports above or ask Mark to repost it in a better format :)