Ops Man missed a low alert, until the disk hit zero

We lost a server over the weekend, when a local application problem crammed the system disk. The server is being monitored by Ops Man 2012 R2, and we got an email alert when the disk hit 0% free, but nothing up until then.

Referring to the screencaps below, it looks like Ops Man was getting performance data from the system, and knew the disk space was low. The override thresholds for the system disk are set to warn at 5,000 MB free and error at 3,000 MB free, but the state change event clearly shows the monitor as healthy, right up until it hit 0. (First graph shows % disk free, second shows free MB)

I'm new to Ops Man, but I'm pretty sure I've dug into this thoroughly.  What am I missing?  Is there any way to know why the alert didn't trigger?  

May 19th, 2015 4:34pm

What is the logical disk size percentage configured for?

Both the megabytes and percentage threshold must be breached before an alert will be triggered.

Free Windows Admin Tool Kit Click here and download it now
May 20th, 2015 12:40am

By default, Windows XXX Logical disk free space Monitor will trigger alert when both HD free disk space in MB and % of HD free disk space fall below threshold.

Roger

May 20th, 2015 2:18am

Well I tried to find the percentage for you, but I don't see an override for it, so I was about to say it should be using the default (10% Warn, 5% Error)...

However while digging around, I found the XML for that monitor.  Now the % numbers in the XML are what I expected to see, but look at the Warning/Error MBytes.  Looks like 500 MB Warn and 300 MB Error.  This is where I have to please some Ops Man ignorance, but if the override is configured to warn at 5000 and error at 3000, why does the XML show up like it does?

Unfortunately, this doesn't answer the missed alert issue either.  I checked the performance log, and Ops Man recorded the system disk at 265 MB free at 6:48 PM, and 88 MB free at 6:58 PM.  Either of those values should have tripped the alert and notification. (% Free at the time was .8%)

 
 <Configuration>
  <IntervalSeconds>3600</IntervalSeconds> 
  <TargetComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</TargetComputerName> 
  <SystemDriveWarningMBytesThreshold>500</SystemDriveWarningMBytesThreshold> 
  <SystemDriveWarningPercentThreshold>10</SystemDriveWarningPercentThreshold> 
  <SystemDriveErrorMBytesThreshold>300</SystemDriveErrorMBytesThreshold> 
  <SystemDriveErrorPercentThreshold>5</SystemDriveErrorPercentThreshold> 
  <NonSystemDriveWarningMBytesThreshold>2000</NonSystemDriveWarningMBytesThreshold> 
  <NonSystemDriveWarningPercentThreshold>10</NonSystemDriveWarningPercentThreshold> 
  <NonSystemDriveErrorMBytesThreshold>1000</NonSystemDriveErrorMBytesThreshold> 
  <NonSystemDriveErrorPercentThreshold>5</NonSystemDriveErrorPercentThreshold> 
  <DiskLabel>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$</DiskLabel> 
  <TimeoutSeconds>360</TimeoutSeconds> 
  <DebugFlag>false</DebugFlag> 
  </Configuration>

Free Windows Admin Tool Kit Click here and download it now
May 20th, 2015 3:46pm

based on the config you provided, the polling cycle is 3600 seconds which translates to 1 hour.

If the disk size depletes within the hour of the last polling cycle, the alert will only be triggered once the next cycle kicks in.

May 20th, 2015 11:13pm

The most likely reason is your override, thresholds for the system disk are set to warn at 5,000 MB free and error at 3,000 MB free, does not work. You should use Export-SCOMEffectiveMonitoringConfiguration to export the effective configuration of hard disk instance to check its effective configuration of the monitor.
https://technet.microsoft.com/library/hh527837.aspx
Roger
Free Windows Admin Tool Kit Click here and download it now
May 21st, 2015 4:30am

The most likely reason is your override, thresholds for the system disk are set to warn at 5,000 MB free and error at 3,000 MB free, does not work. You should use Export-SCOMEffectiveMonitoringConfiguration to export the effective configuration of hard disk instance to check its effective configuration of the monitor.
https://technet.microsoft.com/library/hh527837.aspx
Roger
May 21st, 2015 8:28am

The most likely reason is your override, thresholds for the system disk are set to warn at 5,000 MB free and error at 3,000 MB free, does not work. You should use Export-SCOMEffectiveMonitoringConfiguration to export the effective configuration of hard disk instance to check its effective configuration of the monitor.
https://technet.microsoft.com/library/hh527837.aspx
Roger
  • Marked as answer by Jester4kicks Wednesday, June 03, 2015 12:01 AM
Free Windows Admin Tool Kit Click here and download it now
May 21st, 2015 8:28am

Sorry for the delay. Took a much-needed vacation.

I'm trying to figure out the Export-SCOMEffectiveMonitoringConfiguration command. I wasn't having a whole lot of luck, so I ended up finding a script that's supposed to dump it out a bit more intelligently. https://gallery.technet.microsoft.com/scriptcenter/SCOM-Powershell-Export-3849e612

That's going to take a while to run.  In the meantime, there have been a few developments.

1. I created some garbage data on a different monitored server.  First I brought its system drive down to <10% free and < 5,000 MB free.  I watched to make sure Ops Man detected it, but it did not through the warning alert.  I then took it down further to <5% and < 3,000 MB free.  Still no alert, no error.  Health Explorer still shows it as "healthy."

2. I ran the sample command from the Export-SCOMEffectiveMonitoringConfiguration cmdlet help file, filtered to instance names that started with the server name.  What's interesting is that I can't find where the system disk is monitored in the resulting output.  I see disk free monitor info for the D: drive, but not the C: drive.  Not sure what to make of that.

If anyone knows a better way to run that cmdlet, I'm all ears.

May 29th, 2015 3:59pm

I would suggest you to make a Custom unit monitor targeting Windows server operating system with the required thresholds rather than just waiting to make the original one work.

So you ensure that you do not miss any other Low disk space alert in future or until you find the rot cause for this issue.

Free Windows Admin Tool Kit Click here and download it now
May 31st, 2015 1:46pm

I may have found the issue. Unfortunately, it's probably going to get chalked up to my SCOM newbish-ness, but I wanna run it by all of you just in case.

The overrides that I posted all refer to the Logical Disk Free Space monitor.  I thought class-overrides affected multiple monitors, but thinking about that more, it doesn't make much sense.  With that in mind, I took a second look at the monitors.  If my understanding (now) is correct, the Logical Disk Free Space monitor is actually not turned on.  It's the Windows 2008 Logical Disk Free Space Monitor which is actually active.  Looking at just that monitor, there do not appear to be any applied overrides, which I think means the default of "warn @ 500 MB and error @ 300 MB" is in effect.  I'm testing this now to confirm.  What do you all think?

June 1st, 2015 6:11pm

I may have found the issue. Unfortunately, it's probably going to get chalked up to my SCOM newbish-ness, but I wanna run it by all of you just in case.

The overrides that I posted all refer to the Logical Disk Free Space monitor.  I thought class-overrides affected multiple monitors, but thinking about that more, it doesn't make much sense.  With that in mind, I took a second look at the monitors.  If my understanding (now) is correct, the Logical Disk Free Space monitor is actually not turned on.  It's the Windows 2008 Logical Disk Free Space Monitor which is actually active.  Looking at just that monitor, there do not appear to be any applied overrides, which I think means the default of "warn @ 500 MB and error @ 300 MB" is in effect.  I'm testing this now to confirm.  What do you all think?

Free Windows Admin Tool Kit Click here and download it now
June 1st, 2015 10:08pm

I may have found the issue. Unfortunately, it's probably going to get chalked up to my SCOM newbish-ness, but I wanna run it by all of you just in case.

The overrides that I posted all refer to the Logical Disk Free Space monitor.  I thought class-overrides affected multiple monitors, but thinking about that more, it doesn't make much sense.  With that in mind, I took a second look at the monitors.  If my understanding (now) is correct, the Logical Disk Free Space monitor is actually not turned on.  It's the Windows 2008 Logical Disk Free Space Monitor which is actually active.  Looking at just that monitor, there do not appear to be any applied overrides, which I think means the default of "warn @ 500 MB and error @ 300 MB" is in effect.  I'm testing this now to confirm.  What do you all think?

  • Marked as answer by Jester4kicks Tuesday, June 02, 2015 11:59 PM
June 1st, 2015 10:08pm

Hi There,

Can you override the monitor which is disabled and check if the Enable option is set to false or true ?

If it is False then there lies the issue. You will need to override the monitor to a true state and save the override to a custom management pack.

Free Windows Admin Tool Kit Click here and download it now
June 2nd, 2015 12:31am

Yup, that was it.  All those nice threshold overrides not doing a dang thing with the monitor turned off.  Thanks for putting up with a SCOM newb. ;)  
June 2nd, 2015 7:59pm

I'm marking this as an answer too, since this is a HUGE problem with the default monitor config.  (This and the ridiculously-low free space thresholds.)
Free Windows Admin Tool Kit Click here and download it now
June 2nd, 2015 8:01pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics