Better way to detect failed disks #25

zerebubuth · 2015-03-09T22:44:54Z

Recently we've lost disks in ramoth and yevaud without notifications (at least - not that I can find). However, failed disk notifications appear to be working fine on orm and pummelzacken.

tomhughes · 2015-03-09T22:50:50Z

It's basically a function of which RAID controller is in use - if it's soft raid or a hardware RAID that http://hwraid.le-vert.net/ has support for then we get notifications. If it's a hardware RAID with only proprietary tools and no wrapper script then we don't.

tomhughes · 2015-03-09T22:53:38Z

Looks like ramoth should be able to detect them, but the status command is bombing out:

ramoth [~] % sudo megaclisas-status 
[sudo] password for tomh: 
-- Controller informations --
-- ID | Model
c0 | LSI MegaRAID SAS 9271-8iCC

-- Arrays informations --
-- ID | Type | Size | Status | InProgress
c0u0 | RAID1 | 465G | Optimal | None
c0u1 | RAID6 | 7633G | Optimal | None
Traceback (most recent call last):
  File "/usr/sbin/megaclisas-status", line 164, in <module>
    arrayinfo = returnArrayInfo(output,controllerid,arrayid)
  File "/usr/sbin/megaclisas-status", line 101, in returnArrayInfo
    if ldpdcount and (int(spandepth) > 1):
UnboundLocalError: local variable 'ldpdcount' referenced before assignment

tomhughes · 2015-03-09T22:56:21Z

The yevaud card is an Areca which hwraid doesn't have a status daemon for.

tomhughes · 2015-03-09T23:02:31Z

The error on ramoth seems to be because megacli is claiming there are four arrays but will only return information about the first one.

tomhughes · 2015-03-09T23:09:52Z

Ah I was wrong about that, the actual problem is array 2, which doesn't report a number of disks:

ramoth [~] % sudo megacli -LDInfo -l2 -a0 -NoLog


Adapter 0 -- Virtual Drive Information:
CacheCade Virtual Drive: 2 (Target Id: 2)
Virtual Drive Type    : CacheCade 
Name          : 
RAID Level        : Primary-1, Secondary-0
State             : Optimal
Size          : 558.406 GB
Target Id of the Associated LDs : 0,3,1
Default Cache Policy  : WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy  : WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU



Exit Code: 0x00

tomhughes · 2015-03-09T23:11:31Z

I've patched it now but it seems to be reporting everything as optimal anyway.

zerebubuth · 2015-03-09T23:28:26Z

Indeed - the arrays are optimal. But there's a global hot-spare, which has been pulled into the array to stand in for the failed drive. So there's a failed physical disk, which isn't part of any array... but we still want to know about it.

tomhughes · 2015-03-10T00:03:05Z

So send a patch to https://github.com/eLvErDe/hwraid then ;-)

tomhughes · 2015-03-10T00:08:28Z

eLvErDe/hwraid#13 may actually be what we want?

tomhughes · 2015-03-10T00:10:27Z

I've opened eLvErDe/hwraid#17 for the CacheCade issue.

zerebubuth · 2016-05-17T18:47:25Z

Is this still an issue, or are we getting much better alerting these days?

Firefishy · 2019-02-10T17:58:57Z

Closing as very old. If still an issue can be re-opened.

pnorman added the hardware label Jan 21, 2019

Firefishy closed this as completed Feb 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to detect failed disks #25

Better way to detect failed disks #25

zerebubuth commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

zerebubuth commented Mar 9, 2015

tomhughes commented Mar 10, 2015

tomhughes commented Mar 10, 2015

tomhughes commented Mar 10, 2015

zerebubuth commented May 17, 2016

Firefishy commented Feb 10, 2019

Better way to detect failed disks #25

Better way to detect failed disks #25

Comments

zerebubuth commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

tomhughes commented Mar 9, 2015

zerebubuth commented Mar 9, 2015

tomhughes commented Mar 10, 2015

tomhughes commented Mar 10, 2015

tomhughes commented Mar 10, 2015

zerebubuth commented May 17, 2016

Firefishy commented Feb 10, 2019