Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to detect failed disks #25

Closed
zerebubuth opened this issue Mar 9, 2015 · 12 comments
Closed

Better way to detect failed disks #25

zerebubuth opened this issue Mar 9, 2015 · 12 comments
Labels

Comments

@zerebubuth
Copy link
Collaborator

Recently we've lost disks in ramoth and yevaud without notifications (at least - not that I can find). However, failed disk notifications appear to be working fine on orm and pummelzacken.

@tomhughes
Copy link
Member

It's basically a function of which RAID controller is in use - if it's soft raid or a hardware RAID that http://hwraid.le-vert.net/ has support for then we get notifications. If it's a hardware RAID with only proprietary tools and no wrapper script then we don't.

@tomhughes
Copy link
Member

Looks like ramoth should be able to detect them, but the status command is bombing out:

ramoth [~] % sudo megaclisas-status 
[sudo] password for tomh: 
-- Controller informations --
-- ID | Model
c0 | LSI MegaRAID SAS 9271-8iCC

-- Arrays informations --
-- ID | Type | Size | Status | InProgress
c0u0 | RAID1 | 465G | Optimal | None
c0u1 | RAID6 | 7633G | Optimal | None
Traceback (most recent call last):
  File "/usr/sbin/megaclisas-status", line 164, in <module>
    arrayinfo = returnArrayInfo(output,controllerid,arrayid)
  File "/usr/sbin/megaclisas-status", line 101, in returnArrayInfo
    if ldpdcount and (int(spandepth) > 1):
UnboundLocalError: local variable 'ldpdcount' referenced before assignment

@tomhughes
Copy link
Member

The yevaud card is an Areca which hwraid doesn't have a status daemon for.

@tomhughes
Copy link
Member

The error on ramoth seems to be because megacli is claiming there are four arrays but will only return information about the first one.

@tomhughes
Copy link
Member

Ah I was wrong about that, the actual problem is array 2, which doesn't report a number of disks:

ramoth [~] % sudo megacli -LDInfo -l2 -a0 -NoLog


Adapter 0 -- Virtual Drive Information:
CacheCade Virtual Drive: 2 (Target Id: 2)
Virtual Drive Type    : CacheCade 
Name          : 
RAID Level        : Primary-1, Secondary-0
State             : Optimal
Size          : 558.406 GB
Target Id of the Associated LDs : 0,3,1
Default Cache Policy  : WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy  : WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU



Exit Code: 0x00

@tomhughes
Copy link
Member

I've patched it now but it seems to be reporting everything as optimal anyway.

@zerebubuth
Copy link
Collaborator Author

Indeed - the arrays are optimal. But there's a global hot-spare, which has been pulled into the array to stand in for the failed drive. So there's a failed physical disk, which isn't part of any array... but we still want to know about it.

@tomhughes
Copy link
Member

So send a patch to https://github.com/eLvErDe/hwraid then ;-)

@tomhughes
Copy link
Member

eLvErDe/hwraid#13 may actually be what we want?

@tomhughes
Copy link
Member

I've opened eLvErDe/hwraid#17 for the CacheCade issue.

@zerebubuth
Copy link
Collaborator Author

Is this still an issue, or are we getting much better alerting these days?

@Firefishy
Copy link
Member

Closing as very old. If still an issue can be re-opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants