0%

RAID卡管理

在目前各大知名品牌服务器厂商:IBM、DELL、HP、华为、联想、宝德、浪潮、中科曙光等服务器都使用LSI品牌的阵列卡作为服务器存储控制器。

LSI阵列卡除了通过图形化的BIOS界面来管理之外,还提供命令行管理软件。

目前LSI官方发布的基于SAS/SATA控制器RAID控制卡产品型号(芯片)有:LSI1064LSI1086、LSI1078、LSI2008、LSI2108LSI2208lSI2308LSI3008LSI3108等。

阵列卡命令行管理软件:

  • 1064:cfggen
  • 1068:lsiutilcfggen
  • 2108:MegaCli
  • 2208:MegaCli
  • 2308:SAS2IRCU
  • 3008:SAS3IRCU
  • 3108:storcli
  • HP专用:hpacuclihpssacli

查看LSI卡型号

1
2
# lspci | grep -i "lsi"
02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
1
2
# lspci | grep -i "lsi"
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

RAID卡管理软件

MegaCli

一般通过 MegaCli 的Media Error Count: Other Error Count:这两个数值来确定阵列中磁盘是否有问题;Media Error Count表示磁盘可能错误,可能是磁盘有坏道,这个值不为0值得注意,数值越大,危险系数越高,Other Error Count表示磁盘可能存在松动,可能需要重新再插入。

主要系数:

  • Adapter #0 阵列卡号
  • Enclosure Device ID: 32 物理磁盘底盘号
  • Slot Number: 0 物理磁盘插槽位置

常用命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL 查看所有硬盘信息
/opt/MegaRAID/MegaCli/MegaCli64 -pdInfo -PhysDrv[32:3] -aALL 查看某磁盘详细信息
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL 查raid卡级别
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL 查raid卡详细信息,包括支持的raid级别
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aAll 查看电池信息
/opt/MegaRAID/MegaCli/MegaCli64 -FwTermLog -Dsply -aALL 查看raid卡日志
/opt/MegaRAID/MegaCli/MegaCli64 -adpCount 【显示适配器个数】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpGetTime –aALL 【显示适配器时间】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL |grep "Charger Status" 【查看充电状态】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL【显示BBU状态信息】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuCapacityInfo -aALL【显示BBU容量信息】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuDesignInfo -aALL 【显示BBU设计参数】
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuProperties -aALL 【显示当前BBU属性】
/opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL 【显示Raid卡型号,Raid设置,Disk相关信息】
/opt/MegaRAID/MegaCli/MegaCli6 -cfgclr -a0 清除所有的raid组的配置
/opt/MegaRAID/MegaCli/MegaCli6 -cfglddel -L0 -a0 删除指定的raid组(Target Id: 0)的raid组,可以通过上面的“查看所有硬盘信息”得到。
/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ProgDsply -PhysDrv [12:7] -aALL 查看硬盘的重建情况

常见问题

  1. 报错:
1
2
3
4
The specified physical disk does not have the appropriate attributes to complete 
the requested command.

Exit Code: 0x26

解决:

1
2
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -scan -a0  #扫描外来配置
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -clear -a0 #清除外来配置

参考:https://forum.huawei.com/enterprise/zh/thread-430333.html

  1. 报错:
1
The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.

解决:

1
2
3
4
5
6
7
8
9
/opt/MegaRAID/MegaCli/MegaCli64 -GetPreservedCacheList -aALL
Adapter #1
Virtual Drive(Target ID 11): Missing.
Exit Code: 0x00

/opt/MegaRAID/MegaCli/MegaCli64 -DiscardPreservedCache -L11 -a0
Adapter #0
Virtual Drive(Target ID 11): Preserved Cache Data Cleared.
Exit Code: 0x00

3.报错:

1
2
3
4
FW error description: 
The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.

Exit Code: 0x54

解决:

1
2
/opt/MegaRAID/MegaCli/MegaCli64 -GetPreservedCacheList -aALL  #根据这条命令找到Target ID,也就是L的值
/opt/MegaRAID/MegaCli/MegaCli64 -DiscardPreservedCache -L1 -a0
  1. 状态异常

Firmware state: Unconfigured(bad)

解决:

1
/opt/MegaRAID/MegaCli/MegaCli64 -PDMakeGood -PhysDrv[0:9] -a0

查看磁盘缓存策略

1
2
3
4
5
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -L0 -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -L1 -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -LALL -a0
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -Cache -LALL -aALL
/opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -DskCache -LALL -aALL

创建阵列

1
2
3
4
5
6
7
8
# 创建一个 raid5 阵列,由物理盘 2,3,4 构成,该阵列的热备盘是物理盘 5
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r5 [1:2,1:3,1:4] WB Direct -Hsp[1:5] -a0
# 创建阵列,不指定热备
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r5 [1:2,1:3,1:4] WB Direct -a0
# 查看raid卡信息
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL | grep -e 'Target Id' -e '^Size'
# 删除阵列
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -a0 # 这里-L后面的数字就是查看raid卡信息里面的Target Id。

操作RAID中的磁盘,热备盘

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 在线添加磁盘
/opt/MegaRAID/MegaCli/MegaCli64 -LDRecon -Start -r5 -Add -PhysDrv[1:4] -L1 -a0

# 指定第5块为全局热备盘
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set [-EnclAffinity] [-nonRevertible] -PhysDrv[1:5] -a0

# 指定某个阵列的专用热备
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set [-Dedicated [-Array1]] [-EnclAffinity] [-nonRevertible] -PhysDrv[1:5] -a0

# 删除全局热备
/opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Rmv -PhysDrv[1:5] -a0

# 某块物理盘上线/下线
/opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv [1:4] -a0
/opt/MegaRAID/MegaCli/MegaCli64 -PDOnline -PhysDrv [1:4] -a0

将raid0调整为raid1:

1
2
3
4
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
/opt/MegaRAID/MegaCli/MegaCli64 -ldrecon -start -r1 -add -physdrv[252:1] -l0 -a0 # -physdrv为新盘的[Enclosure Device ID:Slot Number],-l为当前raid0磁盘的Target Id;
/opt/MegaRAID/MegaCli/MegaCli64 -ldrecon -showprog -l0 -a0 # 查看进度

查看进度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 查看阵列初始化进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDInit -ShowProg -LALL -aALL
# 可视化文字界面查看初始化进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDInit -ProgDsply -LALL -aALL

# 阵列后台初始化进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDBI -ShowProg -LALL -aALL
# 可视化文字界面查看阵列后台初始化进度
/opt/MegaRAID/MegaCli/MegaCli64 -LDBI -ProgDsply -LALL -aALL

# 查看物理磁盘重建进度
/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv [1:5] -a0
# 动态可视化文字界面查看磁盘重建进度
/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ProgDsply -PhysDrv [1:5] -a0

实例

查看所有硬盘信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep -E "Enclosure Device ID:|Slot Number:|Raw Size:|Media Error Count:|Other Error Count|Inquiry Data:"
Enclosure Device ID: 32
Slot Number: 0
Media Error Count: 0
Other Error Count: 2
Raw Size: 745.211 GB [0x5d26ceb0 Sectors]
Inquiry Data: BTHC552502N7800NGNINTEL SSDSC2BX800G4R G201DL29
Enclosure Device ID: 32
Slot Number: 1
Media Error Count: 0
Other Error Count: 0
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Inquiry Data: SEAGATE ST4000NM0023 GS11S1Z1SBYV
...省略部分输出

查看盘的详细信息,[32:1]分别是Enclosure Device IDSlot Number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# /opt/MegaRAID/MegaCli/MegaCli64 -pdInfo -PhysDrv[32:1] -aALL

Enclosure Device ID: 32
Slot Number: 1
Enclosure position: 1
Device Id: 1
WWN: 5000C5009609B430
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size: 512
Logical Sector Size: 512
Physical Sector Size: 512
Firmware state: JBOD
Device Firmware Level: GS11
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c5009609b431
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST4000NM0023 GS11S1Z1SBYV
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :33C (91.40 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No

Exit Code: 0x00

更换硬盘重做raid步骤:

1
2
3
4
5
6
7
8
# 扫描外来配置的个数
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -scan -a0
# 清除外来配置:
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -clear -a0
# 再次扫描外来配置的个数:
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -scan -a0
# 创建Raid0
/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0[32:2] WB Direct -a0

SAS2IRCU

sas2ircu是可执行文件不需要安装,直接使用,可对LSI2308阵列卡的管理,命令使用方法:

1
2
3
4
5
6
7
8
9
10
11
./sas2ircu -h    查看帮助信息
./sas2ircu list 查看所有RAID控制器信息
./sas2ircu 0 display 查看第一块RAID控制器、物理磁盘、逻辑磁盘等详细信息,0代表RAID卡编号,如果有多块RAID卡,那么从0开始编号,以此类推;如果只有一块RAID卡,那么就是0。
./sas2ircu 0 status 查看第一块RAID控制器所有逻辑磁盘信息
./sas2ircu 0 delete noprompt 删除第一块RAID控制器上所有RAID配置
./sas2ircu 0 create raid 1 max 2:0 2:1 noprompt 在第一块RAID控制器上将第一块与第二块物理磁盘配置为RAID1,分配最大使用空间。
./sas2ircu 0 create raid10 max 2:2 2:3 2:4 2:5 2:6 2:7 2:8 2:9 2:10 2:11 noprompt 在第一块RAID控制器上将第三块到第十二块物理磁盘配置为RAID10,分配最大使用空间。(注意:LSI SAS2308最多支持2个RAID,单个RAID中最多支持10个硬盘。所有的RAID包含的硬盘总量最多14个,剩余硬盘只能以独立的“Physical drive”方式被LSI SAS2308管理。)
./sas2ircu 0 bootir 286 设置Volume ID号为286的RAID组为预先可引导模式
./sas2ircu 0 hotspare 2:10 配置第11块物理硬盘为热备
./sas2ircu 0 hotspare delete 2:10 删除热备硬盘
./sas2ircu 0 logir 上传或清除日志信息

hpssacli

hpssacli下载地址:https://downloads.linux.hpe.com/SDR/repo/mcp/centos/7/x86_64/current/

命令示例:https://www.kvm.la/1032.html

删除raid后,盘符会变化,如删除了sdb对应的raid,则sdc会变成sdb,sdd会变成sdc,等等。

查看linux硬盘盘符与物理槽位对应关系

获取盘符对应的SN(一般为Serial Number:关键字)

1
2
# smartctl -a /dev/sdg | grep "^Serial Number:"
Serial Number: WD-WMC130E5F1LZ

或者

1
2
3
4
# ls -l /dev/disk/by-id/ | grep sdg
lrwxrwxrwx 1 root root 9 Sep 24 13:24 ata-WDC_WD4000FYYZ-01UL1B2_WD-WMC130E5F1LZ -> ../../sdg
lrwxrwxrwx 1 root root 9 Sep 24 13:24 scsi-SATA_WDC_WD4000FYYZ-_WD-WMC130E5F1LZ -> ../../sdg
lrwxrwxrwx 1 root root 9 Sep 24 13:24 wwn-0x50014ee0aec22ba1 -> ../../sdg

通过RAID卡工具查看物理槽位与SN对应关系。

此处使用的命令可能不同,这里以LSI 2308阵列卡为例

1
2
3
4
5
6
7
8
9
10
11
# ./sas2ircu list
LSI Corporation SAS2 IR Configuration Utility.
Version 13.00.00.00 (2012.02.17)
Copyright (c) 2009-2012 LSI Corporation. All rights reserved.


Adapter Vendor Device SubSys SubSys
Index Type ID ID Pci Address Ven ID Dev ID
----- ------------ ------ ------ ----------------- ------ ------
0 SAS2308_2 1000h 87h 00h:03h:00h:00h 1000h 3050h
SAS2IRCU: Utility Completed Successfully.

可以看到,这里只有一个raid卡,Index为0。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# ./sas2ircu 0 display | grep -C 10 "WMC130E5F1LZ"

Device is a Hard disk
Enclosure # : 2
Slot # : 5
SAS Address : 5001c45-0-2825-da85
State : Ready (RDY)
Size (in MB)/(in sectors) : 3815447/7814037167
Manufacturer : ATA
Model Number : WDC WD4000FYYZ-0
Firmware Revision : 1K03
Serial No : WDWMC130E5F1LZ
GUID : 50014ee0aec22ba1
Protocol : SATA
Drive Type : SATA_HDD

Device is a Hard disk
Enclosure # : 2
Slot # : 6
SAS Address : 5001c45-0-2825-da86
State : Ready (RDY)
Size (in MB)/(in sectors) : 3815447/7814037167

可以看到,SN为WMC130E5F1LZ的磁盘对应的slot为5。也可以使用以下命令:

1
2
# ./sas2ircu 0 display | awk '{if($0 ~ " Slot"){slot=$0}; if($0 ~ "WMC130E5F1LZ"){print slot}}'
Slot # : 5

盘符绑定槽位

在同一槽位,热插拔硬盘,系统会顺着盘符分配新盘符,出现盘符错乱的情况。

Linux分配给硬盘的盘符与所在槽位没有关系,只与插入硬盘的顺序有关。

解决盘符错乱问题,主要有以下思路:

  1. 打内核patch,参考http://ilinuxkernel.com/?p=462
  2. 使用UUID或LABEL来挂载磁盘,虽然不能绑定盘符与槽位,但也不会影响使用。
  3. 借助系统文件进行绑定(只找到了centos7系统的)

在centos7的系统上,可以使用以下方法进行盘符与槽位的绑定,该方法未验证。

1
2
udevadm info -q path -n /dev/sda
/devices/pci0000:00/0000:00:10.0/host0/target0:0:0/0:0:0:0/block/sda

拿到编号,在 /etc/udev/rules.d/80-mydisk.rules增加

1
DEVPATH=="/devices/pci0000:00/0000:00:10.0/host0/target0:0:0/0:0:0:0/block/sda", NAME="sda", MODE="0660"

这样对应的插槽的第二个就会一直对应盘符sdc 而不会出现跳盘符的问题了

硬盘检测

一般使用smartctl命令进行磁盘检测。

1
2
3
4
5
6
7
8
9
10
smartctl -H /dev/sda   #检查健康状态
smartctl -A /dev/sda 查看硬盘的详细信息
smartctl -s on /dev/sda 如果没有打开SMART技术,使用该命令打开SMART技术。
smartctl -t short /dev/sda 后台检测硬盘,消耗时间短;
smartctl -t long /dev/sda 后台检测硬盘,消耗时间长;
smartctl -C -t /dev/sda short前台检测硬盘,消耗时间短;
smartctl -C -t /dev/sda long前台检测硬盘,消耗时间长。其实就是利用硬盘SMART的自检程序。
smartctl -X /dev/sda 中断后台检测硬盘。
smartctl -l selftest /dev/sda 显示硬盘检测日志。
smartctl -l error /dev/sda 显示硬盘错误汇总。

参考:
主体:
https://blog.csdn.net/xtggbmdk/article/details/82817784
盘符与槽位对应:
https://blog.csdn.net/zhaominpro/article/details/81359348
Megacli64参考:
http://blog.sina.com.cn/s/blog_57c70e190101ebl9.html
https://www.jianshu.com/p/0b4e0f5ffe93
https://segmentfault.com/a/1190000011514147
盘符绑定槽位:
https://www.cnblogs.com/lianggn/p/6913925.html
https://www.cnblogs.com/bldly1989/p/7117367.html
http://ilinuxkernel.com/?p=462
硬盘检测
https://blog.csdn.net/beckdon/article/details/12441245