You are not logged in.
Pages: 1
Hello,
This morning my raid-6 array failed after less than three years.
The computer is only booted up twice a month for about 10 minutes.
I'm just glad I have the same data backed up on a USB hard drive,
and after this, I'm getting a second one.
Whether the raid array can be repaired or not, the array
will no longer be used for the time being.
The raid-6 system :
OS : Arch Linux 6.18.7-arch1-1 x86_64
CPU : Intel Xeon E-2388G
Mainboard : Asus P12R-M-10G-2T
RAM : 16GB DDR4-3200 ECC
Controller : Broadcom HBA 9500-8i
SSD : 8x Crucial SSD MX 400 4TB connected to Broadcom Controller
Power supply : Bequiet Straight Power 12 1000 Watts
The motherboard, controller, and SSDs have the latest firmware.
Apart from the controller,
no other hardware is installed in the computer. (Headless access, SSH, NFS)
The raid-6 volume uses the BTRFS file system,
The mount parameters of the raid volume on the raid system are:
defaults,compress=zstd:1,ssd,discard=async
The mount parameters on the client computers are:
defaults,noauto,users
What I find strange is that all 8 SSDs are present and accessible.
And according to the SMART values, as far as I understand,
they are all error-free. But according to mdadm, something seems to be wrong.
sda1 and sdc1 are completely missing, which can't be right,
because otherwise they wouldn't be recognized by the system or smartmontools at all.
sdg1 and sdh1 are declared as faulty in mdadm.
Or is the controller defective?
lsblk -fNAME FSTYPE FSVER LABEL UUID
sda
└─sda1 linux_raid_member 1.2 fileserver:RAID6 ...
sdb
└─sdb1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127
sdc
└─sdc1 linux_raid_member 1.2 fileserver:RAID6 ...
sdd
└─sdd1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127
sde
└─sde1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127
sdf
└─sdf1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127
sdg
└─sdg1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127
sdh
└─sdh1 linux_raid_member 1.2 fileserver:RAID6 ...
└─md127sudo mdadm --detail /dev/md127/dev/md127:
Version : 1.2
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Used Dev Size : 18446744073709551615
Raid Devices : 8
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Sat Feb 7 13:38:02 2026
State : active, FAILED, Not Started
Active Devices : 4
Working Devices : 4
Failed Devices : 2
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : unknown
Name : fileserver:RAID6 (local to host fileserver)
UUID : ...
Events : 4275
Number Major Minor RaidDevice State
- 0 0 0 removed
- 0 0 1 removed
- 0 0 2 removed
- 0 0 3 removed
- 0 0 4 removed
- 0 0 5 removed
- 0 0 6 removed
- 0 0 7 removed
- 8 113 - faulty /dev/sdh1
- 8 97 - faulty /dev/sdg1
- 8 81 5 sync /dev/sdf1
- 8 65 4 sync /dev/sde1
- 8 49 3 sync /dev/sdd1
- 8 17 0 sync /dev/sdb1cat /proc/mdstatPersonalities : [raid4] [raid5] [raid6]
md127 : inactive sdf1[5] sdd1[3] sdh1[7](F) sdg1[6](F) sdb1[0] sde1[4]
15627014144 blocks super 1.2
unused devices: <none>sudo mdadm --detail /dev/sdg1
mdadm: /dev/sdg1 does not appear to be an md devicesudo mdadm --detail /dev/sdh1
mdadm: /dev/sdh1 does not appear to be an md devicesudo smartctl -a /dev/sdg1smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.18.7-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT4000MX500SSD1
Serial Number: ...
LU WWN Device Id: 5 ...
Firmware Version: M3CR046
User Capacity: 4.000.787.030.016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database 7.5/6083
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 7 17:42:39 2026 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3265
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 183
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 2
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 156
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 240
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 071 044 000 Old_age Always - 29 (Min/Max 0/56)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 9424242918
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 75709343
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 70653098
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for moresudo smartctl -a /dev/sdh1smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.18.7-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT4000MX500SSD1
Serial Number: ...
LU WWN Device Id: 5 ...
Firmware Version: M3CR046
User Capacity: 4.000.787.030.016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database 7.5/6083
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 7 17:46:58 2026 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3308
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 183
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 2
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 156
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 273
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 070 046 000 Old_age Always - 30 (Min/Max 0/54)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 9517752593
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 76567085
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 71469014
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for moresudo smartctl -a /dev/sda1smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.18.7-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT4000MX500SSD1
Serial Number: ...
LU WWN Device Id: 5 ...
Firmware Version: M3CR046
User Capacity: 4.000.787.030.016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database 7.5/6083
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 7 17:49:08 2026 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3306
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 184
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 3
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 156
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 259
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 073 050 000 Old_age Always - 27 (Min/Max 0/50)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 12194506750
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 97695686
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 92699913
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for moresudo smartctl -a /dev/sdc1smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.18.7-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT4000MX500SSD1
Serial Number: ...
LU WWN Device Id: 5 ...
Firmware Version: M3CR046
User Capacity: 4.000.787.030.016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database 7.5/6083
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Feb 7 17:51:17 2026 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3289
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 184
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 3
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 156
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 276
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 074 047 000 Old_age Always - 26 (Min/Max 0/53)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 12195276537
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 97698924
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 94815348
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Completed [00% left] (0-65535)
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for moreOffline
mdadm --examine for all? dmesg of failed assembly (syslogs of failure event if you have them, but I guess they're on the raid)?
Offline
mdadm --examine for all?
sudo mdadm --examine
mdadm: No devices to examinedmesg of failed assembly
I haven't tried any recovery or anything else yet.
sudo dmesg | grep "md127"[ 3.863271] md/raid:md127: device sde1 operational as raid disk 4
[ 3.863274] md/raid:md127: device sda1 operational as raid disk 1
[ 3.863274] md/raid:md127: device sdd1 operational as raid disk 3
[ 3.863275] md/raid:md127: device sdb1 operational as raid disk 0
[ 3.863275] md/raid:md127: device sdf1 operational as raid disk 5
[ 3.863981] md/raid:md127: not enough operational devices (3/8 failed)
[ 3.888515] md/raid:md127: failed to run raid set.
[ 3.888522] md127: ADD_NEW_DISK not supported
[ 4.086511] md127: ADD_NEW_DISK not supported(syslogs of failure event if you have them, but I guess they're on the raid)?
The raid is not a boot/system drive.
Offline
sudo mdadm --examine /dev/sd{a,c}1Also, the full output of
sudo journalctl -bwould be useful. It does seem like a hardware issue of some sort.
BTW, I think a journal device would be a good idea with this setup. If you crash for any reason, btrfs metadata might get corrupted due to the write hole.
Offline
In case you need to rebuild from scratch, you might consider just using ZFS RAIDz2, which is rock solid, with all the data integrity goodies. I use BTRFS RAID for Single/0/1, and ZFS for 5/6 (RAIDz). I don't really use mdadm anymore.
Offline
although i agree with topcat in recommending zfs over md there has to be a reason whs a system that's pretty much a cold vault comes up with multiple drives failing at once all of the sudden
to me the this smells like an issue with the power supply: if this is really all hardware that's in that system the psu is way too overkill
even maxing out with synthetic stress tests you hardly will get over 200w total - if you even get that high
it's likely that the psu is somewhat unstable when only hit with 10% of its rated power - and that unstable power can cause issues with all components
modern switch mode psu are designed for a load between 20% and 80%
my guess what has happened here: due to over spec the psu runs unstable in low power which caused the multi-failure - either throw in a 200w gpu and keep it along the cpu at a constant load to keep the psu above 20% - or scale down the psu to about 500w or even smaller to get a better utilization and thereby stable power output
yes, power issues are often overlooked - the root cause here is something not often zfs could have helped
also: don't rely on a single external drive for your backup - get at least a second one!
and ditch btrfs!
Online
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 183
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 156I think these two values indicate something. This shows that 85% of the shutdowns were unexpected.
Will the value increase even when using the motherboard's SATA ports?
Offline
This shows that 85% of the shutdowns were unexpected.
agreed - this, combined with "only comes up twice a month for 10 minutes", sounds like "yup, and when I'm done I just kill the power instead of proper shutdown the system"
something doesn't add up here ...
Online
Before the SSDs, HDDs were used here.
I kept the power supply.
I always shut down the system with sudo shutdown -h now,
and I can't recall a power outage that could have damaged the RAID volume.
It's quite possible that the power supply is oversized,
and that's why the problems are occurring. I never paid attention to that.
I'm going to install the Bequiet System Power 11 450Watts power supply.
I also considered using ZFS back then.
But I didn't because, as far as I understand,
everything becomes more complicated. For example,
updating with sudo pacman -Syu is no longer straightforward,
since you have to manually adjust the kernel and other things
with every update.
Is that still the case?
sudo mdadm --examine /dev/sd{a,b,c,d,e,f,g,h}1/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 8db59ff6 - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 92128170 - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 279a1a0b - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 83782fe7 - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 6b107ef0 - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 4
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 13:38:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 748acddf - correct
Events : 4275
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : AAAAAA.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 12:43:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 5965089 - correct
Events : 4274
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 6
Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdh1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : ...
Name : fileserver:RAID6 (local to host fileserver)
Creation Time : Sat Sep 30 18:19:57 2023
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813507072 sectors (3.64 TiB 4.00 TB)
Array Size : 23440521216 KiB (21.83 TiB 24.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=0 sectors
State : clean
Device UUID : ...
Internal Bitmap : 8 sectors from superblock
Update Time : Sat Feb 7 12:43:02 2026
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 4604721c - correct
Events : 4274
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 7
Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)Last edited by samtk0225 (2026-02-10 18:18:31)
Offline
I also considered using ZFS back then.
But I didn't because, as far as I understand,
everything becomes more complicated. For example,
updating with sudo pacman -Syu is no longer straightforward,
since you have to manually adjust the kernel and other things
with every update.
Is that still the case?
have a look at: https://github.com/archzfs/archzfs
be aware: the projects gist is to support latest supported kernel by latest stable zfs - which usually results in "we target linux-lts"
in the past there were a few kernel updates which were safe with an older zfs version by patching in a --experimental flag into the build or DKMS - but lately every kernel update required an update of zfs to support it - hence I dropped it again in my fork
yes, the project had some slack at some point - but currently it's done via github actions which runs once every day
you can also go my route - fork it and build your own packages - it became a bit of a meme in the repo who's the first to open the PR for the next zfs update for the next kernel - but unless it's on a weekend you can expect it's up-to-date within 24h
the official archzfs is a bit defensive (which I actually do support) so they don't support RC builds (which I do on my fork) - but even from ZFS upstream: "the next version is released when it's done" - pretty much the same as for every arch package
TL;DR: if you'Re fine living on the edge with potential data loss - you can use zfs on arch on default kernel - otherwise stick to lts
Online
Quite odd, the last two drives sdg/sdh (device role 6/7) somehow fell off the array (event count off by one, update time off, and marked failed by all other drives), but it's RAID 6, so it should still be running with only two drives missing.
Yet in assembly it claims there are 3 failed drives. Not entirely clear to me why that is, either I'm overlooking something in the examine or there's a bug... there was a similar issue recentishly but I don't remember how that went.
What does your mdadm.conf look like? If it's overspecific (listing devices) then it could be part of the issue. Or are these drives perhaps detected late? Those ADD_NEW_DISK messages could be related but I haven't seen that before.
The drives are not even listed at all in that dmesg output. Was there anything else in dmesg? (Don't use | grep, there's a chance of missing some non-matching but related lines.) Also check your journal (if you have it) around Sat Feb 7 12:43:02 2026 what was going on there.
Try mdadm --stop /dev/md127 then --assemble it manually using only the 6 drives that should be good, what does it say (mdadm itself and dmesg)?
If it does not work and you think there was not much write activity around the update time. You can also try your luck with --assemble --force.
Last edited by frostschutz (2026-02-10 19:51:47)
Offline
It's quite possible that the power supply is oversized,
???
Also, according to the smart data not only were 85% of all power-offs sketchy but also the drives ran ~18h on average per power cycle - not 10 minutes.
A lot of things here don't quite add up.
Offline
It's quite possible that the power supply is oversized,
???
Also, according to the smart data not only were 85% of all power-offs sketchy but also the drives ran ~18h on average per power cycle - not 10 minutes.
A lot of things here don't quite add up.
not my words:
The computer is only booted up twice a month for about 10 minutes.
as for psu "oversized": as mentioned: modern switch mode PSU have a sweet spot and similar to old linear ones still need at least some load - and according to my knowledge the 80 plus start only at around 20%
so a system with the given inventory doesn't come even close the lower 20% under full load - and hence it's quite possible that the psu is just unstable when the system is idle and maybe just draws less then 100W total
have a look over https://github.com/openzfs/zfs/issues - there're countless of issues which turned out to be caused by some faulty or otherwise misbehaving PSU - hence I just see it fair to point out: 1000W is WAY too oversized for the given inventory - even when it was previously build with regular HDD (in a big cluster one enables either staggered spin up or power up in standby to limit the inrush current and let the drives spin up one by one)
from personal testing for 8 HDDs a PSU of 300W is powerfull enough to survive the inrush current without giving up on switching on - although by maths its quite pushing
Online
not my words:
Didn't say so either ![]()
Interesting point about the overdimensioned PSU, but the outcome seems overly specific (you'd expect more CPU/RAM related issues?) and wouldn't that render most (optimus) gaming systems unstable that come at 750W but typically idle at mostly 10% of that?
Offline
Have you tried switching a working drive with a failing drive or moving a failing drive to connect directly to the mainboard assuming it has SATA ports?
Offline
I think this is compatibility issue. So the M/B's SATA should work fine.
here is compatibility list of your device:
https://techdocs.broadcom.com/us/en/sto … icron.html
Compatibility lists are there for a reason.
https://www.broadcom.com/support/knowle … ontrollers
This is SAS3018, not SAS3808 but I think it still has similar limitations.
And the manual and the spec never say a word about supporting TRIM.
If it's for cold storage, it may be worth disabling TRIM for a while.
Offline
Quite odd, the last two drives sdg/sdh (device role 6/7) somehow fell off the array (event count off by one, update time off, and marked failed by all other drives), but it's RAID 6, so it should still be running with only two drives missing.
Yet in assembly it claims there are 3 failed drives. Not entirely clear to me why that is, either I'm overlooking something in the examine or there's a bug... there was a similar issue recentishly but I don't remember how that went.
What does your mdadm.conf look like? If it's overspecific (listing devices) then it could be part of the issue. Or are these drives perhaps detected late? Those ADD_NEW_DISK messages could be related but I haven't seen that before.
The drives are not even listed at all in that dmesg output. Was there anything else in dmesg? (Don't use | grep, there's a chance of missing some non-matching but related lines.) Also check your journal (if you have it) around Sat Feb 7 12:43:02 2026 what was going on there.
Try mdadm --stop /dev/md127 then --assemble it manually using only the 6 drives that should be good, what does it say (mdadm itself and dmesg)?
If it does not work and you think there was not much write activity around the update time. You can also try your luck with --assemble --force.
sudo mdadm --assemble /dev/md127
mdadm: /dev/md127 not identified in config file.I have never edited the /etc/mdadm.conf
and I am not a Linux expert.
# mdadm configuration file
#
# mdadm will function properly without the use of a configuration file,
# but this file is useful for keeping track of arrays and member disks.
# In general, a mdadm.conf file is created, and updated, after arrays
# are created. This is the opposite behavior of /etc/raidtab which is
# created prior to array construction.
#
#
# the config file takes two types of lines:
#
# DEVICE lines specify a list of devices of where to look for
# potential member disks
#
# ARRAY lines specify information about how to identify arrays so
# so that they can be activated
#
# You can have more than one device line and use wild cards. The first
# example includes SCSI the first partition of SCSI disks /dev/sdb,
# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second
# line looks for array slices on IDE disks.
#
#DEVICE /dev/sd[bcdjkl]1
#DEVICE /dev/hda1 /dev/hdb1
#
# If you mount devfs on /dev, then a suitable way to list all devices is:
#DEVICE /dev/discs/*/*
#
# The designation "partitions" will scan all partitions found in
# /proc/partitions
DEVICE partitions
# The AUTO line can control which arrays get assembled by auto-assembly,
# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file,
# or "mdadm --incremental" when the array found is not listed in this file.
# By default, all arrays that are found are assembled.
# If you want to ignore all DDF arrays (maybe they are managed by dmraid),
# and only assemble 1.x arrays if which are marked for 'this' homehost,
# but assemble all others, then use
#AUTO -ddf homehost -1.x +all
#
# ARRAY lines specify an array to assemble and a method of identification.
# Arrays can currently be identified by using a UUID, superblock minor number,
# or a listing of devices.
#
# super-minor is usually the minor number of the metadevice
# UUID is the Universally Unique Identifier for the array
# Each can be obtained using
#
# mdadm -D <md>
#
#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
#ARRAY /dev/md1 super-minor=1
#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1
#
# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor
# will then move a spare between arrays in a spare-group if one array has a failed
# drive but no spare
#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1
#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1
#
# When used in --follow (aka --monitor) mode, mdadm needs a
# mail address and/or a program. This can be given with "mailaddr"
# and "program" lines to that monitoring can be started using
# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid
# If the lines are not found, mdadm will exit quietly
#MAILADDR root@mydomain.tld
#PROGRAM /usr/sbin/handle-mdadm-eventsI think I'll replace the SATA power adapters, they look cheap, I don't like them anymore.
Offline
Try `mdadm --assemble --scan`, or `mdadm --assemble --run /dev/md127 /dev/sd[abcdef]1`, or finally `mdadm --assemble --force /dev/md127 /dev/sd[abcdefgh]1`.
Needs `mdadm --stop /dev/md127` before each attempt (if it was previously assembled) since you can't assemble an array that's already listed in /proc/mdstat even if it's not running.
I think I'll replace the SATA power adapters
If you KNOW there are hardware issues, you should fix those first, sure. It's impossible for me to say remotely with nothing to go on, so not my place to speculate randomly.
However it won't change that metadata has already recorded these drives as failed, so changing cables or whatever, by itself, most likely won't fix things. I'm still curious why assembly fails for you, when it should work with two drives missing, but it seems more like a software issue for now. Also curious how this failure originally played out, but without logs, nobody knows. md does not log these things in its metadata despite >100M unused data offset space... did you check journalctl?
Last edited by frostschutz (2026-02-11 15:28:18)
Offline
did you check journalctl?
Please post the kernel messages for the boot where the assembly failed.
Offline
not my words:
Didn't say so either
Interesting point about the overdimensioned PSU, but the outcome seems overly specific (you'd expect more CPU/RAM related issues?) and wouldn't that render most (optimus) gaming systems unstable that come at 750W but typically idle at mostly 10% of that?
as for unstable power: from the issues I read it seem to affect both "spinning rust" as well as solid state equally - but both are affected by unclean power
solutions range from using an uninterruptable power bank over replacing a psu to just switchibg around connectors (hindsight: as OP now mentioned "cheap looking sata power connectors" it could be a lead)
I'm not an (electrical) engineer - but take a look at that crippled design of the 12v hpwr connector: even just because the connector is a bit off angel can cause bad connections raising temps so high entire computers went up in flame
for harddrive power i connectors i suspect similar issues: loss of spring tension inside the connector causing bad connections with high resistance causing issues during power spikes as also the capacitors on a drive can only hold so much charge
so, yea, for whatever reason power issues seem high on the ranking for data issue in an array
Online
Pages: 1