You are not logged in.

#1 2024-05-07 11:21:40

sp4ke
Member
Registered: 2016-05-14
Posts: 7

raid1 extreme random slowdowns - crucial bx500

Hi,

I migrated my single disk ext4 filesystem to a md raid1 (thinkpad w530) using crucial 2TB BX500 disks.

Since the migration I am experiencing random extreme slowdowns from time to time (every few hours). The slowdowns seem related to disk activity. When they start the kernel messages do not show any error whatsoever and neither journalctl.

I use dwm, when the slowdowns start there is no freeze of the UI but anything related to disk activity will experience very slow behavior including internet browsing. an `ls` command takes 2-5 seconds to execute. The slowdown lasts for 10-15 mins then all activity is back to normal.

mdadm -D shows no issue. Same with smartctl on the disks. I tried multiple IO schedulers (none, bfq, md-deadline) as well as process schedulers and kernels. I run Linux-Ck but the problems started on the stable kernel.

I can still send back the disks and was wondering if it's worth to try the raid under BTRFS or just change to an other brand.

What other types of logs/benchmarks can I do to troubleshoot this issue  ?

Update: I am getting a 13MB/s avr write speed using kdiskmark hBa8srN.png

some HW info
Linux w530 6.8.1-1-ck-generic-v2 #1 SMP PREEMPT_DYNAMIC
/dev/md0:
           Version : 1.2
     Creation Time : Sat Mar 23 22:43:32 2024
        Raid Level : raid1
        Array Size : 1953380352 (1862.89 GiB 2000.26 GB)
     Used Dev Size : 1953380352 (1862.89 GiB 2000.26 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue May  7 13:18:46 2024
             State : active
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : any:0
              UUID : 69cc98c9:1344ebc1:dddfc34d:1fdccb8c
            Events : 44519

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

smartctl disk1 :

Model Family:     Crucial/Micron Client SSDs
Device Model:     CT2000BX500SSD1
Serial Number:    2342E8816905
LU WWN Device Id: 5 00a075 1e8816905
Firmware Version: M6CR061
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May  7 13:17:20 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1193
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       74
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   099   099   000    Old_age   Always       -       15
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       23
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       53
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   053   000    Old_age   Always       -       38 (Min/Max 29/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       12620413040
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       394387907
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       527480832
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       2986038327
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       3
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1023         -

Selective Self-tests/Logging not supported

The above only provides legacy SMART information - try 'smartctl -x' for more

smartctl disk2 :

Model Family:     Crucial/Micron Client SSDs
Device Model:     CT2000BX500SSD1
Serial Number:    2342E881581E
LU WWN Device Id: 5 00a075 1e881581e
Firmware Version: M6CR061
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May  7 13:18:05 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1071
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       72
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   099   099   000    Old_age   Always       -       9
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       20
180 Unused_Reserve_NAND_Blk 0x0033   100   100   000    Pre-fail  Always       -       43
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   044   000    Old_age   Always       -       35 (Min/Max 27/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       6473132044
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       202285376
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       899848192
249 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0
251 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       2768195272
252 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       5
253 Unkn_CrucialMicron_Attr 0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       902         -

Selective Self-tests/Logging not supported

The above only provides legacy SMART information - try 'smartctl -x' for more

Last edited by sp4ke (2024-05-07 15:31:38)

Offline

Board footer

Powered by FluxBB