[Repaired] Bad blocks cause kernel blocking to the device

b6fan · 2010-05-10 02:26:57

I got

May 10 10:08:13 qslap kernel: sd 4:0:0:0: [sdb] Unhandled sense code
May 10 10:08:13 qslap kernel: sd 4:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08
May 10 10:08:13 qslap kernel: sd 4:0:0:0: [sdb] Sense Key : 0x3 [current]
May 10 10:08:13 qslap kernel: sd 4:0:0:0: [sdb] ASC=0x14 ASCQ=0x0
May 10 10:08:13 qslap kernel: sd 4:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 25 42 ea af 00 00 01 00
May 10 10:08:13 qslap kernel: end_request: I/O error, dev sdb, sector 625142447
May 10 10:08:13 qslap kernel: Buffer I/O error on device sdb, logical block 78142805

in system log when I try to access /dev/sdb in some way (for example, plug in, fdisk, gparted, but not palimpsest).

This kind of log repeats several times and blocks any access to that device for tens of seconds (Seems kernel keep retrying, not give up the first time), which is annoying.

From palimpsest, I can see:
Current Pending Sector Count: Value: 1 sector
Uncorrectable Sector Count: Value: 1 sector

It says when write fails, "Current Pending Sector" will be remapped automatically by hardware.

I got the sector size = 512 bytes:

# fdisk -lu /dev/sdb

Disk /dev/sdb: 320.1 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders, total 625142448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xaaaaaaaa

Disk /dev/sdb doesn't contain a valid partition table

badblocks detects the bad sector well:

# badblocks -svw -b 512 /dev/sdb 625142447 625142447
Checking for bad blocks in read-write mode
From block 625142447 to 625142447
Testing with pattern 0xaa: 625142447one, 0:20 elapsed
done                                
Reading and comparing: done                                
Testing with pattern 0x55: done                                
Reading and comparing: done                                
Testing with pattern 0xff: done                                
Reading and comparing: done                                
Testing with pattern 0x00: done                                
Reading and comparing: done                                
Pass completed, 1 bad blocks found.

From above, you can see that write a block one time takes 20 seconds due to kernel blocking.
badblocks writes 4 times, ~80 seconds.
Note: badblocks doesn't find any bad blocks when performing a full disk read-only test.

However, the sector wasn't automatically remapped (badblocks has already written that sector)
the kernel is still generating logs and blocking, which is very annoying.

I also tried to write at that sector directly, no luck:

# dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=625142447
dd: writing `/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 7.26951 s, 0.0 kB/s

What should I do to let the hardware remap that sector?
If no way due to hardware limitation, then how can I mute the annoying log and let the kernel not blocking ?

Additional: I am looking for a way to let kernel not blocking, give up at the begining asap, or let the hardware SMART mark that sector not to be 'Pending', not for a way to create fs with bad blocks marked.

I know if I provide a list of badblocks to mkfs.*** to create a fs, these blocks will not be used.
However, when I plug in the removable harddisk, BEFORE performing ANY r/w instructions, the kernel starts to generate logs and /dev/sdb is not visible in tens of seconds. Same situation occurs when I run / fdisk / gparted (these programs are not responsible for tens of seconds due to kernel blocking) ...
I guess that SMART does these checks automatically and cause kernel blocking, while SMART can't handle these things well.

This is the output of smartctl -a /dev/sdb -d sat, which may be helpful:

smartctl 5.39.1 2010-01-28 r3054 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 5400.5 series
Device Model:     ST9320320AS
Serial Number:    5SX3YFQ8
Firmware Version: SD03
User Capacity:    320,072,933,376 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon May 10 11:25:42 2010 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:          ( 700) seconds.
Offline data collection
capabilities:              (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 114) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   094   088   006    Pre-fail  Always       -       182650280
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       595
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       30942693
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4482
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       579
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1812
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   039   045    Old_age   Always   In_the_past 33 (0 166 39 23)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       98
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   011   011   000    Old_age   Always       -       178621
194 Temperature_Celsius     0x0022   033   061   000    Old_age   Always       -       33 (0 12 0 0)
195 Hardware_ECC_Recovered  0x001a   060   039   000    Old_age   Always       -       182650280
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1979 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1979 occurred at disk power-on lifetime: 4480 hours (186 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 da 01 ff ff ff 4f 00      13:43:15.498  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:13.155  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.886  READ DMA EXT

Error 1978 occurred at disk power-on lifetime: 4480 hours (186 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 da 01 ff ff ff 4f 00      13:43:13.155  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.886  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.886  READ DMA EXT

Error 1977 occurred at disk power-on lifetime: 4480 hours (186 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.887  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.886  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.886  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:10.885  READ DMA EXT

Error 1976 occurred at disk power-on lifetime: 4480 hours (186 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 da 01 ff ff ff 4f 00      13:43:08.457  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:06.082  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.814  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.813  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.813  READ DMA EXT

Error 1975 occurred at disk power-on lifetime: 4480 hours (186 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 da 01 ff ff ff 4f 00      13:43:06.082  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.814  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.813  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.813  READ DMA EXT
  25 da 01 ff ff ff 4f 00      13:43:03.813  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4480         625142447
# 2  Short offline       Completed: read failure       90%      4474         625142447
# 3  Extended offline    Completed: read failure       90%      4474         625142447
# 4  Short offline       Completed: read failure       90%      4474         625142447
# 5  Conveyance offline  Completed: read failure       90%      4473         625142447
# 6  Short offline       Completed: read failure       90%      4473         625142447

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Last edited by b6fan (2010-05-10 07:18:46)

b6fan · 2010-05-10 07:18:23

Bad block repaired by SeaTools for DOS.

Seems only SeaTools for DOS can repair this issue.

Arch Linux

#1 2010-05-10 02:26:57

[Repaired] Bad blocks cause kernel blocking to the device

#2 2010-05-10 07:18:23

Re: [Repaired] Bad blocks cause kernel blocking to the device

Board footer