You are not logged in.
Pages: 1
Hi all,
Since the last kernel upgrade on an headless server
[2013-04-04 09:56] upgraded linux (3.8.4-1 -> 3.8.5-1)
I get loads of these errors:
3w-xxxx: scsi2: Unknown scsi opcode: 0x41
[ 4356.826902] sd 2:0:0:0: [sda] Unhandled error code
[ 4356.826905] sd 2:0:0:0: [sda]
[ 4356.826907] Result: hostbyte=0x04 driverbyte=0x00
[ 4356.826909] sd 2:0:0:0: [sda] CDB:
[ 4356.826911] cdb[0]=0x41: 41 00 00 93 28 ee 00 00 08 00
[ 4356.826919] end_request: I/O error, dev sda, sector 9644270
[ 4356.827002] sda3: WRITE SAME failed. Manually zeroing.
sda3 is an ext4 / partition, on a 3ware hardware controller (RAID1). The awful tw_cli tool says it's everything ok:
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 69.2482 ON -
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 69.25 GB 145226112 WD-WMANS1886705
p1 OK u0 69.25 GB 145226112 WD-WMANS1886623
I did not find relevant clues by googling.
Do any of you guys know whether this issue is harmful or can be ignored? Since otherwise everything seems fine...
Cheers,
domanov
code tags as opposed to quote tags please -- Inxsible
Last edited by Inxsible (2013-04-09 16:49:22)
Offline
Did you try a manual verify?
But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner
Offline
domanov, use code tags. not quote tags. Quote tags are reserved for quoting another person's post or comment. Code tags provide a better monospaced font which makes it easier to distinguish between your post text and snippets. They also provide scrollers in case the snippets are long.
There's no such thing as a stupid question, but there sure are a lot of inquisitive idiots !
Offline
domanov, use code tags. not quote tags. Quote tags are reserved for quoting another person's post or comment.
Sorry mate, thanks for pointing that out and for the correction in the OP.
Did you try a manual verify?
It's running right now, but it does not seem to be the right tw_cli client (I repeat: awful support from 3ware):
[root@server ~] ./tw_cli /c2/u0 show all
/c2/u0 show all
/c2/u0 status = VERIFYING
/c2/u0 is not rebuilding, its current state is VERIFYING
/c2/u0 is verifying with percent completion = 54%
/c2/u0 is not initializing. Its current state is VERIFYING
/c2/u0 Write Cache = on
Error: (CLI:149) Feature not supported.
I did not find any clue about that error yet, not even in the manual.
Cheers,
domanov
Last edited by domanov (2013-04-09 17:11:06)
Offline
Apparently, WRITE SAME is an optional drive feature that allows writing the same data to consecutive sectors of the drive. A search for "WRITE SAME 3ware" turned up an offsite (ie. not hosted by 3ware) 3ware manual that suggests it's only or primarily used during RAID initialization to speed things up.
WRITE SAME seems to be part of the "SMART Command Transport" feature set, so a good first step would be to make sure SMART is enabled on your drives. You could also try checking out results for "WD740ADFD sct" (surprised I figured out your drive model? ) to see if there's anything related to your particular drive. If you're feeling particularly froggy (and you have backups) you can apparently send WRITE SAME commands manually using the sg_write_same command which is in the sg3_utils package.
But whether the Constitution really be one thing, or another, this much is certain - that it has either authorized such a government as we have had, or has been powerless to prevent it. In either case, it is unfit to exist.
-Lysander Spooner
Offline
Apparently, WRITE SAME is an optional drive feature that allows writing the same data to consecutive sectors of the drive. A search for "WRITE SAME 3ware" turned up an offsite (ie. not hosted by 3ware) 3ware manual that suggests it's only or primarily used during RAID initialization to speed things up.
WRITE SAME seems to be part of the "SMART Command Transport" feature set, so a good first step would be to make sure SMART is enabled on your drives.
Thanks alphaniner for pointing me there. This is the output of smartctl on one of the two disks in RAID1 (the other gave no error whatsoever):
[root@server ~]# smartctl --all -T permissive --device=3ware,0 /dev/twe0
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.8.6-1-ARCH] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Raptor
Device Model: WDC WD740ADFD-00NLR5
Serial Number: WD-WMANS1886705
Firmware Version: 21.07QR5
User Capacity: 74,355,769,344 bytes [74.3 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 published, ANSI INCITS 397-2005
Local Time is: Wed Apr 10 10:59:26 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2391) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 39) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 196 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0007 169 169 021 Pre-fail Always - 2566
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 67
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 13
7 Seek_Error_Rate 0x000a 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 045 045 000 Old_age Always - 40164
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 62
194 Temperature_Celsius 0x0022 116 106 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 187 187 000 Old_age Always - 13
197 Current_Pending_Sector 0x0012 191 188 000 Old_age Always - 113
198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 198 193 051 Old_age Offline - 60
SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 24 occurred at disk power-on lifetime: 40148 hours (1672 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 40 df 87 c3 e0 Error: UNC 64 sectors at LBA = 0x00c387df = 12814303
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 40 c0 87 c3 41 00 2d+04:24:43.728 READ DMA EXT
25 00 40 80 87 c3 41 00 2d+04:24:43.728 READ DMA EXT
25 00 40 40 87 c3 41 00 2d+04:24:43.728 READ DMA EXT
25 00 40 00 87 c3 41 00 2d+04:24:43.728 READ DMA EXT
25 00 40 c0 86 c3 41 00 2d+04:24:43.728 READ DMA EXT
Error 23 occurred at disk power-on lifetime: 40148 hours (1672 days + 20 hours)
[...]
All the five errors in the smart log are "only" UNC errors on read: if I understand correctly, googling around, this could be a bad cable connection rather than drive really failing (a couple days ago someone worked on the rack where this server is hosted...).
Anyway, the other drive seems fine and since this 2-disks raid1 unit is just the / partition (it's a plain arch installation, the server is a "number cruncher"; all actual data are in a separate 4-disks raid5 unit) I just backuped /etc for excess of zeal.
For now I think I'll wait for the weekend for running long smart tests/badblocks and such.
Any further hint or comment is still appreciated! Do you think it's time to replace the drive?
Thanks,
domanov
Offline
I'd say replace that disk. You already have reallocated sectors and more are probably waiting to be reallocated, that drive _may_ be starting to fail.
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 13 196 Reallocated_Event_Count 0x0032 187 187 000 Old_age Always - 13 197 Current_Pending_Sector 0x0012 191 188 000 Old_age Always - 113 198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
I'd say replace that disk. You already have reallocated sectors and more are probably waiting to be reallocated, that drive _may_ be starting to fail.
Yes, thank you, I agree.
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 13
196 Reallocated_Event_Count 0x0032 187 187 000 Old_age Always - 13
197 Current_Pending_Sector 0x0012 191 188 000 Old_age Always - 113
198 Offline_Uncorrectable 0x0012 200 197 000 Old_age Always - 3
Right now the Current_Pending_Sector is at 114. The disk IS going to fail. Can I just wait till it happens (since it is a 2 disks RAID-1 and the other disk is fine) or could a failed disk cause offtime on the server? Any experience?
Cheers,
domanov
Offline
In theory raid1 should be able to continue working even when n-1 disks fail, however rebuilding the array when adding the new disks will stress the remaining disks, which might also lead them to fail (you can find this in almost every corner of the internet that talks about raid).
I also suspect that you may get some performance degradation when using a disk that is not working well. The drive will take some time until reporting an error when trying to read from those Current_Pending_Sector/Offline_Uncorrectable and after that the data will have to be retrieved from the other disks, how much performance degradation you get depends on how often you try to read from those sectors, and how often new sectors can't be read and become pending sectors.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Pages: 1