Comment #15 suggested that there's been a bug in 4.16 that stashed I/O failures. With those failures being reported again w/ 4.16.3, you face coredumps rather than silent failures in the HW.
How does the LTS kernel behave?
What about some 4.15 kernel?
Possible u right.
I met same issue when installed linux-lts and 4.15-1.
]]>How does the LTS kernel behave?
What about some 4.15 kernel?
DBUS is not related to the kernel, does not cause IO errors in dmesg and that disc is not reported as healthy by any stretch.
You're free to believe whatever you want, but the most likely explanation is that the inodes that hold some dbus binary or library are affected by the disc damage.
sure, i agree that my disc has a crashed blocks. But how it's explain that when i reverting to any linux package 4.16.2-2 or early, i didn't get any errors (related to Dbus) and problems with launch some of programs?
]]>all time this dbus issue start after update with linux version >4.16.2-2.
although got a dbus error on Geany.
]]>220 Disk_Shift 0x0002 100 100 000 Old_age Always - 8256
Never seen before, had to google:
"Distance of the disk has shifted relative to the spindle. Incorrect disk spin can be cause by mechanical shock or high temperature."
5 Reallocated_Sector_Ct 0x0033 061 061 050 Pre-fail Always - 15400
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1028
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 976
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 00% 20770 488296080
Before anything else, save all valuable data from that disk.
The get a life-disk system w/ maintainance focus (eg. grml) and run https://wiki.archlinux.org/index.php/Badblocks (you can go w/ the non-destructive test) to see whether this is an isolated error or the disk is toast.
Then you'll have to make up your mind about the further fate of the disk, but I would cease to "trust" it (though it might still be good as temporary media tank attached to your TV or so)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 00% 20770 488296080
The drive appears to have issues independent of any kernel issue.
suppose, yes, but now all works fine ¯\_(ツ)_/¯ despite of drive issue.
]]>Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 00% 20770 488296080
The drive appears to have issues independent of any kernel issue.
Edit:
From 4.16.3 Release Notes
commit 80dc97f7e1e1b90ab62dc120ec9d09d69c8e03e8
Author: Bart Van Assche <bart.vanassche@wdc.com>
Date: Thu Apr 5 10:32:59 2018 -0700
Revert "scsi: core: return BLK_STS_OK for DID_OK in __scsi_error_from_host_byte()"
commit cbe095e2b584623b882ebaf6c18e0b9077baa3f7 upstream.
The description of commit e39a97353e53 is wrong: it mentions that commit
2a842acab109 introduced a bug in __scsi_error_from_host_byte() although that
commit did not change the behavior of that function. Additionally, commit
e39a97353e53 introduced a bug: it causes commands that fail with
hostbyte=DID_OK and driverbyte=DRIVER_SENSE to be completed with
BLK_STS_OK. Hence revert that commit.
Fixes: e39a97353e53 ("scsi: core: return BLK_STS_OK for DID_OK in __scsi_error_from_host_byte()")
Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Lee Duncan <lduncan@suse.com>
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Could explain why you were not seeing errors before 4.16.3
]]> smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.16.2-2-ARCH] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Toshiba 2.5" HDD MK..75GSX
Device Model: TOSHIBA MK5075GSX
Serial Number: 81MRD284B
LU WWN Device Id: 5 000039 37b289f9b
Firmware Version: GT001M
User Capacity: 500107862016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun May 6 18:43:22 2018 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 112) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 163) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 2013
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 8938
5 Reallocated_Sector_Ct 0x0033 061 061 050 Pre-fail Always - 15400
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 049 049 000 Old_age Always - 20771
10 Spin_Retry_Count 0x0033 253 100 030 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8802
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 277
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 441
193 Load_Cycle_Count 0x0032 075 075 000 Old_age Always - 255166
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 37 (Min/Max 13/65)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1028
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 976
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
220 Disk_Shift 0x0002 100 100 000 Old_age Always - 8256
222 Loaded_Hours 0x0032 061 061 000 Old_age Always - 15728
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0
224 Load_Friction 0x0022 100 100 000 Old_age Always - 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always - 311
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0
SMART Error Log Version: 1
ATA Error Count: 20799 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 20799 occurred at disk power-on lifetime: 20771 hours (865 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 5a a0 0b 24 6d Error: WP at LBA = 0x0d240ba0 = 220466080
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 60 78 81 45 40 00 01:55:26.546 WRITE FPDMA QUEUED
60 10 58 a0 0b 24 40 00 01:55:23.756 READ FPDMA QUEUED
60 88 50 08 44 22 40 00 01:55:23.756 READ FPDMA QUEUED
60 20 48 98 40 22 40 00 01:55:23.741 READ FPDMA QUEUED
60 e0 40 50 8c 08 40 00 01:55:23.734 READ FPDMA QUEUED
Error 20798 occurred at disk power-on lifetime: 20769 hours (865 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 ca a0 0b 24 6d Error: UNC at LBA = 0x0d240ba0 = 220466080
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 e0 58 0d 44 40 00 00:00:48.071 READ FPDMA QUEUED
60 18 d8 68 0a 29 40 00 00:00:48.071 READ FPDMA QUEUED
61 40 d0 98 08 84 40 00 00:00:48.071 WRITE FPDMA QUEUED
60 10 c8 a0 0b 24 40 00 00:00:46.581 READ FPDMA QUEUED
60 08 c0 98 0b c1 40 00 00:00:46.562 READ FPDMA QUEUED
Error 20797 occurred at disk power-on lifetime: 20769 hours (865 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 92 18 1f 24 6d Error: WP at LBA = 0x0d241f18 = 220471064
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 98 00 b8 57 40 00 06:59:09.700 WRITE FPDMA QUEUED
60 08 90 18 1f 24 40 00 06:59:09.556 READ FPDMA QUEUED
60 08 88 60 0c c1 40 00 06:59:09.540 READ FPDMA QUEUED
61 78 80 38 2a 84 40 00 06:59:09.538 WRITE FPDMA QUEUED
61 00 78 00 b2 57 40 00 06:59:08.632 WRITE FPDMA QUEUED
Error 20796 occurred at disk power-on lifetime: 20768 hours (865 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 12 a0 0b 24 6d Error: WP at LBA = 0x0d240ba0 = 220466080
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 38 18 98 08 84 40 00 06:52:05.822 WRITE FPDMA QUEUED
60 10 10 a0 0b 24 40 00 06:52:05.318 READ FPDMA QUEUED
60 20 08 00 48 93 40 00 06:52:05.318 READ FPDMA QUEUED
60 08 00 98 0b c1 40 00 06:52:05.299 READ FPDMA QUEUED
60 30 f0 90 0e c8 40 00 06:52:05.298 READ FPDMA QUEUED
Error 20795 occurred at disk power-on lifetime: 20768 hours (865 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 5a a0 0b 24 6d Error: WP at LBA = 0x0d240ba0 = 220466080
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 60 b0 43 f3 40 00 06:39:53.717 WRITE FPDMA QUEUED
60 08 58 a0 0b 24 40 00 06:39:53.717 READ FPDMA QUEUED
60 00 50 30 f8 88 40 00 06:39:53.717 READ FPDMA QUEUED
61 f8 48 80 83 84 40 00 06:39:53.717 WRITE FPDMA QUEUED
61 08 40 00 0a 55 40 00 06:39:53.717 WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 00% 20770 488296080
# 2 Short offline Completed without error 00% 1 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Can you determine which kernel the issue started with between 4.16.2-2 and 4.16.6-1? You can obtain kernel versions you do not have cached from the ALA.
Running a SMART test / checking the SMART status of devices will rule out a possibility.
issue started from linux-4.16.3-1-x86_64.pkg.
later i will provide SMART test results (but idea that system depend on some HDD block on new version little bit confuse me).
]]>In any case: YOU WANT TO BE SURE YOUR HDD IS OK and the stackpile of SIGBUS' plus the IO errors should scare the shit out of you.
Once you settled it's NOT the HDD, it's time to check on whether this might relate to drive power management, swap handling, a single bad block (does dmesg always point the same problematic sector?), ...
]]>If your harddisk is starting to fail this will have all sorts of wide reaching correlations. Back up what you can immediately. Run a SMART test and post the -a results after the mentioned time elapses.
But I don't have much hope of that coming out positive
But how it relate to the fact that all works ok if i will downgrade my linux version to 4.16.2-2-ARCH?
]]>But I don't have much hope of that coming out positive
]]>