You are not logged in.
My setup: separate /home partition formatted with btrfs, with eCryptFS encryption of my home directory (say it's /home/myuser), using Ubuntu tools.
Recently, I've been having random read/write errors. Sometimes I read some random file on my FS, then I hit a read error, then the entire filesystem becomes read-only because errors were encountered. I kind of figured out which files were problematic by trial-and-error, and I copied all the files I could read and backed them up.
Write errors were more problematic. Sometimes, when I'm randomly executing a command (like rvm install), boom, a write error occurs and my FS turns read-only. This is completely unpredictable, and I can't do anything about it.
I have tried the following:
btrfs scrubwhich discovered a handful errors but could not correct any of them;
btrfs fix --repairwhich attempted to fix without avail, and unfortunately ending with a segfault;
btrfs fix --repair --init-crc-treewhich attempted to fix but gives me a write error;
SMART testing of the drive, which returned no errors.
Eventually, I went for the nuke, so I wiped the filesystem, remade the file system, and copied the backed-up files back to it. Fortunately I did not lose anything that matters. However, I'm now very confused about what is going wrong. How did my filesystem get this broken? Here are my guesses:
forced shutdown broke stuff: could be, but quite unlikely;
reading/writing/copying files with errors: again, unlikely situation;
aging disk, no significant problems with the disk but can cause random R/W errors: this is my guess, but the problem is that I've only been using the disk for, like, 1 year;
btrfs and eCryptFS incompatibility: also seems possible; can anyone suggest an alternative in this case?
What are other possible causes? Which hypothesis seems most plausible?
Last edited by xuanrui (2020-06-01 18:56:27)
Offline
Can you share output of "sudo smartctl -A ..." for the drive?
Maybe it's not the drive, instead your PC isn't running stable and causing data corruption? Did you try testing general stability, for example using memtest86 or stressapptest? Did you check on the CPU temperature?
Offline
Can you share output of "sudo smartctl -A ..." for the drive?
Maybe it's not the drive, instead your PC isn't running stable and causing data corruption? Did you try testing general stability, for example using memtest86 or stressapptest? Did you check on the CPU temperature?
Here's the output:
martctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-3-clear] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 100 100 050 Old_age Always - 0
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 20
9 Power_On_Hours 0x0032 100 100 050 Old_age Always - 3245
12 Power_Cycle_Count 0x0032 100 100 050 Old_age Always - 1664
160 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 48
161 Unknown_Attribute 0x0033 100 100 050 Pre-fail Always - 75
163 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 17
164 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 78694
165 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 198
166 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 97
167 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 154
168 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 7000
169 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 98
175 Program_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
176 Erase_Fail_Count_Chip 0x0032 100 100 050 Old_age Always - 0
177 Wear_Leveling_Count 0x0032 100 100 050 Old_age Always - 0
178 Used_Rsvd_Blk_Cnt_Chip 0x0032 100 100 050 Old_age Always - 20
181 Program_Fail_Cnt_Total 0x0032 100 100 050 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 050 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 118
194 Temperature_Celsius 0x0022 100 100 050 Old_age Always - 50
195 Hardware_ECC_Recovered 0x0032 100 100 050 Old_age Always - 100129
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 48
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 20
198 Offline_Uncorrectable 0x0032 100 100 050 Old_age Always - 48
199 UDMA_CRC_Error_Count 0x0032 100 100 050 Old_age Always - 1
232 Available_Reservd_Space 0x0032 100 100 050 Old_age Always - 75
241 Total_LBAs_Written 0x0030 100 100 050 Old_age Offline - 862657
242 Total_LBAs_Read 0x0030 100 100 050 Old_age Offline - 186688
245 Unknown_Attribute 0x0032 100 100 050 Old_age Always - 1149289Good point on PC instability, I ought to test for that
Offline
I think your drive is bad. I bet it's the drive causing the corruption problems and not your PC.
These SMART record entries here are bad:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 20
...
196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 48
197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 20
...The "raw value" column shows how often those events have happened in the past. These "reallocated sector" events happen when the drive cannot tries and fails to read/write an area of the drive.
A certain amount can sometimes already be there when a drive is brand new. The numbers you have are too high for that, I'm guessing. You also say that you lost data, so that would be a sign that those events happened at that time.
The normal recommendation would be to immediately stop using the drive and replace it. You should not use it at all, except for reading all data you want to keep.
Personally, I kept using these kinds of drives for fun for things that don't matter, like for example an extra backup drive or for installing games or downloading videos. I was interested to see how the errors would develop. I've only seen these kinds of errors on HDDs. Most of the HDDs I had with those kinds of issues died very fast, within days. There was one HDD that after a while stopped counting up those events. That one HDD then kept working fine for another five years or so and then one day it was suddenly dead.
Offline
I think your drive is bad. I bet it's the drive causing the corruption problems and not your PC.
These SMART record entries here are bad:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE ... 5 Reallocated_Sector_Ct 0x0032 100 100 050 Old_age Always - 20 ... 196 Reallocated_Event_Count 0x0032 100 100 050 Old_age Always - 48 197 Current_Pending_Sector 0x0032 100 100 050 Old_age Always - 20 ...The "raw value" column shows how often those events have happened in the past. These "reallocated sector" events happen when the drive cannot tries and fails to read/write an area of the drive.
A certain amount can sometimes already be there when a drive is brand new. The numbers you have are too high for that, I'm guessing. You also say that you lost data, so that would be a sign that those events happened at that time.
The normal recommendation would be to immediately stop using the drive and replace it. You should not use it at all, except for reading all data you want to keep.
Personally, I kept using these kinds of drives for fun for things that don't matter, like for example an extra backup drive or for installing games or downloading videos. I was interested to see how the errors would develop. I've only seen these kinds of errors on HDDs. Most of the HDDs I had with those kinds of issues died very fast, within days. There was one HDD that after a while stopped counting up those events. That one HDD then kept working fine for another five years or so and then one day it was suddenly dead.
Hmmm, my drive is an SSD though. I know that HDDs tend to randomly die, and I have a few HDDs that died that way, but SSDs seem to only wear out due to "old age" and tend not to suddenly die due to physical setup. I have ordered a replacement drive that will arrive tomorrow, and the plan is to migrate to that drive when it arrives. At least it can't hurt!
Offline