You are not logged in.

#1 2020-06-01 18:56:08

xuanrui
Member
Registered: 2018-09-27
Posts: 52

Strange trouble with eCryptFS-on-btrfs; what could be the cause?

My setup: separate /home partition formatted with btrfs, with eCryptFS encryption of my home directory (say it's /home/myuser), using Ubuntu tools.

Recently, I've been having random read/write errors. Sometimes I read some random file on my FS, then I hit a read error, then the entire filesystem becomes read-only because errors were encountered. I kind of figured out which files were problematic by trial-and-error, and I copied all the files I could read and backed them up.

Write errors were more problematic. Sometimes, when I'm randomly executing a command (like rvm install), boom, a write error occurs and my FS turns read-only. This is completely unpredictable, and I can't do anything about it.

I have tried the following:

  • btrfs scrub

    which discovered a handful errors but could not correct any of them;

  • btrfs fix --repair

    which attempted to fix without avail, and unfortunately ending with a segfault;

  • btrfs fix --repair --init-crc-tree

    which attempted to fix but gives me a write error;

  • SMART testing of the drive, which returned no errors.

Eventually, I went for the nuke, so I wiped the filesystem, remade the file system, and copied the backed-up files back to it. Fortunately I did not lose anything that matters. However, I'm now very confused about what is going wrong. How did my filesystem get this broken? Here are my guesses:

  • forced shutdown broke stuff: could be, but quite unlikely;

  • reading/writing/copying files with errors: again, unlikely situation;

  • aging disk, no significant problems with the disk but can cause random R/W errors: this is my guess, but the problem is that I've only been using the disk for, like, 1 year;

  • btrfs and eCryptFS incompatibility: also seems possible; can anyone suggest an alternative in this case?

What are other possible causes? Which hypothesis seems most plausible?

Last edited by xuanrui (2020-06-01 18:56:27)

Offline

#2 2020-06-01 20:36:29

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: Strange trouble with eCryptFS-on-btrfs; what could be the cause?

Can you share output of "sudo smartctl -A ..." for the drive?

Maybe it's not the drive, instead your PC isn't running stable and causing data corruption? Did you try testing general stability, for example using memtest86 or stressapptest? Did you check on the CPU temperature?

Offline

#3 2020-06-01 22:35:20

xuanrui
Member
Registered: 2018-09-27
Posts: 52

Re: Strange trouble with eCryptFS-on-btrfs; what could be the cause?

Ropid wrote:

Can you share output of "sudo smartctl -A ..." for the drive?

Maybe it's not the drive, instead your PC isn't running stable and causing data corruption? Did you try testing general stability, for example using memtest86 or stressapptest? Did you check on the CPU temperature?

Here's the output:

martctl 7.1 2019-12-30 r5022 [x86_64-linux-5.6.15-3-clear] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       20
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       3245
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       1664
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       48
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       75
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       17
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       78694
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       198
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       97
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       154
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       7000
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       98
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       20
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       118
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       50
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       100129
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       48
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       20
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       48
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       1
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       75
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       862657
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       186688
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1149289

Good point on PC instability, I ought to test for that

Offline

#4 2020-06-01 23:06:07

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: Strange trouble with eCryptFS-on-btrfs; what could be the cause?

I think your drive is bad. I bet it's the drive causing the corruption problems and not your PC.

These SMART record entries here are bad:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       20
...
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       48
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       20
...

The "raw value" column shows how often those events have happened in the past. These "reallocated sector" events happen when the drive cannot tries and fails to read/write an area of the drive.

A certain amount can sometimes already be there when a drive is brand new. The numbers you have are too high for that, I'm guessing. You also say that you lost data, so that would be a sign that those events happened at that time.

The normal recommendation would be to immediately stop using the drive and replace it. You should not use it at all, except for reading all data you want to keep.

Personally, I kept using these kinds of drives for fun for things that don't matter, like for example an extra backup drive or for installing games or downloading videos. I was interested to see how the errors would develop. I've only seen these kinds of errors on HDDs. Most of the HDDs I had with those kinds of issues died very fast, within days. There was one HDD that after a while stopped counting up those events. That one HDD then kept working fine for another five years or so and then one day it was suddenly dead.

Offline

#5 2020-06-02 04:07:09

xuanrui
Member
Registered: 2018-09-27
Posts: 52

Re: Strange trouble with eCryptFS-on-btrfs; what could be the cause?

Ropid wrote:

I think your drive is bad. I bet it's the drive causing the corruption problems and not your PC.

These SMART record entries here are bad:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       20
...
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       48
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       20
...

The "raw value" column shows how often those events have happened in the past. These "reallocated sector" events happen when the drive cannot tries and fails to read/write an area of the drive.

A certain amount can sometimes already be there when a drive is brand new. The numbers you have are too high for that, I'm guessing. You also say that you lost data, so that would be a sign that those events happened at that time.

The normal recommendation would be to immediately stop using the drive and replace it. You should not use it at all, except for reading all data you want to keep.

Personally, I kept using these kinds of drives for fun for things that don't matter, like for example an extra backup drive or for installing games or downloading videos. I was interested to see how the errors would develop. I've only seen these kinds of errors on HDDs. Most of the HDDs I had with those kinds of issues died very fast, within days. There was one HDD that after a while stopped counting up those events. That one HDD then kept working fine for another five years or so and then one day it was suddenly dead.

Hmmm, my drive is an SSD though. I know that HDDs tend to randomly die, and I have a few HDDs that died that way, but SSDs seem to only wear out due to "old age" and tend not to suddenly die due to physical setup. I have ordered a replacement drive that will arrive tomorrow, and the plan is to migrate to that drive when it arrives. At least it can't hurt!

Offline

Board footer

Powered by FluxBB