[SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

mavrix · 2025-03-04 18:09:15

I keep getting uncorrectable errors reported by btrfs scrub within a few days of using a fresh Arch installation on a brand new system with a Samsung NVMe SSD 990 PRO. Very basic setup, no RAID or anything, just a plain btrfs partition sitting directly on the drive. Did multiple installations now from scratch, using a sha256sum checked thumb drive. Erased and recreated the partitions each time (gdisk).

Latest status:

UUID:             fbf9aad1-8c66-4603-8527-239485e14712
Scrub started:    Tue Mar  4 18:18:25 2025
Status:           finished
Duration:         0:00:04
Total to scrub:   22.96GiB
Rate:             5.72GiB/s
Error summary:    csum=32
  Corrected:      0
  Uncorrectable:  32
  Unverified:     0

Smartctl:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.13.5-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 2TB
Serial Number:                      [***]
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2’000’398’934’016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2’000’398’934’016 [2.00 TB]
Namespace 1 Utilization:            64’679’645’184 [64.6 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4b41a0fc21
Local Time is:                      Tue Mar  4 18:31:55 2025 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Log0_FISE_MI
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    6800
 4 -   0.0050W       -        -    4  4  4  4     2700   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4’204’600 [2.15 TB]
Data Units Written:                 1’255’254 [642 GB]
Host Read Commands:                 26’865’557
Host Write Commands:                14’360’251
Controller Busy Time:               20
Power Cycles:                       74
Power On Hours:                     43
Unsafe Shutdowns:                   10
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               30 Celsius
Temperature Sensor 2:               32 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

Google produced one similar story here https://www.reddit.com/r/linuxquestions … _new_nvme/

Any ideas?

Last edited by mavrix (2025-03-09 14:37:27)

agapito · 2025-03-05 16:17:23

Faulty RAM.

ua4000 · 2025-03-05 18:05:30

here is a link, how it could be checked:
MemTest86+
https://wiki.archlinux.org/title/Stress … MemTest86+

mavrix · 2025-03-05 21:06:56

Oh, right, forgot to mention that. That was one of my guesses as well. MemTest86+ completed just fine.

frostschutz · 2025-03-05 21:16:52

Could be a misbehaving drive. Or a bug in btrfs. Or still a memory error that memtest86+ is bad at detecting. (You could also run memtester in Linux, or just fill a zram device with random data, then see if its checksum remains stable for a few hours).

Otherwise it could be a kernel/btrfs bug, or a more obscure issue, like this https://lore.kernel.org/regressions/210 … dc14f2c540

If you have another drive available, you could create an mdadm raid1 with your SSD on one side, the other drive on the other side. Set the other drive to write-mostly (raid1 feature that makes that drive used for writing only).

Then if btrfs reports checksum errors. You could check if the data on the other drive matches. If your SSD returns wrong data on its own accord, it should be correct on the write-mostly mirror.

If its wrong on both sides, then most likely it was written wrong in the first place...

Last edited by frostschutz (2025-03-05 21:22:13)

mavrix · 2025-03-06 20:02:07

Oh, wow. That's exactly my configuration: single SSD in 1st M.2 socket of AsRock X600 + Ryzen 8700G.

frostschutz · 2025-03-06 20:23:30

Oof

There seems to be a BIOS update in the works https://bugzilla.kernel.org/show_bug.cgi?id=219609#c109

Not sure if you want to download an unreleased bios from unknown source though. If there is another slot and its unaffected, might just wait it out.

mavrix · 2025-03-09 14:46:36

Not sure yet, what the best course of action will be for me; I'm kind of hoping they'll resolve it any time soon with an official BIOS update. Marked it as solved, though, as the cause is fairly clearly identified. I also have to say, you guys are geniuses, with you remembering this issue and recognizing it as possibly related @frostschutz, without even knowing the details of the CPU etc, as well as whoever discovered that there's a pattern in the first place where this specific configuration leads to random and sporadic file corruption. Kudos! And thanks for your help, you just saved me a lot of nerves!

Arch Linux

#1 2025-03-04 18:09:15

[SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#2 2025-03-05 16:17:23

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#3 2025-03-05 18:05:30

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#4 2025-03-05 21:06:56

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#5 2025-03-05 21:16:52

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#6 2025-03-06 20:02:07

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#7 2025-03-06 20:23:30

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

#8 2025-03-09 14:46:36

Re: [SOLVED] Uncorrectable btrfs errors on brand new Samsung SSD 990 PRO

Board footer