You are not logged in.
Hi!
I use btrfs since 2018. Now my version is btrfs-progs 5.14.2-1
Suddenly and by accident I discovered bug during copy:
rsync: [sender] read errors mapping "/data/SDK/x64/setup.zip": Input/output error
I thought - this is the number!
# btrfs device usage /data
/dev/sdd, ID: 1
Device size: 2.73TiB
Device slack: 0.00B
Data,RAID5/3: 1.06TiB
Metadata,RAID5/3: 3.00GiB
System,RAID5/3: 8.00MiB
Unallocated: 1.67TiB
/dev/sde, ID: 2
Device size: 2.73TiB
Device slack: 0.00B
Data,RAID5/3: 1.06TiB
Metadata,RAID5/3: 3.00GiB
System,RAID5/3: 8.00MiB
Unallocated: 1.67TiB
/dev/sdf, ID: 3
Device size: 2.73TiB
Device slack: 0.00B
Data,RAID5/3: 1.06TiB
Metadata,RAID5/3: 3.00GiB
System,RAID5/3: 8.00MiB
Unallocated: 1.67TiB
I never see such behavour with ZFS!
# btrfs device stat /data
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 6
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs 0
[/dev/sdf].read_io_errs 0
[/dev/sdf].flush_io_errs 0
[/dev/sdf].corruption_errs 22
[/dev/sdf].generation_errs 0
Wow! my data is in danger!
# btrfs scrub start -Bdr /data
Corrected: 1
Uncorrectable: 1
Unverified: 0
Error summary: no errors found
Corrected: 0
Uncorrectable: 17
Unverified: 1
ERROR: there are uncorrectable errors
Ohh! How does it add numbers like that?
# btrfs check --check-data-csum -p /dev/disk/by-label/RAID5
# journalctl --dmesg --grep 'checksum error'
kernel: BTRFS warning (device sdd): checksum error at logical 1081319096320
kernel: BTRFS warning (device sdd): checksum error at logical 1522947457024
# btrfs inspect-internal logical-resolve 1081319096320 /data
# btrfs inspect-internal logical-resolve 1522947457024 /data
/data/VIDEO_TS/VTS_16_1.VO
/data/00000000000000059072.xml
Wow! But these are other files!
# btrfs rescue chunk-recover /dev/by-label/RAID5
Hangs! Absolute freeze! Reboot... It makes me crazy! SOS!
# smartctl -t long /dev/sdd ; smartctl -t long /dev/sde ; smartctl -t long /dev/sdf
No errors! What?!!
# btrfs restore by-label/RAID5 /mnt/usbdisk
(very long time here)
# btrfs device stats /data
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 90
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 165
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs 0
[/dev/sdf].read_io_errs 0
[/dev/sdf].flush_io_errs 0
[/dev/sdf].corruption_errs 202
[/dev/sdf].generation_errs 0
The number of bugs has increased wildly!
# rsync --dry-run -ri --delete --checksum /data/ /mnt/usbdisk/ 2>&1 | tee rsync.log
# cat rsync.log | grep -v skipping | grep ERROR | nl
7 errors at final! 7 files lost. 7 Log records:
1 ERROR: 1tb/work/2021Q3.txt failed verification -- update discarded
...
As a result I feel like an idiot with BTRFS.
Where is my 7 files? What can I do to restore data?
5 working days in the ass (5 dney v zhopu) - its' a hellish unstable!
Last edited by SunDoctor (2021-10-26 08:47:50)
Offline
Please update your post and use code tags for commands and outputs. Also please update your title, the "really?" phrase adds no value and may cause people to ignore your post.
As for your problem: can you post the output of smartctl? Maybe there's warnings that some folks here can interpret, even if there are no errors reported. It could still be hardware failure.
Where is my 7 files? What can I do to restore data?
As the famous saying goes: "there are two kinds of people: those that plan to make backups and those that make backups".
Offline
# smartctl -a /dev/sde | more
Model Family: Seagate IronWolf
Device Model: ST3000VN007-2E4166
Serial Number: Z731BBQR
LU WWN Device Id: 5 000c50 0b44ea38f
Firmware Version: SC60
User Capacity: 3000592982016 bytes [3,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5900 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 39938736
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1154
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5290592
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 10973
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1154
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 21
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1154
SMART Error Log Version: 1
No Errors Logged
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 10821 -
3 disks... Simple raid5... Where is the reliability of the data with raid?
Why BTRFS works silently - it writes and reads without any notification of corruption?
The very first file "/data/SDK/x64/setup.zip" was written a week ago.
Offtopic: I hate Seagate.
Offtopic2: in that case ZFS would have ruined RAID5 long ago (stopping writes and saves data), but not BTRFS
Last edited by SunDoctor (2021-10-22 16:45:32)
Offline
One importance notice:
I remount btrfs "/data" filesystem in readonly mode straight after scrub and start recover.
# mount -o recovery,ro,nospace_cache,clear_cache /dev/disk/by-label/RAID5 /data
But after rsync/copy check error counters have increased dramatically.
Besides "btrfs restore" and "rsync" give me different results - rsync found many files not restored at "/usbdisk"
I think btrfs distorts the real situation, incorrectly overwrites data with deliberately destroyed checksums and it is not clear what it displays in my case.
I also have an error with metadata: the definition of raid5-metadata was not optimal:
Metadata,RAID5/3: 3.00GiB
System,RAID5/3: 8.00MiB
But this should not affect the integrity of the raid5
Last edited by SunDoctor (2021-10-22 17:20:46)
Offline
The official info (may be until v4.12 - 4.14)
https://btrfs.wiki.kernel.org/index.php/Status
Currently raid5 and raid6 profiles have flaws that make it strongly not recommended as per the Status page.
In less recent releases the parity of resynchronized blocks was not calculated correctly, this has been fixed in recent releases (TBD).
If a crash happens while a raid5/raid6 volume is being written this can result in a "transid" mismatch as in transid verify failed.
The resulting corruption cannot be currently fixed.Some fixes went to 4.12, namely scrub and auto-repair fixes. Feature marked as mostly OK for now.
Further fixes to raid56 related code are applied each release. The write hole is the last missing part, preliminary patches have been posted but needed to be reworked. The parity not checksummed note has been removed.
You should not show anyone a product that is losing data! Any such features should not go beyond the testing development branches!
Last edited by SunDoctor (2021-10-26 08:44:44)
Offline
My last bullet:
# btrfs check --repair -p /dev/disk/by-label/RAID5
Result:
Checking filesystem on /dev/disk/by-label/RAID5
UUID: 6de9f5ee-1888-43d9-9ca7-60bb532170a2
[1/7] checking root items (0:00:50 elapsed, 2651972 items checked)
Fixed 0 roots.
No device size related problem found (0:01:35 elapsed, 333914 items checked)
[2/7] checking extents (0:01:35 elapsed, 333914 items checked)
cache and super generation don't match, space cache will be invalidated
[3/7] checking free space cache (0:00:00 elapsed)
[4/7] checking fs roots (0:07:59 elapsed, 158652 items checked)ked)
[5/7] checking csums (without verifying data) (0:01:02 elapsed, 2711727 items checked)
[6/7] checking root refs (0:00:00 elapsed, 3 items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 2323713069056 bytes used, no error found
total csum bytes: 2263352004
total tree bytes: 5470715904
total fs tree bytes: 2607480832
total extent tree bytes: 228573184
btree space waste bytes: 735118001
file data blocks allocated: 51597669175296
referenced 2317529624576
Am I blind???
found 2323713069056 bytes used, no error found
# cat /data/work/2021Q3.txt > /dev/null
cat: /data/work/2021Q3.txt: Input/output Error
BTRFS is fully hairy at the heel!
# btrfs device stat /data/
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 7
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 6
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs 0
[/dev/sdf].read_io_errs 0
[/dev/sdf].flush_io_errs 0
[/dev/sdf].corruption_errs 51
[/dev/sdf].generation_errs 0
It's a hell...
Last edited by SunDoctor (2021-10-22 18:43:11)
Offline
For two days now I have seen the results of
btrfs check --init-csum-tree /dev/disk/by-label/RAID5
Something is moving
owner ref check failed [1955053568 16384]
repair deleting extent record: key [1955053568,169,0]
Repaired extent references for 1955053568
ref mismatch on [1955069952 16384] extent item 1, found 0
backref 1955069952 root 7 not referenced back 0x5575aa7e0c60
incorrect global backref count on 1955069952 found 1 wanted 0
backpointer mismatch on [1955069952 16384]
owner ref check failed [1955069952 16384]
repair deleting extent record: key [1955069952,169,0]
But to be honest, the entire disc was copied in a day. Why it slows down for two days is a mystery...
In my soul only the worst words about my btrfs...
Last edited by SunDoctor (2021-10-25 14:41:53)
Offline
Some important analytics about btrfs from WWW:
https://lore.kernel.org/linux-btrfs/202 … ycats.org/
https://arstechnica.com/gadgets/2021/09 … ilesystem/
In my case the array was not degraded and it continued to operate in the "rw" mode with an increasing number of errors.
I continue to observe, trying to rectify the situation with btrfs methods, without resorting to formatting.
Although I am sure that formatting and writing the captured information in place would be much faster.
Last edited by SunDoctor (2021-10-26 08:42:09)
Offline
Almost 3 days are useless:
......
backref 2322541641728 root 7 not referenced back 0x5575fb840450
incorrect global backref count on 2322541641728 found 1 wanted 0
backpointer mismatch on [2322541641728 16384]
owner ref check failed [2322541641728 16384]
repair deleting extent record: key [2322541641728,169,0]
Repaired extent references for 2322541641728
super bytes used 2322811310080 mismatches actual used 2325452951552
super bytes used 2322811244544 mismatches actual used 2322811277312
super bytes used 2322811228160 mismatches actual used 2322811244544
No device size related problem found
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 9293886717952 bytes used, no error found
total csum bytes: 9053408016
total tree bytes: 20563165184
total fs tree bytes: 10429923328
total extent tree bytes: 923009024
btree space waste bytes: 1883349854
file data blocks allocated: 206388397096960
referenced 9267838894080
It seems like the errors were corrected. But not.
# btrfs device stat /data
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 9
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs 0
[/dev/sde].read_io_errs 0
[/dev/sde].flush_io_errs 0
[/dev/sde].corruption_errs 6
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs 0
[/dev/sdf].read_io_errs 0
[/dev/sdf].flush_io_errs 0
[/dev/sdf].corruption_errs 47
[/dev/sdf].generation_errs 0
Nothing is changed since the very beginning.
Apparently, I am forced to format the disks and create the array anew on old disks.
Offline
I am happy to report: the very first source of this entire wildest situation has been found!
Reason for everything: memory fault (memtest86, test: 7,8) high address
So, there was a crash in the fourth memory bank in the upper memory addresses. It did not manifest itself at all when its memory was at least a little free. But when I filled up all the memory with virtual machines, then, by chance and very sometimes, there were rare write errors that btrfs swallowed without hesitation.
This garbage was found on a new blank disk with btrfs just at the moment when I once again occupied all the memory with applications, wrote one huge file (50GB) and checked it for reading (and received an "Input/Output error").
I note that btrfs seriously confused me with its "indicators", and, in general, added fuel to the fire: its behavior was not ideal. But all the information at the moment is intact, and now, finally, I'm starting to put it back in place!
I spent two weeks of my life studying the behavior of btrfs, but now I am even proud that I know it as my girlfriend
Finally, if you are really interested in the attitude of professionals to the BTRFS - translate and read the latest article by Eduard Shishkin (RaiserFS) (from Russian to English):
https://habr.com/ru/post/559014/
The worst thing is that an unprivileged user can almost always deprive everyone of free space in a fairly short time, bypassing any disk quotas. It looks like this (tested for Linux kernel 5.12). On a freshly installed system, a script is launched that, in a loop, creates files with certain names in the home directory, writes data to them at certain offsets, and then deletes these files. After a minute of this script, nothing unusual happens. After five minutes, the portion of the occupied space on the section increases slightly. After two to three hours, it reaches 50% (with an initial value of 15%). And after five to six hours of work, the script crashes with the error "there is no free space on the partition."
...
How do I position Btrfs for myself? As something that categorically cannot be called a file system, let alone use. For, by definition, FS is an OS subsystem responsible for efficient management of the "disk space" resource, which we do not see in the case of Btrfs.
...
Btrfs was they(=RHEL) only had a "technology preview". So, this very "preview" has not passed. And they cannot launch a flawed by-design product with full support... RHEL is an enterprise, that is, a prescribed commodity-money relationship. Red Hat cannot mock users like it does on the Btrfs mailing list.
Last edited by SunDoctor (2021-10-29 08:04:18)
Offline