[SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

newsboost · 2020-12-11 06:59:38

Hi,

I have a weird problem I've never tried before, so just wanted to hear if any has tried it and has any experience with it: So I have a 3.5" SATA-harddisc drive connected via USB3 to my pc, using something that is pretty close to this device: https://www.thermaltakeusa.com/thermalt … ation.html - the volume was formatted inside my Synology NAS, so I have it mounted to an Arch Linux pc using "mount /dev/md3 /mnt/mountpoint" and the filesystem is ext4. Some details (not sure if it means anything, but now you have it):

# mdadm /dev/md3
/dev/md3: 7447.44GiB raid1 1 devices, 0 spares. Use mdadm --detail for more detail.
# mdadm --detail /dev/md3
/dev/md3:
           Version : 1.2
     Creation Time : Mon Jul  4 20:58:07 2016
        Raid Level : raid1
        Array Size : 7809204544 (7447.44 GiB 7996.63 GB)
     Used Dev Size : 7809204544 (7447.44 GiB 7996.63 GB)
      Raid Devices : 1
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Fri Dec 11 06:19:58 2020
             State : clean 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : Syn:3
              UUID : 14b58f99:13b5e0fa:4455458a:9041eb50
            Events : 1132235

    Number   Major   Minor   RaidDevice State
       0       8       51        0      active sync   /dev/sdd3

I'm transferring a 3 TB file, the problem is that md5sum shows the copied file differs from the source file, e.g. source-file md5sum is 9ac057361d8c725dfef90df4d37b7fb7 and destination is 8b3e2da08e3128e24d068827ef0e7b15. This is really bad, I think I had the problem using rsync earlier but at that time I didn't thought the problem would happen again. Could it be the https://www.thermaltakeusa.com/thermalt … ation.html that just doesn't work? As mentioned, I think the problem happened a few weeks ago also where I used rsync (this time I used cp). I'm currently doing a:

badblocks -b 4096 -v /dev/md3 > badblocks_b4096_v_md3.txt

But other than that I have no idea about the cause of this problem, never tried it before - found a single google post: https://serverfault.com/questions/33094 … -sometimes about a guy having the same problem, but I also don't think that guy found the cause and solution so I hope maybe some of you have good ideas or relevant experience to share? The disk drive is around 8 TB, only 2 years old and haven't beed used much (for offline backups earlier). I now suspect the HDD docking station is the problem and I might want to repeat the experiment later. The obvious thing to try after the "badblocks"-command finish, is to both run memtest for maybe 24 hours and also by turning off the pc and repeat the copy using SATA-cable inside the computer - but the idea with a USB3 docking station is to avoid having to open and close the computer case... Never thought a docking station could be so bad - if that is the problem... Anyone tried something similar?

Last edited by newsboost (2020-12-16 00:07:17)

seth · 2020-12-11 07:36:48

root@proxmox:/hugeZFS#

So the target is ext4, but what's the source and is it what that path suggests?

frostschutz · 2020-12-11 08:40:05

Is the file size identical?

You could compare files with 'cmp' instead of 'md5sum', that would tell you exactly which byte differs.

newsboost · 2020-12-11 09:14:47

seth wrote:

root@proxmox:/hugeZFS#
So the target is ext4, but what's the source and is it what that path suggests?

Right, the source FS is also ext4, the path from where the command was issued is not really relevant, sorry for the confusion. I've removed it from the post to avoid further confusion, but the source is also ext4... But different physical disks.

frostschutz wrote:

Is the file size identical?
You could compare files with 'cmp' instead of 'md5sum', that would tell you exactly which byte differs.

Yes, I should've written that the file size appears to be the same. It's an excellent suggestion with "cmp": I assume it works fine with binary files too. The "badblocks" command is running now, I expect it to be running for around 12 hours more. After that, I'll try to see if the "cmp" can perhapes give me the offset and tell "how bad" the copy-operation went. Excellent idea, thanks! I'll update later when I've got some more info...

newsboost · 2020-12-13 04:22:45

UPDATE:

The file sizes are the same. The "badblocks" command showed no errors, it indicated all was fine. After around 10 minutes, the "cmp -l srcfile dstfile" created a 22 MB file, the contents is something like:

   7837057025 365  32
   7837057026 117 131
   7837057027 147   4
   7837057028 244 125
   7837057029 333 207
   7837057030 156 334
   7837057031 302 372
   7837057032 367 372
   7837057033 323 144
   7837057034 162  26
   7837057035  40 353
...
...
etc etc etc etc - page up and down (don't see any reason to show more of this)

I think I used rsync the first time. I next tried to repeat and this time used "cp". Still, the same problem happened - different md5sum and different "cmp -l"-output. I've never experienced weird things so I don't suspect the RAM (or RAM-test) will show anything. However, I came up with the excellent (if I should say so) idea of running "smartctl -a /dev/sdd" which revealed:

=== START OF INFORMATION SECTION ===                     
Model Family:     Western Digital Red                     
Device Model:     WDC WD80EFZX-68UW8N0                    
Serial Number:    VKJNEDUX
LU WWN Device Id: 5 000cca 254e578c8                                                                                                        
Firmware Version: 83.H0A83                                                                                                                  
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]                                                                           
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)     
Local Time is:    Sun Dec 13 03:49:59 2020 UTC                   
SMART support is: Available - device has SMART capability.       
SMART support is: Enabled                                 
                                                         
=== START OF READ SMART DATA SECTION ===                  
SMART overall-health self-assessment test result: PASSED  
                                                          
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity                                                                    
                                        was completed without error.                                                                        
                                        Auto Offline Data Collection: Enabled.                                                              
Self-test execution status:      (   0) The previous self-test routine completed                                                            
                                        without error or no self-test has ever                                                              
                                        been run.
Total time to complete Offline                                                                                                              
data collection:                (  101) seconds.
Offline data collection                                      
capabilities:                    (0x5b) SMART execute Offline immediate.                                                                    
                                        Auto Offline data collection on/off support.                                                        
                                        Suspend Offline collection upon new                                                                 
                                        command.          
                                        Offline surface scan supported.                                                                     
                                        Self-test supported.
                                        No Conveyance Self-test supported.                                                                  
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering                                                                    
                                        power-saving mode.                                                                                  
                                        Supports SMART auto save timer.                                                                     
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.                                                                  
Short self-test routine
recommended polling time:        (   2) minutes.                                                                                            
Extended self-test routine
recommended polling time:        (1083) minutes.             
SCT capabilities:              (0x003d) SCT Status supported.    
                                        SCT Error Recovery Control supported.                                                               
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
                                                          
SMART Attributes Data Structure revision number: 16      
Vendor Specific SMART Attributes with Thresholds:         
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                            
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0                                                    
  2 Throughput_Performance  0x0005   129   129   054    Pre-fail  Offline      -       124                                                  
  3 Spin_Up_Time            0x0007   146   146   024    Pre-fail  Always       -       453 (Average 450)                                    
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       86                                                   
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0                                                    
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0                                                    
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18                                                   
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       4124                                                 
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0                                                    
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       86                                                   
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100                                                  
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       841                                                  
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       841                                                  
194 Temperature_Celsius     0x0002   139   139   000    Old_age   Always       -       43 (Min/Max 23/51)                                   
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0                                                    
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0                                                    
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0                                                    
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       2539564                                              
                                                                                                                                            
SMART Error Log Version: 1                                                                                                                  
ATA Error Count: 65535 (device log contains only the most recent five errors)                                                               
        CR = Command Register [HEX]                                                                                                         
        FR = Features Register [HEX]                                                                                                        
        SC = Sector Count Register [HEX]                                                                                                    
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]                      
        CH = Cylinder High Register [HEX]   
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]   
        ST = Status register [HEX]  
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.             
                                                                                                                                            
Error 65535 occurred at disk power-on lifetime: 4118 hours (171 days + 14 hours)                                                            
  When the command that caused the error occurred, the device was active or idle.                                                           

  After command completion occurred, registers were:                                                                                        
  ER ST SC SN CL CH DH                                                                                                                      
  -- -- -- -- -- -- --
  84 51 a0 a7 70 65 40  Error: ICRC, ABRT 160 sectors at LBA = 0x006570a7 = 6647975
                                                                 
  Commands leading to the command that caused the error were:                                                                               
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                           
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 00 48 70 65 40 00   5d+05:08:38.816  WRITE DMA EXT
  35 03 00 48 70 65 40 00   5d+05:08:38.796  WRITE DMA EXT
  25 03 08 60 08 10 40 00   5d+05:08:38.796  READ DMA EXT 
  35 03 00 48 68 65 40 00   5d+05:08:38.792  WRITE DMA EXT
  35 03 00 48 68 65 40 00   5d+05:08:38.772  WRITE DMA EXT                                                                                  
                                                                                                                                            
Error 65534 occurred at disk power-on lifetime: 4118 hours (171 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.
                          
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                                                                                                                      
  -- -- -- -- -- -- --                                 
  84 51 20 27 78 65 40  Error: ICRC, ABRT 32 sectors at LBA = 0x00657827 = 6649895
                                                                 
  Commands leading to the command that caused the error were:    
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 00 48 70 65 40 00   5d+05:08:38.800  WRITE DMA EXT
  25 03 08 60 08 10 40 00   5d+05:08:38.796  READ DMA EXT 
  35 03 00 48 68 65 40 00   5d+05:08:38.792  WRITE DMA EXT
  35 03 00 48 68 65 40 00   5d+05:08:38.772  WRITE DMA EXT
  35 03 00 48 68 65 40 00   5d+05:08:38.747  WRITE DMA EXT                                                                                  
                                                                                                                                            
Error 65533 occurred at disk power-on lifetime: 4118 hours (171 days + 14 hours)                                                            
  When the command that caused the error occurred, the device was active or idle.                                                           
                                                                                                                                            
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                                                                                                                      
  -- -- -- -- -- -- --                          
  84 51 80 c7 69 65 40  Error: ICRC, ABRT 128 sectors at LBA = 0x006569c7 = 6646215
                                                                                                                                            
  Commands leading to the command that caused the error were:                                                                               
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                           
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 00 48 68 65 40 00   5d+05:08:38.773  WRITE DMA EXT                                                                                  
  35 03 00 48 68 65 40 00   5d+05:08:38.747  WRITE DMA EXT  
  25 03 08 58 08 10 40 00   5d+05:08:38.747  READ DMA EXT                                                                                   
  35 03 48 00 68 65 40 00   5d+05:08:38.745  WRITE DMA EXT            
  25 03 08 50 08 10 40 00   5d+05:08:38.745  READ DMA EXT                                                                                   
                                                                                                                                            
Error 65532 occurred at disk power-on lifetime: 4118 hours (171 days + 14 hours)                                                            
  When the command that caused the error occurred, the device was active or idle.
                                                                                                                                            
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH                                                                                                                      
  -- -- -- -- -- -- --    
  84 51 b0 97 69 65 40  Error: ICRC, ABRT 176 sectors at LBA = 0x00656997 = 6646167
                                                                 
  Commands leading to the command that caused the error were:                                                                               
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name     
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 00 48 68 65 40 00   5d+05:08:38.748  WRITE DMA EXT
  25 03 08 58 08 10 40 00   5d+05:08:38.747  READ DMA EXT
  35 03 48 00 68 65 40 00   5d+05:08:38.745  WRITE DMA EXT
  25 03 08 50 08 10 40 00   5d+05:08:38.745  READ DMA EXT                                                                                   
  35 03 00 00 60 65 40 00   5d+05:08:38.741  WRITE DMA EXT                                                                                  
                                                                                                                                            
Error 65531 occurred at disk power-on lifetime: 4118 hours (171 days + 14 hours)                                                            
  When the command that caused the error occurred, the device was active or idle.                                                           
                                                                                                                                            
  After command completion occurred, registers were:                                                                                        
  ER ST SC SN CL CH DH                                                                                                                      
  -- -- -- -- -- -- --                                                                                                                      
  84 51 40 bf 54 65 40  Error: ICRC, ABRT 64 sectors at LBA = 0x006554bf = 6640831                                                          
                                                                                                                                            
  Commands leading to the command that caused the error were:                                                                               
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name                                                                           
  -- -- -- -- -- -- -- --  ----------------  --------------------                                                                           
  35 03 00 00 50 65 40 00   5d+05:08:38.715  WRITE DMA EXT                                                                                  
  25 03 08 38 08 10 40 00   5d+05:08:38.711  READ DMA EXT                                                                                   
  35 03 00 00 48 65 40 00   5d+05:08:38.707  WRITE DMA EXT                                                                                  
  25 03 08 30 08 10 40 00   5d+05:08:38.707  READ DMA EXT                                                                                   
  35 03 08 f8 44 65 40 00   5d+05:08:38.704  WRITE DMA EXT                                                                                  
                                                                                                                                            
SMART Self-test log structure revision number 1                                                                                             
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                             
# 1  Short offline       Completed without error       00%      3261         -                                                              
# 2  Short offline       Completed without error       00%      2538         -                                                              
# 3  Short offline       Completed without error       00%      1995         -                                                              
# 4  Short offline       Completed without error       00%      1604         -
# 5  Short offline       Completed without error       00%       814         -
                                            
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing                     
    4        0        0  Not_testing                 
    5        0        0  Not_testing                                
Selective self-test flags (0x0):                                                                                                            
  After scanning selected spans, do NOT read-scan remainder of disk.                                                                        
If Selective self-test is pending on power-up, resume after 0 minute delay.

I think the line "Error 65535 occurred at disk power-on lifetime" tells something is wrong. It's a WD RED harddisk, so it should be able to be turned on 24/7/365. I googled a bit and it seems I've got some of these errors: "const char *icrc = "ICRC"; // INTERFACE CRC ERROR"? From google:

One important error is ICRC – the interface CRC error. This means that there are errors being detected on the IDE/SATA or PCIe bus the hard drive is connected to. Although this is rare and might be caused by the HDD itself, it might mean that your chipset (the hardware controlling e.g. SATA) is damaged – in this case, replacing the hard drive would not fix the issue. Possibly there is also an intermittent cable connection.

The command "smartctl -H /dev/sdd" says: "SMART overall-health self-assessment test result: PASSED". Googling even more led me to https://bbs.archlinux.org/viewtopic.php?id=246686 which says (quoting from the last post/the conclusion):

Amongst all the different things I tried, moving the two HDDs sdc and sdd from USB to SATA fixed the issue.
I guess the USB protocol does not handle well some kind of file transfert (I guess it's very frequent tiny modifications). Even though it's USB3.

Internally I don't have any SATA-cables left but I guess next step is to remove an SSD-disk I have (temporarily) and then repeat everything and see if the problem persists, just to get to the bottom of this... Unbelievable... Incredibly! I'm also using USB3 - but if the USB3 docking station cannot transfer files reliable, that's completely - even unbelievable - stupid!?! I've never heard of such a problem before, but that would mean these harddisk docking station devices that transfer via USB-cable cannot be used for anything serious?!? Amazing, if this is really the problem, how can manufacturers put crap like that (Thermaltake BlacX) on the market (if it really is the problem, which I suspect as the disk haven't been used that much)??? Any of you know anything about this hypothesis, maybe if you can confirm that it's true (or not) or is there some other linux command I could use, to get more insight? Or maybe you understand the smartctl output better than me and can give your thoughts on this? Appreciate if so, thanks!

Last edited by newsboost (2020-12-13 04:24:16)

frostschutz · 2020-12-13 10:06:45

Wow, that's a crazy high udma crc error count.

UDMA CRC is a 16 bit hash, so there should only be 1:65536 chance for bad data to go through. For the occasional random blip on the sata line this should be good enough.

Usually this should be good enough. And if the cable is so bad you get tens of thousands of these errors, transfer speed would effectively drop to zero. You should notice.

Most likely the drive aborted these commands and never wrote the data it was supposed to write, so what is returned on read is old data that was there before.

The drive only logs the most recent errors, if you have journalctl syslogs of the time you did the copy operation, you could check those.

Until you determine the cause you can not trust the drive, sata cable, sata port/controller, usb enclosure (if applicable).

Also run a long selftest (smartctl -t long), as well as a memtest on the PC itself.

Good luck,

frostschutz · 2020-12-13 10:17:26

The syslog would be interesting in particular since you have that all running on top of a single-disk RAID 1. Normally when a drive fails, it would be kicked from RAID. It could be there is a problem with that when it's the only remaining drive in a RAID. It would be unfortunate if that in turn caused errors to not be reported to the userland copy utility.

I use single disk RAID myself, on my SSD, to allow for on-the-fly mirroring. It hasn't failed me yet, but this setup is after all a bit unusual, so I can't rule it out either. I'll have to do some testing, I guess...

seth · 2020-12-13 10:23:58

https://www.amazon.de/Thermaltake-SATA- … geNumber=1
Seems a pattern…

newsboost · 2020-12-14 02:06:07

frostschutz wrote:

Wow, that's a crazy high udma crc error count.
UDMA CRC is a 16 bit hash, so there should only be 1:65536 chance for bad data to go through. For the occasional random blip on the sata line this should be good enough.
Usually this should be good enough. And if the cable is so bad you get tens of thousands of these errors, transfer speed would effectively drop to zero. You should notice.

I'm transferring a 3 TB file, so in any case it'll take 1/2 day or several days, the pc is maybe 3-4 years old. It's the first time I've tried something like this, where things apparantly seems ok, but as I was transferring a file that around 2 weeks ago seemed to have become corrupt I was suspicious this time and the md5sum (+cmp command) showed the data being written wasn't really the same as that being read... No obvious errors, during the copy - shit.... What a piece of shit, that USB docking station... OMFG, it should be forbidden to sell such crap on the market where people otherwise think they can trust the equipment... Damn, it'll be thrown directly out with the garbage very soon... But I didn't notice the transfer speed dropping to zero, by the way - I'm not watching the speed from beginning to end, of that 3 TB copy operation... For some reason the speed using both cp and rsync varies from around 5-10 MB/s to around 140-150 MB/s and it varies a lot.

I can add that yesterday I replaced a 120 GB SSD harddisk with this 8 TB disk that previously was connected via the USB docking station - so now it's connected via SATA-cables, internally in the machine (fortunately I cannot use the SSD disk as there aren't more cables, but at least this seems better now). When I yesterday started a new copy of the file (using both cp + rsync) it started out with around 5 MB/sec transfer speed and it stayed like that for maybe at least 10-20 minutes and then I stopped watching and went to bed (maybe it was doing some auto-correction due to the many write errors there must've been, I don't know). When I woke up, it was more around 90-140 MB/s and the speed seemed to stay like that until it finished. I think the 110-130 MB/sec, which it ended up with for a long time is rather "normal" for a mechanical HDD (not SSD unfortunately, because the mechanical drives are much cheaper for the space I need).

By the way, do you know if I can reset that high UDMA crc error value, now where I'll throw out that ridiculous usb docking station?

frostschutz wrote:

Most likely the drive aborted these commands and never wrote the data it was supposed to write, so what is returned on read is old data that was there before.
The drive only logs the most recent errors, if you have journalctl syslogs of the time you did the copy operation, you could check those.
Until you determine the cause you can not trust the drive, sata cable, sata port/controller, usb enclosure (if applicable).
Also run a long selftest (smartctl -t long), as well as a memtest on the PC itself.
Good luck,

1) I think it just wrote random data, from time to time - no error messages - no warnings... Holy shit, I cannot believe companies but crap like that on the market. Note to self: NEVER EVER BUY ANYTHING FROM THERMAL-TAKE AGAIN... Even for consumer-products, this is completely unacceptable. I wouldn't even pay a cent for that piece of shit, if I knew how bad it is, when you really need it.

2) About syslog: Sorry, I switched to using the internal SATA cable (actually took it an SSD-drive which used the SATA-cable from the cdrom-drive, which therefore doesn't work right now). After I rebooted, it seems I can only see the last/current messages (also tried to look in /var/log, couldn't see anything useful):

# journalctl --list-boots
 0 b14e2c3de3fc44edb66014d2a747e64f Sun 2020-12-13 07:07:48 UTC_Sun 2020-12-13 15:43:00 UTC
 
# journalctl -b -1
Specifying boot ID or boot offset has no effect, no persistent journal was found.

3) As written, I'm now using a SATA-cable and the harddisk is now inside the pc. I've copied the 3 TB file over again and am in the process of doing a "cmp" on the 2 files. Currently it has verified 44% (1.3 TiB / 2.9 TiB), running at around 150 MiB/sec, 3 hours remaining. No errors this far - with the USB, when I did the same I think I started seeing errors after 20-30 minutes, which is a lot less than 44% of the file-size. So everything looks MUCH better now - fortunately - so the drive itself is ok, I'm pretty sure of (also hasn't been used much). Also haven't noticed any problems with memory or something before - but I might test when I'm done, just to be sure, as suggested. I'll try both the long (smartctl -t long) test, as well as a memtest, but I'm relatively optimistic now...

frostschutz wrote:

The syslog would be interesting in particular since you have that all running on top of a single-disk RAID 1. Normally when a drive fails, it would be kicked from RAID. It could be there is a problem with that when it's the only remaining drive in a RAID. It would be unfortunate if that in turn caused errors to not be reported to the userland copy utility.
I use single disk RAID myself, on my SSD, to allow for on-the-fly mirroring. It hasn't failed me yet, but this setup is after all a bit unusual, so I can't rule it out either. I'll have to do some testing, I guess...

I have an old synology disk-station, that's where the disk was originally formatted inside. I then discovered I could/can directly mount it in Linux, which is really convenient. Inside the synology I have a volume consisting of 3 disks, but I don't have any machines with available 3 SATA cables to use, therefore the fourth disk (which I placed in the docking station and caused all of this) is formatted as a single-raid disk. Could you tell a bit about how you use this "single disk RAID" for doing "on-the-fly mirroring"? Sounds interesting...

seth wrote:

https://www.amazon.de/Thermaltake-SATA- … geNumber=1
Seems a pattern…

Oh, damn, didn't see that... Thanks a lot - very interesting that I'm not the only one... I'm betting a LOT of people are missing a LOT of data on those morons who made that docking station... And if you just look at the reviews, it has 4.4 out of 5 stars, which doesn't sound bad... There are always people who are unhappy about products, but this time it's really crazy this product even is or has been on the market... I think I'll never ever again buy a product from thermaltake, if I can avoid it, actually I think some part of the 3 TB file is already corruct (happened around 2-3 weeks ago) but it is so big so I wouldn't notice if e.g. only a tiny fraction is corruct. Well, done it done. I'm happy I think I've found a conclusion, cause and explanation and know how to avoid the problem... Tomorrow I'll give a - probably final - update and I think everything will be fine by then (currently 48% of the new rsync/copy-operation has been verified ok). I don't even know if this is only thermaltake-products - or all usb3 harddisk docking stations, that could be malfunctioning but hopefully the drive can last at least 2-3 years more before beginning to fail. Thanks all for now!

newsboost · 2020-12-15 23:48:26

Final update: Sorry, almost forgot this - marking the thread as solved. Completed a new 3 TB transfer and md5sum matches (using SATA-cable). Threw away that crappy usb hd docking station. The long smartctl test completed as below - everything is fine - thanks for all the hints/feedback.

# smartctl -a /dev/sdb3 | grep -C 2 -i 'self-test execution'
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.

cmurf · 2020-12-16 01:21:37

There are a variety of sources for corruption. This case pretty conclusively shows it's cable or connector. It could just as well be logic board memory, drive cache memory, drive firmware.

rsync --checksum --dry-run can be used to compare. It'll only tell you whether they're the same or different, not which one is correct. You'd need another source of truth. Both ZFS and Btrfs checksum data, and the checksums themselves are checksummed.

Additionally, by default on spinning drives Btrfs uses "dup" (duplicate) profile for metadata, the file system itself. Upon detection of corruption of metadata, it can self-heal, by locating the good copy and overwriting the bad copy with the good.

XFS and ext4 somewhat recently (few years) have added metadata only checksumming, which will detect errors in the metadata but can't self-heal, though their repair utilities are quite good. Both ZFS and Btrfs offer a scrub feature that checks integrity of every block, metadata and data, compared to checksums, and will report mismatches. In such a case you can consider it a reliable indication of corruption.

Memory bitflips can be quite pernicious. We see them in btrfs land frequently. I don't exactly recommend using Btrfs to try and track down memory problems, but it does seem to have that feature as a side effect. As metadata is a small ~4% portion of what's on disk, data making up 96% it is quite a much larger target for any sort of corruption.

Arch Linux

#1 2020-12-11 06:59:38

[SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#2 2020-12-11 07:36:48

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#3 2020-12-11 08:40:05

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#4 2020-12-11 09:14:47

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#5 2020-12-13 04:22:45

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#6 2020-12-13 10:06:45

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#7 2020-12-13 10:17:26

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#8 2020-12-13 10:23:58

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#9 2020-12-14 02:06:07

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#10 2020-12-15 23:48:26

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

#11 2020-12-16 01:21:37

Re: [SOLVED] cp+rsync to /dev/md3 results in wrong md5sum, for huge file

Board footer