You are not logged in.
Pages: 1
Hi,
I have had an issue I tried to solve for the past couple of months and it's really making me go nuts.
I have a home server on which all services are dockerized.
I made a script to install everything on the server from scratch.
about two months ago I decided everything was ready for a first "release", do wiped out my test server and ran the script on it.
I face an issue I had never encountered while working on the script (and having the server running for days in a row): System hangs after running for a while (10min to hours, so very annoying to test)
I start the server in two steps:
* Mount all the drives (2*2TB data drives (LUKS+ext4), 2*500GB MDADM RAID1 (LUKS+ext4), 1*240GB proxy drive (ext4)
* Docker-compose up
Problem seems to only occure after both steps are completed.
My issue is I have no idea what to look for. I've been going from dead-end to dead-end for the past two months, it's driving me crazy.
Only leads I have are it seems to come from a (several ?) process constantly writing on the MDADM RAID1 (md0):
ps auxf | awk '{if($8=="D") print $0;}':
root 633 0.0 0.0 0 0 ? D 13:06 0:00 \_ [md0_raid1]
root 4545 0.0 0.0 0 0 ? D 13:09 0:00 \_ [dmcrypt_write/2]
root 4558 0.0 0.0 0 0 ? D 13:09 0:00 \_ [jbd2/dm-0-8]
root 5953 0.0 0.0 0 0 ? D 15:03 0:00 \_ [kworker/u32:1+flush-253:0]
(lsof on these four processes does not show anything relevant)
Even tho jbd2 is constantly here, there is not journal written in /var/log/*log
ps -ef|grep jbd2:
root 364 2 0 17:34 ? 00:00:00 [jbd2/sda3-8]
root 4478 2 0 17:37 ? 00:00:00 [jbd2/dm-0-8]
root 4493 2 0 17:37 ? 00:00:00 [jbd2/dm-1-8]
root 4506 2 0 17:37 ? 00:00:00 [jbd2/dm-2-8]
root 4510 2 0 17:37 ? 00:00:00 [jbd2/sdf1-8]
iotop does not show any write on the drives (or very very reasonable occasionnal writes)
md0 is composed of sdb and sdc
Hdparm benchmarks:
hdparm -T -t /dev/sdb
/dev/sdb:
Timing cached reads: 21836 MB in 2.00 seconds = 10938.52 MB/sec
Timing buffered disk reads: 260 MB in 3.01 seconds = 86.48 MB/sec
hdparm -T -t /dev/sdc
/dev/sdc:
Timing cached reads: 20402 MB in 2.00 seconds = 10216.23 MB/sec
Timing buffered disk reads: 338 MB in 3.00 seconds = 112.61 MB/sec
hdparm -T -t /dev/sdc
/dev/md0:
Timing cached reads: 2 MB in 3.72 seconds = 550.43 kB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 258 MB in 3.01 seconds = 85.69 MB/sec
Here the 550.43kB/sec is very very low
Here are the sleep delay of these two drives:
hdparm -B /dev/sdc
/dev/sdc:
APM_level = 128
hdparm -B /dev/sdb
/dev/sdb:
APM_level = 96
(I know they are not the same, but it shouldn't be a problem, right ?)
When the problem occurs, the OS itself is still usable, just any command acting on the md0 device (hdparm, docker srervices having mountpoints on this device, etc.) are totally unresponsive and hang straight.
From there the only solution is to physically power-off the machine (poweroff/shutdown put the machine in a weird state: I can't ssh anymore, but it does not turn off).
I could really appreciate any help right now. starting to loose hope I'll ever fix this issue...
Thanks
Last edited by saponace (2019-05-27 15:26:26)
Offline
No-one can help me ?
I could really use any piece of advice now. I have strictly no lead on how to start tackling the problem and start being desperate ...
Offline
process constantly writing on the MDADM RAID1 (md0)
What's the status of the RAID array?
$ cat /proc/mdstat
--
saint_abroad
Offline
Thanks for the answer.
mdstats is as follow:
Personalities : [raid1]
md0 : active raid1 sdc[1] sdb[0]
624999744 blocks super 1.2 [2/2] [UU]
bitmap: 1/5 pages [4KB], 65536KB chunk
unused devices: <none>
bitmap is 1/5 when system is frozen, rebooted it, and before it froze, bitmap was at 0/5 (I don't think this piece of information is relevant, but never know ...)
Offline
Do either of sdb/sdc report smart errors?
smartctl -H -l error /dev/sdb
Last edited by sabroad (2019-06-05 09:28:58)
--
saint_abroad
Offline
/dev/sdb shows no error, /dev/sdc shows a few errors. Do you think these could be the reason it freezes the whole system the way it does ? The reason I'm surprised is it used to not do this error (before I re-install everything), and it started at the exact same day I reinstalled the whole system.
Here is the output of the command for /dev/sdc:
sudo smartctl -H -l error /dev/sdc
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.1.4-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Error Log Version: 1
ATA Error Count: 5
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 5 occurred at disk power-on lifetime: 11336 hours (472 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 a0 bf 8b 0d 00 Error: ICRC, ABRT 160 sectors at LBA = 0x000d8bbf = 887743
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 c0 9f 8b 0d 40 00 5d+06:14:02.494 WRITE DMA EXT
35 00 10 c7 40 01 40 00 5d+06:13:56.099 WRITE DMA EXT
35 00 10 e7 3b 01 40 00 5d+06:13:55.809 WRITE DMA EXT
25 00 20 9f f3 0c 40 00 5d+06:13:49.617 READ DMA EXT
25 00 08 27 60 e9 40 00 5d+06:13:49.494 READ DMA EXT
Error 4 occurred at disk power-on lifetime: 8989 hours (374 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 08 a7 58 15 09 Error: ICRC, ABRT 8 sectors at LBA = 0x091558a7 = 152393895
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 18 97 58 15 40 00 5d+23:52:05.705 WRITE DMA EXT
35 00 40 3f a9 06 40 00 5d+23:51:57.741 WRITE DMA EXT
35 00 20 7f bb 06 40 00 5d+23:51:57.739 WRITE DMA EXT
35 00 20 7f b1 06 40 00 5d+23:51:57.737 WRITE DMA EXT
35 00 20 bf aa 06 40 00 5d+23:51:57.735 WRITE DMA EXT
Error 3 occurred at disk power-on lifetime: 8418 hours (350 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 68 17 56 e4 08 Error: ICRC, ABRT 104 sectors at LBA = 0x08e45617 = 149181975
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
35 00 78 07 56 e4 40 00 1d+20:44:39.274 WRITE DMA EXT
35 00 40 7f 30 1e 40 00 1d+20:44:39.256 WRITE DMA EXT
35 00 60 1f 30 1e 40 00 1d+20:44:39.241 WRITE DMA EXT
25 00 20 5f 2a 1e 40 00 1d+20:44:39.215 READ DMA EXT
35 00 20 5f 2f 1e 40 00 1d+20:44:38.976 WRITE DMA EXT
Error 2 occurred at disk power-on lifetime: 1818 hours (75 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 88 f2 04 00 Error: WP at LBA = 0x0004f288 = 324232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 18 38 22 92 40 00 00:00:22.582 WRITE FPDMA QUEUED
61 08 10 88 d0 96 40 00 00:00:22.269 WRITE FPDMA QUEUED
60 40 08 00 e9 fc 40 00 00:00:21.632 READ FPDMA QUEUED
60 08 00 88 f2 04 40 00 00:00:21.613 READ FPDMA QUEUED
61 08 18 48 cb 96 40 00 00:00:21.613 WRITE FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 1818 hours (75 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 88 f2 04 00 Error: WP at LBA = 0x0004f288 = 324232
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 18 48 cb 96 40 00 00:00:19.420 WRITE FPDMA QUEUED
60 08 10 a0 49 72 40 00 00:00:19.045 READ FPDMA QUEUED
60 18 08 78 49 72 40 00 00:00:19.045 READ FPDMA QUEUED
60 08 00 88 f2 04 40 00 00:00:19.007 READ FPDMA QUEUED
60 08 00 08 f3 04 40 00 00:00:18.990 READ FPDMA QUEUED
Offline
The SMART errors aren't frequent enough to explain your issue.
What's the output of iostat during an episode?
iostat sdb sdc md0 1 2
--
saint_abroad
Offline
I really don't get it ...
I thought root cause of the issue could be aging/damaged drives. So I tried to replace these disks to make another array.
Only thing I had available was two pen drives. I did with these pen drives the exact same LUKS/MDADM config as my original setup.
I expected to either experience the exact same thing (100% iowait, nothing responds) or working smoothly. I got something in between: 45% iowait and everything working but very slow (like emby homepage taking 10-15 sec to load) ...
Since then, I rolled back to the orignial config (two hard drives in an external USB enclosure). Now I only have 15% iowait, but the services that used to not respond don't respond anymore.
I understand nothing about what's happening.
Here is the output of "iostat sdd sdg md0 1 2":
Linux 5.1.4-arch1-1-ARCH (tartiflette) 06/09/2019 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.31 0.07 1.16 44.70 0.00 53.76
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdg 0.33 3.24 4.56 0.00 830022 1169627 0
sdd 0.40 7.35 4.56 0.00 1885216 1169627 0
md0 0.59 10.59 4.55 0.00 2717607 1166348 0
avg-cpu: %user %nice %system %iowait %steal %idle
0.08 0.08 0.42 16.57 0.00 82.85
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdg 0.00 0.00 0.00 0.00 0 0 0
sdd 0.00 0.00 0.00 0.00 0 0 0
md0 0.00 0.00 0.00 0.00 0 0 0
I'm starting to be pretty desesperate.
Tomorrow I'll try the same with only one drive, not encrypted, not raided and see what happens
EDIT: I moves drives around, so the md0 drives are now sdc and sdg. But they stilll are the same drives as earlier
Last edited by saponace (2019-06-08 23:24:08)
Offline
Could this be related to swapfile on LUKS?
What's the output of vmstat during an episode?
vmstat 1 5
--
saint_abroad
Offline
I actually forgot to give an update.
Amongst all the different things I tried, moving the two HDDs sdc and sdd from USB to SATA fixed the issue.
I guess the USB protocol does not handle well some kind of file transfert (I guess it's very frequent tiny modifications). Even though it's USB3.
Sorry for the delay, and thank you for the help
Offline
Pages: 1