You are not logged in.

#1 2016-05-20 12:08:26

jamb
Member
Registered: 2016-05-20
Posts: 10

hang/freeze disk IO

Hi,

I encounter a rather unusual problem. Since one week or so, my system "hangs/freezes" on doing the following:

- login into KDE
- starting dolphin
- logout
- periodically during operation (thunderbird, firefox, etc.) seemingly not related to any program activity

After some time (3-10s), the operation continues as normal. During the freeze, mouse pointer can be moved, but no interaction with programs is possible.


I googled a lot, but could not come up with useful results. A few things I want to share:
- The problems remain with a newly created user.
- The problems remain after several reboots.
- "strace dolphin" displays "fdatasync(21" while the freeze occurs.
- "glances" shows an increased value (~20) on iowait (However, I cannot see the value during hang, since the system is froozen).
- In the example case of dolphin: The task is visible in the taskbar, however, the window itself is not visible.
- audio keeps playing; a video stops (audio continues) and then jumps to the "correct" position after the delay.

My system is as follows:
- Lenovo X250, arch linux (up to date via yaourt -Syu (daily))
- 2 disks (Root is on m.2 SSD) with btrfs on top of Luks
- Plasma 5.6.4



I would be glad for any hints in any direction or how to diagnose the problem

Last edited by jamb (2016-05-27 08:14:56)

Offline

#2 2016-05-20 13:58:18

lastchancetosee
Member
From: Berlin
Registered: 2009-02-19
Posts: 131

Re: hang/freeze disk IO

fdatasync is responsible for flushing data from buffers to disk. That would indicate that your system is trying to write something to disk but the device isn't responding. Things that have caused similar issues for me in the past: Drives that were spun down due to idling, NFS shares where the connection dropped for some reason, bad USB drives.

The X250 is a laptop, right? How does it have two disks?


My ship don't crash! She crashes, you crashed her!

Offline

#3 2016-05-20 14:26:27

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

Thank you for turning to my problem.

Yes, the Lenovo X250 is indeed a laptop. My system disk is a m.2 SSD (thus the 2 disks). However, I don't think that an SSD can spin down?

There are no NFS shares or USB drives attached to the laptop.

File operations in general are quite ok, i.e., a cp-command goes through fine without any trouble.

Offline

#4 2016-05-20 15:22:53

lastchancetosee
Member
From: Berlin
Registered: 2009-02-19
Posts: 131

Re: hang/freeze disk IO

Ah, OK. No, the SSD can't spin down. I don't know if there are any power saving states for SSDs that take time to wake up from (very unlikely), but anyway, your hang pattern (login/logout, file manager, user programs) sounds more like /home to me.

File operations in general are quite ok, i.e., a cp-command goes through fine without any trouble.

During a freeze? That would kill the inaccessible-disk theory ...


My ship don't crash! She crashes, you crashed her!

Offline

#5 2016-05-23 06:01:32

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

This Answer is outdated, but I keep it for reference. Please continue at my next answer.

~~~~~~~~~~~~~~~~~~~~~~~


lastchancetosee wrote:

File operations in general are quite ok, i.e., a cp-command goes through fine without any trouble.

During a freeze? That would kill the inaccessible-disk theory ...

I tried to test this as follows:


➜  ~  dd if=/dev/urandom of=a.tmp bs=1M count=200; rm a.tmp
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 21,5296 s, 9,7 MB/s


this is during normal operation. Then I executed the same command and opened 3 file manager windows (this produces the freeze):

➜  ~  dd if=/dev/urandom of=a.tmp bs=1M count=200; rm a.tmp
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 22,6356 s, 9,3 MB/s


The difference is only 1 second, so I assume it is unrelated since the freezes seem to be much longer than just one second.


Hence, my conclusion is that disk operation in general is normal. Should I test it another way? Do you have other ideas?

Last edited by jamb (2016-05-23 12:21:02)

Offline

#6 2016-05-23 12:20:20

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

jamb wrote:

Hence, my conclusion is that disk operation in general is normal. Should I test it another way? Do you have other ideas?

I did some more testing and it turned out that disk operation for large files ~10GB is not normal.

When copying a file onto tho m2.SSD (home partition), the following happens:

- IOwait goes up (50% to 80%)
- High Load values (10 - 20) (4 cores)
- CPU is 100% with one core
- after some seconds, system is unusable (close to the freeze, but slightly more responsive within already open programs)


So it really seems disk-related. Unfortunately, this brings me no closer to a solution, so any hints are highly appreciated!

Offline

#7 2016-05-27 06:59:12

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

I have some updates to the situation: I found out that btrfs had some hickup concerning it's data/metadata planning. Using "btrfs filesystem balace start -dusage=55 /" this problem disappeared. The system is now faster than before.

However, if I issue large file operations (cp test/large.tmp ~/), then the system becomes completely unusable after a few seconds and the copying process takes forever.


Is my SSD broken? (Smart values and self-test show no sign of failure). Any hints are highly appreciated.


For reference here are the smart values:

➜  ~  sudo smartctl -a /dev/sdb
[sudo] password: 
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA SP900NS34
Serial Number:    2E4920023868
LU WWN Device Id: 0 000000 000000000
Firmware Version: 5.8.2
User Capacity:    256.060.514.304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri May 27 08:58:17 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  48) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0025) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   094   094   050    Old_age   Always       -       8592773440
  5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       1
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       956 (231 50 0)
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       728
171 Unknown_Attribute       0x000a   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       23
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       1
181 Program_Fail_Cnt_Total  0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   063   093   000    Old_age   Always       -       63 (Min/Max 13/93)
195 Hardware_ECC_Recovered  0x001c   096   096   000    Old_age   Offline      -       8592773440
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       1
201 Unknown_SSD_Attribute   0x001c   096   096   000    Old_age   Offline      -       8592773440
204 Soft_ECC_Correction     0x001c   096   096   000    Old_age   Offline      -       8592773440
230 Unknown_SSD_Attribute   0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   092   092   010    Pre-fail  Always       -       77309411329
233 Media_Wearout_Indicator 0x0032   000   000   000    Old_age   Always       -       33077
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       2907
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       2907
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       1730

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed without error       00%       947         -
# 2  Offline             Completed without error       00%       942         -
# 3  Extended offline    Completed without error       00%       892         -
# 4  Short offline       Completed without error       00%       890         -
# 5  Extended offline    Completed without error       00%       792         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Offline

#8 2016-05-27 10:40:17

alex.theoto
Member
From: Athens Greece
Registered: 2014-11-30
Posts: 307

Re: hang/freeze disk IO

Is your psu ok? Ssd are sensitive on power picks.
Read journalctl as root for any related error.

Offline

#9 2016-05-27 10:53:47

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

I assume, you refer to the power supply by 'psu'. If I grep for psu in journalctl, nothing comes up. With the keyword 'power', I get

thermald[1037]: powercap RAPL max power limit range 15000000
thermald[1037]: powercap RAPL invalid max power limit range

The rest is okay, as some daemons for power management are started and stopped. Should I be concerned of the above entries?

Offline

#10 2016-05-27 11:26:10

alex.theoto
Member
From: Athens Greece
Registered: 2014-11-30
Posts: 307

Re: hang/freeze disk IO

Yes, I mean power supply.
I don't know if these messages has something to do.

I'd like to look journalctl last input, stress your computer till the freezing problem and when system recovers, read journalctl for new errors.

Offline

#11 2016-05-27 11:37:58

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

sudo journalctl gives the following. The hang occurred for about 30 sec. at 13:33. I cannot see something that catches the eye.



Mai 27 13:31:51 MX250 wpa_supplicant[1244]: wlp3s0: Failed to initiate sched scan
Mai 27 13:32:54 MX250 wpa_supplicant[1244]: wlp3s0: Failed to initiate sched scan
Mai 27 13:33:09 MX250 sudo[16425]: pam_unix(sudo:session): session closed for user root
Mai 27 13:33:11 MX250 yakuake[1504]: Attempted to re-run an already running session.
Mai 27 13:33:52 MX250 plasmashell[1488]: Currrent active notifications: QHash()
Mai 27 13:33:52 MX250 plasmashell[1488]: Guessing partOf as: 0
Mai 27 13:34:29 MX250 wpa_supplicant[1244]: wlp3s0: Failed to initiate sched scan
Mai 27 13:33:52 MX250 plasmashell[1488]:  New Notification:  "The Look Of Love" "by Diana Krall" -1 & Part of: 0
Mai 27 13:33:52 MX250 plasmashell[1488]: QDBusObjectPath: invalid path "1"
Mai 27 13:33:52 MX250 plasmashell[1488]: QDBusObjectPath: invalid path "1"
Mai 27 13:35:00 MX250 wpa_supplicant[1244]: wlp3s0: Failed to initiate sched scan
Mai 27 13:35:32 MX250 sudo[17345]:     jamb : TTY=pts/9 ; PWD=/home/jamb ; USER=root ; COMMAND=/usr/bin/journalctl
Mai 27 13:35:32 MX250 sudo[17345]: pam_unix(sudo:session): session opened for user root by jamb(uid=0)
Mai 27 13:35:38 MX250 sudo[17345]: pam_unix(sudo:session): session closed for user root
Mai 27 13:35:52 MX250 sudo[17389]:     jamb : TTY=pts/9 ; PWD=/home/jamb ; USER=root ; COMMAND=/usr/bin/journalctl
Mai 27 13:35:52 MX250 sudo[17389]: pam_unix(sudo:session): session opened for user root by jamb(uid=0)

Offline

#12 2016-05-28 11:23:36

brebs
Member
Registered: 2007-04-03
Posts: 3,742

Re: hang/freeze disk IO

jamb wrote:

thunderbird, firefox

Disable fdatasync in sqlite.

Of course, this is a separate issue to your filesystem misbehaving.

Offline

#13 2016-05-30 07:45:28

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

Just out of curiosity, I performed the "ssd benchmark using dd" from the Arch Wiki. Here, the problem also arises:

➜  ~  dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress
915406848 bytes (915 MB, 873 MiB) copied, 1,00356 s, 912 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 11,3113 s, 94,9 MB/s
➜  ~  dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress
1067450368 bytes (1,1 GB, 1018 MiB) copied, 2,00292 s, 533 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 13,6282 s, 78,8 MB/s
➜  ~  dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress
969932800 bytes (970 MB, 925 MiB) copied, 2,00516 s, 484 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 53,3608 s, 20,1 MB/s

I observe the following:
- Status Reporting stops after 1-2 seconds
- Write Speed is WAY beyond what the SSD should be capable of (although there is an encryption layer which slows things down)


If anyone knows how to diagnose further, please give me some hints...

Offline

#14 2016-06-01 10:25:52

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

I am still trying to figure out the question. In the following some experiments and their effect.
I used my swap-partition (16GB) for testing.

Each test is the multiple execution of
    sudo dd if=/dev/zero of=tempfile bs=1M count=4096 conv=fdatasync,notrunc status=progress

1) Format with ext4:
    Result: A lot of repetition run through with ~400-500 MB/s (no problem)

2) Format with Btrfs:
    Result: A lot of repetition run through with ~400-500 MB/s (no problem)

3) Luks+ext4 ("cryptsetup --cipher serpent-cbc-essiv:sha256 --key-size 256")
    Result: The first few runs run okay with ~200MB/s (seems like Luks halfes the speed)
               Then, write speed breaks down in the 5-7th repetition to ~50MB/s

4) Luks+btrfs (see above)
    Result: same as for ext 4

5) Luks+ext4+write on home
    Result: First, it went fine (~200MB/s), then I wrote onto home (the same SSD, another partition (btrfs)), then write speed dropped to ~30MB/s.



I suggest that the drive is not broken, since writing without luks+btrfs seems okay.


I am happy for any suggestions.



PS: On another note: I tried downgrading the kernel for one version (to 4.4) and this changed nothing.

Offline

#15 2016-06-02 07:32:49

jamb
Member
Registered: 2016-05-20
Posts: 10

Re: hang/freeze disk IO

I played around a bit with an arch linux live usb system (so I can exclude some config errors). It turns out that without luks, speed is fast (regardless whether I choose ext4 or btrfs).

However, once I enable luks, it becomes slow after some writes (16-24 GB written in 4 GB-files).

Contrary to the command in my last post, I use aes-xts with sha1 with a key size of 256.

Here are the luks-performances:

➜  ~  cryptsetup benchmark              
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1       824352 iterations per second for 256-bit key
PBKDF2-sha256     913393 iterations per second for 256-bit key
PBKDF2-sha512     741567 iterations per second for 256-bit key
PBKDF2-ripemd160  591747 iterations per second for 256-bit key
PBKDF2-whirlpool  439838 iterations per second for 256-bit key
#  Algorithm | Key |  Encryption |  Decryption
     aes-cbc   128b   402,0 MiB/s  1744,0 MiB/s
 serpent-cbc   128b    43,2 MiB/s   343,5 MiB/s
 twofish-cbc   128b   119,9 MiB/s   222,7 MiB/s
     aes-cbc   256b   299,9 MiB/s  1334,5 MiB/s
 serpent-cbc   256b    54,0 MiB/s   346,1 MiB/s
 twofish-cbc   256b   121,8 MiB/s   222,3 MiB/s
     aes-xts   256b  1474,0 MiB/s  1470,6 MiB/s
 serpent-xts   256b   345,5 MiB/s   334,5 MiB/s
 twofish-xts   256b   215,3 MiB/s   219,5 MiB/s
     aes-xts   512b  1145,8 MiB/s  1137,2 MiB/s
 serpent-xts   512b   345,8 MiB/s   334,7 MiB/s
 twofish-xts   512b   215,8 MiB/s   217,1 MiB/s

Should I open a new thread, now that it is pinpointed to Luks? (Is it?)

Last edited by jamb (2016-06-02 07:35:18)

Offline

Board footer

Powered by FluxBB