You are not logged in.

#1 2024-06-01 05:38:18

phoenix324
Member
Registered: 2023-08-23
Posts: 83

[Solved] System stops responding, nothing new can be run.

OS: Arch Linux x86_64
Host: GL65 Leopard 9SDK (REV:1.0)
Kernel: Linux 6.9.2-zen1-1-zen
Shell: bash 5.2.26
WM: Hyprland (Wayland)
CPU: Intel(R) Core(TM) i7-9750H (12) @ 4.50 GHz
GPU: Intel UHD Graphics 630 @ 1.15 GHz [Integrated]
Memory: 3.50 GiB / 15.42 GiB (23%)
Swap: 0 B / 24.00 GiB (0%)
Disk (/): 267.98 GiB / 451.44 GiB (59%) - btrfs

This has now happened 4 times.  After 3nd time, i panicked, made backup and spent whole day trying to figure out what was causing the issue, is it my filesystem(BTRFS) or my Disk(nvme) or is it my RAM.

When it happenes, i am not able to launch any new application ( Super+T for kitty ), or execute any binary from already opened terminal, i was able to execute `shutdown now` ( 1st and 2nd time because terminal were opened) but the process got  stuck and i had to hold power-button for force shut-down.

Nothing gets logged to journal, everytime last 20 min of journal is not there. When this happened 4th time, i used SysReq. I will post my 4th incident journal. `journalctl -b -1 -n 500 --no-pager` (didn't use sudo, let me know if sudo is needed) http://0x0.st/XNn9.txt

For Disk(nvme) - I ran `smartctl -x /dev/nvme0`

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

For BTRFS - I ran scrub and `btrfs check` and they didn't reported any error. Here's the output ->

scrub status:1
ff5d4ba3-25ce-46cd-9bab-dc98091801b1:1|data_extents_scrubbed:6260029|tree_extents_scrubbed:583302|data_bytes_scrubbed:277634707456|tree_bytes_scrubbed:9556819968|read_errors:0|csum_errors:0|verify_errors:0|no_csum:2950373|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:411281915904|t_start:1717218527|t_resumed:0|duration:307|canceled:0|finished:1

I am little confused about the argument to `btrfs scrub start <path>`, does the command uses the scrub on the filesystem on which the path lies? So for example, if i give `/home` or `/` both of which lie on same partition albeit they are different subvolume.

ID 261 gen 332395 top level 5 path @
ID 262 gen 332395 top level 5 path @home
ID 263 gen 328837 top level 5 path @snapshots
ID 265 gen 298353 top level 5 path @libvirt_images
ID 266 gen 332365 top level 5 path @var_cache
ID 267 gen 332394 top level 5 path @var_log
ID 269 gen 15387 top level 263 path @snapshots/1/snapshot
ID 932 gen 332386 top level 5 path @var_tmp
ID 933 gen 49184 top level 263 path @snapshots/658/snapshot
ID 2206 gen 259516 top level 263 path @snapshots/1931/snapshot
ID 2207 gen 259517 top level 263 path @snapshots/1932/snapshot

For Ram - i can't do memory_test because Temps become very high, this is because i have lapto with hybrid graphic ( intel + nvidia_dedicated ), when i am in boot stage before kernel, nvidia is running and causing high temps, i can see this because i have LED indicator which turns orange when dedicated graphic card is on. Is there any way to turn off nvidia gpu at this stage. I am already using udev rule to fully power-down discrete gpu. https://wiki.archlinux.org/title/Hybrid … screte_GPU

Last edited by phoenix324 (2024-10-04 09:56:34)

Offline

#2 2024-06-01 07:04:29

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

"SMART overall-health self-assessment test result: PASSED" and an empty bag are worth about the bag, check the entire "smartctl -a"
The journal ends w/

Jun 01 10:11:26 ArchLinux systemd[1099]: Started kitty child process: 121356 launched by: 121347.

but the sysrq aren't logged - did you use the entire REISUB sequence?

Looks a bit like https://bbs.archlinux.org/viewtopic.php?id=296332 … what exact nvme model is this?

Offline

#3 2024-06-01 07:44:39

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.9.2-zen1-1-zen] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON OM8PCP3512F-AI1
Serial Number:                      50026B7683ECE53F
Firmware Version:                   ECFK52.8
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 683ece53f5
Local Time is:                      Sat Jun  1 12:55:52 2024 IST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0054):     DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0a):         Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.74W       -        -    0  0  0  0        0       0
 1 +     5.21W       -        -    1  1  1  1        0       0
 2 +     4.95W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    20%
Data Units Read:                    62,070,794 [31.7 TB]
Data Units Written:                 55,506,653 [28.4 TB]
Host Read Commands:                 651,601,367
Host Write Commands:                935,160,312
Controller Busy Time:               5,029
Power Cycles:                       10,166
Power On Hours:                     9,250
Unsafe Shutdowns:                   163
Media and Data Integrity Errors:    0
Error Information Log Entries:      10,345
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   17
Thermal Temp. 1 Total Time:         3

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      10345     0  0x0008  0x4004  0x028            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Aborted: Controller Reset              9207            -     -   -   -    -

The journal ends w/

Jun 01 10:11:26 ArchLinux systemd[1099]: Started kitty child process: 121356 launched by: 121347.

but the sysrq aren't logged - did you use the entire REISUB sequence?

The system stopped responding at 10:20:**, that's what i meant by journal not storing last 20minutes entries. Yes i use the whole sequence of REISUB, at the stage of 'S', i even got message saying `sync complete`, i preseed 'S' twice just to be safe. But when my system rebooted, joulrnal entries were still missing.


what exact nvme model is this?

 KINGSTON OM8PCP3512F-AI1 

The problem seems similar, but my frequency of error is wery wide, 3-5 days or even a week. i will keep an eye on that forum.


From the articles i read i was under the impression that RAM maybe the issue. As my previous comment, any idea if i can turn off my GPU(discrete) so that it won't generate unnecessary Heat. Also on the wiki it says to run test for 10 cycles, but on memtest86+ UI, there is no `Cycles` column, but there is `PASS`, is this what i should be looking for to coulting cycles?


I will from next time keep my composure and keep a terminal open, also an sshd running. Anything that you would like me to do when i run into this issue again?

Offline

#4 2024-06-01 09:33:20

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

[phoenix@ArchLinux ~]$ sudo btrfs check --force --readonly /dev/nvme0n1p3 
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme0n1p3
UUID: ff5d4ba3-25ce-46cd-9bab-dc98091801b1
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 282496593921 bytes used, no error found
total csum bytes: 259381772
total tree bytes: 4779606016
total fs tree bytes: 4230463488
total extent tree bytes: 231931904
btree space waste bytes: 797230204
file data blocks allocated: 1329677185024
 referenced 353884377088

Offline

#5 2024-06-01 11:42:58

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

https://wiki.archlinux.org/title/Solid_ … leshooting
"nvme_core.default_ps_max_latency_us=0 iommu=soft", https://wiki.archlinux.org/title/Kernel_parameters
and then see whether the system stabilizes.

Offline

#6 2024-06-01 11:47:26

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

seth wrote:

https://wiki.archlinux.org/title/Solid_ … leshooting
"nvme_core.default_ps_max_latency_us=0 iommu=soft", https://wiki.archlinux.org/title/Kernel_parameters
and then see whether the system stabilizes.

Will do, might take some-time, a week or two as the bug is random.



Any idea if i can disable nvidia (discrete-gpu) during memtest86+? or should i just do it, i am afraid high temperature migh damage hardware.

Offline

#7 2024-06-01 11:50:10

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

Also i have been on ArchLinux and linux in total for 8 months. Why is APST causing issue now?

Offline

#8 2024-06-01 11:53:51

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

Any idea if i can disable nvidia (discrete-gpu) during memtest86+?

Can you disable the GPU in the UEFI?

Also i have been on ArchLinux and linux in total for 8 months. Why is APST causing issue now?

Kernel regression, UEFI/nvme firmware update/… things work and then they're broken and then we fix them - the circle of life.

On a formal note, please don't bump. Edit your previous post if nobody has yet replied.

Offline

#9 2024-06-01 14:02:40

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

Can you disable the GPU in the UEFI?

No there are no option there.



[phoenix@ArchLinux ~]$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0 465.8G  0 disk 
├─sda1        8:1    0   600M  0 part 
├─sda2        8:2    0     1G  0 part 
└─sda3        8:3    0 464.2G  0 part 
nvme0n1     259:0    0 476.9G  0 disk 
├─nvme0n1p1 259:1    0   1.5G  0 part /boot
├─nvme0n1p2 259:2    0    24G  0 part [SWAP]
└─nvme0n1p3 259:3    0 451.4G  0 part /var/log
                                      /var/tmp
                                      /var/lib/libvirt/images
                                      /var/cache
                                      /home
                                      /.snapshots
                                      /

Do you think i can also save output of journald to /dev/sd3/some_file? This way even if for whatever reason journald is missing last 20 min output, i will be able to grab it from /dev/sda3/some_file?

I looked into rsyslog, and confused wether it reads from a file on file_system or directly from memory. if it is the later case, even if something is wrong with nvme, logging would still be done to /dev/sda3/some_file

Offline

#10 2024-06-01 14:10:49

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

You could simply bind-mount /var/log/journal to somewhere on sda3
Forwarding should™ only occur throug a socket on the /run tmps, https://wiki.archlinux.org/title/System … ith_syslog

Offline

#11 2024-06-23 05:11:51

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

Finally it happened again, i will post output of commands that i tried.

[phoenix@ArchLinux ~]$ mount | rg dev
bash: /usr/bin/rg: Input/output error

[phoenix@ArchLinux ~]$ ls /usr/bin/
"/usr/bin/": Input/output error (os error 5)

[phoenix@ArchLinux ~]$ mount > out
bash: out: Read-only file system

[phoenix@ArchLinux ~]$ journalctl
Segmentation fault

[phoenix@ArchLinux ~]$ strace journalctl
bash: /usr/bin/strace: Input/output error
[phoenix@ArchLinux ~]$ mount
/dev/nvme0n1p3 on / type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=261,subvol=/@)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=4096k,nr_inodes=2017380,mode=755,inode64)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=3234304k,nr_inodes=819200,mode=755,inode64)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=3377)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,nosuid,nodev,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/credentials/systemd-journald.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-udev-load-credentials.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-tmpfiles-setup-dev-early.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/credentials/systemd-sysctl.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-sysusers.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-tmpfiles-setup-dev.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-vconsole-setup.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
/dev/nvme0n1p3 on /.snapshots type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=263,subvol=/@snapshots)
/dev/nvme0n1p3 on /home type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=262,subvol=/@home)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,nr_inodes=1048576,inode64)
/dev/nvme0n1p3 on /var/cache type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=266,subvol=/@var_cache)
/dev/nvme0n1p3 on /var/lib/libvirt/images type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=265,subvol=/@libvirt_images)
/dev/nvme0n1p3 on /var/log type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=267,subvol=/@var_log)
/dev/nvme0n1p1 on /boot type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro)
/dev/nvme0n1p3 on /var/tmp type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=932,subvol=/@var_tmp)
/dev/sda4 on /var/log/journal type ext4 (rw,noatime)
tmpfs on /run/credentials/systemd-tmpfiles-setup.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/getty@tty1.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1617148k,nr_inodes=404287,mode=700,uid=1000,gid=1000,inode64)
portal on /run/user/1000/doc type fuse.portal (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)

When i open new-termina window, this is the error i get ->

bash: /etc/bash_completion.d/pacolog: Input/output error
Traceback (most recent call last):
  File "/usr/bin/register-python-argcomplete", line 26, in <module>
    import argparse
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1262, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1528, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1502, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1605, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1648, in _fill_cache
OSError: [Errno 5] Input/output error: '/usr/bin'
bash: /usr/bin/zoxide: Input/output error
bash: /usr/bin/navi: Input/output error
bash: /usr/bin/thefuck: Input/output error
bash: /usr/bin/fzf: Input/output error
bash: /usr/bin/base64: Input/output error
[phoenix@ArchLinux ~]$

I rebooted the system using SysRq, here's the journal -> http://0x0.st/XALp.txt

Last edited by phoenix324 (2024-06-23 05:27:29)

Offline

#12 2024-06-23 07:23:43

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

Before you're getting a boatload of btrfs errors the nvme acts up w/ soem bus errors

Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:   device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:   device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:   device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)

resulting in

Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jun 23 10:00:57 ArchLinux kernel: nvme 0000:02:00.0: enabling device (0000 -> 0002)
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Disabling device after reset failure: -19
Jun 23 10:00:57 ArchLinux kernel: I/O error, dev nvme0n1, sector 67125264 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2

The journal doesn't cover the entire boot - is this w/ or w/o any of the parameters from the troubleshooting wiki?

Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

(Can't tell whether "pcie_aspm=off" will make any sense either, but it won't harm)

And just to see whether there might be interference from anything else

lspci -tvnn

Offline

#13 2024-06-23 08:33:14

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

Sorry, i trimmed the journalctl output using `--since -3h`, here's the full journal of the boot - http://0x0.st/XApb.txt


[phoenix@ArchLinux ~]$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt
[phoenix@ArchLinux ~]$ lspci -tvnn
-[0000:00]-+-00.0  Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4]
           +-01.0-[01]--
           +-02.0  Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
           +-12.0  Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379]
           +-14.0  Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]
           +-14.2  Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
           +-14.3  Intel Corporation Cannon Lake PCH CNVi WiFi [8086:a370]
           +-15.0  Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368]
           +-15.2  Intel Corporation Cannon Lake PCH Serial IO I2C Controller #2 [8086:a36a]
           +-16.0  Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
           +-17.0  Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353]
           +-1d.0-[02]----00.0  Kingston Technology Company, Inc. OM8PCP Design-In PCIe 3 NVMe SSD (DRAM-less) [2646:500c]
           +-1d.6-[03]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
           +-1f.0  Intel Corporation HM470 Chipset LPC/eSPI Controller [8086:a30d]
           +-1f.3  Intel Corporation Cannon Lake PCH cAVS [8086:a348]
           +-1f.4  Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323]
           \-1f.5  Intel Corporation Cannon Lake PCH SPI Controller [8086:a324]

Because i don't have windows, i won't be able to update firmware.

First i will try using "nvme_core.default_ps_max_latency_us=0" then later if i still get error, will permutate between iommu=soft" and "pcie_aspm=off"

On a different note, is there any chance of private informaton being there in output of journalctl? ArchWiki dosen't mention anything about this, Thats why i am alwayr reluctant to post full journalctl output.

Offline

#14 2024-06-23 12:38:51

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

The ethernet chip is on the same bus as the nvme, but you don't seem to be using that.

"Private" - yes. "Sensitive" - ideally no. It's not supposed to.
Sometimes network managing daemons leak publically routable IPv4 (not in your case) and sudo gets audited, so "sudo playmyporn" would show up in the journal.
Your wifi MAC is there and some database at google and the chinese government likely can infer your geolocation from that (otherwise its worthless to anyone not in your wifi range)

People often censor their username or hostname and I then just infer that it's something embarrassing but these things are "private" and usually not sensitive ("usually" because ireallyhatethemullahs is probably not a good username if you're living in Iran…)
It's perfectly fine to pseudonymize stuff, but please don't anonymize tokens (ie. replacing aaa with xxx and bbb with yyy is fine, replacing aaa and bbb with the same xyz is not)

Offline

#15 2024-06-24 01:52:23

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

Got it. Also how do you deal with journalctl log  output from other users? do you use some progam like `wget <url> | lnav`?

Do i mark this topic as solved, because to test this it might again take long time.

Offline

#16 2024-06-24 05:28:42

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

Somthing like that, yes.
Marking the threads as solved while we don't even know (for sure) what's going on would be be deceptive - it's ok to keep it open.

Offline

#17 2024-07-05 12:53:20

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

I ran into the issue again.

[phoenix@ArchLinux ~]$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt nvme_core.default_ps_max_latency_us=0

[phoenix@ArchLinux ~]$ uname -r
6.9.7-zen1-1-zen

But this time i couldn't get journalctl logs, as i had to remove the secondary drive which i used to store journal. I had to give it back to my friend.

Do you know any way to do this without additional drives? As you saw previously the whole partition gets remouted as `ro` automatically. What can i do get the logs out?
1. Do i just use `sudo mount -a` after disk get mounted as `ro` ?
2. I have enabled ssh in hopes that i could get the journal out. But not sure if it would work because of `ro`

Last edited by phoenix324 (2024-07-05 12:53:47)

Offline

#18 2024-07-05 20:31:19

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

Does ssh work?

sudo journalctl -b | curl -F 'file=@-' 0x0.st

will upload the journal to 0x0.st and give you a link, otherwise do you have a usb key?
(You'll typically need < 1MB and certainly not 1GB for the journal)

Offline

#19 2024-07-06 00:56:26

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

seth wrote:

Does ssh work?

I tried triggering the readonly system bug behaviour manually using ->

[phoenix@ArchLinux ~]$ sudo mount /dev/nvme0n1p3 -o remount,ro
mount: /: mount point is busy.
       dmesg(1) may have more information after failed mount system call.

Maybe i will have to wait till it happens again.

otherwise do you have a usb key?

I do have, but having the usb plugged all the time in laptop dosen't seem feasible.

Offline

#20 2024-07-06 04:34:20

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

The remount is a symptom, not the bug.
If you can ssh into the system* you can most likely also upload the journal to the network.
Curl could be an issue and journalctl could be if you cannot run new processes any more.
In doubt keep "sudo journalctl -f" running in a terminal and make photos of that if the system stalls.

And try "iommu=soft"!

having the usb plugged all the time in laptop dosen't seem feasible

SD card?

(sshd must be running and listening on some tcp port ahead. Nb. that systemd 256 for dumb reasons starts an sshd instance unconditionally, but that's only listening on a unix socket and will not do! Test the ability to ssh into the system while you've no issues)

Offline

#21 2024-07-08 09:31:35

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

SD card?`

Nice idea.



And try "iommu=soft"!

Will do now.


But i don't understand why is this problem happening now? i thought it maybe with my kernel, i was using linux-zen. So i tried linux kernel, but it also happned there. Now i tried linux-lts and it's still happening.

Is something wrong with my hardware. Should i start preparing backups?



http://0x0.st/XBlJ.txt and i was using these parameter, i removed 'nvme_core.default_ps_max_latency_us=0' thinking it was fault of kernel.

[phoenix@ArchLinux ~]$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-lts.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt

[phoenix@ArchLinux ~]$ uname -r
6.6.37-1-lts

And this is my tlp config ->

[phoenix@ArchLinux ~]$ cat /etc/tlp.d/01-custom.conf 
# --------------------------------------

# Operation mode when no power supply can be detected: AC, BAT.
TLP_DEFAULT_MODE=BAT

# Operation mode select: 0=depend on power source, 1=always use TLP_DEFAULT_MODE
TLP_PERSISTENT_DEFAULT=1

# Do not suspend USB devices
USB_AUTOSUSPEND=0

# Restores radio device state (builtin Bluetooth, Wi-Fi, WWAN) from previous shutdown on boot: 0 – disable , 1 – enable
RESTORE_DEVICE_STATE_ON_STARTUP=1

# --------------------------------------

# Radio devices to disable on connect.

DEVICES_TO_DISABLE_ON_LAN_CONNECT="wifi wwan"
DEVICES_TO_DISABLE_ON_WIFI_CONNECT="wwan"
DEVICES_TO_DISABLE_ON_WWAN_CONNECT="wifi"

Offline

#22 2024-07-08 15:24:22

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

After a boatload of pci bus errors you end with

Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 08 14:41:03 ArchLinux kernel: nvme 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Disabling device after reset failure: -19

Re-add those next to iommu=off, you first want to establish a functional baseline by aggressive measures and the ease the intervention as much as possible from there, poking into the disfunctional system will leave you in the dark because you can't know whether any combination of measures will work at all.

Next to that, the mere presence of tlp is more "concerning" irt than a wifi-specific config.

What's the output of "lspci -tvnn"?
(Basically what's in the nvme's neighbourhood)

Offline

#23 2024-07-09 02:21:33

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

seth wrote:

After a boatload of pci bus errors you end with
Re-add those next to iommu=off,

just to verify ->

[phoenix@ArchLinux ~]$ uname -r
6.9.8-arch1-1
[phoenix@ArchLinux ~]$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt nvme_core.default_ps_max_latency_us=0 pcie_aspm=off iommu=off

you first want to establish a functional baseline by aggressive measures and the ease the intervention as much as possible from there, poking into the disfunctional system will leave you in the dark because you can't know whether any combination of measures will work at all.

Nice approach, will remember it.




What's the output of "lspci -tvnn"?
(Basically what's in the nvme's neighbourhood)

-[0000:00]-+-00.0  Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4]
           +-01.0-[01]--
           +-02.0  Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
           +-12.0  Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379]
           +-14.0  Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]
           +-14.2  Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
           +-14.3  Intel Corporation Cannon Lake PCH CNVi WiFi [8086:a370]
           +-15.0  Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368]
           +-15.2  Intel Corporation Cannon Lake PCH Serial IO I2C Controller #2 [8086:a36a]
           +-16.0  Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
           +-17.0  Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353]
           +-1d.0-[02]----00.0  Kingston Technology Company, Inc. OM8PCP Design-In PCIe 3 NVMe SSD (DRAM-less) [2646:500c]
           +-1d.6-[03]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
           +-1f.0  Intel Corporation HM470 Chipset LPC/eSPI Controller [8086:a30d]
           +-1f.3  Intel Corporation Cannon Lake PCH cAVS [8086:a348]
           +-1f.4  Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323]
           \-1f.5  Intel Corporation Cannon Lake PCH SPI Controller [8086:a324]

Last time when i gave the output of `lspci -tvnn`, you mentioned 'ethernet controller' being there along with nvme, so if i use the above parameters will it have effect on battery or performance?

Offline

#24 2024-07-09 06:23:50

seth
Member
Registered: 2012-09-03
Posts: 64,272

Re: [Solved] System stops responding, nothing new can be run.

Sorry, I forgot I asked you before to post this lol
Afaict from your previous journal you're not using the r8169 chip (ethernet), can you deactivate it in the BIOS/UEFI?

The parameters will cause more battery drain for sure. How much is to bee seen.
You're disabling two power management features (APST and ASPM) and emulate the IOMMU in the CPU.

Offline

#25 2024-07-09 13:23:48

phoenix324
Member
Registered: 2023-08-23
Posts: 83

Re: [Solved] System stops responding, nothing new can be run.

seth wrote:

Afaict from your previous journal you're not using the r8169 chip (ethernet), can you deactivate it in the BIOS/UEFI?

There are no option to disable ethernet. These were the options available there ->

SATA Mode Selection
intel (R) Speed Shift Technology
ERP lot 3 support
Wake up on lan S5 support
intel Virtualization technology
VT-d
Hyper-Threading
CPU C states
Acoustic Noise Mitigation

As you previously said, that we are using all extreme measure to get system in stable state. So in which order would i need to remove those three parameter after testing them for a weeks?
1. nvme_core.default_ps_max_latency_us=0
2. pcie_aspm=off
3. iommu=off

Offline

Board footer

Powered by FluxBB