You are not logged in.
OS: Arch Linux x86_64
Host: GL65 Leopard 9SDK (REV:1.0)
Kernel: Linux 6.9.2-zen1-1-zen
Shell: bash 5.2.26
WM: Hyprland (Wayland)
CPU: Intel(R) Core(TM) i7-9750H (12) @ 4.50 GHz
GPU: Intel UHD Graphics 630 @ 1.15 GHz [Integrated]
Memory: 3.50 GiB / 15.42 GiB (23%)
Swap: 0 B / 24.00 GiB (0%)
Disk (/): 267.98 GiB / 451.44 GiB (59%) - btrfs
This has now happened 4 times. After 3nd time, i panicked, made backup and spent whole day trying to figure out what was causing the issue, is it my filesystem(BTRFS) or my Disk(nvme) or is it my RAM.
When it happenes, i am not able to launch any new application ( Super+T for kitty ), or execute any binary from already opened terminal, i was able to execute `shutdown now` ( 1st and 2nd time because terminal were opened) but the process got stuck and i had to hold power-button for force shut-down.
Nothing gets logged to journal, everytime last 20 min of journal is not there. When this happened 4th time, i used SysReq. I will post my 4th incident journal. `journalctl -b -1 -n 500 --no-pager` (didn't use sudo, let me know if sudo is needed) http://0x0.st/XNn9.txt
For Disk(nvme) - I ran `smartctl -x /dev/nvme0`
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
For BTRFS - I ran scrub and `btrfs check` and they didn't reported any error. Here's the output ->
scrub status:1
ff5d4ba3-25ce-46cd-9bab-dc98091801b1:1|data_extents_scrubbed:6260029|tree_extents_scrubbed:583302|data_bytes_scrubbed:277634707456|tree_bytes_scrubbed:9556819968|read_errors:0|csum_errors:0|verify_errors:0|no_csum:2950373|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:411281915904|t_start:1717218527|t_resumed:0|duration:307|canceled:0|finished:1
I am little confused about the argument to `btrfs scrub start <path>`, does the command uses the scrub on the filesystem on which the path lies? So for example, if i give `/home` or `/` both of which lie on same partition albeit they are different subvolume.
ID 261 gen 332395 top level 5 path @
ID 262 gen 332395 top level 5 path @home
ID 263 gen 328837 top level 5 path @snapshots
ID 265 gen 298353 top level 5 path @libvirt_images
ID 266 gen 332365 top level 5 path @var_cache
ID 267 gen 332394 top level 5 path @var_log
ID 269 gen 15387 top level 263 path @snapshots/1/snapshot
ID 932 gen 332386 top level 5 path @var_tmp
ID 933 gen 49184 top level 263 path @snapshots/658/snapshot
ID 2206 gen 259516 top level 263 path @snapshots/1931/snapshot
ID 2207 gen 259517 top level 263 path @snapshots/1932/snapshot
For Ram - i can't do memory_test because Temps become very high, this is because i have lapto with hybrid graphic ( intel + nvidia_dedicated ), when i am in boot stage before kernel, nvidia is running and causing high temps, i can see this because i have LED indicator which turns orange when dedicated graphic card is on. Is there any way to turn off nvidia gpu at this stage. I am already using udev rule to fully power-down discrete gpu. https://wiki.archlinux.org/title/Hybrid … screte_GPU
Last edited by phoenix324 (2024-10-04 09:56:34)
Offline
"SMART overall-health self-assessment test result: PASSED" and an empty bag are worth about the bag, check the entire "smartctl -a"
The journal ends w/
Jun 01 10:11:26 ArchLinux systemd[1099]: Started kitty child process: 121356 launched by: 121347.
but the sysrq aren't logged - did you use the entire REISUB sequence?
Looks a bit like https://bbs.archlinux.org/viewtopic.php?id=296332 … what exact nvme model is this?
Online
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.9.2-zen1-1-zen] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: KINGSTON OM8PCP3512F-AI1
Serial Number: 50026B7683ECE53F
Firmware Version: ECFK52.8
PCI Vendor/Subsystem ID: 0x2646
IEEE OUI Identifier: 0x0026b7
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 0026b7 683ece53f5
Local Time is: Sat Jun 1 12:55:52 2024 IST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0054): DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0a): Cmd_Eff_Lg Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 90 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 5.74W - - 0 0 0 0 0 0
1 + 5.21W - - 1 1 1 1 0 0
2 + 4.95W - - 2 2 2 2 0 0
3 - 0.0490W - - 3 3 3 3 2000 2000
4 - 0.0018W - - 4 4 4 4 25000 25000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 20%
Data Units Read: 62,070,794 [31.7 TB]
Data Units Written: 55,506,653 [28.4 TB]
Host Read Commands: 651,601,367
Host Write Commands: 935,160,312
Controller Busy Time: 5,029
Power Cycles: 10,166
Power On Hours: 9,250
Unsafe Shutdowns: 163
Media and Data Integrity Errors: 0
Error Information Log Entries: 10,345
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Thermal Temp. 1 Transition Count: 17
Thermal Temp. 1 Total Time: 3
Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message
0 10345 0 0x0008 0x4004 0x028 0 0 - Invalid Field in Command
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Short Aborted: Controller Reset 9207 - - - - -
The journal ends w/
Jun 01 10:11:26 ArchLinux systemd[1099]: Started kitty child process: 121356 launched by: 121347.
but the sysrq aren't logged - did you use the entire REISUB sequence?
The system stopped responding at 10:20:**, that's what i meant by journal not storing last 20minutes entries. Yes i use the whole sequence of REISUB, at the stage of 'S', i even got message saying `sync complete`, i preseed 'S' twice just to be safe. But when my system rebooted, joulrnal entries were still missing.
what exact nvme model is this?
KINGSTON OM8PCP3512F-AI1
Looks a bit like https://bbs.archlinux.org/viewtopic.php?id=296332
The problem seems similar, but my frequency of error is wery wide, 3-5 days or even a week. i will keep an eye on that forum.
From the articles i read i was under the impression that RAM maybe the issue. As my previous comment, any idea if i can turn off my GPU(discrete) so that it won't generate unnecessary Heat. Also on the wiki it says to run test for 10 cycles, but on memtest86+ UI, there is no `Cycles` column, but there is `PASS`, is this what i should be looking for to coulting cycles?
I will from next time keep my composure and keep a terminal open, also an sshd running. Anything that you would like me to do when i run into this issue again?
Offline
[phoenix@ArchLinux ~]$ sudo btrfs check --force --readonly /dev/nvme0n1p3
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme0n1p3
UUID: ff5d4ba3-25ce-46cd-9bab-dc98091801b1
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 282496593921 bytes used, no error found
total csum bytes: 259381772
total tree bytes: 4779606016
total fs tree bytes: 4230463488
total extent tree bytes: 231931904
btree space waste bytes: 797230204
file data blocks allocated: 1329677185024
referenced 353884377088
Offline
https://wiki.archlinux.org/title/Solid_ … leshooting
"nvme_core.default_ps_max_latency_us=0 iommu=soft", https://wiki.archlinux.org/title/Kernel_parameters
and then see whether the system stabilizes.
Online
https://wiki.archlinux.org/title/Solid_ … leshooting
"nvme_core.default_ps_max_latency_us=0 iommu=soft", https://wiki.archlinux.org/title/Kernel_parameters
and then see whether the system stabilizes.
Will do, might take some-time, a week or two as the bug is random.
Any idea if i can disable nvidia (discrete-gpu) during memtest86+? or should i just do it, i am afraid high temperature migh damage hardware.
Offline
Also i have been on ArchLinux and linux in total for 8 months. Why is APST causing issue now?
Offline
Any idea if i can disable nvidia (discrete-gpu) during memtest86+?
Can you disable the GPU in the UEFI?
Also i have been on ArchLinux and linux in total for 8 months. Why is APST causing issue now?
Kernel regression, UEFI/nvme firmware update/… things work and then they're broken and then we fix them - the circle of life.
On a formal note, please don't bump. Edit your previous post if nobody has yet replied.
Online
Can you disable the GPU in the UEFI?
No there are no option there.
[phoenix@ArchLinux ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 600M 0 part
├─sda2 8:2 0 1G 0 part
└─sda3 8:3 0 464.2G 0 part
nvme0n1 259:0 0 476.9G 0 disk
├─nvme0n1p1 259:1 0 1.5G 0 part /boot
├─nvme0n1p2 259:2 0 24G 0 part [SWAP]
└─nvme0n1p3 259:3 0 451.4G 0 part /var/log
/var/tmp
/var/lib/libvirt/images
/var/cache
/home
/.snapshots
/
Do you think i can also save output of journald to /dev/sd3/some_file? This way even if for whatever reason journald is missing last 20 min output, i will be able to grab it from /dev/sda3/some_file?
I looked into rsyslog, and confused wether it reads from a file on file_system or directly from memory. if it is the later case, even if something is wrong with nvme, logging would still be done to /dev/sda3/some_file
Offline
You could simply bind-mount /var/log/journal to somewhere on sda3
Forwarding should™ only occur throug a socket on the /run tmps, https://wiki.archlinux.org/title/System … ith_syslog
Online
Finally it happened again, i will post output of commands that i tried.
[phoenix@ArchLinux ~]$ mount | rg dev
bash: /usr/bin/rg: Input/output error
[phoenix@ArchLinux ~]$ ls /usr/bin/
"/usr/bin/": Input/output error (os error 5)
[phoenix@ArchLinux ~]$ mount > out
bash: out: Read-only file system
[phoenix@ArchLinux ~]$ journalctl
Segmentation fault
[phoenix@ArchLinux ~]$ strace journalctl
bash: /usr/bin/strace: Input/output error
[phoenix@ArchLinux ~]$ mount
/dev/nvme0n1p3 on / type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=261,subvol=/@)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=4096k,nr_inodes=2017380,mode=755,inode64)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=3234304k,nr_inodes=819200,mode=755,inode64)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=3377)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,nosuid,nodev,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/credentials/systemd-journald.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-udev-load-credentials.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-tmpfiles-setup-dev-early.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/credentials/systemd-sysctl.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-sysusers.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-tmpfiles-setup-dev.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/systemd-vconsole-setup.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
/dev/nvme0n1p3 on /.snapshots type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=263,subvol=/@snapshots)
/dev/nvme0n1p3 on /home type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=262,subvol=/@home)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,nr_inodes=1048576,inode64)
/dev/nvme0n1p3 on /var/cache type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=266,subvol=/@var_cache)
/dev/nvme0n1p3 on /var/lib/libvirt/images type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=265,subvol=/@libvirt_images)
/dev/nvme0n1p3 on /var/log type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=267,subvol=/@var_log)
/dev/nvme0n1p1 on /boot type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro)
/dev/nvme0n1p3 on /var/tmp type btrfs (ro,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=932,subvol=/@var_tmp)
/dev/sda4 on /var/log/journal type ext4 (rw,noatime)
tmpfs on /run/credentials/systemd-tmpfiles-setup.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/credentials/getty@tty1.service type tmpfs (ro,nosuid,nodev,noexec,relatime,nosymfollow,size=1024k,nr_inodes=1024,mode=700,inode64,noswap)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1617148k,nr_inodes=404287,mode=700,uid=1000,gid=1000,inode64)
portal on /run/user/1000/doc type fuse.portal (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
When i open new-termina window, this is the error i get ->
bash: /etc/bash_completion.d/pacolog: Input/output error
Traceback (most recent call last):
File "/usr/bin/register-python-argcomplete", line 26, in <module>
import argparse
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 1262, in _find_spec
File "<frozen importlib._bootstrap_external>", line 1528, in find_spec
File "<frozen importlib._bootstrap_external>", line 1502, in _get_spec
File "<frozen importlib._bootstrap_external>", line 1605, in find_spec
File "<frozen importlib._bootstrap_external>", line 1648, in _fill_cache
OSError: [Errno 5] Input/output error: '/usr/bin'
bash: /usr/bin/zoxide: Input/output error
bash: /usr/bin/navi: Input/output error
bash: /usr/bin/thefuck: Input/output error
bash: /usr/bin/fzf: Input/output error
bash: /usr/bin/base64: Input/output error
[phoenix@ArchLinux ~]$
I rebooted the system using SysRq, here's the journal -> http://0x0.st/XALp.txt
Last edited by phoenix324 (2024-06-23 05:27:29)
Offline
Before you're getting a boatload of btrfs errors the nvme acts up w/ soem bus errors
Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: [ 0] RxErr (First)
Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: [ 0] RxErr (First)
Jun 23 09:59:45 ArchLinux kernel: pcieport 0000:00:1d.0: AER: Correctable error message received from 0000:02:00.0
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: device [2646:500c] error status/mask=00000001/00006000
Jun 23 09:59:45 ArchLinux kernel: nvme 0000:02:00.0: [ 0] RxErr (First)
resulting in
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jun 23 10:00:57 ArchLinux kernel: nvme 0000:02:00.0: enabling device (0000 -> 0002)
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Disabling device after reset failure: -19
Jun 23 10:00:57 ArchLinux kernel: I/O error, dev nvme0n1, sector 67125264 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2
The journal doesn't cover the entire boot - is this w/ or w/o any of the parameters from the troubleshooting wiki?
Jun 23 10:00:57 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
(Can't tell whether "pcie_aspm=off" will make any sense either, but it won't harm)
And just to see whether there might be interference from anything else
lspci -tvnn
Online
Sorry, i trimmed the journalctl output using `--since -3h`, here's the full journal of the boot - http://0x0.st/XApb.txt
[phoenix@ArchLinux ~]$ cat /proc/cmdline
initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt
[phoenix@ArchLinux ~]$ lspci -tvnn
-[0000:00]-+-00.0 Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4]
+-01.0-[01]--
+-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
+-12.0 Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379]
+-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]
+-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
+-14.3 Intel Corporation Cannon Lake PCH CNVi WiFi [8086:a370]
+-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368]
+-15.2 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #2 [8086:a36a]
+-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
+-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353]
+-1d.0-[02]----00.0 Kingston Technology Company, Inc. OM8PCP Design-In PCIe 3 NVMe SSD (DRAM-less) [2646:500c]
+-1d.6-[03]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
+-1f.0 Intel Corporation HM470 Chipset LPC/eSPI Controller [8086:a30d]
+-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
+-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323]
\-1f.5 Intel Corporation Cannon Lake PCH SPI Controller [8086:a324]
Because i don't have windows, i won't be able to update firmware.
First i will try using "nvme_core.default_ps_max_latency_us=0" then later if i still get error, will permutate between iommu=soft" and "pcie_aspm=off"
On a different note, is there any chance of private informaton being there in output of journalctl? ArchWiki dosen't mention anything about this, Thats why i am alwayr reluctant to post full journalctl output.
Offline
The ethernet chip is on the same bus as the nvme, but you don't seem to be using that.
"Private" - yes. "Sensitive" - ideally no. It's not supposed to.
Sometimes network managing daemons leak publically routable IPv4 (not in your case) and sudo gets audited, so "sudo playmyporn" would show up in the journal.
Your wifi MAC is there and some database at google and the chinese government likely can infer your geolocation from that (otherwise its worthless to anyone not in your wifi range)
People often censor their username or hostname and I then just infer that it's something embarrassing but these things are "private" and usually not sensitive ("usually" because ireallyhatethemullahs is probably not a good username if you're living in Iran…)
It's perfectly fine to pseudonymize stuff, but please don't anonymize tokens (ie. replacing aaa with xxx and bbb with yyy is fine, replacing aaa and bbb with the same xyz is not)
Online
Got it. Also how do you deal with journalctl log output from other users? do you use some progam like `wget <url> | lnav`?
Do i mark this topic as solved, because to test this it might again take long time.
Offline
Somthing like that, yes.
Marking the threads as solved while we don't even know (for sure) what's going on would be be deceptive - it's ok to keep it open.
Online
I ran into the issue again.
[phoenix@ArchLinux ~]$ cat /proc/cmdline
initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt nvme_core.default_ps_max_latency_us=0
[phoenix@ArchLinux ~]$ uname -r
6.9.7-zen1-1-zen
But this time i couldn't get journalctl logs, as i had to remove the secondary drive which i used to store journal. I had to give it back to my friend.
Do you know any way to do this without additional drives? As you saw previously the whole partition gets remouted as `ro` automatically. What can i do get the logs out?
1. Do i just use `sudo mount -a` after disk get mounted as `ro` ?
2. I have enabled ssh in hopes that i could get the journal out. But not sure if it would work because of `ro`
Last edited by phoenix324 (2024-07-05 12:53:47)
Offline
Does ssh work?
sudo journalctl -b | curl -F 'file=@-' 0x0.st
will upload the journal to 0x0.st and give you a link, otherwise do you have a usb key?
(You'll typically need < 1MB and certainly not 1GB for the journal)
Online
Does ssh work?
I tried triggering the readonly system bug behaviour manually using ->
[phoenix@ArchLinux ~]$ sudo mount /dev/nvme0n1p3 -o remount,ro
mount: /: mount point is busy.
dmesg(1) may have more information after failed mount system call.
Maybe i will have to wait till it happens again.
otherwise do you have a usb key?
I do have, but having the usb plugged all the time in laptop dosen't seem feasible.
Offline
The remount is a symptom, not the bug.
If you can ssh into the system* you can most likely also upload the journal to the network.
Curl could be an issue and journalctl could be if you cannot run new processes any more.
In doubt keep "sudo journalctl -f" running in a terminal and make photos of that if the system stalls.
And try "iommu=soft"!
having the usb plugged all the time in laptop dosen't seem feasible
SD card?
(sshd must be running and listening on some tcp port ahead. Nb. that systemd 256 for dumb reasons starts an sshd instance unconditionally, but that's only listening on a unix socket and will not do! Test the ability to ssh into the system while you've no issues)
Online
SD card?`
Nice idea.
And try "iommu=soft"!
Will do now.
But i don't understand why is this problem happening now? i thought it maybe with my kernel, i was using linux-zen. So i tried linux kernel, but it also happned there. Now i tried linux-lts and it's still happening.
Is something wrong with my hardware. Should i start preparing backups?
http://0x0.st/XBlJ.txt and i was using these parameter, i removed 'nvme_core.default_ps_max_latency_us=0' thinking it was fault of kernel.
[phoenix@ArchLinux ~]$ cat /proc/cmdline
initrd=\intel-ucode.img initrd=\initramfs-linux-lts.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt
[phoenix@ArchLinux ~]$ uname -r
6.6.37-1-lts
And this is my tlp config ->
[phoenix@ArchLinux ~]$ cat /etc/tlp.d/01-custom.conf
# --------------------------------------
# Operation mode when no power supply can be detected: AC, BAT.
TLP_DEFAULT_MODE=BAT
# Operation mode select: 0=depend on power source, 1=always use TLP_DEFAULT_MODE
TLP_PERSISTENT_DEFAULT=1
# Do not suspend USB devices
USB_AUTOSUSPEND=0
# Restores radio device state (builtin Bluetooth, Wi-Fi, WWAN) from previous shutdown on boot: 0 – disable , 1 – enable
RESTORE_DEVICE_STATE_ON_STARTUP=1
# --------------------------------------
# Radio devices to disable on connect.
DEVICES_TO_DISABLE_ON_LAN_CONNECT="wifi wwan"
DEVICES_TO_DISABLE_ON_WIFI_CONNECT="wwan"
DEVICES_TO_DISABLE_ON_WWAN_CONNECT="wifi"
Offline
After a boatload of pci bus errors you end with
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Jul 08 14:41:03 ArchLinux kernel: nvme 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
Jul 08 14:41:03 ArchLinux kernel: nvme nvme0: Disabling device after reset failure: -19
Re-add those next to iommu=off, you first want to establish a functional baseline by aggressive measures and the ease the intervention as much as possible from there, poking into the disfunctional system will leave you in the dark because you can't know whether any combination of measures will work at all.
Next to that, the mere presence of tlp is more "concerning" irt than a wifi-specific config.
What's the output of "lspci -tvnn"?
(Basically what's in the nvme's neighbourhood)
Online
After a boatload of pci bus errors you end with
Re-add those next to iommu=off,
just to verify ->
[phoenix@ArchLinux ~]$ uname -r
6.9.8-arch1-1
[phoenix@ArchLinux ~]$ cat /proc/cmdline
initrd=\intel-ucode.img initrd=\initramfs-linux.img root=UUID=ff5d4ba3-25ce-46cd-9bab-dc98091801b1 rw rootflags=subvol=@ loglevel=3 mitigations=off nowatchdog modprobe.blacklist=iTCO_wdt nvme_core.default_ps_max_latency_us=0 pcie_aspm=off iommu=off
you first want to establish a functional baseline by aggressive measures and the ease the intervention as much as possible from there, poking into the disfunctional system will leave you in the dark because you can't know whether any combination of measures will work at all.
Nice approach, will remember it.
What's the output of "lspci -tvnn"?
(Basically what's in the nvme's neighbourhood)
-[0000:00]-+-00.0 Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec4]
+-01.0-[01]--
+-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
+-12.0 Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379]
+-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]
+-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
+-14.3 Intel Corporation Cannon Lake PCH CNVi WiFi [8086:a370]
+-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368]
+-15.2 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #2 [8086:a36a]
+-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
+-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353]
+-1d.0-[02]----00.0 Kingston Technology Company, Inc. OM8PCP Design-In PCIe 3 NVMe SSD (DRAM-less) [2646:500c]
+-1d.6-[03]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168]
+-1f.0 Intel Corporation HM470 Chipset LPC/eSPI Controller [8086:a30d]
+-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
+-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323]
\-1f.5 Intel Corporation Cannon Lake PCH SPI Controller [8086:a324]
Last time when i gave the output of `lspci -tvnn`, you mentioned 'ethernet controller' being there along with nvme, so if i use the above parameters will it have effect on battery or performance?
Offline
Sorry, I forgot I asked you before to post this
Afaict from your previous journal you're not using the r8169 chip (ethernet), can you deactivate it in the BIOS/UEFI?
The parameters will cause more battery drain for sure. How much is to bee seen.
You're disabling two power management features (APST and ASPM) and emulate the IOMMU in the CPU.
Online
Afaict from your previous journal you're not using the r8169 chip (ethernet), can you deactivate it in the BIOS/UEFI?
There are no option to disable ethernet. These were the options available there ->
SATA Mode Selection
intel (R) Speed Shift Technology
ERP lot 3 support
Wake up on lan S5 support
intel Virtualization technology
VT-d
Hyper-Threading
CPU C states
Acoustic Noise Mitigation
As you previously said, that we are using all extreme measure to get system in stable state. So in which order would i need to remove those three parameter after testing them for a weeks?
1. nvme_core.default_ps_max_latency_us=0
2. pcie_aspm=off
3. iommu=off
Offline