You are not logged in.
Hi,
I've been using arch for a while, this is the first issue I can't troubleshoot using the wiki and the forums. I've posted this to other places too, hope it's not against rules.
My goal with this post is to determine the root cause of the issues I'm running into, and take appropriate action to fix stuff long term. Concretely, I could probably reinstall everything, but if my disk is indeed failing, I'll be back to square 1 in a week...
I need this computer for work, so any help is immensely appreciated.
I'm going to start to list the symptoms my system is showing right now, and their evolution over time. Then I'll list a bunch of information that I find inconsistent about my disk.
1/ System Symptoms
I reached the point where I can get to a working X (run my DE, mail client, browser, ...) maybe 20% of the time. The symptoms right now are:
- If I log in tty1, the system will start the X server, a couple of programs will start correctly (openbox and tint2 basically), then everything will freeze except the mouse. I can access the tty2+, but I cannot login (password is reportedly incorrect). After a while, on tty1, tint2 will close and openbox error messages will show up, saying "input/output errors"
- If I log in tty2+, then I can login fine and run a few commands (ls, df, ...). I tried "pacman -Syu base base-devel" from here, the update could not complete because of input/output errors. After running pacman, no command could run anymore (ls, df ...), all would fail with input/output errors
- In a live USB, all filesystems pass fsck.vfat (/dev/sda1, ESP) and fsck.ext4 (/dev/sda2, root). I can run smartctl and run self-tests on my nvme drive, they complete fine (smartctl /dev/sda -t {short,long}). However, Gnome disks doesn't report any temperature for this drive and the option to run SMART tests is greyed out.
(EDIT) Here is a journal log obtained when, 1/ boot into tty2 and clear /val/log/journal, 2/ reboot into X, freeze, and wait until input/output error message appear in openbox windows, 3/ reboot into tty2 and recover journal
[pastebin](https://pastebin.com/UitPFZ9U)
Grepping on errors doesn't highlight an obvious root cause. Although if it's really disk errors happening, it would make sense that they might not log correctly to disk...
In the past, I could still boot to X every time, but had slightly different issues. These issues are still present today, if I can boot to X. After a while, (typically when starting a game, but not necessarily) :
- Either the system would freeze completely, leaving only the mouse moving, unable to acces tty2+, forcing reboot via holding the power button
- Either the "already running" programs would continue running, but trying to start any new program would fail with a input/output error (either in an openbox error window for graphical programs, or in the console for console commands like ls, df, ...)
Here is a journal log of what happens when, 1/ booting into X correctly, not freezing, 2/ clear /var/log/journal, 3/ start gui programs until running into input/output errors, 4/ reboot by holding power button and recover journal.
[pastebin](https://pastebin.com/NyMnRktJ)
2/ Disk-related inconsistencies
- Although this drive is clearly a nvme drive (kingston shpm2280p2/240), the BIOS is configured to use it as nvme and not sata, and the bios reports it as nvme, the drive shows up as /dev/sda and not /dev/nvme
- it is also impossible to use nvme-cli with this disk. nvme-cli fails with error "inapropriate ioctl for device", regardless of whether running off live usb or not.
- Wether in a live USB or in the host OS, Gnome-disks fails to give any info on the drive. No temperature is reported, and the options to read SMART status or run self-tests are greyed out (although they seem to work via smartctl on live usb)
- I find the drive runs very hot. In windows (where the disk is not used at all), it idles at 47°c. In linux, I cannot get a temperature reading, but by touch, the drive is uncomfortable hot at idle. You will feel a burning sensation in your fingertip within 3 seconds. The burning sensation on human skin happens around 50°c. Under load, it immediately burns your finger, so it must be hotter.
- To fix this, I removed the sticker off the drive, and put a pi heatsink on the controller (not the dram cache, not the nand chips). The chip is probably still too hot, because touching the heatsink feels exactly the same as touching the bare drive felt, both at idle or under load. I also updated the firmware using the manufacturer's utility in windows (note that I had input/output error before updating anything).
3/ Conclusion
The frequent input/output errors would lead me to believe that something is wrong with the disk. The high temperature could have caused some aging, especially so as the drive ran for a while with a sticker on it and no heatsink. However, all file systems seem to be okay, and the SMART tests that were run in command line from live usb report no error.
Do you think this could be a disk error ? Could it be something else ? Can you suggest me the next step to find out what's going on ?
Any help is greatly apreciated
Some final notes
- I tried unplugging-replugging the drive
- The kernel is linux-lts
- The memory passes 1 round of memtest
- Everything in fstab is commented out except /dev/sda
- No overclock whatsoever
Last edited by qbvt (2020-07-13 13:00:13)
Offline
Regarding your NVME drive, is this a new drive? And what is your motherboard? I'm asking because as you probably already know, there are 2 types of NVME drive - PCIE and SATA and just wanted to make sure your motherboard supports the PCIE drive. I know it shows up as NVME in the BIOS but if you're getting it as /dev/sda in the OS, clearly the OS is not seeing it in the PCIE lane. One cannot install an NVME made for PCIE in the SATA lane and vice-versa.
Regarding heat, they do run hot but fairly normal. My first NVME drive I got when the technology was fairly new is stable even at 80 degrees and idles at 60-70 degrees without any problem. However my far newer Samsung NVME runs far cooler and barely reaches 60 degrees in heavy IO.
Regarding your x problem, are you using a discrete GPU?
Offline
Thanks for your answer !
The drive is not new, it worked correctly for ~2 years now. The motherboard is a ASrock AB350 ITX/AC, with a relatively recent BIOS, although not the latest (newer BIOSes are designed for more recent ryzen CPUs and are not recommended for older ones) . I'm using a discrete nvidia gpu with the 'nvidia-dkms' driver. Do you think it could be gpu-related ? Eventually, everything seems to be caused by these input/output errors...
I looked into this /dev/sda thing. The drive is actually a PCIe AHCI drive (Today I Learned...), which is indeed not NVMe (this explains that), but is not sata either. it does indeed show "PCIe" as "Transport" when checked using $ hdparm -I
Offline
The drive is actually a PCIe AHCI drive (Today I Learned...), which is indeed not NVMe (this explains that), but is not sata either.
Mmmm... I'm not sure what that means but I'm intrigued and will try to find out more about it. Was it working before with the same motherboard and setup?
Yes I have a suspicion it is gpu-related. I've had issues with the recent nvidia-570.57-2 but have been resolved with the latest upgrade to linux 5.7.8 and nvidia 450.57-2. Have you tried updating your system and see if the problem continues? If it does perhaps you can try the stock nvidia 570.57-2?
Offline
It's just a different protocol. I did no hardware change whatsoever in the last 2 years, and everything worked fine. I already updated the system, but I'll try the stock nvidia.
AHCI is just an older protocol: https://www.anandtech.com/show/7843/tes … ith-asus/4
Offline
Was your BIOS upgrade recent? If it was, there might be a switch in your BIOS that toggles the PCIE and SATA NVME that probably got reset in the process.
Offline
It's unlikey that it's a BIOS issue, I last updated maybe 4 months back.
The situation with the drive not being recognized as NVMe is also normal. I looked into it after your comment and conclude the following:
- NVMe, AHCI, (IDE...) are protocols
- PCIe, SATA are interfaces
- M.2 is a connector, but not an interface (which is confusing, because SATA implies typically both the connector and the interface)
- Protocols are implemented over interfaces: SATA drives typically implement AHCI or IDE protocols, while PCIe drives will typically implement AHCI (older drives) or NVMe protocols.
AHCI drives are not supposed to show up as /dev/nvme, that's completely normal. It also makes sense that nvme-cli doesn't recognize it. However, it's clearly not SATA:
- It benchmarks in gnome-disks at around 1.5GB/s (don't remember if read or write), which is beyond sata capabilities (6 Gbps=0.75 GB/s)
- "$ hdparm -I /dev/sda" reports PCIe as "Transport"
Now for an update:
- While installing the nvidia package, input/output errors showed up and borked the system. It's too much effort to fix it at this point, I'll reformat the drive and restore from backups.
- From then, hopefully i won't get errors for a while. We'll see
- Finally, I found about "badblocks". Which I would describe as "memtest for your disks". I'll run 2 or three passes, see if it reports any errors, and eventuelly report for reference. However i'm kind of busy now - I setup my backup computer and I have some work to catch up on, so this might take me a few days.
In any case, I really appreciate you taking the time and helping me out ![]()
Offline