You are not logged in.
Hello,
I have a computer with an Nvidia graphics card which used to run the proprietary nvidia driver.
Sometime ago my harddrive started experiencing data loss and unfortunately I had to recover data from it and I moved to an ssd.
I did a fresh install of arch on the ssd and ever since I did that, I've been experiencing input output errors.
Everytime I have to reboot to get my system working again.
From all the times it has happened I have recognised a pattern.
I run a program with some form of hardware acceleration (Games like minecraft, factorio)
I get an input output error.
I write the journal to the file to debug what is happening and what I see is that while these programs are running, my ssd has I/O errors for some reason. And someone on #archlinux said that after a few I/O errors on the drive, the kernel tries to mount the drive read-only.
And I saw that in the journal.
Due to this, all my partitions are mounted read-only and my system becomes unusable.
To stress the point this only happens when some kind of game is running.
If I leave my desktop running for 5 hrs nothing happens.
If I 100% my cpu for 5hrs nothing happens.
Only after a game is running for 30min-2hr in between this time period does the I/O error occur.
Peeps on #archlinux said to me to check the ssd, so I did. I ran s.m.a.r.t tests. 
Everything was fine. I ran fsck everything was clean. All my data is intact too.
I updated the kernel and drivers and still the problem persists.
I switched to the nouveau driver and it everything works fine except for the fact that the nouveau driver is broken and it hangs the pc and breaks xorg sometimes.
So I'm pretty sure it is the fault of the driver.
Details:
Kernel: 5.10.24-1-lts
Nvidia-dkms: 460.67
Help?
Offline

I'm pretty sure it's temperature or power consumption. Or is the SSD actually an NVMe?
Also https://bbs.archlinux.org/viewtopic.php?id=57855 - "an input output error" is equivalent to "somehow doesn't work"
Offline
Thanks for the reply,
I don't think my ssd is an NVMe.
I got these logs after the I/O error occurred.
I switched to the Nvidia Drivers, ran a game and in about 45 minutes I got an I/O error.
No new programs could be run. I had a terminal open before, so I ran ls and it just hung.
This happens when I get an I/O error.
I got these logs:
Dmesg in follow mode: log
Journalctl in follow mode (This is only from when I ran the command which is after reaching desktop): log
Xorg: log
output of smart -a: log
I can't see any errors in these logs but that just might be me.
Offline

195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       140799
…
199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       37Do these numbers raise w/ the IO errors?
There're no IO errors at all in those logs?
What exactly are you talking about? (In doubt make a photo if you can't preserve the information otherwise, only post a link - the board has a 200x200px limit)
Offline
Hardware_ECC_Recovered has increased by 1 after I tested but I cannot guarantee that it was caused by the I/O error.
The I/O error logs are not written onto the disk. Hence I had to follow dmesg and journal on the screen. 
After running a game for 10-20 min, I could see error messages and the game froze. I couldn't run any new command.
but I could see logs being sent to the terminal.
I couldn't copy them however due to the I/O error.
I ended up clicking photos of the errors.
Here is a tar containing all the photos. Due to how I got the photos the dmesg and journal logs got mixed up but I think it is pretty easy to tell which are which.
also the ordering is gone. Sorry.
error_log.tar
Offline

The drive ceases to respond, even on soft and hard resets.
Since the SMART data doesn't suggest that this is a broken drive.
There's little chance that this is a bus interference, there's also no problem recorded w/ the nvidia blob and your GPU seems fine.
My money is still on power.
Can you force the nvidia driver to operate in "desktop" mode or underclock it and see whether you can still trigger the problem?
Offline
I don't know how to run in desktop mode but I'll try to underclock and see if that helps
Offline
I used nvidia-settings to underclock
I decreased the clock by -105Mhz (Max)
I decreased the memory transfer rate by -500
I played xonotic and I got the same I/O error.
Offline

GPU GeForce GT 730 - that's a 38W possibly even passive GPU anyway… just make sure it's in the proper PCIe slot and that no cables touch the cooler…
You could try the 390xx legacy driver to narrow it down :\
https://aur.archlinux.org/packages/nvidia-390xx-dkms/
https://aur.archlinux.org/packages/nvidia-390xx-utils/
Offline
I switched from the main branch to 390xx but it still caused the same I/O error. 
Offline

Not the kernel, not the driver (version), not the disk, …
make sure it's in the proper PCIe slot and that no cables touch the cooler
Offline
I'm still on the 390xx branch.
I removed the gpu and put it back in and made sure everything is connected properly.
The gpu has a small fan and there are no cables near it blocking it.
I tried the xonotic test again and the same I/O error occured.
One thing I noticed in dmesg after boot is this:
[ 16.730326] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000dffff window]
[ 16.730692] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
Does this have anything to do with the problem?
Offline

https://forums.developer.nvidia.com/t/d … bars/58088
but
https://forums.developer.nvidia.com/t/n … 14/55670/8
Also everybody gets that and your problem is suspiciously reproducible…
nb. that not all PCIe slots are equal, https://en.wikipedia.org/wiki/PCI_Express
Consult your board manual about this (though this should™ crash the GPU rather than the SATA bus…)
Offline
I read the links and there seem to be no answers to fix it.
All PCIe slots are equal when you have only one standard slot.
There is no board manual.
What should I do? 
Offline

Do you have a second SATA port on the board?
Btw, "[    0.000000] DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./G41, BIOS 080015  10/08/2016" - wtf… what board is that actually? Are there still BIOS updates available?
Offline
I have 4 SATA ports on the board. It was plugged into 3 so I tried 1,2 & 4 but no luck. 
I got the same error.
The board is an obscure cheap motherboard. Probably from china. It has a BIOS and I don't think there are any updates available.
Offline
UPDATE:
I had lost hope in the proprietary drivers so I switched to nouveau drivers. (Even though they are buggy)
I thought that the proprietary drivers were the problem and the nouveau drivers worked fine.
Then I looked into improving performance in nouveau drivers and turns out that while using the nouveau drivers the gpu is at the lowest 'clock stage'
You have to manully set the gpu to a higher clock.
You can check 'clock stages' like this:
me@ME: ~ > sudo cat /sys/kernel/debug/dri/0/pstate
[sudo] password for me:
07: core 405 MHz memory 810 MHz
0f: core 653-901 MHz memory 1600 MHz
AC: core 405 MHz memory 810 MHz
I switched to the high clock speed and while running unigine-valley I got hardware errors as seen in error_log.tar.
But if I ran the benchmark for a little while I got one or two error and there is successful error recovery
while running it for 15min causes the kernel to give up and mount the ssd readonly.
I guess there is a co-relation between higher clock speeds and sata interface crashing?
My gpu is a gt 730 and it doesn't even take power from the power pins. It even comes with passively cooled options so I don't think it is overheating by just running it at full capacity.
A few years ago I used to overclock it and get a good 200Mhz clock increase with no glitches or problems of any kind.
it only started after switching to an ssd but I can't guarantee that because I didn't use to look at dmesg logs then.
Offline

When you replaced the disk, did you also replace the SATA cable? Do you have spare ones?
That or the PSU or the board - but getting the same behavior w/ nouveau and depending on the power draw makes a HW issue incredibly likely…
Offline
I do have a spare sata cable. I tried it and it's no different.
I also looked into sata errors and tried some things suggested in the kernel documentation and Archwiki.
for eg, acpi=off, libata.force=noncq
but none of them yielded any results.
from my knowledge yes increasing the clock speed does increase the power draw but that itself doesn't cause the SATA error, a heavy 3D application needs to be run for the
error to occur.
I'm thinking of trying to update the SSD firmware to see if that helps...
Offline

Maybe it's completely unrelated to the GPU itself but caused by eg. some cache burst induced by "heavy 3D application".
How does your SSD hold up against a dedicated disk stress test?
Offline
How do I conduct a dedicated stress test?
Offline

Just run a couple of parallel dd's that write some GB of /dev/zero to some file on the drive and read some superlarge™ file into /dev/null.
Offline
I ran all these settings:
single
zero -> file (10GB)
file   -> null (10GB)
parallel 4:
zero -> file (3GB)
file -  > null (3GB)
I got 200MB/s on the single one and 50MB/s on each when doing the parallel run... it did not error out
Offline

it did not error out
That's actually the interesting part.
It's either power or some EM interference (or leakage) - at least I've no other explanation for "heavy GPU load causes IO errors"
Offline
Checkout my current smart log: 
The attributes have changed suspiciously! I thought power on hours only goes forward?
log
I also did some testing...
I ran linux mint from a usb and ran unigine valley on it. (it by default gives nouveau drivers so I increase the clock state)
It worked fine with no errors
then I mounted all the ssd filesystems in /mnt and copied over the unigine valley folder to my home folder from the flashdrive
When I ran ungine valley from my ssd's home folder in linux mint the ssd errored out with the same error as in the dmesg logs.
Back on my main setup
I also tried putting unigine valley on a flashdrive and running that but still the ssd had the same error...
I don't have any clue now what to do... Nor can I buy another storage device.
Offline