You are not logged in.
Title says it all. I'm using libvirt and qemu and KVM to pass a video card into a Windows VM with OVMF for gaming (and also an Arch Linux one for Linux games that need a better graphics card). It's been working really great for months, but for the past week or two, roughly half the time I shut down the Windows guest naturally, the second it finishes shutting down, my entire computer has a complete hard freeze. The display is frozen, I can't switch TTYs, and I'm guessing the processes all stop running, because I share internet with another physical computer over ethernet, and that computer stops having access to the internet in this freeze. I update pretty frequently, probably once a week on average, so it could be new versions of anything. Since the problem is not an absolute every-case-this-happens scenario, I wanted to avoid testing package downgrades before posting here and seeing if anyone's in a similar boat.
Oh, and it's worth noting that I'm currently running the linux-lts kernel because of this bug in 4.2 and beyond: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Thanks to that bug, I completely cannot test whether I still have this issue after 4.1.
EDIT: Many new informations have been found. They can be found in the posts below, but here's the summary:
-it's ONLY when passing through a PCI device
-it affects the Windows and Linux guests equally
-it will only and always freeze on shutdown if the guest has been running for ~2 hours or longer
-it's still a problem in Linux 4.4 (which fixed the aforementioned bug that kept me on 4.1)
Last edited by Cadeyrn (2016-06-26 09:02:26)
Offline
...so it could be new versions of anything.
It would probably be helpful to post the exact current versions of the relevant packages, and configs/commands etc
Are you familiar with our Forum Rules, and How To Ask Questions The Smart Way?
BlueHackers // fscanary // resticctl
Offline
Right. I'll do that later today when I have access to my computer. In the meantime, you can assume that I'm using whatever was the newest version of each package on Dec 3. As for configs and custom commands, I'm passing in 3 PCI devices (video card, video card's HDMI's sound, and a USB3 card. I'm also running a qemu hook script that forwards some ports in libvirt's virtual network. I also have the TianoCore EDK2 UEFI thingy as its BIOS, just like in the OVMF guide in the Arch Wiki. Everything else is the standard stuff you'd expect for Windows 10 in libvirt.
EDIT: package versions:
linux-lts 4.1.13-1
libvirt 1.2.21-1
qemu 2.4.1-1
Last edited by Cadeyrn (2015-12-10 23:37:05)
Offline
I finally managed to figure out SOMEthing about this problem. It turns out, the hang is NOT intermittent. It happens if, only if, and always if, I had the VM running for a while. However, I don't actually know how long "a while" is. I know that it's not less than an hour, but it is several hours. I'm guessing the point where it'll crash my host is maybe the 2-hour mark? And I've been regularly updating my system this whole time, so it's not version-specific (unless it's specific to qemu 2.4.1 or is a regression that has yet to be fixed).
linux 4.4.1-2
libvirt 1.3.1-2
qemu 2.4.1-2
Offline
Alright, even more new info. I've confirmed that the crossover point is about 2 hours, and it happens on my Arch Linux guest, too, so it's not specific to the guest's OS. I've also just confimed that it's specifically only happening when I pass PCI devices into the guest. The problem started happening around the same time I upgraded to qemu 2.4.1 (from 2.4.0.1). I'm going to downgrade to 2.4.0.1 and see if that fixes it.
EDIT: Downgrading did not fix it. I don't think it's qemu's fault.
Last edited by Cadeyrn (2016-02-14 23:52:29)
Offline
Exact same issue here, but on Gentoo. I'm also passing a video card with HDMI sound, and using OVMF for Windows 10. I'll post my package versions here to help, but Cadeyrn has already given all the info that I know to mention. I'll try to capture some crash dumps, but so far kexec has failed to trigger. Logs aren't helpful and show no abnormal behavior.
Linux 4.1.6-ck
Qemu 2.5.0 *
Not using libvirt.
*includes patch for Nvidia cards found here: https://www.redhat.com/archives/vfio-us … 00198.html
Last edited by RomeIsBurning (2016-02-22 21:16:08)
Offline
RomeIsBurning: Welcome to the Arch forums. Please note we only support Arch Linux here, so information pertaining to your (system|problem|solution) may or may not be relevant to OP and others experiencing this issue.
Are you familiar with our Forum Rules, and How To Ask Questions The Smart Way?
BlueHackers // fscanary // resticctl
Offline
Thanks, fukawi2. I was already aware of this. I don't expect any assistance with the issue, but I was hoping to help solve the issue in general, and show that it is not Arch-specific. Just let me know if there's anything I can provide that will be of use.
Offline
You are exactly what I needed. I was hoping somebody from a different distro would come up with the same issue, like what happened last time with the other recent PCI passthrough bug, so that we know it's not distro-specific. Now it's just a matter of tracking down exactly which package and which version caused it and submitting a bug, and I may have already ruled out qemu. I should try downgrading to an even older version to make sure. You've already ruled out libvirt by not using it.
EDIT: I tried downgrading qemu and the next version down (2.3) requires an old version of ncurses, so I'm just going to assume it's not qemu. I know for a fact that I didn't notice the problem until after upgrading to 2.4.1, and going down to 2.4.0 didn't fix it.
Last edited by Cadeyrn (2016-02-26 02:53:52)
Offline
A quick failure to report: I tried the VFIO mailing list suggestion of not attaching the HDMI audio device to qemu, while leaving it on the vfio-pci driver. No change in behavior.
Update: Not important enough for it's own post, but I've discovered a few things. It appears that graphics intensive gameplay (GTA V) can cause the crash to happen sooner. A shutdown after only 15 minutes of such gameplay resulted in crash, whereas that would only occur after several hours of light usage. Secondly, I've found that bringing up a second passthrough VM with separate hardware will cause the same crash. The situation is identical to a shutdown, even down to the two-hour limit. Bringing up another VM before the two-hour mark will not result in a crash (except in the heavy gameplay situation).
Last edited by RomeIsBurning (2016-03-05 03:18:39)
Offline
I didn't realize that was a suggestion. I don't attach the audio to my Linux VM because it refuses to work, opting for networked sound instead, so I could've confirmed it's not a fix.
Last edited by Cadeyrn (2016-03-04 17:06:39)
Offline
I discovered something interesting by complete accident: binding any PCI device to any driver after the two-hour mark of qemu running - even PCI devices not involved with qemu - will cause the same freeze. I think that this might be an important clue toward solving this. Cadeyrn, if you could confirm this behavior on your end somehow, that would be a huge help.
Offline
Wait, how exactly do I do that? Do I just not load the module for the PCI device I'm using to test it until after the 2-hour mark?
Offline
Wait, how exactly do I do that? Do I just not load the module for the PCI device I'm using to test it until after the 2-hour mark?
Sorry, I should have been more clear. Use the kernel directories to unbind, and then rebind some PCI device. For example, moving a sound card from any driver (pci-stub in my case) to the sound drivers as root:
echo 0000:08:00.0 > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
echo 0000:08:00.0 > /sys/bus/pci/drivers/snd_ctxfi/bind
You can try this with any device not attached to qemu, as long as you bind to the correct driver. Attempt these commands both before the two-hour mark, and then after. It should work without error the first time, but initiate the freeze on the second time. The freeze occurred on the bind for me, rather than the unbind. If there is no freeze, try a shutdown of qemu as normal to see if that freezes instead. If the PCI bind doesn't freeze, but the shutdown does, then it's only on my end, and this is not a symptom we have in common, but I suspect that won't be the case.
Last edited by RomeIsBurning (2016-03-11 08:21:36)
Offline
Curses! I tested it just now, and got nothing, because it didn't even freeze when I shut it down. That happens ocassionally. I'll have to try again next time and... hope this annoying issue happens. Yay for tests.
Offline
Yay for tests.
Good luck! Thanks for all the work you've put into solving this, so far.
Offline
I tested it again, and still nothing. I also tested it a second time last night and didn't mention that. 3 tests across 2 boots and I can't get it to crash. Has the entire problem gone away? I'm still regularly updating everything, so that could have fixed it.
Offline
I just gamed for 6 hours off and on, shutdown, and no crash. What an anticlimactic way for this to be solved. When I get off work I'll get a list of packages updated recently, and maybe, if this occurs again in the future, someone might use that info to fix it. You might try the same, but I'm unsure as to how easy it is to get such a list on Arch. Again, thanks for your help.
Offline
Arch keeps it all in /var/log/pacman.log.
The only upgrades I made that could've fixed it were on March 3:
pciutils 3.4.0-1 -> 3.4.1-1
linux (and -headers) 4.4.1-2 -> 4.4.3-1
But, I'm not so sure, because I think this problem has shown up between 3/3 and 3/11, even though 3/11 was when it suddenly disappeared and I upgraded nothing between those days.
I'm going to keep testing the PCI binding for now, in case the problem shows up again and I catch it red-handed.
Also, now that I've learned about sysfs and PCI binding, I'm wondering if it would be possible to bind my gaming video card to nouveau if I want to play a demanding game in Linux without having to use my Linux VM (that would be awesome). I can unbind and re-bind it to vfio-pci without any issues, but when I try to bind it to nouveau (after unbinding it from vfio-pci), I get the infamous "bash: echo: write error: No such device". Am I doing it wrong? Or does Linux/nouveau simply not support re-binding graphics cards yet? Should I switch from using vfio-pci to using pci-stub instead, and would I still be able to bind the video card to a VM from there?
Last edited by Cadeyrn (2016-03-13 20:20:03)
Offline
I don't see any updated packages that we have in common, but for the sake of completeness, here is my list.
Thu Mar 10 14:37:52 2016 >>> net-libs/http-parser-2.6.2
Thu Mar 10 14:38:02 2016 >>> www-plugins/chrome-binary-plugins-49.0.2623.87_p1
Thu Mar 10 14:39:46 2016 >>> net-libs/nodejs-5.7.1
Thu Mar 10 14:39:57 2016 >>> www-client/google-chrome-49.0.2623.87_p1
Fri Mar 11 21:11:12 2016 >>> net-libs/nodejs-5.8.0
Fri Mar 11 21:11:32 2016 >>> dev-vcs/git-2.7.3
Fri Mar 11 22:25:22 2016 >>> www-client/inox-48.0.2564.109
Sat Mar 12 21:56:02 2016 >>> media-libs/a52dec-0.7.4-r7
Sat Mar 12 21:56:14 2016 >>> media-libs/libdca-0.0.5-r3
Sat Mar 12 21:56:47 2016 >>> media-libs/libsdl-1.2.15-r9
Sat Mar 12 21:57:02 2016 >>> media-libs/openal-1.15.1-r2
Sat Mar 12 21:57:15 2016 >>> media-libs/libcanberra-0.30-r5
Sat Mar 12 21:57:37 2016 >>> media-sound/sox-14.4.2
Sat Mar 12 22:02:24 2016 >>> media-video/ffmpeg-2.8.6
Sat Mar 12 22:02:42 2016 >>> media-sound/mpg123-1.18.1
Sat Mar 12 22:03:05 2016 >>> media-video/mpv-0.9.2-r1
Sat Mar 12 22:03:09 2016 >>> media-plugins/gst-plugins-meta-0.10-r10
Sat Mar 12 22:12:16 2016 >>> app-emulation/wine-1.9.5
Sat Mar 12 22:12:45 2016 >>> net-misc/openssh-7.2_p2
Sat Mar 12 22:12:51 2016 >>> www-plugins/adobe-flash-11.2.202.577
Sun Mar 13 23:58:44 2016 >>> net-misc/openssh-7.2_p2
Mon Mar 14 22:58:26 2016 >>> dev-perl/Date-Manip-6.510.0
Mon Mar 14 22:58:29 2016 >>> app-portage/genlop-0.30.9-r1
Nothing that jumps out at me as a culprit.
As for your question about GPU binding, it is possible to bind GPUs to Nouveau (or any open source driver), but it may require a different syntax than just the PCI slot. Sometimes they require a vendor and device ID, as vfio-pci does. See this code for an example:
vendor=$(cat /sys/bus/pci/devices/0000:08:00.0/vendor)
device=$(cat /sys/bus/pci/devices/0000:08:00.0/device)
echo 0000:08:00.0 > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
Unfortunately I can't remember what syntax Nouveau takes. You may just fiddle with it until it takes, or it could be documented somewhere. The main problem with all this though, is that the proprietary Nvidia drivers do not allow dynamic assignment, and Nouveau is very poor at 3D graphics performance, so there really isn't a good way to do what you'd like.
Offline
In my experience, Nouveau has actually been pretty amazing at 3D compared to what I expected. Maybe that'll change when I'm using an actual high-powered card, though.
So, would this bind it to Nouveau (as root), or am I missing a step?:
# echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
# echo $(cat /sys/bus/pci/devices/0000:06:00.0/vendor) $(cat /sys/bus/pci/devices/0000:06:00.0/device) > /sys/bus/pci/drivers/nouveau/new_id
Does echoing it to new_id bind the driver? Sorry for asking these basic questions, but this is very hard to Google and I have yet to find any docs on it. I tried binding after echoing it to new_id, and it still gave me the same error, maybe because I'm supposed to just echo it to new_id.
Also, how would I get Xorg to recognize it and use it for a program of my choice? The only examples I can find involve setting things at boot in conf files, but of course I can't set a static configuration at boot, since my second video card will change places as I use the VM or the host.
Offline
It came back just now! I shut down a VM and my computer froze. But, I didn't test the PCI binding beforehand because I had gotten complacent. I'm gonna be super vigilant about always testing it now.
Here's the interesting part: the VM had only been running for a few minutes, and it happened after I had shut down both VMs each dozens of times in a row within an hour (I was doing things).
Offline
Do you have any dmesg output after shutting down your vm? Maybe it's like my problem posted here: https://bbs.archlinux.org/viewtopic.php … 5#p1613845
Offline
I doubt it's the same issue, because we get a hard freeze upon binding/unbinding things and you apparently don't. And because of that freeze, I don't think we'd get any sort of log output, even in dmesg.
However, I made a new breakthrough a few days ago that I apparently forgot to post here. The problem is indeed back, but it's only happening with the Windows VM and never happening with the Linux VM. I managed to try unbinding and binding a PCI device that has nothing to do with either VM while Windows was running, and it didn't wait for me to bind it. The computer froze as soon as I unbound it. Has that ever happened to you, RomeIsBurning? Maybe we have the same problem but slightly different symptoms? The weird part is, the only thing that's different about the two VMs (besides the OSes they're running) is the Windows one is also getting the video card's HDMI audio device passed in, and the Linux VM is not. But, you confirmed that didn't fix it for you, and until now, it wasn't fixing it for me, either. I'll try taking the audio out of Windows and see where that gets me.
But, the fact that the behavior is different between the two VMs would explain why the problem "disappeared" between 3/3 and 3/11. I haven't been testing both of them equally (in fact I was almost exclusively gaming on the Linux VM that week), so it may well could have upgraded to the issue I have now because of the 3/3 upgrades to pciutils and/or linux. Then it would probably be the issue in that mailing list that is fixed by removing HDMI audio. That's my hunch.
EDIT: Well, the problem just happened after I took the audio device out. That's it, then. I'm now only having this issue with the Windows VM, even though nothing about it is different from the Linux VM besides the OS.
Last edited by Cadeyrn (2016-03-22 08:21:39)
Offline
I also have this issue with the host system sometimes locking up when shutting down a win10 guest with a pci passthrough NVidia 980. I have never had it happen on with a linux guest, however I do not normally run linux VMs for gaming or anything graphics intensive. I normally have the crash upon shutting down the win10 guest, but also was able to replicate the problem by doing a 'rmmod snd-hda-intel' on the host while a win10 guest had been running for a couple of days. My host machine does have a radeon card which uses snd_hda_intel, but my pci-passthrough graphics and sound is bound to vfio-pci.
I decided to try removing sound card support from my host's kernel all-together as a test and so far I've not seen a lockup. Its hard to say if I've just been lucky since the problem is intermittent, however from my experience the system would normally have locked after my recent usage. Maybe you guys can try blacklisting the snd module and snd-hda-intel and see if that helps your problem. Mine is not compiled into the kernel at all but I imagine blacklistiing the modules would have the same effect.
I will post again if my system locks upon a guest shutdown, but I plan on letting it run a few days before shutting it down again.
Offline