You are not logged in.
Hello Arch Community,
please help me figure out the random system freezes on my System.
I upgraded my Hardware recently and replaced my old CPU, RAM and Mainboard with some used parts from ebay. (i7-6700K, ASRock Z170 Extreme4, 16GB Ram)
Since then I get random System freezes where I can't move the mouse pointer or change to tty2. Even SysRQ doesn't work anymore and I have to use the power switch to turn the pc off. These freezes happen to me from 3-10 times a week. I can't find anything regarding to this in the logs.
There is another issue with Suspend to Ram on this system. I am not sure if they are connected or not. Sympthoms: on some resumes I get back to a disorted screen (huge triangles all over the sceen), it is possible to move to tty2 for the first few seconds, then the system comes unresponsive as above -> power switch.
Here are the logfiles of my last 3 boots. What happened:
boot1: worked for almost 24 hours -> suspend -> freeze
https://gist.github.com/rolandg/5ab8a69 … 162423a051
boot2: worked for almost 2 minutes: boot & started some software -> freeze
https://gist.github.com/rolandg/fc829bd … c87f9b282b
boot3: boot after freeze, still running
https://gist.github.com/rolandg/28ce38d … 30b4e5382f
I thought it was an issue with my SSD since at some point I saw some I/O errors from my system partition. Could that be my problem? Would a defect disk cause a freeze?
Any other ideas?
Thank You in advance
LandoR
Last edited by LandoR (2017-05-19 12:31:07)
Offline
https://wiki.archlinux.org/index.php/Kernel_Panics
https://wiki.archlinux.org/index.php/S.M.A.R.T.
https://www.archlinux.org/packages/extr … emtest86+/
Since you replaced cpu, mb & RAM with used parts, try memtest first. If you're lucky, the memory timings/clocks are just set too aggressive.
Offline
Nothing obvious jumped out from the logs. I would recommend installing mcelog and enabling mcelog.service
After you reboot from a hang, check journalctl for the prior boot and look for errors.
Offline
Hi,
thanks for your replies.
I Forgot to mention: I use Windows for gaming and never had a freeze in Windows. I just played ~ 15h since the new hardware.
My RAM is new, but I tried memtest86+ and it doesn't report any errors.
The output of smartctl looks good imo:
$ sudo smartctl /dev/sda -a
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.13-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: Corsair Force GT SSD
Serial Number: 11246511000006930004
LU WWN Device Id: 0 000000 000000000
Firmware Version: 1.2
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed May 3 15:22:48 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7f) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0021) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 120 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 5092 (211 228 0)
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 2031
171 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0
174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 76
177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 6
181 Program_Fail_Cnt_Total 0x0032 000 000 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 030 080 000 Old_age Always - 30 (Min/Max 7/80)
195 Hardware_ECC_Recovered 0x001c 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unknown_SSD_Attribute 0x001c 100 100 000 Old_age Offline - 0
204 Soft_ECC_Correction 0x001c 100 100 000 Old_age Offline - 0
230 Unknown_SSD_Attribute 0x0013 100 100 000 Pre-fail Always - 429496729700
231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0000 000 000 000 Old_age Offline - 9633
234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 9623
241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 9623
242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 8298
SMART Error Log not supported
SMART Self-test Log not supported
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I enabled mcedit to report errors. No freeze since then.
Offline
I use Windows for gaming
Fastboot issue?
Offline
Fastboot issue?
If you mean fast start-up (https://wiki.archlinux.org/index.php/Du … t_Start-Up), it's disabled in windows. I had some issues with this earlier on the old system. I didn't reinstall win on the upgrade.
Offline
I now also disabled fastboot option in bios. Didn't help.
Here is a log with mcelog enabled.
https://gist.github.com/rolandg/b9427d1 … 4469e3f8c5
Can't find anything in it.
Offline
The BIOS setting is unrelated (it just skips some self tests where the MS feature leaves HW in an undefined state)
Whatever is /dev/sdf reports a temperature > 100°C !
Either the device is already broken and reports junk or it's pretty hot in there - too hot for most things (notably RAM, Disks, it's even hot for a CPU - GPUs can usually take that much heat, but that's it)
Also the airflow cell is > 60° - in apparently a desktop system/tower?
Something seems weary about temperatures. Either a device heats up too much or the fancontrol is badly configured.
Offline
Yes it's even a big tower.
The disk(s) both are a bit older and they get warm. I don't think they were 100°C when I touched them, even if smartctl said it. They were in the lower front corner of the tower where they don't get cooled very well.
I unplugged them for now, freezing persists.
Not sure if this is related:
I have these housing fans with 4 LEDs on each of them.
When I upgraded my system I clipped one of the cables to each LED.
I did not isolate them so it might be that one of them is connected to GND.
Could that be related?
Thanks
Offline
I'd generally suggest to rule out *any* cause by electricity stunts, yes.
An even brief shortcut could cause entirely random things (ask me whether dust can turn conductive ;-)
Offline
The disk(s) both are a bit older and they get warm. I don't think they were 100°C when I touched them,
Oh, you would know if they were. It would leave blisters.
60°C exhaust air temperature for a tower is not unreasonable. Maybe a bit high.
100°C for a disk drive is too high. Be aware that this is probably not the case temperature, but some ambient sensor on the circuit board. It could also be an airflow temperature
Temperature may be a red herring.
Have you installed and configured microcode updates for your processor?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Offline
May 01 16:13:04 wks-rd kernel: NVRM: Your system is not currently configured to drive a VGA console
May 01 16:13:04 wks-rd kernel: NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
May 01 16:13:04 wks-rd kernel: NVRM: requires the use of a text-mode VGA console. Use of other console
May 01 16:13:04 wks-rd kernel: NVRM: drivers including, but not limited to, vesafb, may result in
May 01 16:13:04 wks-rd kernel: NVRM: corruption and stability problems, and is not supported.
https://wiki.archlinux.org/index.php/GR … ramebuffer
Also check the dmesg tail right after wakeup (though more than ten lines) and simply try to suspend the kwin compositor before the S3 and resume it after wakeup.
Offline
I am intrigued that the error only occured on Linux and not on Windows. Have you inspected the hard drive using tools described here: https://wiki.archlinux.org/index.php/S.M.A.R.T.?
And small edit: the microcode thing might be related, sooo, have you tried that?
Last edited by SmallAndSimple (2017-05-11 10:52:39)
Offline
@seth
I'll try that in a minute.
@SmallAndSimple
Yes microcode was enabled since reinstallation.
It was a drive formated in ext4 so windows has never accessed it.
Thank You guys
EDIT:
After setting the framebufffer the system resumes to a black screen or the lock screen. I can move the mouse for some seconds then it freezes. No changing to tty2 or SysRQ possible.
Here are the logs of two tried resumes:
log 1
log 2
Last edited by LandoR (2017-05-12 12:15:18)
Offline
What do you mean by "after setting the framebuffer" - the logs indicate that you're still using a framebuffer console, but the idea is to do not (because that's not officially supported by the nvidia driver)
The system is btw. fully alive (SysRq is just not enabled, see https://wiki.archlinux.org/index.php/Sysctl) - the error seems to be
nvidia-modeset: ERROR: GPU:0: Idling display engine timed out
Piping that into google will get you here https://devtalk.nvidia.com/default/topi … ia-370-/11
There doesn't seem to be a known cause/solution for this, though :-(
Offline
I had two freezes out of nothing yesterday. So it wasn't the disc. After one of them i was able to change to tty2 and could read some I/O errors from my system partition. So i went to the shop and got myself a brand new ssd. I hope this helps now.
What do you mean by "after setting the framebuffer" - the logs indicate that you're still using a framebuffer console, but the idea is to do not (because that's not officially supported by the nvidia driver)
At least i set
GRUB_TERMINAL_OUTPUT=console
+ grub-mkconfig. After that the border of the grub screen disappeared. Might be a log from before it was set.
The system is btw. fully alive (SysRq is just not enabled
Oh. You are right. I didn't reenable sysrq after the installation. But if it were fully alive it would be possible to change to tty2. This is sometimes possible, sometimes the system correctly resumes after 30s. Sometimes tty2 freezes after switching to it. I already had all variations.
Thanks
Offline
Framebuffer/graphics issue - you could try to ssh into it for an inspection.
Do you still load the nouveau module at some point? Eg. from/in the initramfs?
The tty2 output should "look like DOS" ie. only have 25 lines of text.
Offline
Ok seems fixed since i've got the new ssd.
I also had no more problems resuming from suspend.
Even if I've got no idea why, I'll close this thread.
Offline