You are not logged in.
Context (maybe relevant)
Have been using arch on my laptop for 2 years without facing major issues
Laptop recently failed to start at times, crashed frequently and the charging light would not come on sometimes
Technicians said that the Power IC had shorted and replaced/repaired it
Problem went away but Arch did not boot and the laptop always entered the BIOS menu
I used a bootable usb to reinstall arch. However, I found that my partitions and data were intact and I just used the same partitioning without formatting
I reinstalled GRUB and got my system back to exactly how it was. So, probably the bootloader files went corrupt due to extensive hard booting
Problem
A new problem has emerged, where the kernel panics and the system hangs
I have to hard reboot every time, and the panic messages are not persisted. Hence, the snapshot below
The first time I use the laptop everyday, there are no crashes for a long time (5-10 hours)
After the first crash, subsequent sessions crash within the first 10 minutes
This strongly suggests a hardware problem, more like a heating problem. But I monitored the CPU temperature, which was fine throughout (~53C) and the laptop didn't seem unusually hot either
At this stage, I am clueless. Is it actually a hardware problem? If so, how do I narrow down to the faulty component?
Other details
[~]$ uname -a
Linux rhinoMSi 5.2.11-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 29 08:09:36 UTC 2019 x86_64 GNU/Linux
[~]$ sudo lshw -short
H/W path Device Class Description
=================================================
system PE62 7RE (16J9.3)
/0 bus MS-16J9
/0/1 memory 64KiB BIOS
/0/3e memory 8GiB System Memory
/0/3e/0 memory 8GiB SODIMM DDR4 Synchronous 2400 MHz (0.4 ns)
/0/3e/1 memory [empty]
/0/42 memory 256KiB L1 cache
/0/43 memory 1MiB L2 cache
/0/44 memory 6MiB L3 cache
/0/45 processor Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
/0/100 bridge Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
/0/100/1 bridge Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16)
/0/100/1/0 display GP107M [GeForce GTX 1050 Ti Mobile]
/0/100/2 display HD Graphics 630
/0/100/14 bus 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
/0/100/14/0 usb1 bus xHCI Host Controller
/0/100/14/0/7 input MSI EPF USB
/0/100/14/0/a communication Bluetooth wireless interface
/0/100/14/0/b multimedia BisonCam, NB Pro
/0/100/14/0/c generic USB2.0-CRW
/0/100/14/1 usb2 bus xHCI Host Controller
/0/100/14.2 generic 100 Series/C230 Series Chipset Family Thermal Subsystem
/0/100/16 communication 100 Series/C230 Series Chipset Family MEI Controller #1
/0/100/17 storage HM170/QM170 Chipset SATA Controller [AHCI Mode]
/0/100/1c bridge 100 Series/C230 Series Chipset Family PCI Express Root Port #1
/0/100/1c/0 wlp2s0 network Dual Band Wireless-AC 3168NGW [Stone Peak]
/0/100/1c.3 bridge 100 Series/C230 Series Chipset Family PCI Express Root Port #4
/0/100/1c.3/0 enp3s0 network QCA8171 Gigabit Ethernet
/0/100/1f bridge HM175 Chipset LPC/eSPI Controller
/0/100/1f.2 memory Memory controller
/0/100/1f.3 multimedia CM238 HD Audio Controller
/0/100/1f.4 bus 100 Series/C230 Series Chipset Family SMBus
/1 power To Be Filled By O.E.M.
[~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping: 9
CPU MHz: 1000.385
CPU max MHz: 2800.0000
CPU min MHz: 800.0000
BogoMIPS: 5618.00
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx1
6 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd
ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveo
pt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
Let me know if I need to post any other details. I have never dealt with kernel panic or hardware faults before. Any help is greatly appreciated.
Update: Faulty GPU. See comment below.
Last edited by rbiswas143 (2019-10-22 04:40:38)
I'm new to this forum and have read the rules. I'd appreciate your feedback with regard to my adherence to the norms.
Offline
Ensure your microcode is set up and generally update your system. You might also want to disable laptop-mode-tools to ensure it isn't triggering some faulty power saving option.
You will likely also want to test your RAM for a day or so
Last edited by V1del (2019-10-18 09:40:05)
Offline
The scenario sounds as if maybe the case warms up, deforms, causes tension and boom: you got a loose connection.
You could try if you can restore the 5-10h uptime capacity by aggressively cooling it down, eg. putting it in the fridge for some time™ (where you keep the butter, NOT where you keep the ice cream. And power it off before doing so, esp. if you've a spinning HDD)
Edit: and make sure the humidity in the fridge isn't too high (ie. if everything has drops of water on it, you should defrost it before putting electronics there)
Last edited by seth (2019-10-18 15:45:11)
Offline
Thank you for your support. I tried everything you guys suggested but without much luck. But, I kind of managed to fix the problem anyway.
Laptop soon went back to the initial state where it won't boot for a long time
Booted without SSD, HDD, battery, etc (one at a time) but the problem persisted
Screen distortions showed up at times and the screen sometimes went completely green
Finally uninstalled/blacklisted GPU drivers, and that worked like a charm
My diagnosis is that I have a faulty GPU but it starts misbehaving only when it heats up. The GPU is built into the motherboard, and I'd rather get a new laptop than replace the motherboard. Till then I'll be using a GPU less laptop which, sadly, defeats the purpose of me going with MSI in the first place. I wonder if I should consider using the GPU at times till it heats up without damaging anything else. Probably not. Marking this SOLVED.
I'm new to this forum and have read the rules. I'd appreciate your feedback with regard to my adherence to the norms.
Offline