[SOLVED] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read

uka · 2022-07-30 17:24:26

Hi all, I recently repurposed an old office Ubuntu desktop machine with an AMD A10-7860K CPU as an Arch Linux NAS with ZFS, everything is working fine. However, I took a peek into dmesg for kicks some time ago, and every time there is a kernel upgrade and noticed a barrage of messages (no longer able to see the boot logs!) like:

[782205.344976] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1025549ns
[782205.344988] clocksource: wd-tsc-wd read-back delay of 578285ns, clock-skew test skipped!
[782951.361019] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1055441ns
[782951.361033] clocksource: wd-tsc-wd read-back delay of 517384ns, clock-skew test skipped!
[783517.379087] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1025549ns
[783517.379099] clocksource: wd-tsc-wd read-back delay of 578285ns, clock-skew test skipped!
[791427.581843] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1004317ns
[791427.581858] clocksource: wd-tsc-wd read-back delay of 578565ns, clock-skew test skipped!
[792383.605345] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1025269ns
[792383.605357] clocksource: wd-tsc-wd read-back delay of 578565ns, clock-skew test skipped!
[792923.619029] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1025828ns
[792923.619042] clocksource: wd-tsc-wd read-back delay of 578285ns, clock-skew test skipped!
[792931.619232] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read-back delay of 1056838ns

Originally where it says acpi_pm it used to say hpet, but I disabled HPET in BIOS thinking it was the culprit, it wasn't. It seems the tsc clocksource is misbehaving and it decides to use HPET or ACPI as it sees fit. I don't seem to find any information about that besides this recent thread at kernel.org: https://lore.kernel.org/lkml/2022052801 … edhat.com/.

This system is mostly idling all the time, which in the thread they say "If this still happens at idle, there is definitely something wrong with the cpus." and that "The only way to stop the spamming is to build a kernel with a larger CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US. By default, it is 100us", so I guess that's the solution for the spamming. The system otherwise works fine, and at some point had an uptime of two months. Monthly ZFS scrubs report no issues. The box itself had a very comfortable life before retirement as a web browsing and office desktop machine, so no overheating or things like that. It is replacing an AMD Athlon 5350 that was somewhat unresponsive because I'm using ZFS compression and it was getting slow and 100% CPU usage with some loads.

This has been happening for many kernel versions now, with the constraint that zfs-linux package requires a slightly older version of the kernel, so now it is running 5.18.12 instead of 5.18.15.

Do somebody else had this issue? Any recommendation besides a custom kernel? I will try that as soon as I have time, but the NAS box is now indispensable for my home office, and maybe there is a tweakable boot setting or so?

Last edited by uka (2022-08-08 00:53:22)

risto3 · 2022-08-02 09:51:34

More or less the same thing, excepting that I seem to only see it on the machine running LTS (5.15.55-2-lts), as I have two supermicro H8SGL-F with an opteron 6128
the other one is running 5.18.15-arch1-1... only other difference is the microcode (the latter has 3.5b instead of 3.5) and has unbuffered ECC memory instead of RDIMMs.

Seems the LTS machine (with the older microcode) has seen these messages for months, I'll see about updating the microcode to see if that's the culprit.

uka · 2022-08-08 00:51:48

Probably solved the issue: by setting the tsc=unstable kernel boot parameter, and re-enabling HPET in the UEFI/BIOS, the Linux kernel decided to use hpet as clocksource since boot time and has been running fine for around a week. It is possible that setting clocksource=hpet as kernel parameter will render the same positive result, but because the system is running fine, I haven't tried that.

After reading and searching a bit, it seems that when tsc isn't working properly, and another clocksource (either hpet or acpi_pm) is selected due to that instability, it is quite difficult to get the timings right in a multicore system and then all those messages keep appearing, and that is what they mean.

Since the introduced change, the only (un?)related timing warning I have received are the following ones, but those are very common to see in any other system:

[ 2311.702439] perf: interrupt took too long (2844 > 2500), lowering kernel.perf_event_max_sample_rate to 70200
[ 3010.165480] perf: interrupt took too long (3591 > 3555), lowering kernel.perf_event_max_sample_rate to 55500
[ 3545.793316] perf: interrupt took too long (4874 > 4488), lowering kernel.perf_event_max_sample_rate to 40800
[ 4158.016652] perf: interrupt took too long (6428 > 6092), lowering kernel.perf_event_max_sample_rate to 30900
[ 5899.068558] perf: interrupt took too long (8403 > 8035), lowering kernel.perf_event_max_sample_rate to 23700
[ 7649.439641] perf: interrupt took too long (10794 > 10503), lowering kernel.perf_event_max_sample_rate to 18300
[10630.516664] perf: interrupt took too long (13529 > 13492), lowering kernel.perf_event_max_sample_rate to 14700
[16340.985432] perf: interrupt took too long (17060 > 16911), lowering kernel.perf_event_max_sample_rate to 11700
[18797.710808] perf: interrupt took too long (21502 > 21325), lowering kernel.perf_event_max_sample_rate to 9300
[27527.936817] perf: interrupt took too long (27218 > 26877), lowering kernel.perf_event_max_sample_rate to 7200
[41601.451793] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.035 msecs
[49045.829185] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.097 msecs

For full disclosure, the offending system is using a beta BIOS (F3b) to enable a FM2+ CPU (AMD A10-7860K, with latest amd-ucode loaded) on a retail FM2 budget motherboard (Gigabyte GA-F2A58M-HD2 (rev. 3.x), that used an AMD A4-6300 originally), so, maybe the tsc unstable issue stems from that scenario. However, the issue seems to appear in recent 5.1x kernels due to a very low CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US value.

For the sake of completeness and people having similar issues that come from a web search engine, the following is a log of the issue with hpet (previous post was with acpi_pm):

Jul 04 14:19:37 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 14:19:37 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 14:21:49 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025689ns
Jul 04 14:21:49 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 14:54:57 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025689ns
Jul 04 14:54:57 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 14:58:47 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 14:58:47 nas kernel: clocksource: wd-tsc-wd read-back delay of 547555ns, clock-skew test skipped!
Jul 04 15:08:11 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 15:08:11 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 15:23:21 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025689ns
Jul 04 15:23:21 nas kernel: clocksource: wd-tsc-wd read-back delay of 578844ns, clock-skew test skipped!
Jul 04 15:48:31 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025689ns
Jul 04 15:48:31 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 15:53:05 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 15:53:05 nas kernel: clocksource: wd-tsc-wd read-back delay of 578844ns, clock-skew test skipped!
Jul 04 16:06:11 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 16:06:11 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!
Jul 04 16:08:19 nas kernel: clocksource: timekeeping watchdog on CPU1: hpet wd-wd read-back delay of 1025200ns
Jul 04 16:08:19 nas kernel: clocksource: wd-tsc-wd read-back delay of 578355ns, clock-skew test skipped!

So, happy all around as I was trying to avoid to build a custom kernel just to set a higher CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US value, and the downside to use HPET is that is more resource intensive (tens of times slower in the nanoseconds scale) for getting time values than tsc, specifically for benchmarking and gaming, but for a mostly idling NAS machine, I'm getting the same fully saturated Ethernet throughput as always. Hooray!

Last edited by uka (2022-08-08 01:06:27)

uka · 2022-08-09 21:03:57

Rebooted over the weekend and currently testing tsc=nowatchdog, everything working fine so far (48H+, no clocksource warnings on kernel log), and this kernel parameter allows to use the recommended tsc instead of hpet, might be worth a try for others that have this issue too.

Arch Linux

#1 2022-07-30 17:24:26

[SOLVED] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read

#2 2022-08-02 09:51:34

Re: [SOLVED] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read

#3 2022-08-08 00:51:48

Re: [SOLVED] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read

#4 2022-08-09 21:03:57

Re: [SOLVED] clocksource: timekeeping watchdog on CPU1: acpi_pm wd-wd read

Board footer