scsi_eh_1 process high cpu after upgrading to 6.5.2

D3vil0p3r · 2023-09-15 09:16:42

How to manage this?

Last edited by D3vil0p3r (2023-09-15 09:17:36)

loqs · 2023-09-15 09:26:25

$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 1 step)
[152e52fb6ff180e97d64585e87fea44c49b8bda8] scsi: core: Support Service Action in scsi_report_opcode()

https://drive.google.com/file/d/1egNtO7 … sp=sharing linux-6.4rc1.r7.g152e52fb6ff1-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1N6_z7E … sp=sharing linux-headers-6.4rc1.r7.g152e52fb6ff1-1-x86_64.pkg.tar.zst
Edit:
Based on leonshaw's findings https://bbs.archlinux.org/viewtopic.php … 3#p2120823 this step will be good

$ git bisect good
624885209f31eb9985bf51abe204ecbffe2fdeea is the first bad commit
commit 624885209f31eb9985bf51abe204ecbffe2fdeea
Author: Damien Le Moal <dlemoal@kernel.org>
Date:   Thu May 11 03:13:41 2023 +0200

    scsi: core: Detect support for command duration limits
    
    Introduce the function scsi_cdl_check() to detect if a device supports
    command duration limits (CDL). Support for the READ 16, WRITE 16, READ 32
    and WRITE 32 commands are checked using the function scsi_report_opcode()
    to probe the rwcdlp and cdlp bits as they indicate the mode page defining
    the command duration limits descriptors that apply to the command being
    tested.
    
    If any of these commands support CDL, the field cdl_supported of struct
    scsi_device is set to 1 to indicate that the device supports CDL.
    
    Support for CDL for a device is advertizes through sysfs using the new
    cdl_supported device attribute. This attribute value is 1 for a device
    supporting CDL and 0 otherwise.
    
    Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
    Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
    Link: [url]https://lore.kernel.org/r/20230511011356.227789-9-nks@flawful.org[/url]
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

 Documentation/ABI/testing/sysfs-block-device |  9 ++++
 drivers/scsi/scsi.c                          | 81 ++++++++++++++++++++++++++++
 drivers/scsi/scsi_scan.c                     |  3 ++
 drivers/scsi/scsi_sysfs.c                    |  2 +
 include/scsi/scsi_device.h                   |  3 ++
 5 files changed, 98 insertions(+)
[stephen@arch ~/builds/linux-stable/src/linux]$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4
git bisect good 6995e2de6891c724bfeb2db33d7b87775f913ad1
# bad: [2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
git bisect bad 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
# good: [b775d6c5859affe00527cbe74263de05cfe6b9f9] Merge tag 'mips_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
git bisect good b775d6c5859affe00527cbe74263de05cfe6b9f9
# bad: [56cbceab928d7ac3702de172ff8dcc1da2a6aaeb] Merge tag 'usb-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect bad 56cbceab928d7ac3702de172ff8dcc1da2a6aaeb
# good: [b30d7a77c53ec04a6d94683d7680ec406b7f3ac8] Merge tag 'perf-tools-for-v6.5-1-2023-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next
git bisect good b30d7a77c53ec04a6d94683d7680ec406b7f3ac8
# bad: [dfab92f27c600fea3cadc6e2cb39f092024e1fef] Merge tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
git bisect bad dfab92f27c600fea3cadc6e2cb39f092024e1fef
# bad: [28968f384be3c064d66954aac4c534a5e76bf973] Merge tag 'pinctrl-v6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect bad 28968f384be3c064d66954aac4c534a5e76bf973
# bad: [af92c02fb2090692f4920ea4b74870940260cf49] Merge patch series "scsi: fixes for targets with many LUNs, and scsi_target_block rework"
git bisect bad af92c02fb2090692f4920ea4b74870940260cf49
# bad: [2e2fe5ac695a00ab03cab4db1f4d6be07168ed9d] scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe()
git bisect bad 2e2fe5ac695a00ab03cab4db1f4d6be07168ed9d
# good: [8759924ddb93498bd5777f0b05b6bc9cacf4ffe3] Merge patch series "scsi: hisi_sas: Some misc changes"
git bisect good 8759924ddb93498bd5777f0b05b6bc9cacf4ffe3
# good: [7907ad748bdba8ac9ca47f0a650cc2e5d2ad6e24] Merge patch series "Use block pr_ops in LIO"
git bisect good 7907ad748bdba8ac9ca47f0a650cc2e5d2ad6e24
# bad: [390e2d1a587405a522dc6b433d45648f895a352c] scsi: sd: Handle read/write CDL timeout failures
git bisect bad 390e2d1a587405a522dc6b433d45648f895a352c
# good: [734326937b65cec7ffd00bfbbce0f791ac4aac84] scsi: core: Rename and move get_scsi_ml_byte()
git bisect good 734326937b65cec7ffd00bfbbce0f791ac4aac84
# bad: [624885209f31eb9985bf51abe204ecbffe2fdeea] scsi: core: Detect support for command duration limits
git bisect bad 624885209f31eb9985bf51abe204ecbffe2fdeea
# good: [152e52fb6ff180e97d64585e87fea44c49b8bda8] scsi: core: Support Service Action in scsi_report_opcode()
git bisect good 152e52fb6ff180e97d64585e87fea44c49b8bda8
# first bad commit: [624885209f31eb9985bf51abe204ecbffe2fdeea] scsi: core: Detect support for command duration limits

https://git.kernel.org/pub/scm/linux/ke … bffe2fdeea

Last edited by loqs (2023-09-15 10:15:18)

laktak · 2023-09-15 12:12:48

loqs wrote:

Based on leonshaw's findings https://bbs.archlinux.org/viewtopic.php … 3#p2120823 this step will be good

good: 6.4.0-rc1-1-00007-g152e52fb6ff1

correct. How can we proceed? Who has experience reporting bugs to the kernel team? I don't have any (yet)

loqs · 2023-09-15 12:32:33

Either https://bugzilla.kernel.org/ Component: SCSI Product: IO/Storage and add Damien Le Moal <dlemoal@kernel.org> and Niklas Cassel <niklas.cassel@wdc.com> to the CC list or reply to https://lore.kernel.org/all/20230511011 … awful.org/ (instructions on how to do that in the link). I would also raise it with vmware (assuming you can not reproduce the issue using a different virtualization) , no idea on how you would do that.
Edit:
https://lore.kernel.org/regressions/49d … idgow.net/
Edit2:
6.5.3-arch1 with https://lore.kernel.org/linux-scsi/2023 … ernel.org/ applied:

https://drive.google.com/file/d/1oDrKqV … sp=sharing linux-6.5.3.arch1-1.3-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1mPWP7g … sp=sharing linux-headers-6.5.3.arch1-1.3-x86_64.pkg.tar.zst

Last edited by loqs (2023-09-15 16:58:10)

wretchedbanana · 2023-09-15 17:32:48

loqs wrote:

I would also raise it with vmware (assuming you can not reproduce the issue using a different virtualization)

Ftr, I can reliably reproduce the issue by connecting an external USB2.0 mechanical drive on 6.5.2; i.e., [SCSI] errors, scsi_eh_n processes firing up at high CPU.

Baremetal, no virtualization involved. Unfortunately I don't own the box and have no way to further debug today. Cannot verify if it's indeed the same issue, but highly probable. The drive works fine elsewhere. Just a heads up.

loqs · 2023-09-15 17:53:54

wretchedbanana wrote:

Cannot verify if it's indeed the same issue, but highly probable.

Can you try linux-6.5.3.arch1-1.3-x86_64.pkg.tar.zst and see if that fixes your issue?

laktak · 2023-09-15 19:34:14

Thanks loqs!

I've reported the bug here: https://bugzilla.kernel.org/show_bug.cgi?id=217914

(though bugzilla did not accept the addresses for cc)

Last edited by laktak (2023-09-15 19:35:04)

loqs · 2023-09-15 19:44:32

@laktak Please see my Edit2 to post #54. Did it fix the issue for you?

laktak · 2023-09-15 19:58:44

Yes, "6.5.3-arch1 with https://lore.kernel.org/linux-scsi/2023 … ernel.org/ applied" is good!

mattbarszcz · 2023-09-15 20:20:16

Loqs,

I can confirm that installing linux-6.5.3.arch1-1.3-x86_64.pkg.tar.zst and linux-headers-6.5.3.arch1-1.3-x86_64.pkg.tar.zst did resolve the issue on my system as well (Arch Linux VM running on ESXi 7.0.3).

loqs · 2023-09-15 21:13:41

laktak wrote:

Yes, "6.5.3-arch1 with https://lore.kernel.org/linux-scsi/2023 … ernel.org/ applied" is good!

Great. Can you please report that finding in the bugzilla report you opened?

laktak · 2023-09-18 12:59:34

I was a bit late and now it's already mentioned.

Thanks for your help!

ericj1966 · 2023-09-24 11:24:28

Not only CPU is high, the very high WAIT I/O is more killing on my VM after upgrading to 6.5.
After applying https://lore.kernel.org/linux-scsi/2023 … ernel.org/ (edit2 in #54) the problem is solved for now, so many thanks!

Zepman · 2023-09-27 14:18:57

This bug still exists with VMWare VMs and kernel 6.5.5-arch1-1 if a virtual optical drive is connected. This is also discussed here:
https://bugzilla.kernel.org/show_bug.cgi?id=217914#c6

If you do not have control over the connected virtual optical drive (e.g. you are renting a VPS), it is possible to simply disable the drive's controller as a workaround.

1.
Find out the relevant PCI device ID of the controller to which sr0 is connected.

$ ls -l /dev/disk/by-path

Format of a PCI device ID:
0000:00:12.3

Make sure that the PCI device ID is not the same as that of other storage! If it is, it means that some storage is connected to the same device as the virtual optical drive. This is unlikely, but not an impossible situation.

2.
Find the relevant sysfs path.

$ find /sys/bus/pci/drivers -name "*0000:00:12.3*

The relevant module is likely ata_piix. In this case, the relevant path would be:
/sys/bus/pci/drivers/ata_piix/unbind

Strip everything that comes after /sys/bus/pci/drivers/<drivername>, and add /unbind at the end.

3.
Disable the controller:

# echo "0000:00:12.3" > /sys/bus/pci/drivers/ata_piix/unbind

Adjust the command with the PCI device ID and controller module name you found earlier.

The CPU hogging will stop immediately after executing the command. It needs to be rerun after a reboot. A systemd service can take care of that. Alternatively, there is probably some smarter udev rule or kernel command line argument I haven't bothered to lookup, but which might be more appropriate.

It might also be possible to simply unload the relevant module using rmmod ata_piix (or another module beloning to the controller of the virtual optical drive), but this carries a higher risk of also disabling access to other storage (if connected to a controller using the same module).

Last edited by Zepman (2023-09-27 14:26:14)

wretchedbanana · 2023-09-27 15:56:42

Apologies for the radio silence--been traveling and testing. The USB drive issue I was reporting was indeed another instance of this regression. Namely, as the patch describes, older storage devices sometimes entirely hang on receiving the unsupported CDL probe.

That happens far less consistently than the VMware optical drive issue, I have had servers go unresponsive after days of normal operation over the past week. If you aren't absolutely sure you aren't impacted by the issue more broadly, I do not recommend partial workarounds described in this thread.

Last edited by wretchedbanana (2023-09-27 15:57:35)

loqs · 2023-09-27 17:36:13

The fix has just landed in mainline [1] it should be included in a future stable release. If you want the fix to be included sooner you could open a bug report on the Arch bug tracker.
Edit:
Queued for 6.5.6 [2].

[1] https://git.kernel.org/pub/scm/linux/ke … 22244ae342
[2] https://git.kernel.org/pub/scm/linux/ke … 8f496c335e

Last edited by loqs (2023-10-04 15:54:49)

WxC · 2023-10-11 05:50:39

I can confirm that it's fixed on my 3 Arch VM inside VMware with the 6.5.6 release.

Arch Linux

#51 2023-09-15 09:16:42

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#52 2023-09-15 09:26:25

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#53 2023-09-15 12:12:48

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#54 2023-09-15 12:32:33

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#55 2023-09-15 17:32:48

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#56 2023-09-15 17:53:54

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#57 2023-09-15 19:34:14

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#58 2023-09-15 19:44:32

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#59 2023-09-15 19:58:44

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#60 2023-09-15 20:20:16

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#61 2023-09-15 21:13:41

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#62 2023-09-18 12:59:34

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#63 2023-09-24 11:24:28

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#64 2023-09-27 14:18:57

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#65 2023-09-27 15:56:42

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#66 2023-09-27 17:36:13

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#67 2023-10-11 05:50:39

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

Board footer