scsi_eh_1 process high cpu after upgrading to 6.5.2

loqs · 2023-09-13 21:18:29

$ git bisect bad
Bisecting: 231 revisions left to test after this (roughly 8 steps)
[af92c02fb2090692f4920ea4b74870940260cf49] Merge patch series "scsi: fixes for targets with many LUNs, and scsi_target_block rework"

https://drive.google.com/file/d/1UnqkZE … sp=sharing linux-6.4rc1.r201.gaf92c02fb209-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/16b-tw5 … sp=sharing linux-headers-6.4rc1.r201.gaf92c02fb209-1-x86_64.pkg.tar.zst

laktak · 2023-09-13 21:41:28

bad: 6.4.0-rc1-1-00201-gaf92c02fb209

loqs · 2023-09-13 22:10:44

$ git bisect bad
Bisecting: 100 revisions left to test after this (roughly 7 steps)
[2e2fe5ac695a00ab03cab4db1f4d6be07168ed9d] scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe()

https://drive.google.com/file/d/1eWiEP1 … sp=sharing linux-6.4rc1.r100.g2e2fe5ac695a-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/19_r7E_ … sp=sharing linux-headers-6.4rc1.r100.g2e2fe5ac695a-1-x86_64.pkg.tar.zst

laktak · 2023-09-14 07:35:44

bad: 6.4.0-rc1-1-00100-g2e2fe5ac695a

Thanks again for helping with this. I'm sure you've automated the process but I didn't realize this was that much work.

loqs · 2023-09-14 08:28:44

$ git bisect bad
Bisecting: 48 revisions left to test after this (roughly 6 steps)
[8759924ddb93498bd5777f0b05b6bc9cacf4ffe3] Merge patch series "scsi: hisi_sas: Some misc changes"

https://drive.google.com/file/d/1vwvtR0 … sp=sharing linux-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1pI50Vq … sp=sharing linux-headers-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst

jerryChung · 2023-09-14 09:00:25

I encountered the same problem in my VM that was previously running Linux 6.4.12.

After upgrading the kernel to Linux 6.5.*, the scsi_eh_1 process cost almost 50%.

Any update message please?

laktak wrote:

I've upgraded several VMs
- from Linux arch 6.4.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:38:14 +0000 x86_64 GNU/Linux
- to Linux arch 6.5.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 06 Sep 2023 21:01:01 +0000 x86_64 GNU/Linux
and in all of them the scsi_eh_1 process eats 10-15% CPU.
I can't find any errors in the logs. When I go back to a snapshot with 6.4.12 the problem disappears. The FS is btrfs.
scsi_eh should be the scsi error handler so I've tried to enable logging (https://github.com/ibm-s390-linux/s390- … ging_level) but I see no output in dmesg/journalctl.
How can I find out what is causing this?

seth · 2023-09-14 14:16:48

Do you have an optical drive?
Also check your journal, https://bbs.archlinux.org/viewtopic.php … 0#p2120490

laktak · 2023-09-14 19:10:24

loqs wrote:

$ git bisect bad
Bisecting: 48 revisions left to test after this (roughly 6 steps)
[8759924ddb93498bd5777f0b05b6bc9cacf4ffe3] Merge patch series "scsi: hisi_sas: Some misc changes"
https://drive.google.com/file/d/1vwvtR0 … sp=sharing linux-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1pI50Vq … sp=sharing linux-headers-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst

I'm unable to test this - I downloaded a few times but I always get:

...
(1/2) downgrading linux                                                                                [############################################################] 100%
error: could not extract /usr/lib/modules/6.4.0-rc1-1-00051-g8759924ddb93/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.zst (Zstd decompression failed: Data corruption detected)
error: problem occurred while upgrading linux
error: could not commit transaction
error: failed to commit transaction (transaction aborted)
Errors occurred, no packages were upgraded.

laktak · 2023-09-14 19:12:04

jerryChung wrote:

I encountered the same problem in my VM that was previously running Linux 6.4.12.
After upgrading the kernel to Linux 6.5.*, the scsi_eh_1 process cost almost 50%.
Any update message please?

We are trying to find the cause.

As a workaround you can switch your drives to SATA or try to remove the optical drive.

loqs · 2023-09-14 19:48:20

I can not reproduce the error on this system. Hopefully rebuilding the package will fix whatever caused the issue.
Rebuilt package:
https://drive.google.com/file/d/1J7UMFX … sp=sharing linux-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1UKNsWm … sp=sharing linux-headers-6.4rc1.r51.g8759924ddb93-1-x86_64.pkg.tar.zst

seth · 2023-09-14 19:50:16

@laktak, could you confirm that it's directly tied to an optical drive?
Is your setup similar to MoZhonghua's?

laktak · 2023-09-14 20:27:26

good: 6.4.0-rc1-1-00051-g8759924ddb93

laktak · 2023-09-14 20:37:40

seth wrote:

@laktak, could you confirm that it's directly tied to an optical drive?
Is your setup similar to MoZhonghua's?

I can confirm that the error disappears when I remove the cd-rom even though the device is configured as ide1 (there is no ide0) and not scsi:

ide1:0.present = "TRUE"
ide1:0.autodetect = "TRUE"
ide1:0.deviceType = "cdrom-image"
ide1:0.startConnected = "FALSE"

RuneArch · 2023-09-14 20:45:55

Just to add a "me too"

I found FS#79644 - [linux] Kernel 6.5.2 Causes Marvell Technology Group 88SE9128 PCIe SATA to Constantly Reset which directed me here

Same messages in journal:

Sep 14 17:01:11 nas kernel: ata16: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 14 17:01:11 nas kernel: ata16.00: configured for UDMA/66

Downgraded the kernel from 6.5.3 to 6.4.12 and the journal messages disappear

I stepped back up to 6.5.2 and the messages reappear, so it's definitely between 6.4.12 and 6.5.2.... 6.5.3 does not resolve it (for me).

This is on a NAS (ASRock Rack C2550D4I) with no optical drives.

In my case, it's a Marvel 88SE9230:

$ ls /sys/class/ata_port/ | grep ata16
lrwxrwxrwx  1 root root 0 Sep 14 21:45 ata16 -> ../../devices/pci0000:00/0000:00:04.0/0000:09:00.0/ata16/ata_port/ata16

$ lspci | grep 09:00
09:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)

$ cat /sys/bus/pci/devices/0000\:09\:00.0/uevent
DRIVER=ahci
PCI_CLASS=10601
PCI_ID=1B4B:9230
PCI_SUBSYS_ID=1849:9230
PCI_SLOT_NAME=0000:09:00.0
MODALIAS=pci:v00001B4Bd00009230sv00001849sd00009230bc01sc06i01

Last edited by RuneArch (2023-09-14 20:56:28)

loqs · 2023-09-14 20:49:41

$ git bisect good
Bisecting: 21 revisions left to test after this (roughly 5 steps)
[7907ad748bdba8ac9ca47f0a650cc2e5d2ad6e24] Merge patch series "Use block pr_ops in LIO"
[stephen@arch ~/builds/linux-stable/src/linux]$ git bisect visualize

https://drive.google.com/file/d/1jYdoHb … sp=sharing linux-6.4rc1.r78.g7907ad748bdb-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/10s4GjB … sp=sharing linux-headers-6.4rc1.r78.g7907ad748bdb-1-x86_64.pkg.tar.zst

laktak · 2023-09-14 20:54:19

good: 6.4.0-rc1-1-00078-g7907ad748bdb

loqs · 2023-09-14 21:13:22

@RuneArch did linux-6.5.3.arch1-1.2-x86_64.pkg.tar.zst from https://bbs.archlinux.org/viewtopic.php … 4#p2120494 still have the issue? Would be useful to report your finding to https://lore.kernel.org/regressions/ZQH … x1-carbon/ or https://bugzilla.kernel.org/show_bug.cgi?id=217902

$ git bisect good
Bisecting: 10 revisions left to test after this (roughly 4 steps)
[390e2d1a587405a522dc6b433d45648f895a352c] scsi: sd: Handle read/write CDL timeout failures

https://drive.google.com/file/d/1PzxdNP … sp=sharing linux-6.4rc1.r11.g390e2d1a5874-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1QxZNd2 … sp=sharing linux-headers-6.4rc1.r11.g390e2d1a5874-1-x86_64.pkg.tar.zst

laktak · 2023-09-14 21:52:19

bad: 6.4.0-rc1-1-00011-g390e2d1a5874

loqs · 2023-09-14 22:12:04

$ git bisect bad
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[734326937b65cec7ffd00bfbbce0f791ac4aac84] scsi: core: Rename and move get_scsi_ml_byte()

https://drive.google.com/file/d/1J2Gqwu … sp=sharing linux-6.4rc1.r5.g734326937b65-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1Y37yPf … sp=sharing linux-headers-6.4rc1.r5.g734326937b65-1-x86_64.pkg.tar.zst

laktak · 2023-09-14 22:26:01

good: 6.4.0-rc1-1-00005-g734326937b65

loqs · 2023-09-14 22:48:33

$ git bisect good
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[624885209f31eb9985bf51abe204ecbffe2fdeea] scsi: core: Detect support for command duration limits

https://drive.google.com/file/d/1hvAmA3 … sp=sharing linux-6.4rc1.r8.g624885209f31-1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1ein1Ww … sp=sharing linux-headers-6.4rc1.r8.g624885209f31-1-x86_64.pkg.tar.zst

jerryChung · 2023-09-15 03:25:19

@laktak @loqs @seth

On the contrary, my VM didn't connect an optical drive.

I tried to connect the optical drive with SATA type, and process scsi_eh_1 now works normally.

Thanks for your suggestions!

laktak wrote:

We are trying to find the cause.
As a workaround you can switch your drives to SATA or try to remove the optical drive.

laktak · 2023-09-15 06:55:46

bad: 6.4.0-rc1-1-00008-g624885209f31

leonshaw · 2023-09-15 07:48:12

Hit the same problem. Based on bisect results from @laktak @loqs, found that removing scsi_cdl_check() from commit 624885209f31eb9985bf51abe204ecbffe2fdeea can fix the problem:

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index aa13feb17c62..d217be323cc6 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1087,8 +1087,6 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
 	if (sdev->scsi_level >= SCSI_3)
 		scsi_attach_vpd(sdev);

-	scsi_cdl_check(sdev);
-
 	sdev->max_queue_depth = sdev->queue_depth;
 	WARN_ON_ONCE(sdev->max_queue_depth > sdev->budget_map.depth);
 	sdev->sdev_bflags = *bflags;
@@ -1626,7 +1624,6 @@ void scsi_rescan_device(struct device *dev)
 	device_lock(dev);

 	scsi_attach_vpd(sdev);
-	scsi_cdl_check(sdev);

 	if (sdev->handler && sdev->handler->rescan)
 		sdev->handler->rescan(sdev);

D3vil0p3r · 2023-09-15 09:15:38

Same issue here, on VMware, after kernel upgrading:
6.5.3-zen1-1-zen

Arch Linux

#26 2023-09-13 21:18:29

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#27 2023-09-13 21:41:28

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#28 2023-09-13 22:10:44

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#29 2023-09-14 07:35:44

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#30 2023-09-14 08:28:44

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#31 2023-09-14 09:00:25

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#32 2023-09-14 14:16:48

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#33 2023-09-14 19:10:24

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#34 2023-09-14 19:12:04

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#35 2023-09-14 19:48:20

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#36 2023-09-14 19:50:16

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#37 2023-09-14 20:27:26

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#38 2023-09-14 20:37:40

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#39 2023-09-14 20:45:55

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#40 2023-09-14 20:49:41

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#41 2023-09-14 20:54:19

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#42 2023-09-14 21:13:22

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#43 2023-09-14 21:52:19

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#44 2023-09-14 22:12:04

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#45 2023-09-14 22:26:01

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#46 2023-09-14 22:48:33

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#47 2023-09-15 03:25:19

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#48 2023-09-15 06:55:46

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#49 2023-09-15 07:48:12

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

#50 2023-09-15 09:15:38

Re: scsi_eh_1 process high cpu after upgrading to 6.5.2

Board footer