pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

chboi · 2024-07-14 10:50:29

CUDA available: True
CUDA device: AMD Radeon RX 580 Series

RuntimeError: CUDA failed with error no CUDA-capable device is detected

import whisperx
import torch
import gc

hf_token = ""

# Check if CUDA is available
print("CUDA available: ", torch.cuda.is_available())

# Print CUDA device name
if torch.cuda.is_available():
print("CUDA device: ", torch.cuda.get_device_name(0))

# Settings
device = "cuda"
audio_file = ""
batch_size = 16 # Reduce if low on GPU memory
compute_type = "float16" # Change to "int8" if low on GPU memory (may reduce accuracy)

# Step 1: Transcribe with original whisper (batched)
model = whisperx.load_model("tiny.en", device, compute_type=compute_type)

# Optionally, save model to local path
# model_dir = ""
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print("Segments before alignment:", result["segments"])

# Delete model if low on GPU resources
gc.collect()
torch.cuda.empty_cache()
del model

# Step 2: Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print("Segments after alignment:", result["segments"])

# Delete model if low on GPU resources
gc.collect()
torch.cuda.empty_cache()
del model_a

# Step 3: Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="hf_token", device=device)

# Add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print("Diarize segments:", diarize_segments)

[c@archb ~]$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents
==========
*******
Agent 1
*******
Name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4500
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32831460(0x1f4f7e4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32831460(0x1f4f7e4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32831460(0x1f4f7e4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx803
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 580 Series
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1411
BDFID: 256
Internal Node ID: 1
Compute Unit: 36
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 730
SDMA engine uCode:: 58
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

[c@archb ~]$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP.dbg (3590.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon RX 580 Series
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1411Mhz
Address bits: 64
Max memory allocation: 7301444400
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8589934592
Constant buffer size: 7301444400
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3006477104
Max global variable size: 7301444400
Max global variable preferred total size: 8589934592
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x760bb1c05010
Name: gfx803
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3590.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 1.2
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

[c@archb ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

I followed https://wiki.archlinux.org/title/AMD_Ra … _MI25#ROCm

chboi · 2024-07-14 11:10:33

[c@archb ~]$ python ballertest.py
/home/c/.pyenv/versions/3.11.9/lib/python3.11/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")

Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user c is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
---> Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
---> gfx803
CUDA available: True
CUDA device: AMD Radeon RX 580 Series
Error during CUDA operation: HIP error: invalid device function
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Traceback (most recent call last):
File "/home/c/ballertest.py", line 134, in <module>
main()
File "/home/c/ballertest.py", line 100, in main
model = whisperx.load_model("tiny.en", device, compute_type=compute_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c/.pyenv/versions/3.11.9/lib/python3.11/site-packages/whisperx/asr.py", line 289, in load_model
model = model or WhisperModel(whisper_arch,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/c/.pyenv/versions/3.11.9/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 133, in __init__
self.model = ctranslate2.models.Whisper(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA failed with error no CUDA-capable device is detected

import torch
import grp
import pwd
import os
import subprocess
import whisperx
import gc

# Function to check ROCm support
def check_rocm_support():
devices = []
try:
print("\n\nChecking ROCM support...")
result = subprocess.run(['rocminfo'], stdout=subprocess.PIPE)
cmd_str = result.stdout.decode('utf-8')
cmd_split = cmd_str.split('Agent ')
for part in cmd_split:
item_single = part[0:1]
item_double = part[0:2]
if item_single.isnumeric() or item_double.isnumeric():
new_split = cmd_str.split('Agent ' + item_double)
device = new_split[1].split('Marketing Name:')[0].replace(' Name: ', '').replace('\n', '').replace(' ', '').split('Uuid:')[0].split('*******')[1]
devices.append(device)
if len(devices) > 0:
print('GOOD: ROCM devices found: ', len(devices))
else:
print('BAD: No ROCM devices found.')
except:
print('Cannot find rocminfo command information. Unable to determine if AMDGPU drivers with ROCM support were installed.')
return devices

# Function to check PyTorch
def check_pytorch():
print("Checking PyTorch...")
x = torch.rand(5, 3)
has_torch = False
len_x = len(x)
if len_x == 5:
has_torch = True
for i in x:
if len(i) == 3:
has_torch = True
else:
has_torch = False
if has_torch:
print('GOOD: PyTorch is working fine.')
else:
print('BAD: PyTorch is NOT working.')
return has_torch

# Function to check user groups
def check_user_groups():
print("Checking user groups...")
user = os.getlogin()
groups = [g.gr_name for g in grp.getgrall() if user in g.gr_mem]
gid = pwd.getpwnam(user).pw_gid
groups.append(grp.getgrgid(gid).gr_name)
if 'render' in groups and 'video' in groups:
print('GOOD: The user', user, 'is in RENDER and VIDEO groups.')
else:
print('BAD: The user', user, 'is NOT in RENDER and VIDEO groups. This is necessary in order to PyTorch use HIP resources')

# Function to check CUDA support
def check_cuda_support():
print("CUDA available: ", torch.cuda.is_available())
if torch.cuda.is_available():
print("CUDA device: ", torch.cuda.get_device_name(0))
# Test a simple CUDA operation
try:
x = torch.tensor([1.0, 2.0, 3.0]).cuda()
print(x)
except RuntimeError as e:
print(f"Error during CUDA operation: {e}")

def main():
devices = check_rocm_support()
pytorch_ok = check_pytorch()
check_user_groups()

if torch.cuda.is_available():
print("GOOD: PyTorch ROCM support found.")
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
print('Testing PyTorch ROCM support...')
if str(t) == "tensor([5, 5, 5], device='cuda:0')":
print('Everything fine! You can run PyTorch code inside of: ')
for device in devices:
print('---> ', device)
else:
print("BAD: PyTorch ROCM support NOT found.")

check_cuda_support()

# WhisperX operations
device = "cuda" if torch.cuda.is_available() else "cpu"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("tiny.en", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("tiny.en", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

if __name__ == "__main__":
main()

chboi · 2024-07-14 13:01:01

downgrading rocm to 5.5.1

chboi · 2024-07-14 13:44:35

downgrading to 5.4.3 but rocminfo rocsparse and rocm-ml-libraries do not have equivalents so seems they will be staying 5.5.1

chboi · 2024-07-14 13:47:51

I had tried following https://wiki.archlinux.org/title/AMD_Ra … _MI25#ROCm

chboi · 2024-07-14 13:50:39

https://github.com/OpenNMT/CTranslate2/issues/1072

ROCm-For-RX580 by woodrex83 - This repository hosts Docker images with ROCm backend support specifically for gfx803 architectures.
rocm-pytorch-gfx803-docker by Firstbober - This repository provides a Docker image based on ROCm/PyTorch with support for gfx803 GPUs.
xuhuisheng/rocm-gfx803 - This repository contains various fixes and patches for making ROCm work on gfx803 GPUs.

chboi · 2024-07-14 13:52:56

https://github.com/tsl0922/pytorch-gfx803

Arch Linux

#1 2024-07-14 10:50:29

pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#2 2024-07-14 11:10:33

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#3 2024-07-14 13:01:01

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#4 2024-07-14 13:44:35

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#5 2024-07-14 13:47:51

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#6 2024-07-14 13:50:39

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

#7 2024-07-14 13:52:56

Re: pytorch-rocm-5.7.1 and ctranslate2 gfx803 - python detects cuda device

Board footer