Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: no-retry page fault: VM_L2_PROTECTION_FAULT #172

Closed
iratebadger opened this issue Sep 16, 2024 · 1 comment
Closed

[Issue]: no-retry page fault: VM_L2_PROTECTION_FAULT #172

iratebadger opened this issue Sep 16, 2024 · 1 comment

Comments

@iratebadger
Copy link

Problem Description

OS:
NAME="Ubuntu"
VERSION="24.04.1 LTS (Noble Numbat)"
CPU:
model name : AMD EPYC 7402P 24-Core Processor
GPU:
Name: AMD EPYC 7402P 24-Core Processor
Marketing Name: AMD EPYC 7402P 24-Core Processor
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-

I'm getting a similar error to this #147

NOTE: I've committed a few sins here to get this working. Specifically I have edited the following file
/usr/src/amdgpu-6.8.5-2009582.24.04/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
line 563 to be ret = drm_dp_add_payload_part2(mst_mgr, new_payload);
so that it works with the back ported changes from
torvalds/linux@5a507b7

I am running the tests for flash-attention after modifying the setup.py to allow for targeting gfx908 as an expriment.

[ 4778.915423] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.916081] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.916571] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae8570611000 from IH client 0x1b (UTCL2)
[ 4778.917054] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[ 4778.917532] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 4778.918012] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x1
[ 4778.918491] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.918961] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 4778.919424] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.919879] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.920325] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.920781] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.921242] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae8570612000 from IH client 0x1b (UTCL2)
[ 4778.921713] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.922182] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.922656] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.923125] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.923594] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.924057] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.924513] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.924959] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.925416] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.925877] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae8570613000 from IH client 0x1b (UTCL2)
[ 4778.926426] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.926895] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.927366] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.927836] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.928305] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.928768] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.929224] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.929670] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.930126] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.930588] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705a8000 from IH client 0x1b (UTCL2)
[ 4778.931135] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.931604] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.932079] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.932549] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.933020] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.933484] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.933940] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.934389] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.934845] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.935306] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705a9000 from IH client 0x1b (UTCL2)
[ 4778.935855] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.936325] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.936797] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.937266] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.937737] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.938199] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.938653] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.939098] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.939552] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.940011] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705a8000 from IH client 0x1b (UTCL2)
[ 4778.940555] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.941023] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.941492] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.941961] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.942430] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.942891] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.943346] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.943795] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.944254] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.944717] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705aa000 from IH client 0x1b (UTCL2)
[ 4778.945265] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.945738] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.946211] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.946683] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.947156] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.947617] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.948071] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.948515] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.948969] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.949429] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705a9000 from IH client 0x1b (UTCL2)
[ 4778.949976] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.950443] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.950913] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.951382] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.951850] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.952311] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.952764] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.953209] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.953663] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.954122] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae85705aa000 from IH client 0x1b (UTCL2)
[ 4778.954665] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.955132] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.955601] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.956069] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.956537] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.956998] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.957452] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0
[ 4778.957896] amdgpu 0000:c6:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32769)
[ 4778.958350] amdgpu 0000:c6:00.0: amdgpu: for process pt_main_thread pid 12924 thread pt_main_thread pid 12924)
[ 4778.958810] amdgpu 0000:c6:00.0: amdgpu: in page starting at address 0x0000ae8570678000 from IH client 0x1b (UTCL2)
[ 4778.959362] amdgpu 0000:c6:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 4778.959829] amdgpu 0000:c6:00.0: amdgpu: Faulty UTCL2 client ID: CB (0x0)
[ 4778.960298] amdgpu 0000:c6:00.0: amdgpu: MORE_FAULTS: 0x0
[ 4778.960767] amdgpu 0000:c6:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4778.961236] amdgpu 0000:c6:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[ 4778.961697] amdgpu 0000:c6:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4778.962150] amdgpu 0000:c6:00.0: amdgpu: RW: 0x0

Operating System

Ubuntu 24.04.1

CPU

AMD EPYC 7402P

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

ROCK-Kernel-Driver

Steps to Reproduce

Install the dkms driver and rocm packages on ubuntu 24.04. DKMS will fail, edit the source as described in the problem description and rerun apt install amdgpu-dkms it will now install.

Acquire latest flash-attention
https://github.com/iratebadger/flash-attention.git

Modify setup.py line 124 allowed_archs to include "gfx908" such that it is now
allowed_archs = ["native", "gfx908", "gfx90a", "gfx940", "gfx941", "gfx942"]

pull the rocm/pytorch:latest

from the flash-attention folder
podman run --rm -it --device /dev/kfd --device /dev/dri --group-add keep-groups --security-opt seccomp=unconfined -v ./:/source rocm/pytorch:latest bash

GPU_ARCHS=gfx908 python setup.py bdist_wheel

pip install dist/flash_attn-2.6.3-cp39-cp39-linux_x86_64.whl

pytest tests/test_flash_attn_ck.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.8.5 is loaded

HSA System Attributes

Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD EPYC 7402P 24-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7402P 24-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2800
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 131707928(0x7d9b418) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131707928(0x7d9b418) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131707928(0x7d9b418) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx908
Uuid: GPU-fdc9fe5eaa557cf3
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 50688
Internal Node ID: 1
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

No response

@kentrussell
Copy link
Contributor

So the official stance is that only MI200 and MI300 are supported for flash-attention:

_AMD ROCm Support
ROCm version uses composable_kernel as the backend. It provides the implementation of FlashAttention-2.

Requirements:

ROCm 6.0 and above.
We recommend the Pytorch container from ROCm, which has all the required tools to install FlashAttention.

FlashAttention-2 with ROCm currently supports:

MI200 or MI300 GPUs.
Datatype fp16 and bf16
Forward's head dimensions up to 256. Backward head dimensions up to 128._
-https://github.com/ROCm/flash-attention

I'll close this one off, but you should open the same issue over on the flash-attention repo. The VM faults you're seeing seem like application-level faults. The addresses are valid, but they're unmapped (hence the error), so the fix should be coming from the flash-attention crew. I wish github had an easier way to move issues from one repo to another, but closing this and opening a new one at https://github.com/ROCm/flash-attention/issues/new/choose is the best way to get attention on the issue. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants