Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ppc64el) no-retry page fault: VM_L2_PROTECTION_FAULT #147

Open
tucnak opened this issue May 30, 2023 · 1 comment
Open

(ppc64el) no-retry page fault: VM_L2_PROTECTION_FAULT #147

tucnak opened this issue May 30, 2023 · 1 comment

Comments

@tucnak
Copy link

tucnak commented May 30, 2023

-- on every rocminfo call
amdgpu: update_gpuvm_pte() failed
amdgpu: SG Table of BO is UNEXPECTEDLY NULL
amdgpu: Failed to map bo to gpuvm
amdgpu 0000:03:00.0: amdgpu: Failed to map peer:0000:03:00.0 mem_domain:

-- occurs in hipblas
amdgpu: init_user_pages: Failed to get user pages: -1
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1fff000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1ffa000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1ff5000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: Resetting wave fronts (cpsch) on dev 00000000fa7830ec

However, I'm using amdgpu that came with 6.3.4 kernel & hipblas from rocm 5.3.2; does this mean that I would have to build the kernel from this repository, and how likely that it would help?

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    POWER9
  Uuid:                    CPU-XX
  Marketing Name:          POWER9
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3800
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx906
  Uuid:                    GPU-bc4261817337ecd7
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 26273(0x66a1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1725
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            60
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33538048(0x1ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
@ppanchad-amd
Copy link

@tucnak Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants