-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KVM Support on proxmox #100
Comments
Looks like your system does not support PCIe atomics. This is part of the PCIe 3 standard and is required for KFD to work on some GPUs, including Polaris. |
I checked on a fresh Ubuntu 18.04.4, rocm installs properly and cards are detected by
|
In your reference to ROCM's Issue #26 , the
The rationale to do so is similar to ROCM's issue #26 : the KVM guest cannot correctly recognize a PCIE devices and always regard them as ordinary PCI devices that After you build your Above steps work for me, but the |
I am also seeing this, with KVM 5.2 running on Debian bullseye, kernel 5.10, guest Ubuntu kernel 5.8. ROCm and atomics work fine on the host system. In the guest with vfio-pci passthrough, I have tried the in tree amdgpu module, and the dkms package with ROCm 4.1 and 4.2, and they all fail with the kfd PCI rejects atomics error. Is there an older kernel version that won't have this issue that I should try? Is someone working on a fix? |
Here are more details. On the host, both the PCIe root port and the AMD VGA device have AtomicOpsCap:
and on the guest, only AMD VGA has AtomicOpsCap:
|
If I try the workaround suggested by @GongYiLiao from #26 (patch applied to dkms source, set pci on host after vm boot, load amdgpu), I get a firmware load error:
This is with the rock-dkms package from ROCm 4.1 and HWE kernel 5.8 in ubuntu guest. I will try with ROCm 4.2 dkms as well. |
I was able to get the module to load with kfd atomics support, after a reboot, with the patch applied but NOT running the setpci on the host. However when I try to any ROCm code, it hangs in From strace:
So at this point I have no way to use an AMD gpu from KVM, the proposed workarounds also fail to work. Is this related to the PCIe bridge inside my KVM not supporting atomics? @GongYiLiao when you got it working, does lspci show atomcis supported by the PCIe bridge? |
Sorry for the noise - I now have it working completely, using setpci on the host. I guess when I was trying different things, the GPU got in a bad state, and rebooting the host fixed it. It would be great if this worked out of the box, without a hack involving patching the kernel module and running a command on host between boot and module load. |
How to cope with the ROCm cripplewarefrom https://en.wikipedia.org/wiki/Crippleware
If you want to use an AMD device in a KVM VM, you need to dodge the traps AMD devs deliberately built in. You need to:
=> profit. rocminfo:
Of course this is just a dirty quickhack, a workaround. It nevertheless works and transforms the hardware you bought from AMD for good money from a piece of worthless junk into something usable at last. This comment is here to spare others to spend 3 days full-time to have to work this out and the unbearable pain of reading comments like "It's not the fault of ROCm". |
@rico666 : I don't think the term "crippleware" applies to open-source software. If there are traps, there is nothing deliberate about them. If we had infinite time, we'd support every feature anyone could ever wish for. We don't have infinite time. But we certainly don't waste the limited time we do have to lay elaborate traps to frustrate our users. |
You might want to read the linked WP article in it's entirety.
However, I admit that my comment must be seen in light of 3 lost days full-bridge-rectifying a situation that was messed up by the devs of amdgpu without any need. You see - ROCm in the year 2017 never had this PCIe atomics problem, at least not with this hardware. Whoever coded that stuff in, made it worse. I call that crippleware. I could now assemble the statements in several issues here on github - some of them closed already - how this can't be solved, how the AMDGPU can't set this or that bit (yeah - but root can?) and how it would hurt performance otherwise. If I'd be the responsible lead developer (and team: pray I won't take that job), then the design question would be "What is more performance? 0% or 20% of the potential 100%? => Make it work first and always, tweak later. You know, for someone who sees and treats OpenCL as the unwanted child (Nvidia), their stuff "just works". If what AMD delivers in the OpenCL area is what an "OpenCL protagonist" should deliver, then good night. |
I personally don't have any evidence suggesting that AMD intentionally "cripple" the GFX8 customers to force them buy newer generation of Radeons. Most frusrtrated AMD customers may just vote by foot when the GPU market eventually move back to normal, assuming cryptocurerrncy mining eventually become unprofitable unless using customized FPGA. When that time comes, AMD will take the hit. To me, the more severe problems of ROCm platform, regardless running on bare metal or virtual machine, are
Above are my own rants on how AMD's appoarch to Linux customers lead to some sensible angers. Let's circle back to the issue of using Radeon for computation on a KVM guest. I don't think AMD will solve the PCIe atomic issue ( if AMD deems this as an issue) any time soon as gfx8 is a semi-deprecated product already. The only solution I can come up with in the near future is just the dirty workaround I found by ```dmesg | grep kfd" and search where the error message pops in the kernel module source code file. Therefore, the following appoach may be sutiable for long term, on :
|
I have spent days to find out the reason why some opencl programs get error in linux VM. The answer is, ROCm sucks. |
The last arg why I'm still trying to launch ML on Radeon, is because I belive, that Radeon hardware has more perfomance for this tasks, that Nvidia. But totally unusable software ecosystem make it qute useless... |
@ROCmSupport, Does AMD plan to support running ROCm in a VM? It would be useful to be able to do this. |
Hi,
TLDR : I have a small proxmox server at home and planned on using ROCm for some deep learning tasks I have. I have followed all the different information I could find but still cannot use ROCm in a VM (KVM).
The server contains a Supermicro X10SDV-6C-TLN4F (Intel Xeon D-1528) and a Radeon Pro Duo Polaris.
If I understood the documentation correctly, a Broadwell Xeon v4 CPU should work and so should Polaris 10 cards.
The PCI tree from
lspci -tv
is:The many steps in the tree are Broadcom PLX switches which should also support PCIe atomics. I believe they are directly in the GPU, and that they might actually be a single chip, not exactly sure how to interpret deep trees in lspci.
I first did some research and found the Virtualisation & Containers page in the documentation and this now closed issue #26. Both of which explain how to set up for KVM, followed ubunutu 16.04 instructions given that I am using Proxmox which is Debian based.
Having had no success with ROCm I tested OpenCL (AMDGPU-PRO) and games (mesa) in other VMs and both work as expected, so the issue should not be with PCIe passthrough setup or IOMMU, but with PCIe atomics.
Following the steps in the issue I set up the necessary bits (
setpci -v -d *:67c4 80.b=40
) after starting the VM, then load amdgpu, but, I cannot get ROCm to run. The steps in the documentation gave the same results. I always get:Is there anything I have missed about KVM support ?
Thanks in advance for your help.
The text was updated successfully, but these errors were encountered: