-
-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: yet another attempt to add windows builds #231
base: main
Are you sure you want to change the base?
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe:
|
Hi! This is the friendly automated conda-forge-linting service. I wanted to let you know that I linted all conda-recipes in your PR ( Here's what I've got... For recipe:
For recipe:
|
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe:
|
Both pipelines failed because they ran out of disk space:
What would be the most idiomatic way to solve this issue? |
Try following https://conda-forge.org/docs/maintainer/conda_forge_yml/#azure to clear some disk space. Set this in
and then rerender the feedstock. |
I think there’s little we can do - the Azure free disk space setting is already enabled. I’d try and see if these build locally. Perhaps there is a way to use the Quansight servers for Windows as well, the same way they are used for Linux builds? If not, I guess if there are some volunteers to build these locally then this would be an option - I did that for aarch64 for a while for qt. Conda-forge has a windows server too, but disk space has always been quite restricted there too so it might be a bit of a pain. |
Perhaps cross-compiling Windows from Linux is worth trying? Here is a different feedstock PR that does this ( conda-forge/polars-feedstock#187 ) If we were to use Quansight resources for Windows, being able to run the build on Linux (so cross-compiling) would be very helpful |
Sadly thats already set: pytorch-cpu-feedstock/conda-forge.yml Line 2 in 9e99e03
I assume you mean the runners provided through open-gpu-server by Quantsight and MetroStar? This PR only build the cpu-only version but if we also start building for Cuda I think this is the only possible way forward (let alone for other related repositories like tensorflow). However, the open-gpu-servers don't seem to provide any Windows images. Do you know who I should contact to get the ball rolling?
That would be an option but Id prefer to automate and open-source things as much as possible. Having something hooked up to this repository would be ideal.
The native code of the example you linked is using Rust which makes this much easier. I doubt that this would be easy to achieve with pytorch. |
I also expect another error when actual linking starts. On my local machine that takes at least 16GB of memory. The cuda version will mostly require more. |
If we don't try, we won't know |
Although that is technically true, its already hard enough to build pytorch natively. Adding cross-compilation in the mix seems to me to complicate this even further. Id much rather first focus on getting native builds working. Even if we need to modify the infrastructure to do so. I think having the ability to do resource intensive windows builds would be a huge benefit for the conda-forge ecosystem in general. However, if all else fails cross-compiling seems like a worthwhile avenue to explore. |
One thing to try is to move the build from
You should have roughly 70 GB free on |
Thanks! I added that to the PR. I quickly searched github and it seems c:\bld\ is used more often so I tried that. |
Just make sure that the directory exists and is writeable. Also, you need to rerender for the variable to be set. This comment should trigger the bot. @conda-forge-admin, please rerender |
A bit of history. Back when this feedstock was created 6 years ago, the pytorch officially suggested that people install two distinct packages I personally feel like for windows users, we would HURT their experience to not have a GPU package in 2024. |
Couldnt agree more. I started with CPU only to be able to make incremental progression. My goal is definitely to be able to build the cuda version too! |
well few things:
Typically we "stop" the compilation on the CIs when we reach your stage (seems like it is working OK enough...). |
Hi @baszalmstra @hmaarrfk - do you have any updates on this? It would be amazing to see this happen :)! |
@Tobias-Fischer Im still working on the Cuda builds but its a slow process because it takes ages to build them locally so iteration times are suuuper slow. In parallel we are also looking into getting large Windows runners into the conda-forge infrastructure. |
I got to the testing stage and noticed this: pytorch-cpu-feedstock/recipe/meta.yaml Line 300 in f9fd731
However this seems to always fail with (this is from the logs of the latest release):
Some dependencies are missing. Particularly:
(as can be seen here https://github.com/pytorch/pytorch/blob/6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0/test/run_test.py#L1705) Given that the test is allowed to fail (due to |
The more we fix, the better. If it's really a lot of failures, we might not fix it right away (though depending on the severity of the failures, we might want to think twice about releasing something in that state). In any case, let's leave the testing in, add the required dependencies, and pick up as many fixes as we can. |
FWIW, cuda builds seem to fail with
|
This file is added by |
Ah it appears the nvcc activation script sets the |
I think I figured out the issue: conda-forge/cuda-nvcc-feedstock#47 |
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
I'm not necessarily the best question to ask about the CUDA stack - I think the issue you opened is a good start (which'll probably have to be fixed in any case). On the other hand, if something doesn't work for 12.0 but is available for later CUDA versions, I don't think it would be an issue to increase the |
Yeah, I think the issue I found is indeed the problem. I mimicked the behavior of the Linux activation script to a degree and I was able to successfully build Let's see what CI here does. 😄 |
Timing out after 12 hours, how fun … Thanks for pushing this @baszalmstra! |
I bumped the specs of the runners, let's see what happens. |
The CUDA builds succeeded! 🥳🎈 Here are some of the things I want to do before I consider this PR ready:
And finally:
Anything Im missing? @hmaarrfk @h-vetinari |
Huge congratulations on this milestone! 👏 🚀 🥳 Your todo-list sounds good to me. 👍 Sidenote: pytorch 2.4.0 just got released, depending on how much time you'll still need, we might want to build that first (at least, that shouldn't cause a lot of merge conflicts...). |
Yeah, just go ahead and build that first. I don't expect any merge conflicts, and otherwise Ill solve them. |
Looks like starting with pytorch 2.4.1 it supports split (python and non-python) builds out of the box. That will hopefully also help with our recipe here! |
@REM Clear the build from any remaining artifacts. We use sccache to avoid recompiling similar code. | ||
cmake --build build --target clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you cleaning here? Unix builds do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The reason is that we get into a weird state on Windows where apparently the order of the variant builds for pytorch influences the compilation process. I noticed that this order is non-deterministic. We would run into a weird issue where if pytorch for python 3.11 was built before 3.12 you will get linker errors for the 3.12 build. Whereas if 3.12 would be built before 3.11 all would be good.
The last two weeks I have been trying to figure out why this happens and why this specifically only happens for Python 3.11 and 3.12 but I couldnt figure it out. As a last resort, I opted to just remove all artifacts and rebuild them. Using sccache to ensure we aren't rebuilding everything from scratch.
This is a little bit of a temporary workaround. As far as I understand with pytorch 2.4.1 split builds will be supported out of the box and all of the megabuild stuff will become a lot simpler.
If you happen to have a better idea though, I welcome it with open arms! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a log with the linker errors?
As far as I understand with pytorch/pytorch#126328 and all of the megabuild stuff will become a lot simpler.
It probably won't be ready for use until 2.5.0. Interesting note: split build is a result of pytorch developers wanting to replicate the libtorch split of conda-forge in wheels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the last time this failed in CI was here: https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/10077183574/job/27859126386?pr=231
And interesting indeed! That's great. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you print build/CMakeCache.txt ?
Omg its green! I will look at using the shared protobuf binary from conda-forge during the next few days. |
THe issue Im facing with protobuf is this:
It looks like |
I don't see |
Sorry for pushing ahead with so many version updates. I think the build system has "improved" even if some changes required "fixing" from our part. You might want to update to 2.4.1 or perhaps even to the 2.5.0 pre-releases there seems to be rc9 at least. |
Ill give 2.5 a go, I believe the split build is now also part of that? |
Look at the source, i thought that it was part of 2.4. But I could be wrong. the split build uses a different strategy than we have historically used at conda-forge. If you want to make use of it, I would first prototype the split build separately from the windows linking issues. |
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)Fixes #32
This PR is another attempt to add Windows builds (see #134) .
For now I disabled all other builds to be able to test the windows part first. I made this PR draft so we don't accidentally merge it.