-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate modules to Seqera Containers #5832
Comments
It is still unclear to me, that images generated on demand should not harm reproducibility. |
This comment was marked as resolved.
This comment was marked as resolved.
This will also increase the used CPU hours, right? |
What’s the purpose of enforcing this? Biocontainers are automatically generated for every bioconda package and get automatically generated upon bioconda software version bump. I’m not seeing the reason to then manually create a seqera container? |
Hi both - thanks for your comments. You're right that this issue precedes some community discussion that we still need to have. That started with the recent two bytesize talks and resulting conversations on Slack, but we should still open it up to wider input. To address your concerns:
Wave generates images on demand, but Seqera Containers is a registry that sits behind Wave. The intention here is that the images are generated on demand by the developer when a package is updated - but then they are cached in the Seqera Containers registry. The image URIs will then be hardcoded into pipelines and the exact same container images will always be fetched by all users - just the same as they are today. We're also going to introduce conda-lock files (see #5835) so reproducibility should be even better than it is today.
No - Wave / Seqera Containers handles the build server side. As mentioned above, the generated images are stored in a registry and simply downloaded. So just as today, native images will be downloaded. No increase in CPU hours.
This one is more subjective. We will not make it a requirement to use Seqera Containers, just as we don't make it a requirement to use BioContainers today, so for me it feels about the same. We will keep the vast majority of build logic (eg. conda env files, conda lock files) on the nf-core side and will be free to reverse the decision at any point should we wish.
One of the main reasons for adopting Seqera Containers is that it'll have even more automation and less manual work than the current setup. The process will roughly be:
Note that this process will also work for multi-package containers, which is not currently the case with BioContainers (mulled images). So it should represent a significantly easier workflow. Note that although Seqera Containers has a web interface (https://seqera.io/containers/) it also works programmatically via CLI, API and Nextflow (eg. BioContainers has been brilliant for nf-core, but there are several reasons to move away:
Wave and Seqera Containers have been built specifically for our community, based on our combined experience and needs. So hopefully we can mitigate / avoid these pitfalls. I hope these responses help clear things up! Shout if you have any questions or concerns, and I'd recommend checking out the podcast and bytesize videos in the top comment as they go through how much of this works. |
To provide a practical example of these points:
and
When opening a PR to chipseq I found the "docker image format v1" error, see here. To fix the error, I tried to bump to a newer version of the tool ( |
As shown here, using wave images fix the issue above. |
I just want to clarify a difference in case anyone missed it:
For 1, I think it's not controversial. Wave is open source, we're not relying on Seqera, we're grateful they host the service, but we could host the service ourselves if we needed to. Just like using Platform for megatests. For 2, using Seqera containers as a registry can seem controversial, but it's really not. Right now, we're relying on biocontainers to host our singularity and docker image (on quay.io). We've had issues with pushback from biocontainers on updates, and we've had uptime issues with quay.io. If people felt more comfortable, we could point wave at anytime to use DockerHub, quay.io, GitHub containers, or host our own ECR(AWS's image registries). But from a practicality standpoint, that's not nf-core's main skill, we don't have the resources to waste on hosting our own registry. The main takeaway, we do need more flexibility with where our containers are hosted and more flexibility in how they are built, which Wave gives us. We'd also like better uptime from using Seqera Containers as a registry. |
This will make it easier for end users to move their containers to a private registry if they want to back them up: wave.build.repository = 'quay.io/my/lab/repo'
wave.build.cacheRepository = 'quay.io/my/lab/cache-repo' https://www.nextflow.io/docs/latest/wave.html#push-to-a-private-repository |
as per convo on Slack, will this change be compatible with AWS ECR "pull through cache"? https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html |
so, the containers held on the Seqera Containers Registry will be a drop-in replacement for whatever the current container is, for each module? Wave containers seem to generally always have some lifecycle associated with them; that is not going to be the case for these containers right? If we need to come back in a year or two and re-pull some container from Seqera Container Registry, it will still be there? also, on a side note, where are the build files and logs going to be stored for these containers? Are there Dockerfiles available somewhere? I guess that applies to the current biocontainers too |
Correct.
Exactly. That was the exact motivation for the project. Due to the registry cache, they will be there forever* (we're saying minimum 5 years from when they're built, but we have no intention to delete any ever at present).
Current Biocontainers don't have dockerfiles, they're built dynamically on CI. Seqera Containers do have dockerfiles + conda files which are stored with the build log. I'm not 100% sure that we're guaranteeing to store those for the same duration as the images but I think that we are. I can check if it's a concern. It might be nice for us to build some system to store those + security scan results / SBOM files somewhere in nf-core as a duplicate / backup. I'd certainly like to make them visible from the nf-core website module page as a minimum anyway. |
On the same vein, I'm slightly uncomfortable by people already adding Wave containers to nf-core modules. I know that the Wave registries can only be pushed from the Wave build system (right ?) so there's no way someone can tamper with a container and ship a trojaned samtools. But for instance there is |
Great question - I'm just putting together a blog post covering much of this stuff, I will bulk up the part about this as it's important.
Hope that makes sense! Shout if you spot anything missing / have suggestions for improvements. |
Also I'd like to note: although I discussed this on the nf-core Slack for the container image you mentioned, and have already been discussing updating Bioconda with the Crabs tool authors. |
I noticed this, however, I think I have had a few instances when the build ID has a different number appended to the end besides is there any clarification on this? Is it really always gonna be |
The plan looks great, @ewels 👍🏼 |
@stevekm trust you to spot that, I was hoping it'd fly under the radar 😂 Yes you're totally right. The We discussed this in the wave dev team on Monday and suggested that we add a new API endpoint that can return the build ID for a given container. Basically looking for all matching IDs and then retuning the one with the highest |
NB: jobs can fail because of conda deps that change for subsequent requests. Or also simpler stuff like the build cluster scaling and cutting off the build mid-way. |
we also recently had a ton of failed builds due to a bad pinned verison of Perl but I think someone should have already put in a PR for that |
First blog post about this now out: https://nf-co.re/blog/2024/seqera-containers-part-1 second part will be more technical, working on that now. |
Seqera Containers is a new service to provide Docker + Singularity containers from any Conda / PyPI packages. Images are generated on demand and can include multiple packages.
Links for background:
We should strip out the BioContainers quay.io docker images + Galaxy server Singularity images and replace with image URIs from Seqera Containers.
We should use this change as an opportunity to rethink the optimal code structure for defining image names. This is currently under discussion. Group consensus can be posted here once achieved for broader community approval.
Milestone to track broad progress on this update: https://github.com/nf-core/modules/milestone/6
Tasks
The text was updated successfully, but these errors were encountered: