Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gh/ci eamxx standalone testing #6637

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mahf708
Copy link
Contributor

@mahf708 mahf708 commented Sep 23, 2024

Adds support for running eamxx standalone testing in the gh/ci containers, for now only single-precision tests due to a mix of issues. Addresses two issues that have already been fixed in the scream fork. Adds flexibility to build cprnc in eamxx.

[BFB]

@mahf708 mahf708 added Testing Anything related to unit/system tests EAMxx PRs focused on capabilities for EAMxx labels Sep 23, 2024
Copy link

github-actions bot commented Sep 23, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6637/
on branch gh-pages at 2024-09-28 19:40 UTC

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 23, 2024

FYI, the eamxx build fails are due to #6635. These won't affect the standalone testing because it is "sp" (homme isn't supported there). This is part of the reason I am starting with sp only, but there's a second part and it is to do with the complexity of building and executing the other (bigger) test suites, with a weird (unexpected) fail in atm_proc. I will handle those at a later time.

@mahf708 mahf708 changed the title add gh/ci eamxx standalone testing [wip] add gh/ci eamxx standalone testing Sep 23, 2024
@mahf708 mahf708 force-pushed the mahf708/ig/standalone-ghci-oci branch from 782baa7 to cdaa5c4 Compare September 25, 2024 21:46
@mahf708 mahf708 changed the title [wip] add gh/ci eamxx standalone testing add gh/ci eamxx standalone testing Sep 25, 2024
@mahf708 mahf708 added the BFB PR leaves answers BFB label Sep 25, 2024
@mahf708
Copy link
Contributor Author

mahf708 commented Sep 25, 2024

Status update: This PR is very close to being ready, in fact, it may pass all tests, but I'd like to revise the container again before we merge. The rationale: The standalone tests are pushing the container very close to its edge in terms of disk space, resulting in unpredictable fails. Plan A: reduce the space inside the container itself first (we have 13+ GB of data, but we need only ~2 GB for the standalone tests). Plan B: reduce the space in the runner and redesign how the container is run.

@bartgol
Copy link
Contributor

bartgol commented Sep 25, 2024

What do you mean by "reduce the space in the runner"?

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 25, 2024

What do you mean by "reduce the space in the runner"?

The runner image (ubuntu-latest) comes preloaded with all sorts of random things that we can delete. Nominally, they guarantee ~15 GB of space, but in reality, there's more than 120 GB on these runners. It's just that a lot of complex stuff is preinstalled. You can view a list of stuff they have on these in docs like this https://github.com/actions/runner-images/blob/ubuntu22/20240922.1/images/ubuntu/Ubuntu2204-Readme.md. See below for the runner image vs the custom image. I propose to make space in the custom image first. If it doesn't work, I will make space outside of it, but that will likely mean I cannot use the custom image as written below...

  ci:
    runs-on: ubuntu-latest <--------------------------------------- runner image
    strategy:
      fail-fast: false
      matrix:
        test:
          - sp
          - opt
    container: 
      image: ghcr.io/e3sm-project/containers-ghci:ghci-0.1.2 <----- custom image

@bartgol
Copy link
Contributor

bartgol commented Sep 25, 2024

Aren't the limits set on the container image? Why do we care about the runner?

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 26, 2024

Aren't the limits set on the container image? Why do we care about the runner?

I am not sure what you mean...

The limit comes from the hardware (bare-metal), which has the runner image (ubuntu-latest), which has the container image (ghci-container). In other words, ghci-container ⊆ ubuntu-latest ⊆ bare-metal. Inside the ghci-container, you can use as much as you want (i.e., you can run a multi-TB sim if there is space...)

@bartgol
Copy link
Contributor

bartgol commented Sep 26, 2024

Aren't the limits set on the container image? Why do we care about the runner?

I am not sure what you mean...

The limit comes from the hardware (bare-metal), which has the runner image (ubuntu-latest), which has the container image (ghci-container). In other words, ghci-container ⊆ ubuntu-latest ⊆ bare-metal. Inside the ghci-container, you can use as much as you want (i.e., you can run a multi-TB sim if there is space...)

Ah, so we are not fitting on the actual machine where things end up running? I see. Re: container image size, I was thinking gh has a limit in size for images that one can upload, and that's it. Since the runner is not ours, I thought "there's no limit on our end". But yes, while we are not responsible for the runner size, the limit is hardware based.

Wow, 150GB for the runner, that's just nonsense. Why do they make them this big? Have you considered asking for alpine instead? We may have to install something at startup though, which may take some time... Alternatively, maybe a basic debian?

@rljacob
Copy link
Member

rljacob commented Sep 26, 2024

Is this still a draft?

@mahf708
Copy link
Contributor Author

mahf708 commented Sep 26, 2024

Is this still a draft?

Yes, I want to rework the container a bit so that the new tests can pass more reliably. The issue is running out of disk space (sometimes)

@mahf708 mahf708 marked this pull request as ready for review September 29, 2024 01:26
@mahf708
Copy link
Contributor Author

mahf708 commented Sep 29, 2024

Is this still a draft?

@rljacob, this is ready now (see notes below).

@bartgol, could you take another look?

Some notes:

  • this PR introduces changes in EAMxx that are already impl'ed in the scream repo (namely b33da8c for a cosp segfault and 89b7ade for pl semantics --- these two are isolated commits that hopefully won't conflict when merging the repos, but Jim can comment)
  • this PR changes a little bit about cprnc building in eamxx (with more todo later)
  • this PR introduces four different workflows that will run when eamxx or homme are touched (standalone in four flavors); the fpe and dbg flavors can take up to two hours, the build alone is ~20 mins, so these will be our longest tests (recall we have 6 hours on these runners, so we are far awar from timing out)
  • a todo item is to introduce concurrency like Luca suggested (add a concurrency check to gh workflows #6652)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx PRs focused on capabilities for EAMxx Testing Anything related to unit/system tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants