Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEOS] Define operational and HPC metrics #66

Open
2 tasks
FlorianDeconinck opened this issue May 22, 2024 · 2 comments
Open
2 tasks

[GEOS] Define operational and HPC metrics #66

FlorianDeconinck opened this issue May 22, 2024 · 2 comments
Assignees

Comments

@FlorianDeconinck
Copy link
Collaborator

Previous benchmark have been done with the "Node-to-node" metric to answer the question "can we replace a CPU node with a GPU node".

As we gear toward operation, this metric is no longer enough, should also be backed with more scientifically relevant metrics (Gridpoint, SYPD, SDPD which seems to be the GMAO preferred metric etc.).

We should also start measuring ourselves against the SCU17/18 Milan nodes and their 128 cores.

Electric consumptions and price are also previous metric we should carry.

Another angle is scaling and operational usefulness of each hardware, so that the narrative to the scientists is clear.

This process should involve the GMAO but remain lead by us as to make sure we can deliver.

Overall, pragmatism is key: we are not here to give roofline projection and peak FLOPS, we are here to deliver day-to-day usage.


  • Document metrics to be used, their impact and logic
  • Create a version document of methodology to be applied for each metrics
@FlorianDeconinck FlorianDeconinck changed the title Define operational and HPC metrics [GEOS] Define operational and HPC metrics May 22, 2024
@FlorianDeconinck
Copy link
Collaborator Author

Has part of this work we should also do projection of requirements for running bigger simulations, now and every year upward.

Per Tsengdar"

  • Can we estimate how many GPUs and CPU-GPU configuration that we need to support this project in C1440-L181 resolution in FY26? Do we have access to what we need?

Per Laura:

  • What do you need from us to be successful?

@FlorianDeconinck FlorianDeconinck self-assigned this Sep 3, 2024
@FlorianDeconinck
Copy link
Collaborator Author

Working on it as part of the SC24 presentation.

Science

  • Resolution: required resolution to be run.​
    • GEOS: expressed in average kilometer for a square cell on the cube-sphere​
  • Model skill: physical processes to be resolved​
    • GEOS: dynamics, moist physics, chemistry, radiation, ocean coupling, land surface...​
  • Throughput: wall-time for the target simulation​
    • GEOS: expressed in Simulated Day per Day (SPDP)

Software

  • Features: required skills of the technology to express the science​

    • Performance​
    • Ease of use​
    • Completion​
  • Maintainability: tools to ensure enduring good science code​

    • Continuous Integration (CI) with unit, regression and functional testing​
    • Numerical debug tools​
    • QoS: Documentation tooling, coding standards, collaborative tools (user manual)​
  • Technological Debt: managing the inevitable growth in code​

    • Automatic coverage of the science code​
    • Access and documentation of supporting frameworks and/or middleware

    Operations

  • Time to solution: required wall time on a given hardware​

  • Energy use: per hardware energy use (in KW)​

  • Hardware optimization: per hardware memory bandwidth usage (in % of theoretical maximum)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant