Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add admin.list files to warn users of known issues #197

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Neves-P
Copy link
Member

@Neves-P Neves-P commented Aug 27, 2024

A small number of software installed on EESSI have some issues in specific contexts that users should be made aware of (see support ticket #79).

This WIP PR adds admin.list files per architecture to use Lmod's module deprecating feature in order to display a message to users when they load a module with known issues. The information about the known issues comes from the YAML file(s) in the root of the software-layer repository: eessi-2023.06-known-issues.yml

The warnings should appear only in the context where they apply, i.e., a user using zen4 CPUs shouldn't be warned about a few failing tests for SciPy in neoverse_v1.

The admin.list files should have the following format:

/cvmfs/software.eessi.io/versions/2023.06/software/linux/modules/all/aarch64/neoverse_v1/ESPResSo/4.2.1-foss-2023a:
   There is a known issue: ESPResSo tests failing due to timeouts
   See: https://github.com/EESSI/software-layer/issues/363

/cvmfs/software.eessi.io/versions/2023.06/software/linux/modules/all/aarch64/neoverse_v1/FFTW/MPI-3.3.10-gompi-2023a:
   There is a known issue: Flaky FFTW tests, random failures
   See: https://github.com/EESSI/software-layer/issues/325

The first line can be a module name and version, but also the full path to a module. This is preferred, since it ensures that we are picking up the relevant module in the right context, and that we are not displaying unintended warnings should EESSI be mounted in a site that has repeating local installations of the same module.

This PR is marked as WIP because there is issue to solve still. Parsing the known_issues.yml file yields almost the correct path, but the module directory is incorrect as it is the easyconfig name that is recorded and not the module name. Compare (note dash between module name and version):
Expected - /cvmfs/software.eessi.io/versions/2023.06/software/linux/modules/all/aarch64/neoverse_v1/ESPResSo/4.2.1-foss-2023a
Obtained - /cvmfs/software.eessi.io/versions/2023.06/software/linux/modules/all/aarch64/neoverse_v1/ESPResSo-4.2.1-foss-2023a

Converting between easyconfig name and module name is more complicated than I initially thought, because module names can be very variable and be composed of an unknown number of words.

I see two options:

  • Coming up with a clever trick for this, that could possibly fail in some more esoteric model names
  • Modifying the problem upstream and converting the name in the known_issues.yml file to match the expected output.

@boegel
Copy link
Contributor

boegel commented Aug 27, 2024

In general I like this approach, and it's sort of what we had in mind with keeping track of known issues, but we should also wonder if these "nag" messages on load won't be too alarming.

If a module is loaded indirectly as a dependency, should we emit a message in this case, for example?

Maybe the known issues list should also indicate whether or not a message should be emitted when the corresponding module is loaded, and in some cases it may only make sense to emit the message when the module is being loaded directly (not as a dependency).

We should definitely also point to the "Known issues" page in the EESSI documentation, where people can get more information on known issues. We have a page like that, but it doesn't list the known issues included in the YAML file currently: http://www.eessi.io/docs/known_issues/eessi-2023.06

@Neves-P
Copy link
Member Author

Neves-P commented Aug 28, 2024

In general I like this approach, and it's sort of what we had in mind with keeping track of known issues, but we should also wonder if these "nag" messages on load won't be too alarming.

If a module is loaded indirectly as a dependency, should we emit a message in this case, for example?

My preference would be not to warn users if the module is loaded as a dependency mostly because that has the potential of being very verbose if the modules with warnings are very common dependencies. It might also be very alarming in situations where there is no real cause for alarm... I haven't tried it yet, but there is a good chance the messages are triggered even for modules loaded as dependencies.

Maybe the known issues list should also indicate whether or not a message should be emitted when the corresponding module is loaded, and in some cases it may only make sense to emit the message when the module is being loaded directly (not as a dependency).

I agree, and @bedroge suggests to have an entry in the known issues yaml file that determines if the message gets displayed or not. With some refactoring to the file that we discussed (details below) adding this is simple and can easily be parsed by the python script to determine if the module gets added to the admin.list file or not.

We should definitely also point to the "Known issues" page in the EESSI documentation, where people can get more information on known issues. We have a page like that, but it doesn't list the known issues included in the YAML file currently: http://www.eessi.io/docs/known_issues/eessi-2023.06

Absolutely, it would be good to automatically parse the yaml file and add the information in an easy to read format on the "Known Issues" page. Maybe it could also be added to the installed software list by adding the issue information to the relevant pages, but that might be more complicated than it's worth.

We discussed the best way to address converting the easyconfig name to module name and agreed to go by @casparvl's suggestion of revamping the yaml file. The reworked file would have explicit fields for architecture, module name, version, toolchain, link to relevant GH issue, short description of the problem and if the warning message should be displayed or not.

We would have to enforce this formatting and there might be corner cases that we didn't anticipate, but this way is fairly general and would let sites that user other module naming schemes to grab this information and adapt the admin.list file generation script to their particular situation.

I will do a semi-manual conversion of the current yaml file and paste it here to propose a change. We could then also add some CI checks that make sure new items added contain at least the required fields.

@Neves-P
Copy link
Member Author

Neves-P commented Sep 3, 2024

A proposal of what the yml file could contain:

- aarch64/a64x:
  - SciPy-bundle/2023.07-gfbf-2023a:
    - software_name: SciPy-bundle
    - software_version: 2023.07
    - toolchain: gfbf
    - toolchain_version: 2023a
    - issue: https://github.com/EESSI/software-layer/issues/318
    - info: "4 failing tests (vs 54407 passed) in scipy test suite"
    - warn: true
  - SciPy-bundle/2023.11-gfbf-2023b:
    - software_name: SciPy-bundle
    - software_version: 2023.11
    - toolchain: gfbf
    - toolchain_version: 2023b
    - issue: https://github.com/EESSI/software-layer/issues/318
    - info: "3 failing tests (vs 54875 passed) in scipy test suite"

The warn field can be added when the warning should be displayed on module loading for users, otherwise it can be omitted or set to false. I'd suggest also having a suffix field for when we inevitably run into an application that has a suffix in the name. This field can also be omitted if there is no suffix.

Edit: Add some of the changes from support meeting (so I don't forget :) )

@boegel
Copy link
Contributor

boegel commented Sep 3, 2024

@Neves-P I would also keep track of software name, and then maybe this is better:

- aarch64/a64x:
  - SciPy-bundle/2023.07-gfbf-2023a:
    - software:
      - name: SciPy-bundle
      - version: 2023.07
    - toolchain:
      - name: gfbf
      - version: 2023a
    - issue: https://github.com/EESSI/software-layer/issues/318
    - info: "4 failing tests (vs 54407 passed) in scipy test suite"
    - warn: true

Maybe the warn should be refined to always_warn on both direct and indirect load, and another option for only warning on direct loads?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants