Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed short-term fix for the slow parsing of Very Large project.yaml files #340

Open
evansd opened this issue Oct 3, 2024 · 0 comments

Comments

@evansd
Copy link

evansd commented Oct 3, 2024

We're using pure Python parsing for two reasons:

  • It's required for our cross-platform, vendor-everything installation story.
  • It produces more helpful error messages for invalid YAML.

But for very large projects involving megabytes of YAML it's intolerably slow. This makes local opensafely run very difficult to use, and makes it impossible to dispatch jobs in production because the page times out.

A short term workaround for this would be for the pipeline library to first attempt to parse the YAML using a fast, compiled parser (if one is importable) and only if that produces errors to re-parse it using the pure Python parser to get the helpful error messages.

In psuedo-code, what I'm proposing is something like:

try:
    import fast_yaml_parser
except ImportError:
    fast_yaml_paser = False

def parse_yaml(yaml_string):
    if fast_yaml_parser:
        try:
            return fast_yaml_parser(yaml_string)
        except Exception:
            pass
    return slow_but_helpful_yaml_parser(yaml_string)

This does make the unhappy path slower, but not by much. And it would massively speed up the happy path, assuming that a fast parser is importable. There are three different contexts we need to think about.

1. Job Server

Here it looks like pyyaml is already be available so there'd be nothing more to do than upgrading the pipeline library.
https://github.com/opensafely-core/job-server/blob/4814a7a17d42c55a508fb527c2b3c5a9121027c3/requirements.prod.txt#L794

2. Codespaces

I'm not exactly sure of the mechanics here, but presumably we can use whatever mechanism we do for ensuring that opensafely-cli is installed to also install pyyaml.

3. Running locally

This is obviously the hardest part. I think in the first instance we'd just need to talk the affected users through installing pyyaml (or whatever we choose). That's obviously not sustainable, but it makes it practical right now for these users to interact with their projects locally which I think is really important.

Longer term, if we move to using uv for local installation then the need to keep all our dependencies as pure Python goes away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant