Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gray out early clades #1132

Merged
merged 10 commits into from
Oct 3, 2024
Merged

Gray out early clades #1132

merged 10 commits into from
Oct 3, 2024

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Jul 27, 2024

Description of proposed changes

This PR is attempting to solve the same problem as #1129, but in a different fashion. In this case, we keep the subsampling in the 6m, 2m and 1m builds the same as before and so keep the long contextual tail going from 6m / 2m / 1m back to the beginning of the pandemic. However, I've added new functionality to the assign-colors.py script that takes in a --clade-recency parameter. This specifies the number of months back to look for circulating clades. So for example, --clade-recency 6 would use tips in the previous 6 months and only keep clade_membership in this time period. I added parameter logic to specify --clade-recency of 6 vs 2 vs 1 for 6m, 2m and 1m builds. The all-time builds include all clades.

Here's some comparisons.

global/6m

Live

Screenshot 2024-07-26 at 6 24 11 PM

PR

Screenshot 2024-07-26 at 6 25 26 PM

global/2m

Live

Screenshot 2024-07-26 at 6 26 20 PM

PR

Screenshot 2024-07-26 at 6 26 50 PM

global/1m

Live

Screenshot 2024-07-26 at 6 27 46 PM

PR

Screenshot 2024-07-26 at 6 28 20 PM

I think this is a big improvement. We no longer have 3 highly similar similar shades of red for currently circulating clades. It also seems semantically appropriate to gray out these early clades to correspond to "context". This the same thing we do for regions outside of the focal region ala south-america/6m?c=country

My preference would be to merge this PR (after code review) and reconsider the previous time-based sampling PR #1129. I think we'll eventually need to move the min_date up from 2020, but with clade colors fixed it's not so urgent. Additionally, having this strategy of gray clades outside of focal window would work with alternative time-based sampling windows.

Testing

Tested locally and via trial build (URLs linked above). However, this adds a new parameters.yaml option. I tested and it shouldn't break existing builds that don't specify colors:clade_recency, but attention here wouldn't hurt.

Release checklist

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@trvrb trvrb requested a review from a team July 27, 2024 01:40
@trvrb trvrb self-assigned this Jul 27, 2024
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this looks like a great improvement! I'm excited to try the same approach in the seasonal flu workflow.

My main comment is that I think the default value should be an explicit value that ties to the meaning of the clade_recency functionality instead of an empty string. For example, a string value of “all” or “all-time” would more clearly communicate what the config option is doing. An empty string doesn’t immediately communicate “assign colors to all clades” and looks more like a typo in the config YAML (like maybe someone forgot to enter a value?).

The clade recency argument appears in two different parts of the workflow: the workflow configuration/rules and the assign-colors.py script. We could modify either of these parts to support a default argument of "all".

In the context of the script, we could keep the default behavior of “assign colors to all clades” when the --clade-recency argument is not provided and that would not be surprising behavior. Providing that argument would be a way to opt into the behavior of filtering by date. When the user provides that argument, the script would require that the value is an integer representing number of months in the past. In the context of the workflow, this approach would require us to build a clade recency argument string to pass to the script. So, instead of this:

--clade-recency {params.clade_recency}

we would build this:

{params.clade_recency_argument}

When the user provides a value of “all” or the workflow can’t find a clade recency in the config, the clade recency argument is an empty string. When the user provides an integer value, the clade recency argument is --clade-recency 3, for example. We already use this pattern in the ncov workflow (like the get_priority_argument function).

Another pattern we use elsewhere is a custom argparse type that accepts a specific string value (“all”) or an integer like we do for Augur’s nthreads arguments. We could use this pattern to modify the assign-colors script to accept --clade-recency all as a valid input. Then the workflow can keep the same pattern in this PR where the colors rule hardcodes the --clade-recency argument with a {params.clade_recency} reference instead of building the argument as described above. When the user provides a value of all, the script happily performs the default behavior of assigning colors to all clades.

I lean slightly toward the params.clade_recency_argument pattern, since it doesn’t add complexity to the assign-colors script that would only exist for the sake of the workflow, but either approach would work.

docs/src/reference/workflow-config-file.rst Outdated Show resolved Hide resolved
scripts/assign-colors.py Outdated Show resolved Hide resolved
@trvrb
Copy link
Member Author

trvrb commented Sep 26, 2024

Thanks so much for the detailed feedback @huddlej. This was extremely helpful. With 497cb9c, I've implemented your first suggestion of swapping --clade-recency {params.clade_recency} to {params.clade_recency_argument}.

@trvrb
Copy link
Member Author

trvrb commented Sep 26, 2024

I've (finally) attended to @huddlej's helpful comments. This included swapping to {params.clade_recency_argument} and swapping to config["colors"]["default"]["clade_recency"].

I'm re-running trial builds in https://github.com/nextstrain/ncov/actions/runs/11059102557. This should update the URLs referenced in the original description like https://nextstrain.org/staging/ncov/gisaid/trial/gray-out-early-clades/global/6m.

The resulting dataset coloring should be the same as the original PR, but just with a cleaner interface.

@victorlin: I've tagged you for review here because I know that @huddlej is very busy with VCM and Options. If you could look things over, I'd very much appreciate it.

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get to review much but sending in a comment before I sign off for the day.

Will plan to continue review on Monday.

scripts/assign-colors.py Outdated Show resolved Hide resolved
scripts/assign-colors.py Outdated Show resolved Hide resolved
nextstrain_profiles/nextstrain-gisaid/builds.yaml Outdated Show resolved Hide resolved
@victorlin
Copy link
Member

I pushed a couple more fixes: b685fbb, 3273dcc

I also contemplated renaming clade_recency to clade_min_date to align with other min_date options that take the same duration format, but clade_min_date: "all" doesn't sound right. The special value "all" could be replaced by a blank/null value (i.e. clade_min_date:) but that goes against @huddlej's point in #1132 (review).

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think all the commits can get squashed into one, or at least all my commits squashed into one along the lines of "Use ISO duration format for clade recency, error on invalid values".

Comment on lines +189 to +198
def _get_clade_recency_for_wildcards(wildcards):
# check if builds.yaml contains colors:{build_name}:clade_recency
if wildcards.build_name in config["colors"] and 'clade_recency' in config["colors"][wildcards.build_name]:
return config["colors"][wildcards.build_name]["clade_recency"]
# check if builds.yaml or parameters.yaml contains colors:default:clade_recency
elif "colors" in config and "clade_recency" in config["colors"]["default"]:
return config["colors"]["default"]["clade_recency"]
# else return sensible default value
else:
return "all"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I've already picked on this elsewhere and don't want implementation details to get in the way of having improved coloring

In #1131 we settled on "strategy 2" which would translate to:

def _get_clade_recency_for_wildcards(wildcards):
    # check if builds.yaml contains colors:{build_name}:clade_recency
    if wildcards.build_name in config["colors"] and 'clade_recency' in config["colors"][wildcards.build_name]:
        return config["colors"][wildcards.build_name]["clade_recency"]
    # use workflow default
    else:
        return config["colors"]["default"]["clade_recency"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch here. I agree that cleaning up default values to source directly from parameters.yaml is the right way to do things. But that said, I think I'd like to tackle this cleanup in a separate PR after this PR and #1130 are merged.

@trvrb
Copy link
Member Author

trvrb commented Oct 3, 2024

Thanks for the review @victorlin! Could you squash your commits however you'd like and I'll then update change_log.md and merge.

trvrb and others added 9 commits October 3, 2024 11:19
When assigning colors gray out clades that are not in recent samples. Use clade_recency parameter to consider how many months back to consider "recent".
Set up a build-specific override for colors: clade_recency parameter. Make the default to be empty string "" and have this keep all clades colored. Override for the Nextstrain GISAID profile to specify clade_recency of
- `1` month for "1m" build
- `2` months for "2m" build
- `6` months for "6m" build
- `""` (all months) for "all-time" build
Per rationale in issue 1131 (#1131), switch to following a three step look up for setting build-specific parameters of colors:clade_recency.
Rather than previous behavior of setting --clade-recency 6 by building --clade-recency {params.clade_recency}, instead build the entire argument via {params.clade_recency_argument}.

This allows setting colors.global_all-time.clade_recency to "all" in builds.yaml and in this case handing assign-colors.py nothing for clade_recency. This keeps the behavior of assign-colors.py as expected and backwards compatible, while simultaneously allowing for the "all" sensible default in parameters.yaml rather than the previous empty string to accomplish this same behavior.
The colors.clade_recency parameter can be specified in a build-specific fashion. If specifying a default for use, use the pattern of specifying colors.default.clade_recency rather than the previous colors.clade_recency.
Show a meaningful error message if the value is improperly formatted.
Keeps consistent with other existing usage of this format.
@victorlin
Copy link
Member

@trvrb done. Also rebased onto latest master to get the latest change log.

@trvrb trvrb merged commit 7db9229 into master Oct 3, 2024
7 checks passed
@trvrb trvrb deleted the gray-out-early-clades branch October 3, 2024 21:11
@trvrb
Copy link
Member Author

trvrb commented Oct 3, 2024

Thanks again @victorlin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants