Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude sequences with unusual collection dates #175

Open
joverlee521 opened this issue Aug 6, 2024 · 2 comments
Open

Exclude sequences with unusual collection dates #175

joverlee521 opened this issue Aug 6, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Aug 6, 2024

Context

@huddlej flagged sequences with unusual collection dates on Slack, where date == date_submitted. We should exclude these sequences from the builds because this is a clear metadata issue.

Possible solutions

  1. Add (date != date_submitted) to all of the filter queries across all configs
  2. Add a new filter rule in the main workflow to exclude these sequences for all builds
  3. Add a new filter rules in the upload workflow to exclude these sequences in our S3 files
  4. Add specific sequences to outliers.txt (e.g. 8209b35)
@joverlee521 joverlee521 added the enhancement New feature or request label Aug 6, 2024
@huddlej
Copy link
Contributor

huddlej commented Aug 6, 2024

I'm a little worried about excluding these types of records algorithmically without any notification to us. Ideally, we want to catch these data issues, alert the data provider so they can fix the records, and update our records to use the correct metadata. Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.

If data providers can't or won't update their records, the outlier file approach seems reasonable.

@joverlee521
Copy link
Contributor Author

Another approach might be to make a QC report that runs weekly (on new data only?) with checks for this kind of issue plus Nextclade QC statuses, failed alignments, etc.

Ah that would be nice! Seems like something we can add to the upload + Nextclade workflows.

If data providers can't or won't update their records, the outlier file approach seems reasonable.

With the outlier file approach, I feel like we never go back to check if the "outliers" have been fixed. I guess if we implement the QC report, it can flag sequences that have been fixed and can be removed from the outlier files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants