-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(prepro, ingest, deposition): Enforce author formatting in prepro, map authors accordingly in ingest and deposition #2986
base: main
Are you sure you want to change the base?
Conversation
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will bump versions in ingest - we could work around this with a more complicated prepro that takes into account whether something comes from ingest, but that might not be worth the hassle just to avoid +1 version.
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py
Outdated
Show resolved
Hide resolved
NIH data from NCBI virus is not consistent in their formatting of authors:
|
In the downloaded |
9c89328
to
533f051
Compare
More clean up Move checks from snakefile to config fix config update deployment update tests ci Add trigger from db option Fix cronjob Fix link to config-file fix deployment install package in dockerfile install at correct location Remove snakemake as no longer needed Add missing dependency try to debug Create an XmlNone dataclass - this is required since package update test threads stop revert exception test test upload to ena dev still works on preview Make sure test is set correctly!!! remove debug print statements Improve logs Fix merge errors Update ena-submission/README.md Co-authored-by: Cornelius Roemer <[email protected]> Apply suggestions from code review Co-authored-by: Cornelius Roemer <[email protected]> Cronjob: create results directory before writing to it format authors in prepro Fix ingest try to fix pattern simplify regex fix check Add tests # Conflicts: # preprocessing/nextclade/tests/test.py Add to ena submission fix fix other edge case Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ingest/scripts/prepare_metadata.py Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py rename Update reformat_authors_from_genbank_to_loculus Additionally format authors with correct white space Improve error message add tests fix missing pattern improve error logs fix error Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py improve logging more feat(ingest): Do not use processed tsv but raw jsonl when ingesting data from NCBI Virus (#2990) * Use raw jsonl instead of generated tsv when ingesting data from NCBI virus * Do not require authors list to end in ';', capitalize names correctly. * Add tests for capitalization * Add a warning if author list might be in wrong format * Add ascii specific warning * Add tests for warnings and errors * Only capitalize if full authors string is upper case * Properly capitalize initial * Move titlecase option to ingest only - add ingest tests Move author formatting functions to format_ncbi_metadata as this is a more logical location Remove duplicate group name # Conflicts: # ena-submission/scripts/get_ena_submission_list.py # ena-submission/src/ena_deposition/config.py
533f051
to
f480d67
Compare
SQL code for retroactively updating the author format of previous versions:
|
resolves #2985, #1428
preview URL: https://format-authors.loculus.org/
Changes
Breaking: Submission to Loculus
Preprocessing changes
Doe, John A.; Sanchez Roe, Jane Maria
, where last name(s) is mandatory, a comma is mandatory to separate first names/initials from last name.error
, users will receive an error message with expected formatBreaking: Loculus authors in released data/LAPIS
FirstNames LastNames, FirstNames LastNames
Breaking: Ingest bumps version of all ingested sequences and changes how metadata is processed
Ingest changes
To be accepted, ingest needs to resubmit all sequences
Problematic: Preprocessing version bump no longer possible
Old submissions won't be accepted by the new prepro pipeline, this is a problem for the current preprocessing versioning code which expects all released old sequences to be compatible with a new pipeline. There are a few options but none are obviously the best:
Summary
Preprocessing will only accept submissions where authors is formatted as
surname, first name; surname, first name;
. It will error if this is not the case.Testing
Checkout pathoplexus/pathoplexus@ce68b92 to perform regression testing (ignore authors as should be different) by comparing this branch with Loculus main:
micromamba activate pp-integrity cd tests/regression-testing snakemake results/ebola-zaire.meta.main.format-authors.diff snakemake results/cchf.meta.main.format-authors.diff snakemake results/west-nile.meta.main.format-authors.diff
There is no diff in the ebola-zaire, cchf and west nile metadata.
PR Checklist