feat(prepro, ingest, deposition): Enforce author formatting in prepro, map authors accordingly in ingest and deposition #2986

anna-parker · 2024-10-10T16:06:08Z

resolves #2985, #1428

preview URL: https://format-authors.loculus.org/

Changes

Breaking: Submission to Loculus

Preprocessing changes

From now on, we expect authors to be submitted in the new format: Doe, John A.; Sanchez Roe, Jane Maria, where last name(s) is mandatory, a comma is mandatory to separate first names/initials from last name.
Only ASCII alphabetical characters A-Z are allowed in names
Failure to do so results in the sequence becoming annotated with error, users will receive an error message with expected format
Warn users if authors list might be in incorrect format.

Breaking: Loculus authors in released data/LAPIS

The same format as used for submission will also be used for programmatic data output, instead of the current format: FirstNames LastNames, FirstNames LastNames

Breaking: Ingest bumps version of all ingested sequences and changes how metadata is processed

Ingest changes

If authors list is entirely capitalized apply title case (leave common last names prefixes such as van etc. lower case)
Add ingest test by modifying snakemake example files to also have edge cases.
The tsv generated by the ncbi datasets cli from the raw jsonl does not correctly format authors- leading to the case where the string cannot be correctly parsed. To avoid this use the raw jsonl directly and write a python script to map json fields to a tsv This way we can format authors correctly, and have access to the raw data.
The following unused tsv metadata fields are dropped:

ncbiHostBreed
ncbiHostCultivar
ncbiHostEcotype
ncbiHostIsolate
ncbiHostPangolin
ncbiIsVaccineStrain
ncbiMaturePeptideCount
ncbiMolType
ncbiVirusCommonName
ncbiVirusBreed
ncbiVirusCultivar
ncbiVirusEcotype
ncbiVirusIsolate
ncbi_virus (listed as "Virus Infraspecific Names Sex" in column_mapping)
ncbiVirusStrain
ncbiVirusPangolin

The following currently always empty tsv fields are kept but left empty:

ncbiHostCommonName
ncbiPurposeOfSampling

To be accepted, ingest needs to resubmit all sequences

Problematic: Preprocessing version bump no longer possible

Old submissions won't be accepted by the new prepro pipeline, this is a problem for the current preprocessing versioning code which expects all released old sequences to be compatible with a new pipeline. There are a few options but none are obviously the best:

A) Change author formatting inside the db sequence entries
B) Make the prepro pipeline submission-time aware and be more lenient on old submissions before a time cutoff - but we'd still somehow need to ensure old entries have authors output in the new format, which is not trivial.

Summary

Preprocessing will only accept submissions where authors is formatted as surname, first name; surname, first name;. It will error if this is not the case.

Ingest must now change how it formats the author string it receives from NIH to fit this format.
Additionally, the ena-submission pod must map this new Loculus format to the ENA expected format.

Testing

Checkout pathoplexus/pathoplexus@ce68b92 to perform regression testing (ignore authors as should be different) by comparing this branch with Loculus main:

micromamba activate pp-integrity
cd tests/regression-testing 
snakemake results/ebola-zaire.meta.main.format-authors.diff 
snakemake results/cchf.meta.main.format-authors.diff 
snakemake results/west-nile.meta.main.format-authors.diff

There is no diff in the ebola-zaire, cchf and west nile metadata.

PR Checklist

Include consistent white space
Check if this causes a version bump in ingest (it should)
Adjust documentation to include new author input format: Add new required authors format to docs pathoplexus/pathoplexus#239
Decide if we want to change the way we display authors on website
Check that old sequences pass new prepro pipeline
Merge in feat(ingest): Do not use processed tsv but raw jsonl when ingesting data from NCBI Virus #2990 so that ingested authors are in correct format

ena-submission/scripts/create_assembly.py

ena-submission/scripts/ena_submission_helper.py

ingest/scripts/prepare_metadata.py

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

corneliusroemer · 2024-10-11T11:22:54Z

ingest/scripts/prepare_metadata.py

This will bump versions in ingest - we could work around this with a more complicated prepro that takes into account whether something comes from ingest, but that might not be worth the hassle just to avoid +1 version.

ena-submission/scripts/ena_submission_helper.py

ingest/scripts/prepare_metadata.py

preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

anna-parker · 2024-10-11T13:21:56Z

NIH data from NCBI virus is not consistent in their formatting of authors:

13:18:17    DEBUG ( prepare_metadata.py:  83) - Config(compound_country_field='ncbiGeoLocation', fasta_id_field='genbankAccession', rename={'bioprojects': 'bioprojectAccession', 'country': 'geoLocCountry', 'division': 'geoLocAdmin1', 'genbankAccession': 'insdcAccessionFull', 'ncbiCollectionDate': 'sampleCollectionDate', 'ncbiHostCommonName': 'hostNameCommon', 'ncbiHostName': 'hostNameScientific', 'ncbiHostSex': 'hostGender', 'ncbiHostTaxId': 'hostTaxonId', 'ncbiIsLabHost': 'isLabHost', 'ncbiIsolateName': 'specimenCollectorSampleId', 'ncbiPurposeOfSampling': 'purposeOfSampling', 'ncbiSraAccessions': 'sraRunAccession', 'ncbiSubmitterAffiliation': 'authorAffiliations', 'ncbiSubmitterNames': 'authors'}, keep=['division', 'country', 'submissionId', 'insdcAccessionBase', 'insdcVersion', 'bioprojects', 'biosampleAccession', 'ncbiHostName', 'ncbiHostTaxId', 'ncbiIsLabHost', 'ncbiReleaseDate', 'ncbiUpdateDate', 'ncbiSourceDb', 'ncbiVirusName', 'ncbiVirusTaxId', 'sequence_md5', 'genbankAccession', 'jointAccession'], segmented=True)
13:18:17     INFO ( prepare_metadata.py:  85) - Reading metadata from results/filtered_metadata.tsv
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915541: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915542: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915543: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915544: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915545: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915546: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915547: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915548: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915549: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915550: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915551: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915552: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915553: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915554: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915555: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915556: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915557: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915558: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915559: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915560: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915561: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:18     INFO ( prepare_metadata.py: 162) - Saved metadata for 3317 sequences

anna-parker · 2024-10-11T13:52:06Z

MW915549

In the downloaded ncbi_dataset/data/data_report.jsonl the names are actually:
"names":["Shahid MF","Yaqub T","Ali M","Ul-Rahman A","Bente DA","Shahid,M.F.","Yaqub,T.","Ali,M.","Ul-Rahman,A.","Bente,D.A."]}

ingest/scripts/prepare_metadata.py

ingest/Snakefile

More clean up Move checks from snakefile to config fix config update deployment update tests ci Add trigger from db option Fix cronjob Fix link to config-file fix deployment install package in dockerfile install at correct location Remove snakemake as no longer needed Add missing dependency try to debug Create an XmlNone dataclass - this is required since package update test threads stop revert exception test test upload to ena dev still works on preview Make sure test is set correctly!!! remove debug print statements Improve logs Fix merge errors Update ena-submission/README.md Co-authored-by: Cornelius Roemer <[email protected]> Apply suggestions from code review Co-authored-by: Cornelius Roemer <[email protected]> Cronjob: create results directory before writing to it format authors in prepro Fix ingest try to fix pattern simplify regex fix check Add tests # Conflicts: # preprocessing/nextclade/tests/test.py Add to ena submission fix fix other edge case Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ingest/scripts/prepare_metadata.py Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py rename Update reformat_authors_from_genbank_to_loculus Additionally format authors with correct white space Improve error message add tests fix missing pattern improve error logs fix error Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py improve logging more feat(ingest): Do not use processed tsv but raw jsonl when ingesting data from NCBI Virus (#2990) * Use raw jsonl instead of generated tsv when ingesting data from NCBI virus * Do not require authors list to end in ';', capitalize names correctly. * Add tests for capitalization * Add a warning if author list might be in wrong format * Add ascii specific warning * Add tests for warnings and errors * Only capitalize if full authors string is upper case * Properly capitalize initial * Move titlecase option to ingest only - add ingest tests Move author formatting functions to format_ncbi_metadata as this is a more logical location Remove duplicate group name # Conflicts: # ena-submission/scripts/get_ena_submission_list.py # ena-submission/src/ena_deposition/config.py

anna-parker · 2024-10-18T15:09:38Z

SQL code for retroactively updating the author format of previous versions:

WITH latest_authors AS (
    -- Find the latest version of each accession and extract its authors
    SELECT
        accession,
        metadata->'authors' AS latest_authors
    FROM entries
    WHERE (accession, version) IN (
        SELECT accession, MAX(version)
        FROM entries
        GROUP BY accession
    )
)
UPDATE entries
SET metadata = jsonb_set(metadata, '{authors}', latest_authors)
FROM latest_authors
WHERE entries.accession = latest_authors.accession
AND entries.version < (SELECT MAX(version) FROM entries e2 WHERE e2.accession = entries.accession);

Due to conda-forge/unzip-feedstock#16

anna-parker added the preview Triggers a deployment to argocd label Oct 10, 2024