Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(prepro, ingest, deposition): Enforce author formatting in prepro, map authors accordingly in ingest and deposition #2986

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Oct 10, 2024

resolves #2985, #1428

preview URL: https://format-authors.loculus.org/

Changes

Breaking: Submission to Loculus

Preprocessing changes
  • From now on, we expect authors to be submitted in the new format: Doe, John A.; Sanchez Roe, Jane Maria, where last name(s) is mandatory, a comma is mandatory to separate first names/initials from last name.
  • Only ASCII alphabetical characters A-Z are allowed in names
  • Failure to do so results in the sequence becoming annotated with error, users will receive an error message with expected format
  • Warn users if authors list might be in incorrect format.

Breaking: Loculus authors in released data/LAPIS

  • The same format as used for submission will also be used for programmatic data output, instead of the current format: FirstNames LastNames, FirstNames LastNames

Breaking: Ingest bumps version of all ingested sequences and changes how metadata is processed

Ingest changes
  • If authors list is entirely capitalized apply title case (leave common last names prefixes such as van etc. lower case)
  • Add ingest test by modifying snakemake example files to also have edge cases.
  • The tsv generated by the ncbi datasets cli from the raw jsonl does not correctly format authors- leading to the case where the string cannot be correctly parsed. To avoid this use the raw jsonl directly and write a python script to map json fields to a tsv This way we can format authors correctly, and have access to the raw data.
  • The following unused tsv metadata fields are dropped:
ncbiHostBreed
ncbiHostCultivar
ncbiHostEcotype
ncbiHostIsolate
ncbiHostPangolin
ncbiIsVaccineStrain
ncbiMaturePeptideCount
ncbiMolType
ncbiVirusCommonName
ncbiVirusBreed
ncbiVirusCultivar
ncbiVirusEcotype
ncbiVirusIsolate
ncbi_virus (listed as "Virus Infraspecific Names Sex" in column_mapping)
ncbiVirusStrain
ncbiVirusPangolin
  • The following currently always empty tsv fields are kept but left empty:
ncbiHostCommonName
ncbiPurposeOfSampling

To be accepted, ingest needs to resubmit all sequences

Problematic: Preprocessing version bump no longer possible

Old submissions won't be accepted by the new prepro pipeline, this is a problem for the current preprocessing versioning code which expects all released old sequences to be compatible with a new pipeline. There are a few options but none are obviously the best:

  • A) Change author formatting inside the db sequence entries
  • B) Make the prepro pipeline submission-time aware and be more lenient on old submissions before a time cutoff - but we'd still somehow need to ensure old entries have authors output in the new format, which is not trivial.

Summary

Preprocessing will only accept submissions where authors is formatted as surname, first name; surname, first name;. It will error if this is not the case.

  • Ingest must now change how it formats the author string it receives from NIH to fit this format.
  • Additionally, the ena-submission pod must map this new Loculus format to the ENA expected format.

Testing

Checkout pathoplexus/pathoplexus@ce68b92 to perform regression testing (ignore authors as should be different) by comparing this branch with Loculus main:

micromamba activate pp-integrity
cd tests/regression-testing 
snakemake results/ebola-zaire.meta.main.format-authors.diff 
snakemake results/cchf.meta.main.format-authors.diff 
snakemake results/west-nile.meta.main.format-authors.diff 

There is no diff in the ebola-zaire, cchf and west nile metadata.

PR Checklist

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Oct 10, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will bump versions in ingest - we could work around this with a more complicated prepro that takes into account whether something comes from ingest, but that might not be worth the hassle just to avoid +1 version.

@anna-parker anna-parker marked this pull request as ready for review October 11, 2024 12:55
@anna-parker
Copy link
Contributor Author

NIH data from NCBI virus is not consistent in their formatting of authors:

13:18:17    DEBUG ( prepare_metadata.py:  83) - Config(compound_country_field='ncbiGeoLocation', fasta_id_field='genbankAccession', rename={'bioprojects': 'bioprojectAccession', 'country': 'geoLocCountry', 'division': 'geoLocAdmin1', 'genbankAccession': 'insdcAccessionFull', 'ncbiCollectionDate': 'sampleCollectionDate', 'ncbiHostCommonName': 'hostNameCommon', 'ncbiHostName': 'hostNameScientific', 'ncbiHostSex': 'hostGender', 'ncbiHostTaxId': 'hostTaxonId', 'ncbiIsLabHost': 'isLabHost', 'ncbiIsolateName': 'specimenCollectorSampleId', 'ncbiPurposeOfSampling': 'purposeOfSampling', 'ncbiSraAccessions': 'sraRunAccession', 'ncbiSubmitterAffiliation': 'authorAffiliations', 'ncbiSubmitterNames': 'authors'}, keep=['division', 'country', 'submissionId', 'insdcAccessionBase', 'insdcVersion', 'bioprojects', 'biosampleAccession', 'ncbiHostName', 'ncbiHostTaxId', 'ncbiIsLabHost', 'ncbiReleaseDate', 'ncbiUpdateDate', 'ncbiSourceDb', 'ncbiVirusName', 'ncbiVirusTaxId', 'sequence_md5', 'genbankAccession', 'jointAccession'], segmented=True)
13:18:17     INFO ( prepare_metadata.py:  85) - Reading metadata from results/filtered_metadata.tsv
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915541: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915542: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915543: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915544: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915545: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915546: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915547: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915548: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915549: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915550: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915551: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915552: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915553: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915554: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915555: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915556: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915557: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915558: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915559: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915560: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:17    ERROR ( prepare_metadata.py:  47) - Author list of MW915561: Shahid MF,Yaqub T,Ali M,Ul-Rahman A,Bente DA,Shahid,M.F.,Yaqub,T.,Ali,M.,Ul-Rahman,A.,Bente,D.A. has uneven number of first and last names, unable to format author names, returning empty author list
13:18:18     INFO ( prepare_metadata.py: 162) - Saved metadata for 3317 sequences

@anna-parker
Copy link
Contributor Author

MW915549

In the downloaded ncbi_dataset/data/data_report.jsonl the names are actually:
"names":["Shahid MF","Yaqub T","Ali M","Ul-Rahman A","Bente DA","Shahid,M.F.","Yaqub,T.","Ali,M.","Ul-Rahman,A.","Bente,D.A."]}

@anna-parker anna-parker changed the title format authors in prepro feat(prepro, ingest, deposition): Enforce author formatting in prepro, map authors accordingly in ingest and deposition Oct 11, 2024
ingest/Snakefile Outdated Show resolved Hide resolved
@corneliusroemer corneliusroemer changed the base branch from main to refactor_ena_deposition October 18, 2024 14:51
Base automatically changed from refactor_ena_deposition to main October 18, 2024 14:54
@corneliusroemer corneliusroemer changed the base branch from main to pg-log October 18, 2024 14:56
@corneliusroemer corneliusroemer changed the base branch from pg-log to main October 18, 2024 14:56
More clean up

Move checks from snakefile to config

fix config

update deployment

update tests ci

Add trigger from db option

Fix cronjob

Fix link to config-file

fix deployment

install package in dockerfile

install at correct location

Remove snakemake as no longer needed

Add missing dependency

try to debug

Create an XmlNone dataclass - this is required since package update

test threads stop

revert exception test

test upload to ena dev still works on preview

Make sure test is set correctly!!!

remove debug print statements

Improve logs

Fix merge errors

Update ena-submission/README.md

Co-authored-by: Cornelius Roemer <[email protected]>

Apply suggestions from code review

Co-authored-by: Cornelius Roemer <[email protected]>

Cronjob: create results directory before writing to it

format authors in prepro

Fix ingest

try to fix pattern

simplify regex

fix check

Add tests

# Conflicts:
#	preprocessing/nextclade/tests/test.py

Add to ena submission

fix

fix other edge case

Update ena-submission/scripts/ena_submission_helper.py

Co-authored-by: Cornelius Roemer <[email protected]>

Update ena-submission/scripts/ena_submission_helper.py

Update ena-submission/scripts/ena_submission_helper.py

Co-authored-by: Cornelius Roemer <[email protected]>

Update ena-submission/scripts/ena_submission_helper.py

Update ingest/scripts/prepare_metadata.py

Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

rename

Update reformat_authors_from_genbank_to_loculus

Additionally format authors with correct white space

Improve error message

add tests

fix missing pattern

improve error logs

fix error

Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py

improve logging more

feat(ingest): Do not use processed tsv but raw jsonl when ingesting data from NCBI Virus (#2990)

* Use raw jsonl instead of generated tsv when ingesting data from NCBI virus

* Do not require authors list to end in ';', capitalize names correctly.

* Add tests for capitalization

* Add a warning if author list might be in wrong format

* Add ascii specific warning

* Add tests for warnings and errors

* Only capitalize if full authors string is upper case

* Properly capitalize initial

* Move titlecase option to ingest only - add ingest tests

Move author formatting functions to format_ncbi_metadata as this is a more logical location

Remove duplicate group name

# Conflicts:
#	ena-submission/scripts/get_ena_submission_list.py
#	ena-submission/src/ena_deposition/config.py
@anna-parker
Copy link
Contributor Author

SQL code for retroactively updating the author format of previous versions:

WITH latest_authors AS (
    -- Find the latest version of each accession and extract its authors
    SELECT
        accession,
        metadata->'authors' AS latest_authors
    FROM entries
    WHERE (accession, version) IN (
        SELECT accession, MAX(version)
        FROM entries
        GROUP BY accession
    )
)
UPDATE entries
SET metadata = jsonb_set(metadata, '{authors}', latest_authors)
FROM latest_authors
WHERE entries.accession = latest_authors.accession
AND entries.version < (SELECT MAX(version) FROM entries e2 WHERE e2.accession = entries.accession);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(preprocessing): Require formatted author list
2 participants