Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest #6

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Ingest #6

wants to merge 12 commits into from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Dec 12, 2022

Description of proposed changes

See commit messages.

Related issue(s)

Testing

  • Checks pass

If reviewer wants to run a local check as documented in ingest/README.md and README.md:

# Checkout branch
git clone https://github.com/nextstrain/ebola.git
cd ebola
git checkout ingest

# Test Ingest
cd ingest
nextstrain build .
ls -ltr data

# Test build
cd ..
nextstrain build .
nextstrain view auspice

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this ingest PR together so quickly, @j23414. I'm sorry it took me so long to review this more closely. Let me know if you want to chat synchronously about any of the comments. Even though there are some bigger topics that fall outside the scope of this PR (e.g., How to best reuse ingest logic across pathogen repos?), there seem to be simple local solutions we can make in the short term.

ingest/workflow/snakemake_rules/transform.smk Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/transform.smk Outdated Show resolved Hide resolved
ingest/README.md Outdated Show resolved Hide resolved
nextstrain build . --configfiles config/config.yaml config/optional.yaml
```

### Adding new sequences not from GenBank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this approach using sequences we have in fauna.

# From `master` of ebola repo:
snakemake -j 1 -p data/ebola.fasta

# From ingest directory of ingest branch:
seqkit sample -n 50 ../data/ebola.fasta > local_ebola.fasta

wget https://raw.githubusercontent.com/nextstrain/monkeypox/master/ingest/bin/fasta-to-ndjson -O bin/fasta-to-ndjson
chmod 755 bin/fasta-to-ndjson

./bin/fasta-to-ndjson \
        --fasta local_ebola.fasta \
        --fields strain virus accession collection_date region country division location source locus authors url title journal puburl \
        --separator "|" > data/local_ebola.ndjson

These steps all ran as expected. When I tried to dry-run the ingest after modifying the config file to include this new source, I got the following error:

$ nextstrain build .  -np
Building DAG of jobs...
Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths.
MissingInputException in line 49 of /Users/jlhudd/projects/nextstrain/ebola/ingest/workflow/snakemake_rules/fetch_sequences.smk:
Missing input files for rule fetch_all_sequences:
data/local_ebola_all.ndjson

I tried renaming the NDJSON file I created above to data/local_ebola_all.ndjson and ran the workflow again. It ran as expected. In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005 and J0005 for accession KP759628), both records appeared in the output metadata and sequences. While we can disambiguate the metadata records by strain name, the sequences use only the accession, so we end up with duplicate records in sequences.

These duplicate sequences cause augur filter to exit with an error (with a list of duplicate strains) in the top-level workflow. This outcome is probably close to what we would expect, since we want users to know duplicates exist and handle these accordingly. It might be nicer for users for the ingest to warn them about duplicates, but I don't think we have any tools to check for dups currently, do we?

Regarding the issue with the NDJSON filename mismatch above, we could probably address this problem by removing the serotype wildcard from the sources, since we plan to collect all Ebola serotypes into a single pair of metadata and sequence files and split by serotype in the Nextstrain workflow anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the issue with the NDJSON filename mismatch above,

Done with commit: fc12467

"I don't think we have any tools to check for dups currently, do we?

Would be wonderful if we did...I tend to use smof uniq , uniq_merge.py, or write up a one-off perl scripts to remove duplicate sequences. There are many possible ways to remove sequence duplicates. Was there a de-dups rule you had in mind? Maybe from a nextstrain repo?

While we can disambiguate the metadata records by strain name

I've seen some gnarly strain names in my time (and don't get me started on resequenced Isolates, or sequenced segments from the same strain)... so highly recommend keeping with the Monkeypox method of IDs by Accession. I see you're checking old fauna datasets.... My method of checking fauna was writing a quick script to replace

> strain_name

with its associated:

>Accession_number

I can spec out a quick perl/python script to do the swap where it (1) reads in a metadata file to generate strain_name-to-accession_number hash and (2) read in sequence file to replace strain_name with Accession_number.

In cases where the strain names were different between fauna and GenBank for the same accession (e.g., Ebolavirus/H_sapiens_wt/SLE/2014/Makona_J0005 and J0005 for accession KP759628), both records appeared in the output metadata and sequences.

Ah this is exactly where I use uniq_merge.py to flag accessions with multiple strain names and create a fix-stain-name.pl script. I'll think about it more. Thank you for the feedback.

j23414 pushed a commit to nextstrain/dengue that referenced this pull request Dec 30, 2022
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Dec 30, 2022
j23414 pushed a commit to nextstrain/zika that referenced this pull request Dec 30, 2022
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/zika that referenced this pull request Dec 30, 2022
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Mar 25, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Mar 25, 2023
@tsibley tsibley mentioned this pull request Mar 27, 2023
4 tasks
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 11, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 11, 2023
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 12, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 12, 2023
@j23414 j23414 marked this pull request as draft April 12, 2023 18:05
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 14, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 14, 2023
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.

Discussed in nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

Pick curl instead of wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

Pick curl instead of wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

Pick curl instead of wget as discussed in:
nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 17, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Apr 18, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Apr 18, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request May 5, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request May 5, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request May 5, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)

Co-authored-by: John Huddleston <[email protected]>
j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

https://www.oreilly.com/library/view/building-maintainable-software/9781491955987/ch04.html

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request May 5, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)

Co-authored-by: John Huddleston <[email protected]>
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Jun 6, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Jun 6, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 added 12 commits June 20, 2023 16:53
Flags most of the intermediate representations of sequences and metadata as temporary files such that we only keep the final compressed outputs. 

Deleting the intermediate files reduced the data directory size from 221MB to 584 KB.
Since serotype is annotated as a column in metadata, simplify intermediate filenames like `data/sequences_{serotype}.fasta` and `data/metadata_{serotype}.tsv` to `data/sequences.fasta` and `data/metadata.tsv`.
In ingest/README.md, assume user is within ingest directory.
This would allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control.
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Jun 21, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Jun 21, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Aug 19, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Aug 19, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
j23414 pushed a commit to nextstrain/dengue that referenced this pull request Sep 11, 2023
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
j23414 added a commit to nextstrain/dengue that referenced this pull request Sep 11, 2023
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants