Skip to content

Commit

Permalink
ingest: Merge Nextclade metadata with augur merge
Browse files Browse the repository at this point in the history
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields.  We really
need to sort out our TSV formats³, but that's for a larger project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>

Ported-from: <nextstrain/measles@4d73b7f>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
  • Loading branch information
tsibley committed Oct 3, 2024
1 parent 443d0de commit 3fc3e65
Showing 1 changed file with 26 additions and 13 deletions.
39 changes: 26 additions & 13 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -61,14 +61,12 @@ rule run_nextclade:
"""


rule join_metadata_and_nextclade:
rule nextclade_metadata:
input:
nextclade="results/nextclade.tsv",
metadata="data/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
nextclade_metadata=temp("results/nextclade_metadata.tsv"),
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()],
nextclade_fields=",".join(config["nextclade"]["field_map"].values()),
Expand All @@ -80,13 +78,28 @@ rule join_metadata_and_nextclade:
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| tsv-select --header --fields {params.nextclade_fields:q} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
--data-fields {params.metadata_id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude {params.nextclade_id_field} \
> {output.metadata}
> {output.nextclade_metadata:q}
"""


rule join_metadata_and_nextclade:
input:
metadata="data/subset_metadata.tsv",
nextclade_metadata="results/nextclade_metadata.tsv",
output:
metadata="results/metadata.tsv",
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
shell:
r"""
augur merge \
--metadata \
metadata={input.metadata:q} \
nextclade={input.nextclade_metadata:q} \
--metadata-id-columns \
metadata={params.metadata_id_field:q} \
nextclade={params.nextclade_id_field:q} \
--output-metadata {output.metadata:q} \
--no-source-columns
"""

0 comments on commit 3fc3e65

Please sign in to comment.