Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mask): Add --mask-gaps {all,terminals} option #1286

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Aug 19, 2023

This allowsaugur mask to allow masking of gaps at terminals via --mask-gaps terminals or all gaps via --mask-gaps all.

Masking all gaps symmetrizes treatments of gaps and insertions.
It is useful when gaps are jumpy due to sequencing artefacts and allows
gaps to be entirely ignored

Further discussion: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1692443660080129

Testing

I tested all options in a workflow. As this is a new option that is a noop unless the option is explicitly set, it shouldn't break any existing code even if the implementation happens to have a bug.

If this PR gets approved, we might want to add a simple cram test, but it'd be a waste of time to do this now before feedback on whether this PR is at all welcome.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.
  • Add simple cram test (once PR is generally accepted)

This allows`augur mask` to allow masking of gaps at terminals via `--mask-gaps terminals` or all gaps via `--mask-gaps all`.

Masking all gaps symmetrizes treatments of gaps and insertions.
It is useful when gaps are jumpy due to sequencing artefacts and allows
gaps to be entirely ignored

Further discussion: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1692443660080129
@corneliusroemer corneliusroemer requested a review from a team August 19, 2023 12:19
@codecov
Copy link

codecov bot commented Aug 19, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03% 🎉

Comparison is base (29bb674) 69.74% compared to head (40af347) 69.77%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1286      +/-   ##
==========================================
+ Coverage   69.74%   69.77%   +0.03%     
==========================================
  Files          67       67              
  Lines        7146     7154       +8     
  Branches     1742     1745       +3     
==========================================
+ Hits         4984     4992       +8     
  Misses       1855     1855              
  Partials      307      307              
Files Changed Coverage Δ
augur/mask.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jameshadfield
Copy link
Member

See #1048 for some good discussion around this. In the end I didn't implement it because augur align has the functionality, but if we are going to be using nextclade for alignment then worth revisiting.

If this PR goes in can you please update https://docs.nextstrain.org/en/latest/guides/bioinformatics/missing-sequence-data.html?

@corneliusroemer
Copy link
Member Author

Thanks @jameshadfield! I missed the discussion back then.

It seems like my use case (gaps often wrong, sometimes better to strip them) wasn't discussed back then.

I don't think I've ever used augur align. It seems like masking gaps belongs most cleanly into mask.

I'll check out your prior art!

@corneliusroemer
Copy link
Member Author

Nice tests there! I've pretty much reinvented the wheel only worse 😅

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good pending a small request and another non-blocking suggestion.

@@ -78,7 +78,7 @@ def mask_vcf(mask_sites, in_file, out_file, cleanup=True):
pass


def mask_sequence(sequence, mask_sites, mask_from_beginning, mask_from_end, mask_invalid):
def mask_sequence(sequence, mask_sites, mask_from_beginning, mask_from_end, mask_invalid, mask_gaps):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def mask_sequence(sequence, mask_sites, mask_from_beginning, mask_from_end, mask_invalid, mask_gaps):
def mask_sequence(sequence, mask_sites, mask_from_beginning, mask_from_end, mask_invalid, mask_gaps=None):

To match the default noted in docstring.

if beginning + end > sequence_length:
beginning, end = sequence_length, 0

seq = str(sequence.seq)[beginning:-end or None]

if mask_invalid:
seq = "".join(nuc if nuc in VALID_NUCLEOTIDES else "N" for nuc in seq)

if mask_gaps == "all":
seq = "".join(nuc if nuc != "-" else "N" for nuc in seq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

Suggested change
seq = "".join(nuc if nuc != "-" else "N" for nuc in seq)
seq = "".join("N" if nuc == "-" else nuc for nuc in seq)

This reads slightly better to me - it pairs the mask character with the gap character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants