Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different percent identity/alignment length for identical sequences on opposite strands #201

Open
lerminin opened this issue Sep 14, 2023 · 5 comments

Comments

@lerminin
Copy link

lerminin commented Sep 14, 2023

Hello,

I've come across an odd discrepancy - I have a plasmid with two copies of qacE gene (X68232) on opposite strands and I'm getting differing percent identity/alignment length to the reference for each copy despite the nucleotide sequences being identical. The output I get from StarAMR/Resfinder is the following:

Isolate ID Data Data Type Predicted Phenotype CGE Predicted Phenotype %Identity %Overlap HSP Length/Total Length Contig Start End Accession
plasmid qacE Resistance unknown[qacE_1_X68232] Benzylkonium Chloride, Ethidium Bromide, Chlorhexidine, Cetylpyridinium Chloride 100.0 84.68 282/333 p03A17005_A_consensus 33598 33879 X68232
plasmid qacE Resistance unknown[qacE_1_X68232] Benzylkonium Chloride, Ethidium Bromide, Chlorhexidine, Cetylpyridinium Chloride 99.65 85.59 285/333 p03A17005_A_consensus 60928 60644 X68232

Initially this suggested to me that these are two distinct sequences: one on the forward strand with 100 % identity over 282 nt, and one on the reverse strand with 99.65% identity over 285 nt. However, I aligned both qacE sequences to the X68232 reference to search for where the sequences differed and found the following:

  • qacE on forward strand: 282 nt sequence. next three bases are ATT
  • qacE on reverse strand: 285 nt sequence, with final three bases ATT (100 % identity match to forward strand qacE)
  • qacE X68232 reference 283-285 nt are GTT, so the G is mismatched

Seems like the alignment is ending at 282 nt for the forward strand, but the next three bases in the reverse strand are included with a single mismatch over the 285 nt length. Even though the forward strand has the same sequence, the last three nts are being excluded. Given the nt sequences are identical on both strands, I would expect the ResFinder output to be the same for both strands. Is this an artifact of StarAMR or is this a BLAST issue?

I'm also noticing a similar pattern with blaOXA-10 (J03427) - getting 100 % alignment over 768/801 nt in the reference when blaOXA-10 is on the reverse strand, but 99.24 % alignment over 788/801 nt in the reference when blaOXA-10 is on the forward strand. Sequences between forward and reverse are 100% identical in both cases.

plasmid file: plasmid.fasta.txt
version = 0.10.0 (same issue occurs in v0.9.1)
resfinder_gene_drug_version = 072621
resfinder_db_date = Tue, 24 May 2022 06:51

@apetkau
Copy link
Member

apetkau commented Sep 14, 2023

Thanks for the great write-up of this issue. The differences you see is interesting. I don't quite have time to look in-depth right now, but one thing you could try is to add the option --report-all-blast to StarAMR.

StarAMR runs blast against the ResFinder database, which can consist of many sequences for the same gene with minor variation. StarAMR then attempts to pick the "best" blast hit based on some logic related to percent identity and length. Logic for this is here:

def _select_hits_to_include(self, hits):
hits_to_include = []
if len(hits) >= 1:
sorted_hits_pid_first = sorted(hits, key=lambda x: (
x.get_pid(), x.get_plength(), x.get_amr_gene_length(), x.get_amr_gene_id()), reverse=True)
sorted_hits_length_first = sorted(hits, key=lambda x: (
x.get_amr_gene_length(), x.get_pid(), x.get_plength(), x.get_amr_gene_id()), reverse=True)

Maybe something strange is going on with how the blast hit to report is being selected? The --report-all-blast includes all blast hits in the output table.

@lerminin
Copy link
Author

Thanks for the quick reply. I tried running with the --report-all-blast flag but am only getting one hit per gene copy in the output file. Does this suggest its an underlying blast issue?

@apetkau
Copy link
Member

apetkau commented Sep 29, 2023

It's possibly a blast issue. You can run with verbose turned on staramr --verbose search ... to see more information, including each blast hsp.

I did test out by adding qacE_1_X68232 to the beginning and ending of an example genome (and reverse complementing the 2nd sequence) and here are the records I get:

2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                           contig00001
pident                                                 100.0
length                                                   333
qstart                                                     1
qend                                                     333
sstart                                                     1
send                                                     333
slen                                                  641340
qlen                                                     333
sstrand                                                 plus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                                100.0
Name: 0, dtype: object
2023-09-29 10:52:06 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                           contig00001
pident                                                 100.0
length                                                   333
qstart                                                     1
qend                                                     333
sstart                                                641313
send                                                  640981
slen                                                  641340
qlen                                                     333
sstrand                                                minus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                                100.0
Name: 1, dtype: object

That is, they have identical percent identity and length.

Maybe there is some particular sequence around one of the copies of your qacE gene that is causing blast to report a slightly different hsp?

@lerminin
Copy link
Author

I will take a look at the flanking sequences. This is my output from running the above plasmids with --verbose:

2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                 p03A17005_A_consensus
pident                                                 100.0
length                                                   282
qstart                                                     1
qend                                                     282
sstart                                                 33598
send                                                   33879
slen                                                   99166
qlen                                                     333
sstrand                                                 plus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                            84.684685
Name: 0, dtype: object
2023-10-10 11:48:58 DEBUG ResfinderHitHSP.__init__,25: record=qseqid                                         qacE_1_X68232
sseqid                                 p03A17005_A_consensus
pident                                                99.649
length                                                   285
qstart                                                     1
qend                                                     285
sstart                                                 60928
send                                                   60644
slen                                                   99166
qlen                                                     333
sstrand                                                minus
sseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
qseq       ATGAAAGGCTGGCTTTTTCTTGTTATCGCAATAGTTGGCGAAGTAA...
plength                                            85.585586
Name: 1, dtype: object

A pattern I've noticed is that this is only happening in cases (qacE, blaOXA-10, catB3) where there is a less than 100% length match to the reference sequence. Could that possibly be influencing this discrepancy? If I align my plasmid sequence against the reference 333 bp length of X68232 in the online NCBI blastn interface, I get two hits that have 100% identity over 282/333 nucleotides for both strands, which I would interpret as correct

@apetkau
Copy link
Member

apetkau commented Oct 13, 2023

Thanks. We will have to investigate this further later on as we don't have the resources to look into this issue right now.

My suspicion though is differences in blast parameters or blast versions. You could try directly running the version of blast used by StarAMR to look at the output more closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants