Can one read be used to support multiple fusions #203

acloop62hz · 2023-07-10T05:42:49Z

Hi @suhrig ,
I was running Arriba on this simulation dataset :
https://data.broadinstitute.org/Trinity/CTAT_FUSIONTRANS_BENCHMARKING/on_simulated_data/sim_101 (In the case I'm about to mention it's reads from sim3.) I used the default script of Arriba_v2.4.0 and found these two fusions in fusions.tsv:
ISCA1 SLC30A1 -/- -/- 9:88886922 1:211749631 CDS/splice-site CDS/splice-site translocation 299 289 152 1732 4155 high in-frame . Iron-sulphur_cluster_biosynthesis(58%)|Cation_efflux_family(52%) . . ENSG00000135070.9 ENSG00000170385.9 ENST00000375991.4 ENST00000367001.4 upstream downstream duplicates(11),low_entropy(3),mismatches(209),multimappers(162)
CTD-2561J22.1 SLC30A1 -/- -/- 19:21564171 1:211749631 exon CDS/splice-site translocation 3 247 1 10 4155 high . . |Cation_efflux_family(52%) . . ENSG00000268731.1 ENSG00000170385.9 ENST00000597319.1 ENST00000367001.4 upstream downstream duplicates(6),mismatches(100),multimappers(253)

Between these two fusions, there are 166 identical read identifiers, for example:
ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10030480_1_76430_4501_261
ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:1010524_1_76430_4597_166
ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10424315_1_76430_4580_164
ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10462442_1_76430_4592_195
ISCA1|ENSG00000135070.9--SLC30A1|ENSG00000170385.9:10479657_1_76430_4532_241

According to Arriba's multimappers filter (I assume these reads are multi-mapping reads), "multi-mapping reads are reduced to a single alignment", each read should be mapped to only one fusion. Thus I wonder if my understanding for this filter is correct or if there's something wrong with the output.

Thanks for your time!

The text was updated successfully, but these errors were encountered:

suhrig · 2023-07-10T07:42:02Z

Your understanding is correct. While they are listed in the column read_identifiers for both fusions, they are counted for only one of the fusions in the columns split_reads1/2 and discordant_mates.

All read-level filters (including the multimappers filter) work like this: they just exclude reads from being counted in the columns split_reads1/2 and discordant_mates (those are relevant to Arriba's internal filters). They never remove the reads from the read_identifiers column, however. You can think of it like this: the read count columns contain all high-quality reads which passed all filters. The column read_identifiers contains all supporting reads, including ones that were removed by filters. The manual is not sufficiently clear about this, sorry.

BTW, not only multimappers may lead to read identifiers being listed for multiple fusions. Discordant mates which are in the vicinity of breakpoints close to each other also often do.

acloop62hz · 2023-07-10T14:08:45Z

Thanks for your explanation, @suhrig! Is there any way to know those read identifiers that passed all filters?

suhrig · 2023-07-10T18:28:13Z

This is not possible, unfortunately. However, I don't think this is meaningful either. If a fusion made it to the main output file of Arriba, you should consider all reads listed in the read_identifiers column as supporting evidence for the fusion; not just the ones that Arriba considers high quality. The fact that the fusion passed all filters suggests that none of the supporting reads should have been removed in the first place.

acloop62hz · 2023-07-11T17:34:23Z

I see, thanks! One more question, I saw there's another issue where Arriba just picks one of the fusions indicated by multi-mapping reads "KANSL1-ARL17A/KANSL1-ARL17B fusion detection #165". I wonder if the difference between the two cases is that in "KANSL1-ARL17A/KANSL1-ARL17B", the alignment corresponding to KANSL1-ARL17A fusion are all discarded, thus no reads support KANSL1-ARL17A (or reads supporting KANSL1-ARL17A are moved by other filters), and while in the case I mentioned above, there exist both reads reduced to alignment supporting ISCA1--SLC30A1 and CTD-2561J22.1--SLC30A1

suhrig · 2023-07-14T14:06:34Z

I wonder if the difference between the two cases is that in "KANSL1-ARL17A/KANSL1-ARL17B", the alignment corresponding to KANSL1-ARL17A fusion are all discarded, thus no reads support KANSL1-ARL17A (or reads supporting KANSL1-ARL17A are moved by other filters)

Yes, that's most likely what happened in this case.

while in the case I mentioned above, there exist both reads reduced to alignment supporting ISCA1--SLC30A1 and CTD-2561J22.1--SLC30A1

After assignment of the multimapping reads to one of these two fusions, both fusions still have a sufficient number of supporting reads for Arriba to consider them real. However, I think Arriba made a mistake here. Compare the supporting reads of the two fusions: ISCA1--SLC30A1 has many split reads mapping to both fusion partners (columns split_reads1 and split_reads2) as well as discordant mates, whereas CTD-2561J22.1--SLC30A1 only has many split_reads1 and not much else. Having only one of the columns split_reads1, split_reads2, or discordant_mates set to a non-zero value is a characteristic of a false positive. I think that even the 247 split_reads assigned to the second fusion really belong to the first one and have just been misaligned and Arriba failed to recognize this. The second fusion is probably an artifact caused by sequence similarly between ISCA1 and CTD-2561J22.1.

acloop62hz · 2023-07-27T14:46:37Z

Thanks for your reply! After looking into the code, I think the problem might be that after merging adjacent fusions of CTD-2561J22.1--SLC30A1:
fusion filter split_read1 split_read2 discordant_mate
SLC30A1--CTD-2561J22.1--211749631--21564171 merge_adjacent 245 3 1
SLC30A1--CTD-2561J22.1--211749630--21564170 496 7 1
SLC30A1--CTD-2561J22.1--211749457--21564170 mismatches 0 0 0
SLC30A1--CTD-2561J22.1--211749614--21564174 mismatches 0 0 0
SLC30A1--CTD-2561J22.1--211749629--21564169 merge_adjacent 1 0 1
SLC30A1--CTD-2561J22.1--211749627--21564167 mismatches 0 0 0
SLC30A1--CTD-2561J22.1--211749521--21564187 0 0 1
SLC30A1--CTD-2561J22.1--211749628--21564168 mismatches 0 0 0
SLC30A1--CTD-2561J22.1--211749632--21564172 merge_adjacent 1 0 1

only the fusion SLC30A1--CTD-2561J22.1--211749630--21564170 is kept, but its spilt_read_list does not include reads from other fusions, thus when going through the list and discarding the multimappers, alignments in other fusions that are also multimappers are not discarded.

Is there any reason not to include reads into the list for other merged fusions? Should we include them so that the supporting read count would be more accurate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can one read be used to support multiple fusions #203

Can one read be used to support multiple fusions #203

acloop62hz commented Jul 10, 2023 •

edited

Loading

suhrig commented Jul 10, 2023

acloop62hz commented Jul 10, 2023

suhrig commented Jul 10, 2023

acloop62hz commented Jul 11, 2023

suhrig commented Jul 14, 2023 •

edited

Loading

acloop62hz commented Jul 27, 2023 •

edited

Loading

Can one read be used to support multiple fusions #203

Can one read be used to support multiple fusions #203

Comments

acloop62hz commented Jul 10, 2023 • edited Loading

suhrig commented Jul 10, 2023

acloop62hz commented Jul 10, 2023

suhrig commented Jul 10, 2023

acloop62hz commented Jul 11, 2023

suhrig commented Jul 14, 2023 • edited Loading

acloop62hz commented Jul 27, 2023 • edited Loading

acloop62hz commented Jul 10, 2023 •

edited

Loading

suhrig commented Jul 14, 2023 •

edited

Loading

acloop62hz commented Jul 27, 2023 •

edited

Loading