Reproducibility of detection #256

sotaro-kanematsu · 2024-10-15T23:55:12Z

Please provide your opinion on the reproducibility of fusion genes detected when analyzed using Arriba. I conducted two types of analyses: 1. A case where the same template was used to create a library, sequenced twice, and analyzed; 2. A case where independent libraries were created from the same template, each sequenced and analyzed. In the duplicate experiments for cases 1 and 2, I examined the concordance rate of detected fusions, and found that the concordance rate when only sequencing was repeated was about 50% (n=8), while the concordance rate when repeated from the library (n=16) was about 25%. Could the ones that did not match in these two repeated analyses be false positives? I was particularly surprised that even when sequencing the same library twice, the concordance rate was only around 50%.
By the way, there is no significant difference in sequencing amount or quality between the two.

Thank you for your kind help at anytime.

Sota

suhrig · 2024-10-16T11:11:04Z

The "n"s you mention - are they sequencing run counts or fusion counts?

In any case, the concordance you observe doesn't seem unusual. Library creation and sequencing are both stochastic processes. Between two runs, you will not amplify/sequence the exact same molecules twice. A fusion that is clearly detectable in one run may be underrepresented in another. This means that in the first run you will find evidence for other fusions than in the second. Discordance does not necessarily mean artifact, hence. The fusion may simply not be detectable. Of course, some of them will be artifacts, though.

You can reduce the discordance from the sequencing step by increasing the sequencing depth. At a certain depth, you should reach detection saturation. North of 50 million reads should suffice to reliably detect the high-confidence and medium-confidence fusions (provided that the duplication rate isn't too high and you use >=75nt paired-end sequencing). If you're unsure whether you have reached saturation, you can downsample the BAM file in silico to various depths and rerun Arriba. At some point, the saturation curve should flatten.

When comparing the concordance between two samples, I recommend to ignore low-confidence fusions. They have a high false-positive rate. Their purpose is to provide fusion calls in situations where high sensitivity is more important than high specificity. Without external knowledge (e.g., structural variant calls from whole-genome sequencing or an expectation to find a certain fusion that is characteristic for a given cancer type) these fusions should be treated with caution. You should find that the concordance of the high-/medium-conf fusions is better and that most of the discordance in your samples comes from the low-conf fusions.

Happy to answer any follow-up questions you may have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility of detection #256

Reproducibility of detection #256

sotaro-kanematsu commented Oct 15, 2024

suhrig commented Oct 16, 2024

Reproducibility of detection #256

Reproducibility of detection #256

Comments

sotaro-kanematsu commented Oct 15, 2024

suhrig commented Oct 16, 2024