Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsem-extract-reference-transcripts fails with "Error Message: Strand is neither '+' nor '-'!" #220

Open
J-Moravec opened this issue Aug 21, 2024 · 0 comments

Comments

@J-Moravec
Copy link

J-Moravec commented Aug 21, 2024

Working on Rice RNAseq using the https://nf-co.re/rnaseq pipeline that runs RSEM internally.

Here, RSEM fails on:

rsem-extract-reference-transcripts rsem/genome 0 GCF_034140825.1.filtered.gtf None 0 rsem/GCF_034140825.1.fna
The GTF file might be corrupted!
Stop at line : NC_011033.1    RefSeq  transcript  11024   315294  .   ?   .   gene_id "OrsajM_p01"; transcript_id "unassigned_transcript_653"; db_xref "GeneID:6450162"; exception "trans-splicing, RNA editing"; gbkey "mRNA"; gene "n     ad1"; locus_tag "OrsajM_p01"; transcript_biotype "mRNA";

The specification that I could find on GTF2.2 does not mention ? being allowed in strandedness, so I understand these specification based checks.

The reason for ? is that something weird splicing is happening in the mRNA, and this is above my current knowledge, but looks like even the stop codon and start codon have different strand. The whole transcript is thus a patchwork of sequences from positive and negative strands and thus cannot be uniquely assigned strandedness.

See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_011033.1/ with weird complement(...) happening there for about 4 different genes:

image

And here is view of the feature in a GTF file (first 8 columns):

NC_011033.1	RefSeq	gene	11024	11409	.	+	.
NC_011033.1	RefSeq	gene	239890	315294	.	+	.
NC_011033.1	RefSeq	transcript	11024	315294	.	?	.
NC_011033.1	RefSeq	exon	11024	11409	.	+	.
NC_011033.1	RefSeq	exon	241499	241580	.	-	.
NC_011033.1	RefSeq	exon	239890	240081	.	-	.
NC_011033.1	RefSeq	exon	251354	251412	.	-	.
NC_011033.1	RefSeq	exon	315036	315294	.	-	.
NC_011033.1	RefSeq	CDS	11024	11409	.	+	0
NC_011033.1	RefSeq	CDS	241499	241580	.	-	1
NC_011033.1	RefSeq	CDS	239890	240081	.	-	0
NC_011033.1	RefSeq	CDS	251354	251412	.	-	0
NC_011033.1	RefSeq	CDS	315036	315291	.	-	1
NC_011033.1	RefSeq	start_codon	11024	11026	.	+	0
NC_011033.1	RefSeq	stop_codon	315036	315038	.	-	0

Since this is not an obscure organism, but Rice (and I hoped that when working with model organism for once, everything would be fine), should RSEM be able to handle this issue?

Thanks,
-- Jirka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant