Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using own fa file and gtf file error: gzip: stdin: not in gzip format #250

Open
ppxinyi opened this issue Sep 5, 2024 · 8 comments
Open

Comments

@ppxinyi
Copy link

ppxinyi commented Sep 5, 2024

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa to .gz file and add this link in the download.sh, it will show the error: gzip: stdin: not in gzip format.

If I use .fa file directly, it will have another error:
EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

I have check the first character:
head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Do you have some suggestion for this problem when I want to use the diy fa file and gtf file?

@suhrig
Copy link
Owner

suhrig commented Sep 5, 2024

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

@ppxinyi
Copy link
Author

ppxinyi commented Sep 5, 2024

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

sorry for missing copy the >, I think I alreday have such format, but it still show the same error :

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is '
' (10), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Sep 05 00:16:05 ...... FATAL ERROR, exiting

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

@ppxinyi
Copy link
Author

ppxinyi commented Sep 5, 2024

I just find the > can not be copies

@suhrig
Copy link
Owner

suhrig commented Sep 5, 2024

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

@ppxinyi
Copy link
Author

ppxinyi commented Sep 5, 2024

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

yes. I have use this before but it still show the same error

head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

@suhrig
Copy link
Owner

suhrig commented Sep 5, 2024

Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script?

@ppxinyi
Copy link
Author

ppxinyi commented Sep 5, 2024 via email

@suhrig
Copy link
Owner

suhrig commented Sep 9, 2024

Sorry for the show response. I am currently busy. I will have more time next week.

In the meantime, have you tried building the STAR index yourself? What the script does is actually not so complicated. All it does is concatenate the RefSeq viruses to the human genome and then builds a STAR index from it. Maybe it's the easiest of you build the index manually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants