Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasterq-dump overloads memory #903

Open
mfansler opened this issue Jan 29, 2024 · 13 comments
Open

fasterq-dump overloads memory #903

mfansler opened this issue Jan 29, 2024 · 13 comments

Comments

@mfansler
Copy link

I have installed sra-tools v3.0.10 distributed from Bioconda for linux-64 platform. Running fasterq-dump occupies far more RAM than the flags would imply (default 100 MB/core) or I have ever encountered before using identical commands. In previous versions, I always used 8 cores + 1GB/core, with -t pointing to local scratch disk and VDB configured with plenty of room for the ncbi/sra cache. E.g.,

fasterq-dump -e 8 -S --include-technical -o /fscratch/fanslerm/rc11_d8_1_2.fastq -t /fscratch/fanslerm SRR9117967

Using the above for any SRRs from PRJNA544617 ends with LSF killing my jobs for exceeding memory. I have retried with:

  • 8 cores + 2 GB/core (16 GB total)
  • 6 cores + 4 GB/core (24 GB total)

all eventually killed for overallocating memory. I am currently running again with 4 cores + 8 GB/core (32 GB total).

This makes me suspect there is something off in this version with possibly:

  • using /tmp/ instead of the designated -t path, i.e.,
  • not respecting the --mem argument (or not reading the default).
  • a memory leak

Please let me know if I can provide any additional information.

@mfansler
Copy link
Author

I also tried running on a local Docker (mambaorg/micromamba:1.5.6) rather than HPC, with -e2 and 16 GB total on the container. This was also killed.

@OOAAHH
Copy link

OOAAHH commented Mar 4, 2024

我遇到了类似的问题,my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p
截屏2024-03-04 11 53 50
我的HPC节点宕机了,该问题反复出现,我也怀疑是新版本的问题,但我不太明白我该如何寻找恰当的证据证明这一点。
My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

@mfansler
Copy link
Author

mfansler commented Mar 4, 2024

For completeness, I did eventually get it to complete with the 4 core and 8GB/core configuration. I expect this will be dependent on the size of the data.

@mfansler
Copy link
Author

mfansler commented Mar 4, 2024

@OOAAHH I was able to run your example without any issue. The SRA file is 14GB, and unpacked it leads to a 26GB FASTQ file. Are you sure you are not running out of disk quota?

Some things I see: Your example does not provide a scratch space to store the temporary files, so they will be written to a temporary folder in the current directory. Also, unless ~/TOS/output is symlinked elsewhere, that is under user home (~/) which on typical HPC clusters is 100GB. Lastly, have you configured VDB so that the NCBI cache is not under user home (the default)? That is, from worst case assumptions, this single operation could occupy up to 75GB of disk at maximum occupancy.

It should further be noted that this particular data was uploaded as an aligned BAM. Dumping out a FASTQ file from a BAM-derived SRA file is mostly useless for scRNA-seq because any cell barcodes and UMIs will only be in the tags and not get properly dumped out. I don't know what you plan with the data, but for processing as scRNA-seq you are likely better off downloading the BAM (and .bai) directly from the ENA (see ERR4027871).

@OOAAHH
Copy link

OOAAHH commented Mar 4, 2024

First of all, thank you for your prompt and detailed response. Your insights have been incredibly helpful and have shed light on several oversight areas in my approach.

  • Disk Quota and Cache Settings: You're absolutely right; I hadn't fully considered the disk quota and the cache parameter settings. I've been so focused on monitoring my memory usage that I overlooked the capacity of the disk. Based on your advice, I will start specifying a scratch space for temporary files in my commands and configuration to manage disk space more efficiently.

  • Data for scRNA-seq Projects: Also, you've made an excellent point regarding the use of data with UMIs and barcodes for my large-scale single-cell atlas project. It appears I may have encountered issues with some of the .bai files, which complicates the process. Following your suggestion, I will explore downloading the necessary indexed data directly from BioStudies:E-MTAB-8221.

@mfansler
Copy link
Author

mfansler commented Mar 4, 2024

Glad to help. Fortunately, the .bai files shouldn't be essential - one can reindex with samtools index to generate new ones.

@OOAAHH
Copy link

OOAAHH commented Mar 6, 2024

I hope this message finds you well. I wanted to take a moment to update you on the significant progress I've made, thanks in large part to your invaluable advice and guidance.

Following your suggestions, I revisited my BAM files and utilized samtools to reindex them and examine the metadata more closely. This process was incredibly enlightening; not only was I able to generate new .bai files successfully, but I also uncovered crucial information embedded within the BAM files. The metadata and initial read segments revealed essential details such as cell barcodes, UMIs, and sample identifiers - precisely the data I needed for my single-cell RNA sequencing analysis.

Discovering this information was particularly critical for me, given the challenging network environment I am operating in, which makes downloading genomic data quite difficult. Being able to extract and utilize data already within my possession has saved me a tremendous amount of time!
My codes:
samtools view -H
截屏2024-03-06 16 02 19
samtools view my.bam | head
截屏2024-03-06 16 44 09

@permia
Copy link

permia commented Sep 3, 2024

an OOM issue.

我遇到了类似的问题,my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 截屏2024-03-04 11 53 50 我的HPC节点宕机了,该问题反复出现,我也怀疑是新版本的问题,但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

@permia
Copy link

permia commented Sep 11, 2024

an OOM issue.

我遇到了类似的问题,my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 截屏2024-03-04 11 53 50 我的HPC节点宕机了,该问题反复出现,我也怀疑是新版本的问题,但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

In this situation, use fastq-dump than fasterq-dump.

@OOAAHH
Copy link

OOAAHH commented Sep 11, 2024

an OOM issue.

我遇到了类似的问题,my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p 截屏2024-03-04 11 53 50 我的HPC节点宕机了,该问题反复出现,我也怀疑是新版本的问题,但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

I'm having the same issue that causes the node to go down. This doesn't seem to be because of the size of the SRA file. After running many files with the same script, only a few files are like this.

In this situation, use fastq-dump than fasterq-dump.

谢谢你的关注!事实上,我在遇到这个问题之后使用了新的计算方法。我在虚拟容器中(K8s虚拟化)完成了我的所有计算任务。我在后来的排查过程中发现了一个关键的问题,fasterq-dump需要使用本地存储作为缓存,这导致了它在特定情况下(受限的计算环境中)的表现可能不太好,这取决于具体的计算环境的配置。这也是为什么我现在非常推荐各位生物信息学研究者(或者其他领域研究人员)使用容器化技术,尤其是你像我一样有十多万SRA文件需要处理的时候。基于容器化的技术使得这样需要大量重复的计算变得非常容易,你可以在任意地点的任意设备上取得一致的计算体验和结果。

Thank you for your attention! In fact, after encountering this issue, I adopted a new computational approach. I completed all my computational tasks within virtual containers (K8s virtualization). During the troubleshooting process, I discovered a key issue: fasterq-dump requires local storage as a cache, which can lead to suboptimal performance in certain situations (in a constrained computational environment), depending on the specific configuration of the computing environment. This is also why I now highly recommend researchers in bioinformatics (or other fields) to use containerization technology, especially if, like me, you have hundreds of thousands of SRA files to process. Container-based technology makes such repetitive computational tasks much easier, allowing you to achieve consistent computational experiences and results on any device, anywhere.

@mfansler
Copy link
Author

@permia please show the command you use an indicate at least one accession (SRR).

@permia
Copy link

permia commented Sep 12, 2024

The command I used is correct. fasterq-dump does encounter issues when processing certain random SRA files. Providing examples is not meaningful.
fastq-dump does not have this problem, possibly because fastq-dump does not generate temporary files. This issue might be as @OOAAHH mentioned.

@mfansler
Copy link
Author

mfansler commented Sep 12, 2024

@permia having multiple examples of failures can be valuable to developers. This thread is about possible memory issues in recent versions of fasterq-dump and it would be nice to have additional clearly-documented cases similar to the one originally reported. "Clearly-documented" means not only showing the command used, but also reporting on the version and additional system information.

Note that @OOAAHH did not in the end have the same issue, but rather appeared to be about disk space and managing temporary scratch spaces. Ultimately it was resolved it in an orthogonal way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants