Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Description of SRA archive file layout #863

Closed
noamteyssier opened this issue Oct 13, 2023 · 10 comments
Closed

Description of SRA archive file layout #863

noamteyssier opened this issue Oct 13, 2023 · 10 comments

Comments

@noamteyssier
Copy link

Hello

I was just wondering if there was a detailed description of an *.sra file layout?

I am interested in experimenting with building a tool to extract sequencing records from these files but I can't find a good resource of what this file actually is or how the sequencing data is stored within it.

Apologies if this is obvious but would appreciate a link to a resource if one exists.

Cheers

@apredeus
Copy link

apredeus commented Nov 9, 2023

I think SRA is closed source, unfortunately. Perhaps SRA tools team can clarify? I think it would have been great to publish a detailed description.

@wraetz
Copy link
Contributor

wraetz commented Nov 9, 2023

SRA is not closed source, it is actually public domain, it is developed in public here: https://github.com/ncbi/sra-tools and https://github.com/ncbi/ncbi-vdb. You need to understand both. There are C/C++/Java/Python bindings for the library. The physical "file-layout" is very complex, it is a compressed columnar data store with its own language. Don't try to access it at that level. Use the language bindings if you want to write your own tool. The best way to understand what is inside is using vdb-dump to explore the data layout - it is the same layout you can use from the language bindings.

@sbooeshaghi
Copy link

Hi @wraetz is there information on the physical file-layout which you describe as "very complex"? Additionally, given a set of fastqs how is the file constructed? Of course I could try and parse this from various scripts on the repo but it would be helpful for me to understand the structure from a manual or man page.

@durbrow
Copy link
Collaborator

durbrow commented Apr 15, 2024

The file layout is complicated and unimportant.

At a high level, it is a normalized database of genomics data organized into tables, with a consistent set of columns with a consistent set of datatypes, conforming to the INSDC SRA data model.

At a low level, it is an archive file.

And at levels in between, there are different abstractions.

We don't expect people who are accessing the data to need to deal with the lower level abstractions.

@durbrow durbrow closed this as completed Apr 15, 2024
@sbooeshaghi
Copy link

Hi @durbrow,

Referring back to my previous question, could you please point me to documentation that describes the lower level structure of the SRA archive file? This is important to a project I am currently working on.

Thank you!
Sina B.

@durbrow
Copy link
Collaborator

durbrow commented Apr 15, 2024

There is not such document. It isn't necessary for accessing the data, or even useful for that. You are welcome to examine the source code to see how it is written.

@stineaj
Copy link
Collaborator

stineaj commented Apr 18, 2024

@sbooeshaghi is the physical layout necessary or would the logical layout be sufficient? Or perhaps it would make sense to discuss your project as much as you can to see what a good patch to connect the dots would be. We could converse by email if what you are working on is not ready to be in a public forum like this.

@yaschenk
Copy link
Contributor

@sbooeshaghi : try the following cartoonish description of one of the flavors of SRA: the ones using compression by reference based on BAM input data
https://ftp-trace.ncbi.nih.gov/sra/doc/csra-fileformat.ppsx
It is more than 10 years old, but still valid enough to give you an idea of logical and physical layout. The majority of example commands should still work on any SRA format file, not only the ones produced from BAM

@sbooeshaghi
Copy link

Hi, up-to-date documentation of the SRA file format will help diagnose multiple reported issues for the SRA file format and will help facilitate various enhancements related to file parsing. Here are a few examples:

#452, The perennial problem of supporting gzip in fasterq-dump

#794, Increasing speed of SRA data conversion ~8.4x

#889, Potential data corruption for multiple single-cell assays

Are there plans to produce such documentation?

Thanks.

@durbrow
Copy link
Collaborator

durbrow commented Apr 22, 2024

#452 is a feature request for how fasterq-dump formats its output. It has nothing to do with the .sra file format.

#794 is a feature request to have an existing API perform like fasterq-dump does. It has nothing to do with the .sra file format.

#889 is an issue with what data series are stored. It is a policy issue concerning the SRA data model. It has nothing to do with the .sra file format. This is the code repository for the SRA toolkit, and you are talking to the developers. We do not make policy.

It appears that what you want to know is how to use the same APIs we use (from ncbi-vdb) in writing the tools. I would suggest you start with our python bindings. Here is an example. It almost certainly won't work as-is, but it should give you the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants