Description of SRA archive file layout #863

noamteyssier · 2023-10-13T19:03:26Z

Hello

I was just wondering if there was a detailed description of an *.sra file layout?

I am interested in experimenting with building a tool to extract sequencing records from these files but I can't find a good resource of what this file actually is or how the sequencing data is stored within it.

Apologies if this is obvious but would appreciate a link to a resource if one exists.

Cheers

apredeus · 2023-11-09T12:08:25Z

I think SRA is closed source, unfortunately. Perhaps SRA tools team can clarify? I think it would have been great to publish a detailed description.

wraetz · 2023-11-09T14:32:55Z

SRA is not closed source, it is actually public domain, it is developed in public here: https://github.com/ncbi/sra-tools and https://github.com/ncbi/ncbi-vdb. You need to understand both. There are C/C++/Java/Python bindings for the library. The physical "file-layout" is very complex, it is a compressed columnar data store with its own language. Don't try to access it at that level. Use the language bindings if you want to write your own tool. The best way to understand what is inside is using vdb-dump to explore the data layout - it is the same layout you can use from the language bindings.

sbooeshaghi · 2024-04-12T22:23:01Z

Hi @wraetz is there information on the physical file-layout which you describe as "very complex"? Additionally, given a set of fastqs how is the file constructed? Of course I could try and parse this from various scripts on the repo but it would be helpful for me to understand the structure from a manual or man page.

durbrow · 2024-04-15T15:32:20Z

The file layout is complicated and unimportant.

At a high level, it is a normalized database of genomics data organized into tables, with a consistent set of columns with a consistent set of datatypes, conforming to the INSDC SRA data model.

At a low level, it is an archive file.

And at levels in between, there are different abstractions.

We don't expect people who are accessing the data to need to deal with the lower level abstractions.

sbooeshaghi · 2024-04-15T17:27:29Z

Hi @durbrow,

Referring back to my previous question, could you please point me to documentation that describes the lower level structure of the SRA archive file? This is important to a project I am currently working on.

Thank you!
Sina B.

durbrow · 2024-04-15T18:08:42Z

There is not such document. It isn't necessary for accessing the data, or even useful for that. You are welcome to examine the source code to see how it is written.

stineaj · 2024-04-18T17:21:56Z

@sbooeshaghi is the physical layout necessary or would the logical layout be sufficient? Or perhaps it would make sense to discuss your project as much as you can to see what a good patch to connect the dots would be. We could converse by email if what you are working on is not ready to be in a public forum like this.

yaschenk · 2024-04-18T17:49:20Z

@sbooeshaghi : try the following cartoonish description of one of the flavors of SRA: the ones using compression by reference based on BAM input data
https://ftp-trace.ncbi.nih.gov/sra/doc/csra-fileformat.ppsx
It is more than 10 years old, but still valid enough to give you an idea of logical and physical layout. The majority of example commands should still work on any SRA format file, not only the ones produced from BAM

sbooeshaghi · 2024-04-19T18:50:10Z

Hi, up-to-date documentation of the SRA file format will help diagnose multiple reported issues for the SRA file format and will help facilitate various enhancements related to file parsing. Here are a few examples:

#452, The perennial problem of supporting gzip in fasterq-dump

#794, Increasing speed of SRA data conversion ~8.4x

#889, Potential data corruption for multiple single-cell assays

Are there plans to produce such documentation?

Thanks.

durbrow · 2024-04-22T18:02:51Z

#452 is a feature request for how fasterq-dump formats its output. It has nothing to do with the .sra file format.

#794 is a feature request to have an existing API perform like fasterq-dump does. It has nothing to do with the .sra file format.

#889 is an issue with what data series are stored. It is a policy issue concerning the SRA data model. It has nothing to do with the .sra file format. This is the code repository for the SRA toolkit, and you are talking to the developers. We do not make policy.

It appears that what you want to know is how to use the same APIs we use (from ncbi-vdb) in writing the tools. I would suggest you start with our python bindings. Here is an example. It almost certainly won't work as-is, but it should give you the idea.

durbrow closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description of SRA archive file layout #863

Description of SRA archive file layout #863

noamteyssier commented Oct 13, 2023

apredeus commented Nov 9, 2023

wraetz commented Nov 9, 2023

sbooeshaghi commented Apr 12, 2024

durbrow commented Apr 15, 2024

sbooeshaghi commented Apr 15, 2024

durbrow commented Apr 15, 2024

stineaj commented Apr 18, 2024

yaschenk commented Apr 18, 2024

sbooeshaghi commented Apr 19, 2024

durbrow commented Apr 22, 2024 •

edited

Loading

Description of SRA archive file layout #863

Description of SRA archive file layout #863

Comments

noamteyssier commented Oct 13, 2023

apredeus commented Nov 9, 2023

wraetz commented Nov 9, 2023

sbooeshaghi commented Apr 12, 2024

durbrow commented Apr 15, 2024

sbooeshaghi commented Apr 15, 2024

durbrow commented Apr 15, 2024

stineaj commented Apr 18, 2024

yaschenk commented Apr 18, 2024

sbooeshaghi commented Apr 19, 2024

durbrow commented Apr 22, 2024 • edited Loading

durbrow commented Apr 22, 2024 •

edited

Loading