derived dataset: mutation load #2

jingchunzhu · 2017-05-15T22:53:04Z

Build derived datasets for TCGA
For each sample, number of mutations -- this measures tumor mutation load, mutations include all types of mutations.

MaleicAcid · 2018-02-13T17:07:57Z

@jingchunzhu Hi,
I am a junior student in software engineering in Shanghai.This issue has some attractions for me.Although I do not know much about biology, I am willing to try my best to understand the relevant knowledge.

Does this issue want to measure TMB (Tumor Mutation Burden) for each sample(both of tumor samples and normal samples)?
If the goal is to make precise calculations, should the coder remove the effects of “somatic mutations“ before calculating?

jingchunzhu · 2018-02-13T19:08:59Z

@MaleicAcid The goal is to first calculate the number of somatic mutations per sample.
We assume there is no somatic mutations in normal samples.
mutation load = somatic non-synonymous mutations per megabase of coding sequence

MaleicAcid · 2018-02-14T18:00:13Z

@jingchunzhu Although this is a bit tough for me now, I'm trying to understand.
I searched a lot of information, it seems that I need to first download the MAT file. I originally wanted to download them from TCGA's official website, but I could not access their download links properly. Finally, I found the relevant data on Xena. One of the files I downloaded is:
https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz

Could you please briefly explain the main process of calculating the number of somatic mutations for a sample?(It is best to take a specific sample as an example.)

I found many of the existing softwares for calling somatic mutations(mutect, muse, strella, varscan etc.).Do I need to know them?
Looking forward to more guidance.

jingchunzhu · 2018-02-14T18:14:59Z

files like this https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz is already the somatic mutation call results. You don't need to run the calling programs.

Each row of the file is a somatic mutation, has format similar to the following, which you can read from the header row.
Sample chromosome start end reference_base variant_base gene effect

You need to count how many mutations each sample has. To be more precise, you count how many non-synonymous mutations each sample has. The effect column will tell you if it is a non-synonymous mutation. Then you divide the count by the length of coding sequence in the human genome multiply is by 1 million to get the value

mutation load = somatic non-synonymous mutations per megabase of coding sequence

MaleicAcid · 2018-02-16T09:41:19Z

@jingchunzhu Thanks for your patient explanation very much!
My understanding of the project has become clearer. But I'm still not quite sure about some of the details of this project.

If it is convenient, I want to communicate in your native language.

jingchunzhu · 2018-02-16T16:50:19Z

@MaleicAcid English please.

MaleicAcid · 2018-02-17T16:01:02Z

@jingchunzhu Please forgive my poor biological knowledge.

I'm still confused about how to determine the value:"the length of coding sequence in the human genome".
In order to judge the type of mutation,I try to find all possible values for the effect column on tcga's wiki page.
And they are:
Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR 1 , Intron, RNA, Targeted_Region

Please tell me which ones belong to non-synonymous mutations.

In addition, I saw gdc website has a download tool called gdc-client. Combined with the manifest file, gdc-client can download data in batch. Does Xena need such a downloader?

jingchunzhu · 2018-02-22T19:56:35Z

@MaleicAcid To determine the length of coding sequence in the human genome, start by googling it.

jingchunzhu · 2018-02-22T19:59:29Z

@MaleicAcid non-synonymous mutations is any type of mutation with a score 4,3,2 in code here: https://github.com/ucscXena/ucsc-xena-client/blob/master/js/models/mutationVector.js#L67 .

MaleicAcid · 2018-02-25T09:50:33Z

@jingchunzhu I searched for a lot of information about genes, but I'm not sure if the results I found are correct. Please give me further guidance.

Search engines tells me the gene information can be found on the specialized gene website (ncbi, geneBank, uniprot, etc.).I try to search the information about human gene on these website.

One of the sites I visit is called genome.ucsc.edu.In the species options column I choose human.Then I type the name of the gene: NONO and clicke on the first link in the results page.Finally, I jump to this page.

Above this page there is a line of such words "17,586 bp".I guess that's probably the length of the NONO gene.I go on googling and learned that 1bp is equivalent to 2 bases.So it means that the NONO gene has 2 * 17586 bases.

I found the other gene lengths for the TCGA-AB-3011-03 sample as above. You can check the TCGA-AB-3011-03 sample in the file named mutation_wustl.
The following gives a simple piece of code.

#!/usr/bin/python
MB = 1000000
TCGA_AB_3011_03 = {
	"tml": 0, # need to be calculated
	"mutation_count": 6,
	"gene_list": { # search from http://genome.ucsc.edu/, the unit is bp
		"NONO": 17586,
		"OR1C1": 945,
		"IDH1": 18907,
		"GTF3A": 11145,
		"WNK4": 16259,
		"TRPM4": 54096,
		"NPM1": 23768
	}
}

length = 0
for value in TCGA_AB_3011_03["gene_list"].values():
	length += value
length = length*2 # 1bp is equivalent to 2 bases

TCGA_AB_3011_03["tml"] = TCGA_AB_3011_03["mutation_count"]/length*MB
print(TCGA_AB_3011_03["tml"]) # output: 21.02224153154037

I think if the length of the gene can only be searched from other sites, I should write a web crawler to get the data.

Last but not least, what type of data is expected to be delivered? Whether the database data or data files like mutation_wustl.

jingchunzhu · 2018-02-26T05:15:03Z

@MaleicAcid googling something like "human genome coding sequence length", and find publications or books excerpts. you can results such as in https://books.google.com/books?id=dSwWBAAAQBAJ&pg=PA266&lpg=PA266&dq=table+9.5+human+genome+and+human+gene+statistics&source=bl&ots=7AQm7z1ig4&sig=H8pVFYpGhnd6WrO2iv4Ei5FEbP8&hl=en&sa=X&ved=0ahUKEwje3qbc48LZAhVW5GMKHb3_DrkQ6AEIbzAN#v=onepage&q=table%209.5%20human%20genome%20and%20human%20gene%20statistics&f=false and http://kirschner.med.harvard.edu/files/bionumbers/Human%20genome%20and%20human%20gene%20statistics.pdf

jingchunzhu · 2018-02-26T21:26:49Z

@MaleicAcid find another source for the human genome coding length. Typically find minimum of two credible sources.

MaleicAcid · 2018-02-28T09:15:06Z

@jingchunzhu Taking gene CYLC1 as an example，it‘s length is 25552 bp on the ucsc genome browser, but ncbi says the value is 25575 bp. When different sources of data inconsistencies, which should be taken as standard?

Look forward to your reply.

jingchunzhu · 2018-02-28T20:22:49Z

@MaleicAcid there are a few issues. In your example, it is the whole gene's length, not coding sequence length. The second issue is that gene annotations do change due to who did the annotation, and genome build. It is not surprising to see difference. For human genome coding genes, it is relatively stable, meaning the difference between annotations will not be large. The last issue is that we are looking for total length of human coding sequences. You could look for a second credible source to give you that answer. You could also add up each gene coding sequence length. I thought it is easier for you to go with the first approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

derived dataset: mutation load #2

derived dataset: mutation load #2

jingchunzhu commented May 15, 2017

MaleicAcid commented Feb 13, 2018 •

edited

Loading

jingchunzhu commented Feb 13, 2018 •

edited

Loading

MaleicAcid commented Feb 14, 2018

jingchunzhu commented Feb 14, 2018

MaleicAcid commented Feb 16, 2018 •

edited

Loading

jingchunzhu commented Feb 16, 2018

MaleicAcid commented Feb 17, 2018 •

edited

Loading

jingchunzhu commented Feb 22, 2018 •

edited

Loading

jingchunzhu commented Feb 22, 2018

MaleicAcid commented Feb 25, 2018 •

edited

Loading

jingchunzhu commented Feb 26, 2018

jingchunzhu commented Feb 26, 2018

MaleicAcid commented Feb 28, 2018

jingchunzhu commented Feb 28, 2018 •

edited

Loading

derived dataset: mutation load #2

derived dataset: mutation load #2

Comments

jingchunzhu commented May 15, 2017

MaleicAcid commented Feb 13, 2018 • edited Loading

jingchunzhu commented Feb 13, 2018 • edited Loading

MaleicAcid commented Feb 14, 2018

jingchunzhu commented Feb 14, 2018

MaleicAcid commented Feb 16, 2018 • edited Loading

jingchunzhu commented Feb 16, 2018

MaleicAcid commented Feb 17, 2018 • edited Loading

jingchunzhu commented Feb 22, 2018 • edited Loading

jingchunzhu commented Feb 22, 2018

MaleicAcid commented Feb 25, 2018 • edited Loading

jingchunzhu commented Feb 26, 2018

jingchunzhu commented Feb 26, 2018

MaleicAcid commented Feb 28, 2018

jingchunzhu commented Feb 28, 2018 • edited Loading

MaleicAcid commented Feb 13, 2018 •

edited

Loading

jingchunzhu commented Feb 13, 2018 •

edited

Loading

MaleicAcid commented Feb 16, 2018 •

edited

Loading

MaleicAcid commented Feb 17, 2018 •

edited

Loading

jingchunzhu commented Feb 22, 2018 •

edited

Loading

MaleicAcid commented Feb 25, 2018 •

edited

Loading

jingchunzhu commented Feb 28, 2018 •

edited

Loading