Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

derived dataset: mutation load #2

Open
jingchunzhu opened this issue May 15, 2017 · 14 comments
Open

derived dataset: mutation load #2

jingchunzhu opened this issue May 15, 2017 · 14 comments

Comments

@jingchunzhu
Copy link
Collaborator

Build derived datasets for TCGA
For each sample, number of mutations -- this measures tumor mutation load, mutations include all types of mutations.

@MaleicAcid
Copy link

MaleicAcid commented Feb 13, 2018

@jingchunzhu Hi,
I am a junior student in software engineering in Shanghai.This issue has some attractions for me.Although I do not know much about biology, I am willing to try my best to understand the relevant knowledge.

Does this issue want to measure TMB (Tumor Mutation Burden) for each sample(both of tumor samples and normal samples)?
If the goal is to make precise calculations, should the coder remove the effects of “somatic mutations“ before calculating?

@jingchunzhu
Copy link
Collaborator Author

jingchunzhu commented Feb 13, 2018

@MaleicAcid The goal is to first calculate the number of somatic mutations per sample.
We assume there is no somatic mutations in normal samples.
mutation load = somatic non-synonymous mutations per megabase of coding sequence

@MaleicAcid
Copy link

@jingchunzhu Although this is a bit tough for me now, I'm trying to understand.
I searched a lot of information, it seems that I need to first download the MAT file. I originally wanted to download them from TCGA's official website, but I could not access their download links properly. Finally, I found the relevant data on Xena. One of the files I downloaded is:
https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz

Could you please briefly explain the main process of calculating the number of somatic mutations for a sample?(It is best to take a specific sample as an example.)

I found many of the existing softwares for calling somatic mutations(mutect, muse, strella, varscan etc.).Do I need to know them?
Looking forward to more guidance.

@jingchunzhu
Copy link
Collaborator Author

files like this https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz is already the somatic mutation call results. You don't need to run the calling programs.

Each row of the file is a somatic mutation, has format similar to the following, which you can read from the header row.
Sample chromosome start end reference_base variant_base gene effect

You need to count how many mutations each sample has. To be more precise, you count how many non-synonymous mutations each sample has. The effect column will tell you if it is a non-synonymous mutation. Then you divide the count by the length of coding sequence in the human genome multiply is by 1 million to get the value

mutation load = somatic non-synonymous mutations per megabase of coding sequence

@MaleicAcid
Copy link

MaleicAcid commented Feb 16, 2018

@jingchunzhu Thanks for your patient explanation very much!
My understanding of the project has become clearer. But I'm still not quite sure about some of the details of this project.

If it is convenient, I want to communicate in your native language.

@jingchunzhu
Copy link
Collaborator Author

@MaleicAcid English please.

@MaleicAcid
Copy link

MaleicAcid commented Feb 17, 2018

@jingchunzhu Please forgive my poor biological knowledge.

  • I'm still confused about how to determine the value:"the length of coding sequence in the human genome".
  • In order to judge the type of mutation,I try to find all possible values for the effect column on tcga's wiki page.
    And they are:
    Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR 1 , Intron, RNA, Targeted_Region

Please tell me which ones belong to non-synonymous mutations.

In addition, I saw gdc website has a download tool called gdc-client. Combined with the manifest file, gdc-client can download data in batch. Does Xena need such a downloader?

@jingchunzhu
Copy link
Collaborator Author

jingchunzhu commented Feb 22, 2018

@MaleicAcid To determine the length of coding sequence in the human genome, start by googling it.

@jingchunzhu
Copy link
Collaborator Author

@MaleicAcid non-synonymous mutations is any type of mutation with a score 4,3,2 in code here: https://github.com/ucscXena/ucsc-xena-client/blob/master/js/models/mutationVector.js#L67 .

@MaleicAcid
Copy link

MaleicAcid commented Feb 25, 2018

@jingchunzhu I searched for a lot of information about genes, but I'm not sure if the results I found are correct. Please give me further guidance.

Search engines tells me the gene information can be found on the specialized gene website (ncbi, geneBank, uniprot, etc.).I try to search the information about human gene on these website.

One of the sites I visit is called genome.ucsc.edu.In the species options column I choose human.Then I type the name of the gene: NONO and clicke on the first link in the results page.Finally, I jump to this page.

Above this page there is a line of such words "17,586 bp".I guess that's probably the length of the NONO gene.I go on googling and learned that 1bp is equivalent to 2 bases.So it means that the NONO gene has 2 * 17586 bases.

I found the other gene lengths for the TCGA-AB-3011-03 sample as above. You can check the TCGA-AB-3011-03 sample in the file named mutation_wustl.
The following gives a simple piece of code.

#!/usr/bin/python
MB = 1000000
TCGA_AB_3011_03 = {
	"tml": 0, # need to be calculated
	"mutation_count": 6,
	"gene_list": { # search from http://genome.ucsc.edu/, the unit is bp
		"NONO": 17586,
		"OR1C1": 945,
		"IDH1": 18907,
		"GTF3A": 11145,
		"WNK4": 16259,
		"TRPM4": 54096,
		"NPM1": 23768
	}
}

length = 0
for value in TCGA_AB_3011_03["gene_list"].values():
	length += value
length = length*2 # 1bp is equivalent to 2 bases

TCGA_AB_3011_03["tml"] = TCGA_AB_3011_03["mutation_count"]/length*MB
print(TCGA_AB_3011_03["tml"]) # output: 21.02224153154037

I think if the length of the gene can only be searched from other sites, I should write a web crawler to get the data.

Last but not least, what type of data is expected to be delivered? Whether the database data or data files like mutation_wustl.

@jingchunzhu
Copy link
Collaborator Author

@MaleicAcid find another source for the human genome coding length. Typically find minimum of two credible sources.

@MaleicAcid
Copy link

@jingchunzhu Taking gene CYLC1 as an example,it‘s length is 25552 bp on the ucsc genome browser, but ncbi says the value is 25575 bp. When different sources of data inconsistencies, which should be taken as standard?

Look forward to your reply.

@jingchunzhu
Copy link
Collaborator Author

jingchunzhu commented Feb 28, 2018

@MaleicAcid there are a few issues. In your example, it is the whole gene's length, not coding sequence length. The second issue is that gene annotations do change due to who did the annotation, and genome build. It is not surprising to see difference. For human genome coding genes, it is relatively stable, meaning the difference between annotations will not be large. The last issue is that we are looking for total length of human coding sequences. You could look for a second credible source to give you that answer. You could also add up each gene coding sequence length. I thought it is easier for you to go with the first approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants