-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
derived dataset: mutation load #2
Comments
@jingchunzhu Hi, Does this issue want to measure TMB (Tumor Mutation Burden) for each sample(both of tumor samples and normal samples)? |
@MaleicAcid The goal is to first calculate the number of somatic mutations per sample. |
@jingchunzhu Although this is a bit tough for me now, I'm trying to understand. Could you please briefly explain the main process of calculating the number of somatic mutations for a sample?(It is best to take a specific sample as an example.) I found many of the existing softwares for calling somatic mutations(mutect, muse, strella, varscan etc.).Do I need to know them? |
files like this https://tcga.xenahubs.net/download/TCGA.LAML.sampleMap/mutation_wustl.gz is already the somatic mutation call results. You don't need to run the calling programs. Each row of the file is a somatic mutation, has format similar to the following, which you can read from the header row. You need to count how many mutations each sample has. To be more precise, you count how many non-synonymous mutations each sample has. The effect column will tell you if it is a non-synonymous mutation. Then you divide the count by the length of coding sequence in the human genome multiply is by 1 million to get the value mutation load = somatic non-synonymous mutations per megabase of coding sequence |
@jingchunzhu Thanks for your patient explanation very much! If it is convenient, I want to communicate in your native language. |
@MaleicAcid English please. |
@jingchunzhu Please forgive my poor biological knowledge.
Please tell me which ones belong to non-synonymous mutations. In addition, I saw gdc website has a download tool called gdc-client. Combined with the manifest file, gdc-client can download data in batch. Does Xena need such a downloader? |
@MaleicAcid To determine the length of coding sequence in the human genome, start by googling it. |
@MaleicAcid non-synonymous mutations is any type of mutation with a score 4,3,2 in code here: https://github.com/ucscXena/ucsc-xena-client/blob/master/js/models/mutationVector.js#L67 . |
@jingchunzhu I searched for a lot of information about genes, but I'm not sure if the results I found are correct. Please give me further guidance. Search engines tells me the gene information can be found on the specialized gene website (ncbi, geneBank, uniprot, etc.).I try to search the information about human gene on these website. One of the sites I visit is called genome.ucsc.edu.In the species options column I choose human.Then I type the name of the gene: NONO and clicke on the first link in the results page.Finally, I jump to this page. Above this page there is a line of such words "17,586 bp".I guess that's probably the length of the NONO gene.I go on googling and learned that 1bp is equivalent to 2 bases.So it means that the NONO gene has 2 * 17586 bases. I found the other gene lengths for the TCGA-AB-3011-03 sample as above. You can check the TCGA-AB-3011-03 sample in the file named mutation_wustl. #!/usr/bin/python
MB = 1000000
TCGA_AB_3011_03 = {
"tml": 0, # need to be calculated
"mutation_count": 6,
"gene_list": { # search from http://genome.ucsc.edu/, the unit is bp
"NONO": 17586,
"OR1C1": 945,
"IDH1": 18907,
"GTF3A": 11145,
"WNK4": 16259,
"TRPM4": 54096,
"NPM1": 23768
}
}
length = 0
for value in TCGA_AB_3011_03["gene_list"].values():
length += value
length = length*2 # 1bp is equivalent to 2 bases
TCGA_AB_3011_03["tml"] = TCGA_AB_3011_03["mutation_count"]/length*MB
print(TCGA_AB_3011_03["tml"]) # output: 21.02224153154037 I think if the length of the gene can only be searched from other sites, I should write a web crawler to get the data. Last but not least, what type of data is expected to be delivered? Whether the database data or data files like mutation_wustl. |
@MaleicAcid googling something like "human genome coding sequence length", and find publications or books excerpts. you can results such as in https://books.google.com/books?id=dSwWBAAAQBAJ&pg=PA266&lpg=PA266&dq=table+9.5+human+genome+and+human+gene+statistics&source=bl&ots=7AQm7z1ig4&sig=H8pVFYpGhnd6WrO2iv4Ei5FEbP8&hl=en&sa=X&ved=0ahUKEwje3qbc48LZAhVW5GMKHb3_DrkQ6AEIbzAN#v=onepage&q=table%209.5%20human%20genome%20and%20human%20gene%20statistics&f=false and http://kirschner.med.harvard.edu/files/bionumbers/Human%20genome%20and%20human%20gene%20statistics.pdf |
@MaleicAcid find another source for the human genome coding length. Typically find minimum of two credible sources. |
@jingchunzhu Taking gene CYLC1 as an example,it‘s length is 25552 bp on the ucsc genome browser, but ncbi says the value is 25575 bp. When different sources of data inconsistencies, which should be taken as standard? Look forward to your reply. |
@MaleicAcid there are a few issues. In your example, it is the whole gene's length, not coding sequence length. The second issue is that gene annotations do change due to who did the annotation, and genome build. It is not surprising to see difference. For human genome coding genes, it is relatively stable, meaning the difference between annotations will not be large. The last issue is that we are looking for total length of human coding sequences. You could look for a second credible source to give you that answer. You could also add up each gene coding sequence length. I thought it is easier for you to go with the first approach. |
Build derived datasets for TCGA
For each sample, number of mutations -- this measures tumor mutation load, mutations include all types of mutations.
The text was updated successfully, but these errors were encountered: