Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated pseudocode #2

Open
wants to merge 6 commits into
base: test-branch
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions pseudocode.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ TCGA VCF file Subset into breast, ovarian, colo-rectal:

Rows:
If chrom in range of genes in pathway
If pos in range of genes in pathway
Include the row in new file
If pos in range of genes in pathway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be old, but here's where you need to test the chrom and pos together, with an and clause, since teh location ranges you're looking for are defined by the combination of chrom and pos together.

Include the row in new file
Cols:
Use GDC website to make a map file to translate from TCGA ID to Tissue type
If tissue is x, use map file to filter out columns that correspond to sample numbers in tissue cohort x.
Expand All @@ -19,14 +19,17 @@ Final file: tissue type subset where rows are the variants in the ranges of the



TSV File indexed by individual, including TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info:
TSV File organized by individual, where each variant has its own row including TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info:

Loop through columns in tissue type subset (above)
If genotype is not an empty field
Add a row of TCGA-ID (from column), and also Chrom, Pos, Ref, Alt, Genotype Info to the final file

Final file: TSV file indexed by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)

Final file: TSV file organized by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I was confused about earlier, the difference between the TSV file and the final file. If the final file is what tells you which individuals have which variants, then you don't want the TCGA ID and genotype info in the TSV file (above). You'd want the TSV file to be a nonredundant file that lists each of the unique variants in the cohort.

Test to determine success:
Sort the TCGA ids in the final file in alphabetical order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

Count the number of times a unique TCGA comes up in the sorted list
Compare the number of unique TCGA IDs to the number of IDs in the tissue cohort, ideally counts should be the same



Expand All @@ -35,11 +38,13 @@ VCF File with set of unique variants:

Loop through rows in product of above tissue type subset
If genotype information is not blank
If chrom, pos, ref, and alt (combination of them all) are unique
If chrom, pos, ref, and alt (combination of them all) have not been added to the final vcf file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will help you make sure you got all the variants you wanted. You might also think about how you could test that you didn't get any variants you don't want. One way to do this would be to collect variant annotations, such as from OpenCravat, and make sure you don't have any annotations for unexpected genes.

Add row to final file

Final File: VCF file indexed by variant, where each row is a unique variant with traditional vcf columns (Chrom, Pos, ID, Ref, Alt, Qual, Filter, Info, Format) Each variant is unique and only shows up in this file once regardless of whether it shows up in multiple individuals

Test to determine success:
In the tissue type subset, go through line by line and count the number of non-blank genotype fields
Compare this count with the number of lines in the final vcf file of unique variants, ideally counts should be the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!




Expand Down
26 changes: 26 additions & 0 deletions pseudocode_analysis.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
##Sept 30, 2020
## Pseudocode for analysis


Using file of unique germline variants in a given tissue cohort:

Run variants through OpenCravat to get Revel score and pathogenic interpretations from ClinVar, establish cut offs from paper (for missense variants)
Run variants through BayesDel (Bing Feng, Utah, for indel variants)
Identify variants that result in a loss of function mutation (using cut offs for Revel, tbd for BayesDel)

Using file with data per individual, per variant:

For each gene, use this file to identify the individuals who have LOF mutations (positions already identified above)

Using file of somatic variants in a given tissue cohort:

For each gene, match individuals (sample IDs) to view both germline data and somatic data in one file
Identify individuals who have both LOF germline mutation and somatic variant with gistic score of either -1 or -2
Add these individuals to a two hit group (either list or separate file)
Identify individusals who have either LOF germline mutation or somatic variant with gistic score of eitehr -1 or -2
Add these individuals to an at least one hit group (either list or separate file)
Identify individuals who are not in either two hit or one hit group

Using somatic mutational score data:
Compare somatic mutational scores in individuals with two hits, one hit, and no hits.