Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated pseudocode #2

Open
wants to merge 6 commits into
base: test-branch
Choose a base branch
from
Open

Updated pseudocode #2

wants to merge 6 commits into from

Conversation

letitiaismyname
Copy link
Collaborator

@@ -6,8 +6,8 @@ TCGA VCF file Subset into breast, ovarian, colo-rectal:

Rows:
If chrom in range of genes in pathway
If pos in range of genes in pathway
Include the row in new file
If pos in range of genes in pathway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be old, but here's where you need to test the chrom and pos together, with an and clause, since teh location ranges you're looking for are defined by the combination of chrom and pos together.


Loop through columns in tissue type subset (above)
If genotype is not an empty field
Add a row of TCGA-ID (from column), and also Chrom, Pos, Ref, Alt, Genotype Info to the final file

Final file: TSV file indexed by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)

Final file: TSV file organized by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I was confused about earlier, the difference between the TSV file and the final file. If the final file is what tells you which individuals have which variants, then you don't want the TCGA ID and genotype info in the TSV file (above). You'd want the TSV file to be a nonredundant file that lists each of the unique variants in the cohort.


Final file: TSV file organized by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)
Test to determine success:
Sort the TCGA ids in the final file in alphabetical order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@@ -35,11 +38,13 @@ VCF File with set of unique variants:

Loop through rows in product of above tissue type subset
If genotype information is not blank
If chrom, pos, ref, and alt (combination of them all) are unique
If chrom, pos, ref, and alt (combination of them all) have not been added to the final vcf file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will help you make sure you got all the variants you wanted. You might also think about how you could test that you didn't get any variants you don't want. One way to do this would be to collect variant annotations, such as from OpenCravat, and make sure you don't have any annotations for unexpected genes.


Test to determine success:
In the tissue type subset, go through line by line and count the number of non-blank genotype fields
Compare this count with the number of lines in the final vcf file of unique variants, ideally counts should be the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants