Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error running make_se_from_files using diann pg matrix #14

Open
jflucier opened this issue Aug 1, 2024 · 6 comments
Open

error running make_se_from_files using diann pg matrix #14

jflucier opened this issue Aug 1, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@jflucier
Copy link

jflucier commented Aug 1, 2024

Hi,

When pass my DIANN result file to the make_se_from_files function, it returns the following error:

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A0A075B5M4’, ‘A0A075B5M7’, ‘A0A075B5N3’, ‘A0A075B5N4’, ‘A0A075B5R7’, ‘A0A075B5T2’, ‘A0A075B5Y4’, ‘A0A075B666’, ‘A0A087WRA4’, ‘A0A087WS16’, ‘A0A0A6YYP6’, ‘A0A0B4J1I0’, ‘A0A0B4J1M0’, ‘A0A0J9YVH3’, ‘A0A0R4J2B2’, ‘A0A140LIF8’, ‘A0A571BF69’, ‘A2A4P0’, ‘A2A5R2’, ‘A2A8L1’, ‘A2AAY5’, ‘A2AB59’, ‘A2ADY9’, ‘A2APV2’, ‘A2AQ07’, ‘A2ASS6’, ‘A2BH40’, ‘A2CG49’, ‘A2CG63’, ‘A3KFU5’, ‘A8DUK4’, ‘B1ARD6’, ‘B2RSH2’, ‘B2RY04’, ‘B9EJ86’, ‘D3YWQ0-2’, ‘D3YXK2’, ‘D3Z3J6’, ‘D3Z6Q9’, ‘E9PUM5’, ‘E9PVA6’, ‘E9PZM4’, ‘E9Q166’, ‘E9Q1A5’, ‘E9Q1F2’, ‘E9Q1P8’, ‘E9Q448’, ‘E9Q512’, ‘E9QA15’, ‘F8VPU6’, ‘G5E829’, ‘G5E8K5’, ‘G5E8V9’, ‘O08528’, ‘O08638’, ‘O08664’, ‘O08797’, ‘O08807’, ‘O08900’, ‘O08911’, ‘O09106’, ‘O09110’, ‘O35226’, ‘O [... truncated]

I have trace back by executing line by line the make_se_from_files function and found where the error happens. It happens in the make_unique function that returns duplicates. If I inspect the returned proteins_unique object, the returned ID is truncated in the case where proteins groups are composed of multiple proteins. For example:

Diann protein group: ID
A0A075B5M4: A0A075B5M4
A0A075B5M4;A0A0A6YYE7: A0A075B5M4

Would it be ok to prefilter diann results to remove all lines where I see a group of more then 1 protein like A0A075B5M4;A0A0A6YYE7 or it will bias results.

Thank you in advance for your help,
JF

@hsiaoyi0504
Copy link
Member

@jflucier Is your DIA-NN input file generated by running FragPipe?

@hsiaoyi0504
Copy link
Member

Also, which version you are using?

@jflucier
Copy link
Author

jflucier commented Aug 2, 2024

Hi

Is your DIA-NN input file generated by running FragPipe?

No, I run DIANN using command line on a linux cluster. Here is the command I use:

diann --threads 40 --verbose 2 \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_1_Slot1-32_1_24034.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_2_Slot1-33_1_24036.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_3_Slot1-34_1_24038.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_15KO_4_Slot1-35_1_24040.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_1_Slot1-36_1_24046.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_2_Slot1-37_1_24048.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_3_Slot1-38_1_24050.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_minus_4_Slot1-39_1_24052.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_1_Slot1-40_1_24055.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_2_Slot1-41_1_24057.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_3_Slot1-42_1_24059.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_LysM_plus_4_Slot1-43_1_24061.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_1_Slot1-28_1_24025.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_2_Slot1-29_1_24027.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_3_Slot1-30_1_24043.d \
--f $SLURM_TMPDIR/data/Fjolla_DIA_WT_4_Slot1-31_1_24031.d \
--temp $SLURM_TMPDIR/temp \
--cut K*,R* --missed-cleavages 2 --met-excision \
--fasta "$SLURM_TMPDIR/UP000000589_10090_combo.fasta" --fasta-search \
--out-lib "$SLURM_TMPDIR/out/report-lib.tsv" --out-lib-copy \
--out "$SLURM_TMPDIR/out/report.tsv" \
--mass-acc-ms1 20 --mass-acc 20 \
--min-pep-len 7 --max-pep-len 30 \
--min-pr-charge 1 --max-pr-charge 5 \
--min-pr-mz 100 --max-pr-mz 1700 \
--min-fr-mz 100 --max-fr-mz 1500 \
--predictor --reanalyse --matrices --smart-profiling --pg-level 1 \
--unimod4 --unimod35 --var-mod UniMod:1,42.010565,*n,ntermacetyl

Also, which version you are using?

I use DIANN v1.8.1 installed inside a singularity container built using docker image.

Thank you again for your help

@hsiaoyi0504
Copy link
Member

I was asking about the version of FragpipeAnalystR. I believe FragPipe doesn't generate report with such issue. We are willing to support DIA-NN report more but currently we don't support that yet. If you are willing to share your file, you can send it to me through email [email protected]

@jflucier
Copy link
Author

jflucier commented Aug 5, 2024

The FragPipeAnalystR version installed is 0.1.7

I will send you my analysis file directly to the provided email

Thanks again!

@jflucier
Copy link
Author

Hello,

I manage to get this working by filtering pg report using only proteotypic proteins groups (those without ; in protein_group name). Here is the command I used to filter:

perl -ne '
chomp($_);
my @t = split("\t",$_);
my @prot_ident = split(";",$t[0]);
if(scalar(@prot_ident) == 1){
  print $_ . "\n";
}
' report.pg_matrix.tsv > report.pg_matrix.proteoptypic.tsv

Afterwards, the following command run with success:

ccrcc <- make_se_from_files(
  "report.pg_matrix.proteoptypic.tsv",
  "experiment_annotation.tsv",
  type = "DIA",
  level = "gene"
)

@hsiaoyi0504 hsiaoyi0504 added the enhancement New feature or request label Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants