Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

Open
nicolasfredesfranco opened this issue Jun 29, 2021 · 11 comments

Comments

@nicolasfredesfranco
Copy link

I would like to know where the file train_data.pkl and test_data.pkl came from? Specifically, the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 to the data page https://deepgo.cbrc.kaust.edu.sa/data/. These files are not the same train and test files of the data-cafa.tar, data-2016.tar, or another folder available on the webpage. However, I have been using these files in some experiments, and I recently realized it is not the data used to generate the tables presented in the Deepgoplus paper. Despite this, to interpret my results, I need to know the origin of these data files, if they are some merge or section of the other datasets of the data webpage, some version of Uniprot, CAFA, or whatever. Thanks for your help.

@coolmaksat
Copy link
Contributor

Hi,
Those files are generated using deepgoplus_data.py script. You need to provide GO file (go.obo, downloaded from geneontology.org)
and swissprot.pkl file which is generated with uni2pandas.py using uniprot-sprot.dat.gz (data file from uniprot.org).
We continuously update our data file (data.tar.gz) with every release of Uniprot, that is why they are not the same as
in the paper.
For CAFA data, we use cafa3_data.py script to generate training and testing data.
If you would like to reuse our trained models, make sure you use the same terms.pkl file because the order of GO
terms affects the prediction results.

@nicolasfredesfranco
Copy link
Author

Hi!
First of all, I want to say I appreciate your work! Thanks for making it available.
Now, with your explanation, I'm clear about my original question. Thanks for your answer.
As I said. In some experiments of my research, I've been using train_data.pkl and test_data.pkl files that your upload on 02-Dec-2019. I want to evaluate the diamond score over these files using your code evaluate_diamondscore.py. Therefore, I generated the test_diamond.res basing on your code new_evaluation.sh by running:

rm results/deepgoplus_mf.txt
rm results/deepgoplus_bp.txt
rm results/deepgoplus_cc.txt

python diamond_data.py -df data/train_data.pkl -o data/train_data.fa

python diamond_data.py -df data/test_data.pkl -o data/test_data.fa

diamond makedb --in data/train_data.fa -d data/train_data #creates train_data.dmnd

diamond blastp -d data/train_data.dmnd --more-sensitive -t /tmp -q data/test_data.fa --outfmt 6 qseqid sseqid bitscore -o data/test_diamond.res

Then with the test_diamond.res generated, the train_data.pkl and test_data.pkl that I've been talking about and the go.obo file available in your data web page uploaded on 01-Dec-2019 (one day before the pkl files), I try to run evaluate_diamondscore.py. I have supposed this go.obo file match with the pkl files of 2-Dec-2019, but the evaluation of diamond present some problems. First, it produces an error in the evaluate_annotations (line 151) function because the variable "total" divide the variable "ru" by zero (line 158). I've been studying your code, and this produces because the filter you use to maintain just the go terms that belong to the GO subontology (mf, bp or cc) eliminates every go terms in the labels and preds list. Then, after the filter of lines 84 and 107 both lists are empty, and the "for" of the evaluation never occurs. So I suspect the go terms of the go.obo are in a different format of the GOterms in train and test.pkl (the set of the go.obo appear with |IDA and this kind of details but the sets of labels and preds no), but I'm not sure.

I will really appreciate it if you can help me, I need to use precisely these pkl files because my research team invested a lot of time in simulating the structural properties of the proteins contain there, and now we want to send our paper in which, of course, we cite you and your paper. But I need to make your diamond evaluation work with this pkl first.

@nicolasfredesfranco
Copy link
Author

I did solve it by changing the string 'annotations' in lines 48 and 52 of evaluate_diamondscore.py by 'prop_annotations'. You use this column in the other evaluations files. Is this right?

@coolmaksat
Copy link
Contributor

Hi,
Sorry for the late response.
Yes, you are right. It should be prop_annotations. I will update the script.
Thank you.

@nicolasfredesfranco
Copy link
Author

thanks!

@simon19891216
Copy link

where is data-deepgo2016/test-mf-preds.pkl?

@coolmaksat
Copy link
Contributor

Could you please remind me where is this file referenced?

@simon19891216
Copy link

simon19891216 commented Apr 3, 2023 via email

@coolmaksat
Copy link
Contributor

To evaluate CAFA3, we ran cafa3_data.py script to generate cafa test_data.pkl and the run our model to get the predictions.pkl. The data is available here https://deepgo.cbrc.kaust.edu.sa/data/data-cafa.tar.gz

@simon19891216
Copy link

simon19891216 commented Apr 3, 2023 via email

@simon19891216
Copy link

simon19891216 commented Apr 3, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants