Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

nicolasfredesfranco · 2021-06-29T03:17:04Z

I would like to know where the file train_data.pkl and test_data.pkl came from? Specifically, the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 to the data page https://deepgo.cbrc.kaust.edu.sa/data/. These files are not the same train and test files of the data-cafa.tar, data-2016.tar, or another folder available on the webpage. However, I have been using these files in some experiments, and I recently realized it is not the data used to generate the tables presented in the Deepgoplus paper. Despite this, to interpret my results, I need to know the origin of these data files, if they are some merge or section of the other datasets of the data webpage, some version of Uniprot, CAFA, or whatever. Thanks for your help.

coolmaksat · 2021-06-29T05:28:59Z

Hi,
Those files are generated using deepgoplus_data.py script. You need to provide GO file (go.obo, downloaded from geneontology.org)
and swissprot.pkl file which is generated with uni2pandas.py using uniprot-sprot.dat.gz (data file from uniprot.org).
We continuously update our data file (data.tar.gz) with every release of Uniprot, that is why they are not the same as
in the paper.
For CAFA data, we use cafa3_data.py script to generate training and testing data.
If you would like to reuse our trained models, make sure you use the same terms.pkl file because the order of GO
terms affects the prediction results.

nicolasfredesfranco · 2021-06-30T14:08:48Z

Hi!
First of all, I want to say I appreciate your work! Thanks for making it available.
Now, with your explanation, I'm clear about my original question. Thanks for your answer.
As I said. In some experiments of my research, I've been using train_data.pkl and test_data.pkl files that your upload on 02-Dec-2019. I want to evaluate the diamond score over these files using your code evaluate_diamondscore.py. Therefore, I generated the test_diamond.res basing on your code new_evaluation.sh by running:

rm results/deepgoplus_mf.txt
rm results/deepgoplus_bp.txt
rm results/deepgoplus_cc.txt

python diamond_data.py -df data/train_data.pkl -o data/train_data.fa

python diamond_data.py -df data/test_data.pkl -o data/test_data.fa

diamond makedb --in data/train_data.fa -d data/train_data #creates train_data.dmnd

diamond blastp -d data/train_data.dmnd --more-sensitive -t /tmp -q data/test_data.fa --outfmt 6 qseqid sseqid bitscore -o data/test_diamond.res

Then with the test_diamond.res generated, the train_data.pkl and test_data.pkl that I've been talking about and the go.obo file available in your data web page uploaded on 01-Dec-2019 (one day before the pkl files), I try to run evaluate_diamondscore.py. I have supposed this go.obo file match with the pkl files of 2-Dec-2019, but the evaluation of diamond present some problems. First, it produces an error in the evaluate_annotations (line 151) function because the variable "total" divide the variable "ru" by zero (line 158). I've been studying your code, and this produces because the filter you use to maintain just the go terms that belong to the GO subontology (mf, bp or cc) eliminates every go terms in the labels and preds list. Then, after the filter of lines 84 and 107 both lists are empty, and the "for" of the evaluation never occurs. So I suspect the go terms of the go.obo are in a different format of the GOterms in train and test.pkl (the set of the go.obo appear with |IDA and this kind of details but the sets of labels and preds no), but I'm not sure.

I will really appreciate it if you can help me, I need to use precisely these pkl files because my research team invested a lot of time in simulating the structural properties of the proteins contain there, and now we want to send our paper in which, of course, we cite you and your paper. But I need to make your diamond evaluation work with this pkl first.

nicolasfredesfranco · 2021-07-06T03:27:36Z

I did solve it by changing the string 'annotations' in lines 48 and 52 of evaluate_diamondscore.py by 'prop_annotations'. You use this column in the other evaluations files. Is this right?

coolmaksat · 2021-07-06T05:03:31Z

Hi,
Sorry for the late response.
Yes, you are right. It should be prop_annotations. I will update the script.
Thank you.

nicolasfredesfranco · 2021-07-06T05:12:42Z

thanks!

simon19891216 · 2023-04-03T00:35:12Z

where is data-deepgo2016/test-mf-preds.pkl?

coolmaksat · 2023-04-03T05:26:02Z

Could you please remind me where is this file referenced?

simon19891216 · 2023-04-03T08:45:44Z

Indeed, I want to evaluate our annotation results using CAFA3. However, I found that only Fmax can be calculated by cafa3. I found another program (deepgoplus) can generate other values (such as Smin), and the two files were contained in deepgoplus. At 2023-04-03 13:26:12, "Maxat Kulmanov" ***@***.***> wrote: Could you please remind me where is this file referenced? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

coolmaksat · 2023-04-03T09:23:43Z

To evaluate CAFA3, we ran cafa3_data.py script to generate cafa test_data.pkl and the run our model to get the predictions.pkl. The data is available here https://deepgo.cbrc.kaust.edu.sa/data/data-cafa.tar.gz

simon19891216 · 2023-04-03T10:03:22Z

thank you! I will try it again! At 2023-04-03 17:23:55, "Maxat Kulmanov" ***@***.***> wrote: To evaluate CAFA3, we ran cafa3_data.py script to generate cafa test_data.pkl and the run our model to get the predictions.pkl. The data is available here https://deepgo.cbrc.kaust.edu.sa/data/data-cafa.tar.gz — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

simon19891216 · 2023-04-03T13:26:39Z

Only gene IDs and GO IDs were contained in our annotation results. however, the labels and preds should be contained in the input file when using evaluate_cafa3.py. according to your suggestion, we run cafa3_data.py, but the information of labels and preds seemed not to be added in our file. could you tell me how to add the information into our files? thanks At 2023-04-03 17:23:55, "Maxat Kulmanov" ***@***.***> wrote: To evaluate CAFA3, we ran cafa3_data.py script to generate cafa test_data.pkl and the run our model to get the predictions.pkl. The data is available here https://deepgo.cbrc.kaust.edu.sa/data/data-cafa.tar.gz — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

nicolasfredesfranco commented Jun 29, 2021

coolmaksat commented Jun 29, 2021

nicolasfredesfranco commented Jun 30, 2021

nicolasfredesfranco commented Jul 6, 2021

coolmaksat commented Jul 6, 2021

nicolasfredesfranco commented Jul 6, 2021

simon19891216 commented Apr 3, 2023

coolmaksat commented Apr 3, 2023

simon19891216 commented Apr 3, 2023 via email

coolmaksat commented Apr 3, 2023

simon19891216 commented Apr 3, 2023 via email

simon19891216 commented Apr 3, 2023 via email

Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

Where the train_data.pkl and test_data.pkl uploaded on 02-Dec-2019 came from? #37

Comments

nicolasfredesfranco commented Jun 29, 2021

coolmaksat commented Jun 29, 2021

nicolasfredesfranco commented Jun 30, 2021

nicolasfredesfranco commented Jul 6, 2021

coolmaksat commented Jul 6, 2021

nicolasfredesfranco commented Jul 6, 2021

simon19891216 commented Apr 3, 2023

coolmaksat commented Apr 3, 2023

simon19891216 commented Apr 3, 2023 via email

coolmaksat commented Apr 3, 2023

simon19891216 commented Apr 3, 2023 via email

simon19891216 commented Apr 3, 2023 via email