Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

Open
pgaudet opened this issue Jan 27, 2022 · 10 comments
Open

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

pgaudet opened this issue Jan 27, 2022 · 10 comments

Comments

@pgaudet
Copy link

pgaudet commented Jan 27, 2022

This would vastly reduce the number of ChEBI terms to choose from, and would make sure we use the 7.3 forms.

Thanks, Pascale

@kltm

@pgaudet pgaudet changed the title ChEBI : limit load to CheBI that have 'UniProt' synonyms ChEBI : limit load to ChEBI that have 'UniProt' synonyms Jan 27, 2022
@kltm
Copy link
Member

kltm commented Jan 27, 2022

It looks like there is already a limited set from chebi coming in to the "NEO" build (~23k) versus the regular GO release build (~177k). Maybe that's what's already coming in on imports? I'm not sure there is actually anything in the current build process to shave that down more, as we're only examining GPIs and GAFs to produce this.

@cmungall
Copy link
Member

The correct place to handle this is upstream. go-lego.owl imports go-plus, which uses a chebi_import detemined by the editors file.

This is likely both too small (doesn't have any terms that have not been used in the ontology) and too large (includes protonation variants).

Unfortunately simply limiting to the 7.3 forms will have issues since the hierarchy for any one protonation form is often incomplete, and you need all branches with the GCIs to get a complete hierarchy (if that sounds strange and complex, that is because it is).

My preference would be to first scope out more complete requirements for what we want and don't want in chebi and then prioritize a project based on this. For example, in addition to having a canonical protonation state, we want the labels to be intuitive and searchable, we want to ensure that curators are consistent in the level they choose (e.g. L vs D form), and we want to simplify the process of using CHEBI in the ontology, and simplify things for users who might want to use CHEBI and GO together.

We can explore a hack in go-lego that subtracts from the chebi terms in go-plus but I think this will lead to marginal gain at high complexity cost.

@pgaudet
Copy link
Author

pgaudet commented Jan 31, 2022

This is the file that RHEA uses:
https://ftp.expasy.org/databases/rhea/tsv/chebiId_name.tsv
(about 11k)

It would be useful to know how many chemicals we'd be missing if we used this.

Thanks, Pascale

@deustp01
Copy link

t would be useful to know how many chemicals we'd be missing if we used this.

Once I figure out how to do it, I will check the RHEA list against all the ChEBI ID's in Reactome. (If someone reading this knows how, that would be great!)

@kltm
Copy link
Member

kltm commented Jan 31, 2022

@deustp01 Is there a good source for that information? If I just munge through reacto.owl
grep -oh 'CHEBI_[0-9]*' reacto.owl | sort | uniq | sed 's/_/:/' > reacto_chebi.txt
With @pgaudet 's file above, I can extract:
grep -oh 'CHEBI:[0-9]*' chebiId_name.tsv | sort |uniq > reacto_rhea.txt
File sizes compare at:

sjcarbon@moiraine:/tmp$:) wc -l reacto_*
  1978 reacto_chebi.txt
 10226 reacto_rhea.txt
 12204 total

@deustp01
Copy link

@kltm The attached tab-delimited text file contains entries for the reference form of every chemical known to Reactome (including un-released ones), one row for each chemical. ("Reference" means the information we get from an external reference resource, almost always ChEBI, and which we use to construct "working" instances by adding subcellular location information - so there's only one water reference but many working forms differing by location.) The first entry in each row is the chemical's name; the second is its identifier in the reference resource.

If you just omitted all the rows whose identifier does NOT start with ChEBI, that would be OK - there aren't many, and basically if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either.

Adding @ukemi for a sanity check.

Reactome_ChEBI_list.txt

@kltm
Copy link
Member

kltm commented Jan 31, 2022

@deustp01 Processing that file in a similar way:
grep -oh 'ChEBI:[0-9]*' Reactome_ChEBI_list.txt | sort | uniq | sed 's/ChEBI/CHEBI/' > reacto_reacto.txt

sjcarbon@moiraine:/tmp$:) wc -l reacto_rhea.txt reacto_reacto.txt 
 10226 reacto_rhea.txt
  7071 reacto_reacto.txt

So, like 3k short.
Diff output looks like:
https://gist.github.com/kltm/f7294fcf771cf00eada192b9734ac8ed (~10k lines)

@ukemi
Copy link

ukemi commented Feb 1, 2022

I think one question that remains is how to handle entities from imported sources like this and build a robust and complete entity ontology for use in models. In this case Reactome is the straw man, but there have been proposals to do this with other resources as well. I think (correct me if I am wrong) that the plan for Reactome proteoforms and complexes is to move towards using PRO. So there is an ontology for that. We should be able to distinguish location for the Reactome entities using the PRO ids, existing relations and GO cellular components. I would think this could be extended to ChEBI entities, existing relations and GO cellular components.

The question that I still have with respect to this exact ticket is whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

@deustp01
Copy link

deustp01 commented Feb 1, 2022

whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

Yes, as above, that is the hope: "if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either." I'm expecting / hoping / guessing from the work with Rhea and ChEBI over the past few years that we are not going to run into the issue of chemicals important to annotate human (patho)physiology that are a priori out of scope for these other resources. Also, there are generic terms, items like "polypeptide" or "nucleotide" that we can continue to use to ensure that all Reactome physical entities can be mapped to something in ChEBI to enable conversion to GO-CAM to proceed.

@cmungall
Copy link
Member

cmungall commented Feb 1, 2022

I am confident we can get a simple biologist-friendly that satisfies all our requirements IF chebi can fix one thing.

Right now it is impossible to make a subset of chebi that excludes non ph7.3 non-protonated forms without losing large numbers of important classifications. I finally got around to making a comprehensive report for CHEBI:

ebi-chebi/ChEBI#4207

From a GO perspective, this is one of the most important things CHEBI could work on. I suspect this will be high priority for Rhea too. I know it is a priority for multiple other ontologies that use CHEBI.

Note that we would be interested in seeing a systematic approach to this - manually synchronizing the different branches for the different protonated forms is not scalable. I am willing to spend lots of time with the CHEBI team to explain how OWL can help solve this in a systematic way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants