ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

pgaudet · 2022-01-27T09:53:29Z

This would vastly reduce the number of ChEBI terms to choose from, and would make sure we use the 7.3 forms.

Thanks, Pascale

kltm · 2022-01-27T23:53:30Z

It looks like there is already a limited set from chebi coming in to the "NEO" build (~23k) versus the regular GO release build (~177k). Maybe that's what's already coming in on imports? I'm not sure there is actually anything in the current build process to shave that down more, as we're only examining GPIs and GAFs to produce this.

cmungall · 2022-01-28T20:14:46Z

The correct place to handle this is upstream. go-lego.owl imports go-plus, which uses a chebi_import detemined by the editors file.

This is likely both too small (doesn't have any terms that have not been used in the ontology) and too large (includes protonation variants).

Unfortunately simply limiting to the 7.3 forms will have issues since the hierarchy for any one protonation form is often incomplete, and you need all branches with the GCIs to get a complete hierarchy (if that sounds strange and complex, that is because it is).

My preference would be to first scope out more complete requirements for what we want and don't want in chebi and then prioritize a project based on this. For example, in addition to having a canonical protonation state, we want the labels to be intuitive and searchable, we want to ensure that curators are consistent in the level they choose (e.g. L vs D form), and we want to simplify the process of using CHEBI in the ontology, and simplify things for users who might want to use CHEBI and GO together.

We can explore a hack in go-lego that subtracts from the chebi terms in go-plus but I think this will lead to marginal gain at high complexity cost.

pgaudet · 2022-01-31T09:27:44Z

This is the file that RHEA uses:
https://ftp.expasy.org/databases/rhea/tsv/chebiId_name.tsv
(about 11k)

It would be useful to know how many chemicals we'd be missing if we used this.

Thanks, Pascale

deustp01 · 2022-01-31T14:57:41Z

t would be useful to know how many chemicals we'd be missing if we used this.

Once I figure out how to do it, I will check the RHEA list against all the ChEBI ID's in Reactome. (If someone reading this knows how, that would be great!)

kltm · 2022-01-31T21:28:33Z

@deustp01 Is there a good source for that information? If I just munge through reacto.owl
grep -oh 'CHEBI_[0-9]*' reacto.owl | sort | uniq | sed 's/_/:/' > reacto_chebi.txt
With @pgaudet 's file above, I can extract:
grep -oh 'CHEBI:[0-9]*' chebiId_name.tsv | sort |uniq > reacto_rhea.txt
File sizes compare at:

sjcarbon@moiraine:/tmp$:) wc -l reacto_*
  1978 reacto_chebi.txt
 10226 reacto_rhea.txt
 12204 total

deustp01 · 2022-01-31T21:59:07Z

@kltm The attached tab-delimited text file contains entries for the reference form of every chemical known to Reactome (including un-released ones), one row for each chemical. ("Reference" means the information we get from an external reference resource, almost always ChEBI, and which we use to construct "working" instances by adding subcellular location information - so there's only one water reference but many working forms differing by location.) The first entry in each row is the chemical's name; the second is its identifier in the reference resource.

If you just omitted all the rows whose identifier does NOT start with ChEBI, that would be OK - there aren't many, and basically if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either.

Adding @ukemi for a sanity check.

Reactome_ChEBI_list.txt

kltm · 2022-01-31T22:26:26Z

@deustp01 Processing that file in a similar way:
grep -oh 'ChEBI:[0-9]*' Reactome_ChEBI_list.txt | sort | uniq | sed 's/ChEBI/CHEBI/' > reacto_reacto.txt

sjcarbon@moiraine:/tmp$:) wc -l reacto_rhea.txt reacto_reacto.txt 
 10226 reacto_rhea.txt
  7071 reacto_reacto.txt

So, like 3k short.
Diff output looks like:
https://gist.github.com/kltm/f7294fcf771cf00eada192b9734ac8ed (~10k lines)

ukemi · 2022-02-01T12:59:51Z

I think one question that remains is how to handle entities from imported sources like this and build a robust and complete entity ontology for use in models. In this case Reactome is the straw man, but there have been proposals to do this with other resources as well. I think (correct me if I am wrong) that the plan for Reactome proteoforms and complexes is to move towards using PRO. So there is an ontology for that. We should be able to distinguish location for the Reactome entities using the PRO ids, existing relations and GO cellular components. I would think this could be extended to ChEBI entities, existing relations and GO cellular components.

The question that I still have with respect to this exact ticket is whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

deustp01 · 2022-02-01T15:28:35Z

whether Reactome expects all the mismatches to eventually be mapped to Rhea and be incorporated into ChEBI and get blessed in the usable set.

Yes, as above, that is the hope: "if we can't specify something well enough to get a ChEBI identifier for it, it's not well enough specified for GO-CAM either." I'm expecting / hoping / guessing from the work with Rhea and ChEBI over the past few years that we are not going to run into the issue of chemicals important to annotate human (patho)physiology that are a priori out of scope for these other resources. Also, there are generic terms, items like "polypeptide" or "nucleotide" that we can continue to use to ensure that all Reactome physical entities can be mapped to something in ChEBI to enable conversion to GO-CAM to proceed.

cmungall · 2022-02-01T21:12:03Z

I am confident we can get a simple biologist-friendly that satisfies all our requirements IF chebi can fix one thing.

Right now it is impossible to make a subset of chebi that excludes non ph7.3 non-protonated forms without losing large numbers of important classifications. I finally got around to making a comprehensive report for CHEBI:

ebi-chebi/ChEBI#4207

From a GO perspective, this is one of the most important things CHEBI could work on. I suspect this will be high priority for Rhea too. I know it is a priority for multiple other ontologies that use CHEBI.

Note that we would be interested in seeing a systematic approach to this - manually synchronizing the different branches for the different protonated forms is not scalable. I am willing to spend lots of time with the CHEBI team to explain how OWL can help solve this in a systematic way.

pgaudet changed the title ~~ChEBI : limit load to CheBI that have 'UniProt' synonyms~~ ChEBI : limit load to ChEBI that have 'UniProt' synonyms Jan 27, 2022

kltm added the enhancement label Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

pgaudet commented Jan 27, 2022

kltm commented Jan 27, 2022

cmungall commented Jan 28, 2022

pgaudet commented Jan 31, 2022

deustp01 commented Jan 31, 2022

kltm commented Jan 31, 2022 •

edited

Loading

deustp01 commented Jan 31, 2022

kltm commented Jan 31, 2022

ukemi commented Feb 1, 2022

deustp01 commented Feb 1, 2022

cmungall commented Feb 1, 2022

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

ChEBI : limit load to ChEBI that have 'UniProt' synonyms #78

Comments

pgaudet commented Jan 27, 2022

kltm commented Jan 27, 2022

cmungall commented Jan 28, 2022

pgaudet commented Jan 31, 2022

deustp01 commented Jan 31, 2022

kltm commented Jan 31, 2022 • edited Loading

deustp01 commented Jan 31, 2022

kltm commented Jan 31, 2022

ukemi commented Feb 1, 2022

deustp01 commented Feb 1, 2022

cmungall commented Feb 1, 2022

kltm commented Jan 31, 2022 •

edited

Loading