Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load all Swiss-Prot entries in NEO #82

Closed
pgaudet opened this issue Feb 10, 2022 · 22 comments
Closed

Load all Swiss-Prot entries in NEO #82

pgaudet opened this issue Feb 10, 2022 · 22 comments

Comments

@pgaudet
Copy link

pgaudet commented Feb 10, 2022

Hi @kltm

The 'ultimate' goal is to have all Swiss-Prot (reviewed) entries. The file is in the same GOA ftp, it's called
uniprot_reviewed.gpi.gz

The bacteria and viruses file was to test a smaller set, but we'll need everything. This file is about double the size of uniprot_reviewed_virus_bacteria.gpi.gz.

Thanks, Pascale

@pgaudet pgaudet changed the title Load all of Swiss-Prot Load all Swiss-Prot entries in NEO Feb 10, 2022
@pgaudet
Copy link
Author

pgaudet commented Feb 14, 2022

Full URL is
ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz

@cmungall
Copy link
Member

See my latest comments in geneontology/go-site#1431

I think loading the reviewed file for SARS-CoV-2 is a bad idea as we lose the important proteins that do important work

I suspect this problem would remain for other viruses too, I have no idea how we would do useful annotation of them without entries for the polyproteins.

We have fixed the problem for SARS2 with my curated file. However, if we are serious about doing other viruses that have similar genomes then I think we need to programmatically extract the correct entries. This would be a project:

  1. write a python script that takes a GPI file that has polyprotein entries (PRO IDs) and takes the longest protein for each bona-fide pp
  2. (optional) map the interpro function predictions to the polyprotein level

@pgaudet
Copy link
Author

pgaudet commented Feb 18, 2022

I suspect this problem would remain for other viruses too,

@pmasson55 says that this is not typical for all viruses. With Patrick we should look at which viruses need this special processing.

Thanks Pascale

@gillespm
Copy link

gillespm commented Feb 18, 2022

Hi All, I was talking with Peter D'eustachio about this and have two comments that hopefully will be of use.

  • Lots of viruses that humans care about (infect humans) use the polyprotein strategy, there are a number of papers out there, mostly written from a drug targeting protease point of view. Here are two examples:

  • https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7150265/

  • https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660988/

  • There is another class of viral polyprotein that you wouldn't really call a polyprotein, but operates in the same way or at least has the same "subfragment" problem. Influenza is an example of this, where host proteases are used to activate viral proteins. In fact this later mechanism, cleavage of the HA protein intracellulary is one of the things that made the 1918 influenza virus so pathogenic. These generally use host proteases.

@pmasson55
Copy link

Hi All,

Concerning SwissProt viral entries, I would say it concerns about 10% of the total viral entries ( about 1500 out of 15 000 approximately). They are not as complex as SARS-COV-2 entries. Most of the time there is only one polyprotein and not a long and a short version of the same polyprotein. So I think that if we can handle protein processing (being able to annotate chains inside polyproteins) I guess we cover 99% of the viruses....

@kltm
Copy link
Member

kltm commented Feb 24, 2022

Okay, picking up work from #77 here, where there are a few more details. Noting that the working branch is now: https://github.com/geneontology/neo/tree/issue-82-add-all-reviewed .

The current blocking issue is that while we were hoping to have a drop-in replacement work, there is some issue with the owltools solr loader that is preventing a load completion. Essentially, after somewhere between ~500k-~1m documents loaded, we get an error like:

[2022-02-04T23:45:34.383Z] 2022-02-04 23:45:29,869 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 674000 and committing...
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,895 INFO  (FlexCollection:253) Loaded: 675000 of 1520950, elapsed: 2:23:28.058, eta: 2:49:11.400
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,896 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 675000 and committing...
[2022-02-04T23:46:33.662Z] Exception in thread "main" org.apache.solr.common.SolrException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms?	at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?	at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)?	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)?	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)?	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)?	at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms?	at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?	at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)?	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)?	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)?	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)?	at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] request: http://localhost:8080/solr/update?wt=javabin&version=2
[2022-02-04T23:46:33.662Z] 	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)

After running this several times, the error occurs usually between two and three hours in to what should be an approximately five hour load, given the number of documents. Note that these initial numbers are from #77, where the full number of documents would have been 1520942 (compared to our current load of 1168920 documents).

Given that we know that solr can typically handle many more documents (in the main go pipeline) and is being loaded in batch anyways, it feels to me unlikely that it is solr choking out directly. I suspect that there is some kind of memory handling issue or incorrectly passed parameter to the owltools loader that eventually causes memory thrashing and then the error. As a next step, I'll rerun this and make note of memory and disk usage as it approaches the limit. If it is not in owltools directly, this should still give us information about where to look next.

@kltm
Copy link
Member

kltm commented Feb 28, 2022

Talking to @pgaudet we'll be asking upstream to filter out the sars-cov-2 entries.

kltm added a commit that referenced this issue Mar 18, 2022
kltm added a commit to geneontology/pipeline that referenced this issue Mar 18, 2022
@kltm
Copy link
Member

kltm commented Mar 19, 2022

Okay, I'm managed to spend a little time with this and have some observations:

  • I actually managed to load the entire reviewed file when it was the only thing I loaded.
  • owltools seems to be the weak link, with memory
    • peaking at 211G, over 192G on CLI (don't know what it "normally" looks like)
    • solr never got over 82G, over 128G on CLI
    • loading it took about same amount of time as the rest: ~3hrs vs ~3hrs+
    • total number of entities about the same

All told (unless I just happened to be stupendously lucky this time), I think that the issue is that owltools can do one or the other with the memory given, but will eventually thrash out if it tries to do both. I think the most expedient next steps would be:

  • try it with a full load but with more memory
  • see how hard it would be to set it up to load: first the reviewed, breathe, then the rest
    worst case, a new docker image that just them separately (although I'd rather keep this in the pipeline if at all possible and not hide this weirdness layers down)

kltm added a commit to geneontology/pipeline that referenced this issue Mar 21, 2022
kltm added a commit that referenced this issue Mar 21, 2022
@kltm
Copy link
Member

kltm commented Mar 21, 2022

Okay, I'm trying to just add in again the uniprot_reviewed to what we have (bumping ecocyc out for the moment). With that, we're still having problems like we've had before (i.e. #80 ) with:

15:31:31  Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31  	at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:90)
15:31:31  	at org.semanticweb.owlapi.oboformat.OBOFormatStorer.storeOntology(OBOFormatStorer.java:42)
15:31:31  	at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:155)
15:31:31  	at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:119)
15:31:31  	at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1525)
15:31:31  	at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1502)
15:31:31  	at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:289)
15:31:31  	at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:209)
15:31:31  	at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3712)
15:31:31  	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
15:31:31  	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
15:31:31  	at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
15:31:31  Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31  	at org.obolibrary.oboformat.model.Frame.checkMaxOneCardinality(Frame.java:424)
15:31:31  	at org.obolibrary.oboformat.model.Frame.check(Frame.java:405)
15:31:31  	at org.obolibrary.oboformat.model.OBODoc.check(OBODoc.java:390)
15:31:31  	at org.obolibrary.oboformat.writer.OBOFormatWriter.write(OBOFormatWriter.java:183)
15:31:31  	at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:88)
15:31:31  	... 11 more
15:31:32  Makefile:30: recipe for target 'neo.obo' failed
15:31:32  make: *** [neo.obo] Error 1

Taking a look at the files:

bbop@wok:/var/lib/jenkins/workspace/peline_issue-neo-82-all-reviewed/neo/mirror$ zgrep Q8IUB2 *.gz
goa_human.gpi.gz:UniProtKB	Q8IUB2	WFDC3	WAP four-disulfide core domain protein 3	WFDC3|WAP14	protein	taxon:9606		HGNC:15957	db_subset=Swiss-Prot
goa_human_isoform.gpi.gz:UniProtKB	F2Z2G4	WFDC3	WAP four-disulfide core domain protein 3	WFDC3	protein	taxon:9606	UniProtKB:Q8IUB2	HGNC:15957	db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB	F2Z2G5	WFDC3	WAP domain-containing protein	WFDC3	protein	taxon:9606	UniProtKB:Q8IUB2	HGNC:15957	db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB	H0Y2V5	WFDC3	WAP four-disulfide core domain protein 3	WFDC3	protein	taxon:9606	UniProtKB:Q8IUB2	HGNC:15957	db_subset=TrEMBL
uniprot_reviewed.gpi.gz:UniProtKB	Q8IUB2	WFDC3	WAP four-disulfide core domain protein 3	WFDC3|WAP14	protein	taxon:9606		EMBL:AL050348|RefSeq:NP_542181.1|HGNC:HGNC:15957|Ensembl:ENSG00000124116	db_subset=Swiss-Prot|taxon_name=Homo sapiens|taxon_common_name=Human|proteome=gcrpCan

@balhoff I'm betting there will be a lot of collisions like this and getting them on a one-by-one basis will take a long time. Is there a way to just have these clobber or skip, or do we need to write a filter script to take care of these up front?

@cmungall
Copy link
Member

I suggest making a new issue for this and coordinating with Alex

For the goa_human vs goa_human_isoform issue:

the uniprot files are a bit different from the rest, the GPI specs are AFAIK silent on the matter of how a set of GPs should be partitioned across files, but I would strongly recommend making it a requirement that for GPIs loaded into Neo that uniqueness should be guaranteed. For uniprot this means

EITHER

  1. goa_X_isoform includes BOTH isoforms AND all reference entities
  2. goa_X_isoform includes ONLY isoforms AND no reference entities

My preference would for 2

I suggest a uniprot-specific one-line script up front that reports and filters any line in goa_X_isoform that does not follow \w+\-\d+ in col2

For uniprot_reviewed, I think the easiest thing is to filter out any already-covered taxon

@kltm
Copy link
Member

kltm commented Mar 22, 2022

Apparently a lot of overlap in the first pass with species we already have:

   567013 /tmp/uniprot_reviewed.gpi
   388714 /tmp/naively_filtered_file.gpi

Will bolt this in and see if there are any collisions left.

kltm added a commit to geneontology/pipeline that referenced this issue Mar 25, 2022
kltm added a commit that referenced this issue Apr 5, 2022
kltm added a commit that referenced this issue Apr 5, 2022
kltm added a commit that referenced this issue Apr 5, 2022
kltm added a commit to geneontology/pipeline that referenced this issue Apr 5, 2022
… we don't get files we need deleted before we use them (specifically datasets.json); for geneontology/neo#82
@kltm
Copy link
Member

kltm commented Apr 5, 2022

Breakup pipeline command from make clean all to make clean and make all to get around an ordering issue.

touch trigger
wget http://s3.amazonaws.com/go-build/metadata/datasets.json -O datasets.json && touch datasets.json
--2022-04-05 15:33:57--  http://s3.amazonaws.com/go-build/metadata/datasets.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.48.62
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.48.62|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81874 (80K) [application/json]
Saving to: ‘datasets.json’

datasets.json       100%[===================>]  79.96K   406KB/s    in 0.2s    

2022-04-05 15:33:58 (406 KB/s) - ‘datasets.json’ saved [81874/81874]

./build-neo-makefile.py -i datasets.json > Makefile-gafs.tmp && mv Makefile-gafs.tmp Makefile-gafs
rm trigger datasets.json mirror/*gz target/*.obo || echo "not all files present, perhaps last build did not complete"

kltm added a commit that referenced this issue Apr 6, 2022
… things that are not present in the datasets.json; work on #82
@kltm
Copy link
Member

kltm commented Apr 6, 2022

Okay, I think we're getting a little further along with the collisions. Added an additional manual filter list to pick up the things that are "manual" in the Makefile (not datasets.json). Temporary; seeing if that can get us through the owltools conversion.

kltm added a commit that referenced this issue Apr 6, 2022
@kltm
Copy link
Member

kltm commented Apr 6, 2022

@pgaudet @vanaukenk
Okay, we have had some success with the new NEO load with more entities. The formula for this is, similar to how we handle things in the main pipeline:

(all currently loaded files: sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc goa_sars-cov-2)
+
(uniprot_reviewed - (lines with taxa represented in what we currently load above))

To see how this looks, I've put it on to amigo-staging:
https://amigo-staging.geneontology.io/amigo/search/ontology

The load we currently have, for comparison, is here:
http://noctua-amigo.berkeleybop.org/amigo/search/ontology

@vanaukenk
Copy link

Thanks for the update @kltm
Is the goal to ultimately have a four-letter abbreviation for each of the taxa? Some still just show the NCBITaxon id.
(I searched on sod1 as an example).

@pgaudet
Copy link
Author

pgaudet commented Apr 6, 2022

I dont understand where these links go - did you want to show entities? I dont know how to get to entities from there.

@pgaudet
Copy link
Author

pgaudet commented Apr 6, 2022

@vanaukenk
Are we going to make our own 4-letter taxa public? Should we not show something more standard?

@kltm
Copy link
Member

kltm commented Apr 6, 2022

@vanaukenk My understanding for the moment was that we were going to start out initially with the taxon id and then iterate from there.

@pgaudet Those links go to the two NEO loads, as seen through the AmiGO ontology interface; one for the newer load we're experimenting with and one for the current load. Remember to remove the "GO" filter to see all the entities available.

@kltm
Copy link
Member

kltm commented Apr 6, 2022

Shout out to @cmungall for finding this. In the newest NEO load (and maybe some of these are in the older one), at the bottom is a list of kinds of entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important:

http://purl.obolibrary.org/obo/AGI_LocusCode_XYZ : 28986
http://identifiers.org/wormbase/XYZ : 152
http://identifiers.org/uniprot/XYZ : 49
http://purl.bioontology.org/ontology/provisional/XYZ : 17
http://identifiers.org/mgi/MGI:XYZ : 4

Samples of complete list:

alters_location_of
anastomoses_with
anteriorly_connected_to
attached_to
channel_for
channels_from
...
synapsed_by
Tmp_new_group
transitively_anteriorly_connected_to
...
transitively_proximally_connected_to
trunk_part_of
TS01
...
TS28
xunion_of
http://identifiers.org/mgi/MGI:106910
http://identifiers.org/uniprot/A0A5F9CQZ0
http://identifiers.org/wormbase/B0035.8%7CWB%3AF54E12.4%7CWB%3AF55G1.3%7CWB%3AH02I12.6
http://purl.bioontology.org/ontology/provisional/1ddd2e2d-2ace-4c87-8ec6-d3b5730b3e7c
http://purl.obolibrary.org/obo/D96882F1-8709-49AB-BCA9-772A67EA6C33
http://semanticscience.org/resource/SIO_000658
http://www.geneontology.org/formats/oboInOwl#Subset
http://www.w3.org/2002/07/owl#topObjectProperty
http://xmlns.com/foaf/0.1/image

@balhoff @cmungall Is this something where owltools needs a different CURIE map? Post filter? Or is this better handed by circling back to #83?

@kltm
Copy link
Member

kltm commented Apr 7, 2022

  • go forward with what we have--there should be no "blockers" for our current use cases
  • iterate on things like species code; outlier non-compacting identifiers (trace back to source)
  • add some minimal tests to the project; @pgaudet @vanaukenk, could I get some help with test identifiers?

@kltm
Copy link
Member

kltm commented Apr 7, 2022

Now have geneontology/go-annotation#4105 and #88 to trace entities.
For QC: #89

@kltm
Copy link
Member

kltm commented Apr 21, 2022

From managers' discussion, this is now live.

@kltm kltm closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants