Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some cases bad identifiers are getting into the load #112

Open
kltm opened this issue Jan 13, 2023 · 5 comments
Open

In some cases bad identifiers are getting into the load #112

kltm opened this issue Jan 13, 2023 · 5 comments

Comments

@kltm
Copy link
Member

kltm commented Jan 13, 2023

In the most recent successful load, the following error was noticed going by:

    20:35:45  2023-01-12 04:35:45,757 WARN  (OWLGraphWrapperExtended:936) Unable to retrieve the value of oboInOw#id as the identifier for http://identifiers.org/wormbase/T10C6.13%7CWB%3AF45F2.13%7CWB%3AZK131.3%7CWB%3AZK131.7%7CWB%3AK06C4.5%7CWB%3AZK131.2%7CWB%3AK06C4.13%7CWB%3AF17E9.10%7CWB%3AK03A1.1%7CWB%3AF08G2.3%7CWB%3AB0035.10%7CWB%3AF07B7.5%7CWB%3AF54E12.1%7CWB%3AF55G1.2%7CWB%3AF22B3.2; we will use an original iri as the identifier.

Nothing like this seems to be in the WB GPI. In fact, no GPI seems to have this, so it may be coming from a parsed GAF? Weird.
Before digging in more, does this ring any bells @vanaukenk ?

@kltm
Copy link
Member Author

kltm commented Jan 13, 2023

Okay, I take that back: I've found the source in the wb.gpi:

bbop@wok:/home/skyhook/release/products/annotations$ zcat wb-src.gpi.gz | grep "F07B7.5"
WB	WBGene00001923	his-49	HIStone	CELE_F07B7.5	gene	taxon:6239	UniProtKB:P08898	
WB	F07B7.5	his-49	HIStone	CELE_F07B7.5	transcript	taxon:6239	WB:WBGene00001923		
WB	CE03253	HIS-2	HIStone	CELE_T10C6.13	protein	taxon:6239	WB:T10C6.13|WB:F45F2.13|WB:ZK131.3|WB:ZK131.7|WB:K06C4.5|WB:ZK131.2|WB:K06C4.13|WB:F17E9.10|WB:K03A1.1|WB:F08G2.3|WB:B0035.10|WB:F07B7.5|WB:F54E12.1|WB:F55G1.2|WB:F22B3.2	UniProtKB:P08898|UniProtKB:K7ZUH9	

This is ringing a bell; I'm going to dig around to see if I can find a previous instance of this.

@vanaukenk
Copy link

Interesting.
This didn't ring any bells, but there are WB sequence identifiers buried in that string and when I check a few of them, I see that they correspond to genes that produce the exact same protein.

@kltm
Copy link
Member Author

kltm commented Jan 13, 2023

Hm, it looks like we've asked similar questions in the past, and felt that it didn't matter much in the grand scheme of things #88 (comment) (note the WB identifier).

@vanaukenk
Copy link

Okay.
The way the C. elegans protein identifiers are assigned in WB right now, we don't have unique protein ids for each gene if they ultimately produce a protein with the same amino acid sequence.
If you think we need a better way of handling this, we can discuss some more.

@kltm
Copy link
Member Author

kltm commented Feb 24, 2023

@vanaukenk @pgaudet As we come up on a few months on this issue (and about a year since closing the variant #111), I was wondering if we're just documenting this (as we did previously with #88 (comment)) or if we're going to take the time to try and fix this this time around? I'm not sure how much of a problem this is in this case or if it's causing a problem that's valued as worth fixing right now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants