Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate remainder of MGI pipeline into the GO pipeline #77

Open
kltm opened this issue May 24, 2023 · 18 comments
Open

Integrate remainder of MGI pipeline into the GO pipeline #77

kltm opened this issue May 24, 2023 · 18 comments
Assignees
Labels

Comments

@kltm
Copy link
Member

kltm commented May 24, 2023

Project link

https://github.com/orgs/geneontology/projects/136

Project description

Currently, the GOC picks up MGI ortholog and upstream annotation data from MGI. The completion of this project would be that GOC directly pulls in this data, processes it, and adds it to the current data flow. This would remove MGI from the loop of directly processing MGI/mouse function data.

PI

Chris

Product owner (PO)

Li/Pascale

Technical lead (TL)

Sierra

Other personnel (OP)

Seth, Dustin, Anushya

Technical specs

While there is new software being written for this project, it is either 1) within the bounds of current technologies and practices or is 2) custom and one-off, not to be reused elsewhere. The needs of the project are described in great detail in the folders listed below; minimally meeting these requirements and rendering them into a pipeline is the entire scope of the project.

Other comments

This is a continuation of:

#42
https://github.com/orgs/geneontology/projects/109

@kltm kltm added Needs LA approval Needs final approval from the Lead Architect Needs PM approval Needs final approval from the Project Manager Needs tech doc Needs PI Needs PO Needs TL labels May 24, 2023
@kltm kltm changed the title TBD Integrate remainder of MGI pipeline into the GO pipeline May 24, 2023
@kltm
Copy link
Member Author

kltm commented May 24, 2023

Letting @pgaudet and @ukemi know that this is seeded with likely personnel.

@pgaudet pgaudet assigned ukemi and unassigned pgaudet May 31, 2023
@kltm
Copy link
Member Author

kltm commented May 31, 2023

Possible order of operations

@kltm
Copy link
Member Author

kltm commented May 31, 2023

TODO: add clarification for orthology source and how to process down to positive/negative list

@kltm
Copy link
Member Author

kltm commented Jun 1, 2023

@kltm
Copy link
Member Author

kltm commented Jul 19, 2023

@ukemi
Copy link

ukemi commented Jul 26, 2023

QC rounds show that the Rat ISO load is done.
@sierra-moxon will begin working on the human ISO annotations and @ukemi will begin QC on those.
Once GPAD specs have been finalized, the GOC will begin providing test files for Lori to load into MGI.

@kltm
Copy link
Member Author

kltm commented Aug 8, 2023

New repository for this project at https://github.com/geneontology/gopreprocess

@kltm
Copy link
Member Author

kltm commented Aug 11, 2023

Noting for @pgaudet that we have hit a couple of slowdown points WRT needing to update some core software to support recent tooling (basically we need to start updating from some very old python versions). This will likely result in a small overhead increase for the project and draw in myself and @dustine32 for some tasks.

@ukemi
Copy link

ukemi commented Oct 25, 2023

Note that the rat and human ISO parts of the pipeline are close to completion and we have begun working on mouse annotations from Protein2GO. There is a rate limitation for the completion of this project that is tied to the GOC-wide conversion to the GPAD2.0 format and the generation of the GPAD2.0 files.
There are also some issues to be discussed at the GOC-level:

  1. Currently the filtering of 'duplicates' is not taking place at the GOC end. We need to put this into place not only for this project, but globally for the entire GOC.
  2. Do we want annotations that do not map to mouse genes in MGI in the corpus of annotations resident at the GOC?
  3. Do we want annotations from all of the IEA pipelines that are emitted from UniProt?

@kltm
Copy link
Member Author

kltm commented Feb 8, 2024

@pgaudet I had a long conversation with @sierra-moxon and have a feel for the position of the work.
Basically, in a perfect world, it may be that all direct (i.e. ontobio) software work is done and all that's left is checking, making a GPAD/GPI 2.0 announcement, and running it into through the main pipeline. That said, this needs to be confirmed and running this through a pipeline that is a decent simulation of the final work is running into most of the same problems we run into when trying to do release pipeline stuff.
To push through this, I'll be prioritizing pushing this through by whatever methods I can to land it on a "close enough" version of the final product so that we can do any final debugging and confirm the output with MGI. Once MGI has given that confirmation, it will be on us to make the final timeline and do the technical stuff. I've assigned myself geneontology/pipeline#325.

@pgaudet
Copy link

pgaudet commented Feb 8, 2024

It may be that the GPAD/GPI production would be better off as a separate project.

@kltm
Copy link
Member Author

kltm commented Feb 8, 2024

Talking to @pgaudet and @suzialeksander , next concrete steps are

  • produce and confirm output for current "quick" test pipeline runs
  • make sure that MGI gets access to these files (@sierra-moxon )
  • using these files, confirm that GPAD/GPI 2.0 look "good" for MGI (@sierra-moxon )
  • using these files, confirm that GPAD/GPI 2.0 look "good" for consortium (@pgaudet )
  • @pgaudet and @suzialeksander to send announcement that the format will be our primary output after date X/Y/Z
  • on date X/Y/Z, GPAD/GPI 2.0 code moved into the main pipeline branches (i.e. snapshot and release)
  • proceed.

(Note, if more work is needed on the MGI/QC side, we are likely to proceed with adding the code sooner anyways.)

@kltm
Copy link
Member Author

kltm commented Feb 13, 2024

@pgaudet Some of the development team took at look at the output from the test pipeline and there are some issues with the data that we want to pin down before passing the results on to MGI--mainly an increase in annotation in one file that we're having a little trouble tracing. This will mean 1) re-running some of the data (about half a day lag, assuming the pipelines are cooperative) and tweaking/checking a GPAD reprocessing step. We will be meeting again mid-week to see where we're at.

@kltm
Copy link
Member Author

kltm commented Feb 13, 2024

Also tagging @sierra-moxon and @dustine32

@kltm
Copy link
Member Author

kltm commented Feb 27, 2024

@pgaudet Changing PO to Li/Pascale

@kltm
Copy link
Member Author

kltm commented Apr 4, 2024

@pgaudet Talking to @sierra-moxon , the remainder of items in https://github.com/orgs/geneontology/projects/136 are MGI bookkeeping items , with all GO-driven items now moved or being re-created for https://github.com/orgs/geneontology/projects/155.

But these are still open items in our tracker. I think one way forward, to prevent confusion, would be to rename the project and project metadata to make clear that this is now an "MGI sub-project" and move it into the external collab category (i.e. no more "GO" resources, beyond communication, unless something bad happens).

@pgaudet
Copy link

pgaudet commented Apr 12, 2024

@kltm Is geneontology/go-site#2043
a MGI task?

@kltm
Copy link
Member Author

kltm commented Apr 12, 2024

@pgaudet Assuming no answer is needed as is now closed.

@pgaudet pgaudet assigned pgaudet and unassigned ukemi Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Active
Development

No branches or pull requests

3 participants