Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD Pipeline: Ensuring Builds Use Most Current Data #90

Open
2 tasks
callahantiff opened this issue Feb 5, 2021 · 3 comments
Open
2 tasks

CI/CD Pipeline: Ensuring Builds Use Most Current Data #90

callahantiff opened this issue Feb 5, 2021 · 3 comments
Assignees

Comments

@callahantiff
Copy link
Owner

TASK

Currently, the build downloads are via the builds/data_to_download.txt, which is a list of URLs. While this will work for 90% of the existing data used, there are a few data provides that include explicit versions in the URLs. As of now, this means that unless we update this text file we will not be guaranteed to get the most current data. Additionally, some of the downloads rely on running a query against a data provider's API. This should always result in the most up-to-date data, but we should verify this also.

The following resources include explicit versions in the URLs and will need updates to resolve the aforementioned problem:

  • ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz ➞ Ensembl
  • ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.uniprot.tsv.gz ➞ Ensembl
  • ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.entrez.tsv.gz ➞ Ensembl
  • ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2021/mesh2021.nt ➞ MeSH
  • GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct ➞ GTeX
  • 9606.protein.links.v11.0.txt.gz ➞ STRING

The following resources are generated from querying an API:



TODO

  • Modify the download code for explicitly versioned URLs to ensure that we are always getting the most updated data
  • Verify that resources downloaded via API queries will also return the most updated results
@callahantiff callahantiff added coding release v3.0.0 noting work and issues related to release v3.0.0 labels Feb 5, 2021
@callahantiff callahantiff self-assigned this Feb 5, 2021
@callahantiff callahantiff added CI/CD - KG Builds and removed coding release v3.0.0 noting work and issues related to release v3.0.0 labels Feb 5, 2021
@cthoyt
Copy link

cthoyt commented Feb 15, 2021

check out the bioversions project, I'm working on similar stuff for solving this problem... unfortunately the state of versioned biomedical data is just as lacking as most other things 🤡

@callahantiff
Copy link
Owner Author

@cthoyt - brilliant, yes! Will definitely work on this for upcoming releases. Thanks for pointing this out!

@cthoyt
Copy link

cthoyt commented Feb 23, 2021

@callahantiff please let me know if there are any resources you're using that aren't supported by bioversions already and I will add them. The syntax to get the current version for one is:

import bioversions
version_string = bioversions.get_version('resource name')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants