Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qcarchive update #187

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

chrisiacovella
Copy link
Member

@chrisiacovella chrisiacovella commented Sep 21, 2023

This updates qcarchive_utils.py to be compatible with v0.5 of qcportal. Relates to issue #185

This code reproduces the same behavior as the prior implementation.

@mikemhenry
Copy link
Contributor

Awesome! This is good timing with #186

Once we get both in, we should cut a new release.

@chrisiacovella
Copy link
Member Author

This PR implements the logic in effectively the same way as the old code, which is on a per-record basis (i.e., a function operates on a single record name). The new version of qcportal has iterators on records, which are substantially faster (like orders of magnitude, due to prefetching and caching). The next commit will include functions that operate on the entire record sets to avoid slow performance.

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@2e61215). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

@mikemhenry
Copy link
Contributor

@mikemhenry
Copy link
Contributor

We can probably remove that line since https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086R33 will pull in what we need (I think, I am not sure what the "main" qcarchive package is)

@chrisiacovella
Copy link
Member Author

Copy link
Contributor

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had two non-blocking notes

espaloma/data/qcarchive_utils.py Outdated Show resolved Hide resolved
mol = final_molecules[angle]
# NOTE: this is calling the first index of the optimization array
# this gives the same value as the prior implementation, but I wonder if it
# should be molecule_optimization[angle][-1] in both cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kntkb or @yuanqing-wang thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've been trying to figure out the structure of the torsion drive datasets (since I have not looked at them really yet prior to this). Considering the example dataset I used in test (I'll put the code below), each angle has n-number of unique initial conformations that are then optimized. In this case, there are 4 configurations (each that has their own trajectory). So I suppose choosing the first vs the last is somewhat irrelevant (I was initially thinking this was a set of chained optimizations, hence my comment...don't ask why I was thinking that).

Should each of these conformations be considered and added to the datasets rather than just arbitrarily picking one?

from espaloma.data import qcarchive_utils
import numpy as np

record_name = "[h]c1c(c(c(c([c:1]1[n:2]([c:3](=[o:4])c(=c([h])[h])[h])c([h])([h])[h])[h])[h])n(=o)=o)[h]"
name = "OpenFF Amide Torsion Set v1.0"
collection_type = "torsiondrive"
collection, record_names = qcarchive_utils.get_collection(qcarchive_utils.get_client(), collection_type, name)
record_info = collection.get_record(record_name, specification_name="default")

molecule_optimization = record_info.optimizations
angle_keys = list(molecule_optimization.keys())

angle = angle_keys[0]
mol = molecule_optimization[angle][0].final_molecule
result = molecule_optimization[angle][0].trajectory[-1].properties

looking at the actual configurations:

for i in range(len(molecule_optimization[angle])):
    init =  molecule_optimization[angle][i].initial_molecule.geometry
    final = molecule_optimization[angle][i].final_molecule.geometry
    print(init,"\n-\n", final, "\n--\n")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kntkb or @yuanqing-wang thoughts?

I don't know off the top of my head, but I've played around with different QCArchive workflows in the past. I may have some notes left somewhere, so I'll catch up shortly (tomorrow?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the api and the way you access the data changed using qcportal v0.5...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the older version of qcportal, "get_final_molecule()" just picked the first one in the array. The full array was still part of the data record, just you had to dig through the qcvars or something to access. From conversations with Ben, there was a lot of trying to force records into a very rigid schema in the old version; he opted to break the schema in a lot of cases to just make it easier to access the relevant information (and make it clearer what information is available).

As I mentioned in an early comment, it seems that for each angle, multiple (in this case 4) independent starting configurations were used. It seems like it would be better to have the code return data for each replicate, but I'm not sure how this would impact any workflows that use this function.

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is geat. I'm glad that we are now testing the behavior and have some documentation for these utils. I agree with the comments that have been made. Looks good to be merged, just a single non-blocking comment.

espaloma/data/tests/test_qcarchive.py Show resolved Hide resolved
@mikemhenry mikemhenry self-requested a review September 22, 2023 16:31
Copy link
Contributor

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @jchodera

There are apparently some additional issues with the object model such that datasets beyond OptimizationDataset are not supported

@chrisiacovella
Copy link
Member Author

From @jchodera

There are apparently some additional issues with the object model such that datasets beyond OptimizationDataset are not supported

Yes. the get_graph function in the initial code was only setup to work with the OptimizationDataset. I think it would be straight forward to support the SinglepointDataset objects and put in some checking in get_graph and get_graphs to give a descriptive failure message if a different set is tried.

@kntkb
Copy link
Contributor

kntkb commented Sep 22, 2023

@chrisiacovella I remember when fetching the results from the SinglepointDataset that uses b3lyp-d3bj (openff default level of theory), you needed to combine the results from the DFT and the dispersion correction terms. This is not the case for OptimizationDataset and TorsionDriveDataset. I wonder if this behavior is the same for the latest QCArchive server and qcprotal.

@chrisiacovella
Copy link
Member Author

chrisiacovella commented Sep 22, 2023

@chrisiacovella I remember when fetching the results from the SinglepointDataset that uses b3lyp-d3bj (openff default level of theory), you needed to combine the results from the DFT and the dispersion correction terms. This is not the case for OptimizationDataset and TorsionDriveDataset. I wonder if this behavior is the same for the latest QCArchive server and qcprotal.

@kntkb This is something I started looking at when switching from the old to the new version, but I can't seem to find my notes; for some reason I think one of the specifications does include the sum, but don't quote me on that. I'm currently trying to figure that out right now actually.

chrisiacovella and others added 6 commits September 22, 2023 14:10
… dataset has the smiles encoded for converting to openff.molecule
… dataset has the smiles encoded for converting to openff.molecule
…d so that it will raise the desired exception rather than failing.
…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections).
…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections). This PR should sufficiently reproduce the prior behavior, but with new qcportal.
@mikemhenry
Copy link
Contributor

@chrisiacovella Is this PR good to go? I know its a year old now BUT is it good to go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants