Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordgen is slower than RDKit's native 2d coordinate generation #39

Open
d-b-w opened this issue Sep 26, 2019 · 6 comments
Open

Coordgen is slower than RDKit's native 2d coordinate generation #39

d-b-w opened this issue Sep 26, 2019 · 6 comments

Comments

@d-b-w
Copy link
Collaborator

d-b-w commented Sep 26, 2019

Coordgen is slower than RDKit's native 2d coordinate generation. Average speeds are about 100x slower, and in the worse cases, coordgen can take multiple seconds.

The two tools don't do the same things, and I think that coordgen results are much better, so the comparison is not totally fair. I do think that that coordgen should target being able to consistently produce coordinates in less than 0.1s, and have averages closer to 0.001s. This will allow us to discuss making coordgen the default in RDKit, which would be cool.

I'm going to link to the internal Schrödinger bug tracker, and our internal display for performance testing below, sorry...

At the time I post this, our automated performance testing says that:

2d coordinate generator Average speed (s) Slowest (s) Count > 0.1s Count > 1s
RDKit native 0.00035 0.04 0 0
coordgen 0.028 3.9 17 235
@ricrogz
Copy link
Collaborator

ricrogz commented Sep 27, 2019

@d-b-w, it might be a good idea to add some of the molecules (especially the slow ones) from these benchmarks as tests in this repository.

@ptosco
Copy link

ptosco commented Apr 9, 2021

Sorry for reviving this 2-year old ticket. I have just stumbled on the same problem on an internal dataset using the latest RDKit 2021.03.1 release.
So I decided to reproduce the problem on public data and I fetched 2000 indoles with 50 to 60 heavy atoms from ChEMBL (csv file attached)
chembl27_2000_indoles_50-60_ha.csv.gz

Native RDKit depiction of these 2000 molecules takes ~3 s:

%%time
rdDepictor.SetPreferCoordGen(False)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 3.02 s, sys: 23 ms, total: 3.05 s
Wall time: 3.04 s

CoordGen takes ~360x longer:

%%time
rdDepictor.SetPreferCoordGen(True)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 18min 10s, sys: 868 ms, total: 18min 11s
Wall time: 18min 10s

At the moment, this means that CoordGen cannot be used to depict large-ish molecules in a table.
Do you have plans to address this in the near future? Thanks a lot in advance.

@d-b-w
Copy link
Collaborator Author

d-b-w commented Apr 9, 2021

ugh, we just accidentally blew up coordgen time by at least 10x, which should be addressed in - #90

Sorry about that. When #90 is merged, I'll immediately issue a patch release of coordgen and post a PR to RDKit.

We're definitely hoping to do further work on this before the fall RDKit release. The bug in #90 actually provides some clues to next steps.

@ptosco
Copy link

ptosco commented Apr 9, 2021

Thank you for the super-fast reply, Dan! Looking forward to the PR.

@ptosco
Copy link

ptosco commented Apr 10, 2021

Thanks Dan! It looks much better now :-)

%%time
rdDepictor.SetPreferCoordGen(True)
for m in chembl_mols_2000:
    rdDepictor.Compute2DCoords(m)
CPU times: user 2min 5s, sys: 53 ms, total: 2min 5s
Wall time: 2min 5s

@d-b-w
Copy link
Collaborator Author

d-b-w commented Apr 10, 2021

great! This issue is should remain open; I feel like the current rate is still too slow. But it's acceptable for many use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants