New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

murcko scaffold split #18

Open

karinazad wants to merge 4 commits into Genentech:main from karinazad:smdd-splits-scaffolds-murcko

Collaborator

karinazad commented May 15, 2024 •

edited

Loading

Implements Murcko scaffold split based on RDKit

zadorozk added 2 commits

May 15, 2024 09:00


          murcko scaffold split

99f8ca7


          murcko scaffold split

d342be4

Collaborator

0x00b1 commented May 15, 2024 •

edited

Loading

@karinazad How do you feel about beignet.subsets?

I believe it’s clearer from a discovery perspective and it matches the underlying data structure.

0x00b1 reviewed

View reviewed changes

src/beignet/splits/_murcko_scaffold_split.py Outdated Show resolved Hide resolved

0x00b1 reviewed

View reviewed changes

src/beignet/splits/_murcko_scaffold_split.py Outdated

+                  shuffle: bool = True,
+                  include_chirality: bool = False,
+              ) -> tuple[Subset, Subset]:
+                  """Split a dataset based on Murcko scaffold splitting based

Collaborator

0x00b1 May 15, 2024

Prepend r

Collaborator Author

karinazad May 17, 2024

Can you elaborate? Is that for the input SMILES strings? Escape characters shouldn't be present in valid SMILES anyway and isn't it that if a string is formed with escape chars without r initially, prepending it won't change the format? Or maybe I'm just misunderstanding

Collaborator

0x00b1 May 20, 2024

No, preprend r to the docstring.

0x00b1 reviewed

View reviewed changes

src/beignet/splits/_murcko_scaffold_split.py Outdated Show resolved Hide resolved

Collaborator

0x00b1 commented May 15, 2024

@karinazad You’ll also need to add this to the docs.

0x00b1 reviewed

View reviewed changes

src/beignet/splits/_murcko_scaffold_split.py Outdated

+                  for ind, s in enumerate(smiles):
+                      mol = Chem.MolFromSmiles(s)
+                      if mol is not None:
+                          scaffold = MurckoScaffoldSmiles(mol=mol, includeChirality=include_chirality)

Collaborator

0x00b1 May 15, 2024

Weird function? It looks like a wrapper around:

Chem.MolToSmiles(GetScaffoldForMol(mol), includeChirality)

No clue about this:

https://github.com/rdkit/rdkit/blob/3457c1eb60846ea821e4a319f3505933027d3cf8/rdkit/Chem/Scaffolds/MurckoScaffold.py#L77-L85

It looks autogenerated?

Collaborator Author

karinazad May 17, 2024

yeah GetScaffoldForMol looks strange

Collaborator Author

karinazad commented May 17, 2024

@karinazad How do you feel about beignet.subsets?

I believe it’s clearer from a discovery perspective and it matches the underlying data structure.

should we also change the function name? murcko_scaffold_subset or something like that?

karinazad added 2 commits

May 17, 2024 16:00


          doc, subsets

a2cec52


          docs

d7505d4

0x00b1 reviewed

View reviewed changes

tests/beignet/subsets/test__murcko_scaffold_split.py

+              from unittest.mock import MagicMock, patch
+              import pytest
+              from beignet.subsets._murcko_scaffold_split import (

Collaborator

0x00b1 May 20, 2024

from beignet.subsets import (...)

0x00b1 reviewed

View reviewed changes

tests/beignet/subsets/test__murcko_scaffold_split.py

		@@ -0,0 +1,71 @@
		from importlib.util import find_spec
		from unittest.mock import MagicMock, patch

Collaborator

0x00b1 May 20, 2024

from unittest.mock import MagicMock
import unittest.mock

Use unittest.mock.patch explicitly.

0x00b1 reviewed

View reviewed changes

tests/beignet/subsets/test__murcko_scaffold_split.py

		@@ -0,0 +1,71 @@
		from importlib.util import find_spec

Collaborator

0x00b1 May 20, 2024

Make explicit.

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py

@@ @@ -0,0 +1,122 @@ @@
+              import math
+              import random
+              from collections import defaultdict

Collaborator

0x00b1 May 20, 2024

Make explicit.

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py

+                    https://doi.org/10.1021/jm9602928
+                  - "RDKit: Open-source cheminformatics. https://www.rdkit.org"
+                  """
+                  train_idx, test_idx = _murcko_scaffold_split_indices(

Collaborator

0x00b1 May 20, 2024

Inline?

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py


		scaffolds = defaultdict(list)

		for ind, s in enumerate(smiles):

Collaborator

0x00b1 May 20, 2024

Avoid abbreviations, e.g., use index and sequence.

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py

		Chem, MurckoScaffoldSmiles = None, None


		def murcko_scaffold_split(

Collaborator

0x00b1 May 20, 2024

Add generator: Generator arg.

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py

+                  seed: int = 0xDEADBEEF,
+                  shuffle: bool = True,
+                  include_chirality: bool = False,
+              ) -> tuple[Subset, Subset]:

Collaborator

0x00b1 May 20, 2024 •

edited

Loading

I think a better design is returning List[Subset] based on a lengths parameter, e.g., see torch.utils.data.random_split.

Collaborator

0x00b1 commented May 20, 2024

Add RDKit as an extra dependency to pyproject.toml.

0x00b1 reviewed

View reviewed changes

src/beignet/subsets/_murcko_scaffold_split.py

+                  if shuffle:
+                      if seed is not None:
+                          random.Random(seed).shuffle(scaffolds)

Collaborator

0x00b1 May 20, 2024

Use generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet