Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] TSNE and UMAP: Allow input to be precomputed distance matrix #4799

Open
RichieHakim opened this issue Jul 4, 2022 · 7 comments
Open
Labels
feature request New feature or request inactive-30d

Comments

@RichieHakim
Copy link

RichieHakim commented Jul 4, 2022

Is your feature request related to a problem? Please describe.
Yes. Currently, TSNE and UMAP do not actually allow for a precomputed distance matrix to be used. This is despite there being arguments that suggest this functionality exists.

Describe the solution you'd like
Implementation of methods to allow for a custom distance matrix to be used as the sole data input to TSNE and/or UMAP. This functionality currently exists in sklearn and would allow for users already using this feature in sklearn's TSNE to directly port their workflow to rapids.
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
metricstr or callable, default=’euclidean’: ...If metric is “precomputed”, X is assumed to be a distance matrix. ...

Describe alternatives you've considered
What currently exists is the knn_graph argument for the .fit method. The documentation suggests that allowing for a precomputed distance matrix to be used by providing "a sparse array containing the k-nearest neighbors" should be possible, but this method is not functional.

Additional context
This is a highly desirable feature.
Similar requests here, here, here

Thanks!

@RichieHakim RichieHakim added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jul 4, 2022
@divyegala divyegala removed the ? - Needs Triage Need team to review and classify label Jul 27, 2022
@divyegala
Copy link
Member

divyegala commented Jul 27, 2022

@RichieHakim did you try using the knn_graph argument and using the fit_transform API with the data you wish to transform? Or is the request that a workaround be avoided, and the API work exactly as sklearn's?

@RichieHakim
Copy link
Author

@divyegala
That method doesn't work. Same as here: #4460 (comment) I believe.

@viclafargue
Copy link
Contributor

Could you be more specific about what doesn't work for you? Is it with a specific metric? Do you have a crash or simply bad looking results?

Here is an example for UMAP :

from cuml.datasets import make_blobs
from cuml.neighbors import NearestNeighbors
from cuml.manifold.umap import UMAP

n_samples=10000
n_features=20
n_neighbors=5

X, _ = make_blobs(n_samples=n_samples, n_features=n_features)
nn = NearestNeighbors(n_neighbors=n_neighbors)
nn.fit(X)
knn_graph = nn.kneighbors_graph(X, mode="distance")

model = UMAP(random_state=42, init='random', n_neighbors=n_neighbors)
y = model.fit_transform(X, knn_graph=knn_graph.tocsr(), convert_dtype=True)

@viclafargue
Copy link
Contributor

viclafargue commented Aug 10, 2022

After reading some of your posts again, I think that I could get your point. There is indeed some issue in the way the Python API for UMAP is currently designed. But, this should however not prevent users from using the precomputed KNN graph feature.

Indeed, when using the knn_graph parameter, the Python API currently requires an unnecessary and unused X input. This problem should be fixed from the Python side as the KNN graph is still used as expected.

>>> import cupy as cp
>>> y = model.fit_transform(X, knn_graph=knn_graph, convert_dtype=True)
>>> z = model.fit_transform(cp.random.rand(n_samples, n_features), knn_graph=knn_graph, convert_dtype=True)
>>> cp.allclose(y, z)
array(True)

Created PR #4865

@cjnolet
Copy link
Member

cjnolet commented Aug 25, 2022

@RichieHakim

Unfortunately, the choice to not support the precomputed metric option was intentional here because a full pairwise distance matrix requires n_samples * n_samples entries and does not scale well in memory constrained environments. Instead, we provide the option to accept a knn graph, which can also be computed using the RAPIDS NearestNeighbors estimator, because that takes the required space down to the order n_sample * n_neighbors. Assuming we can get your code returning the correct results, does the knn_graph argument satisfy your needs? We could make a feature request to update these algorithms so they accept a precomputed pairwise distance matrix but it would be helpful to know more about your use-case here.

As far as the correctness of the existing knn_graph option- it will be helpful if you provide a small code snippet that demonstrates how you used the knn_graph argument so that we can determine if this is a legitimate bug or if an update to the documentation might be needed. The argument should work, but it is possible it might return unexpected results if not used as intended.

That method doesn't work. Same as here: #4460 (comment) I believe.

HDBSCAN does not accept a precomputed pairwise distance matrix nor does it accept a precomputed knn graph. Unfortunately, accepting the precomputed knn graph in HDBSCAN is a lot more involved than UMAP and TSNE.

@cjnolet
Copy link
Member

cjnolet commented Aug 25, 2022

@RichieHakim Looking through some of your other issues, I think I understand now what you are looking for- you already have pairwise distance matrix (and we might not be able to assume here that you have access to the original data), and you'd like to be able to invoke UMAP and TSNE w/ it.

Can you give us an idea of the size of this pairwise distance matrix? While we work to prioritize this feature, would it be helpful if you were able to convert your pairwise distance matrix into a valid knn graph by using something like cupy to do a sort/argsort and then use the knn_graph argument from there?

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request inactive-30d
Projects
None yet
Development

No branches or pull requests

4 participants