-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a non-SciPy nearest-neighbor function #12
Comments
I now think that instead of replacing the SciPy k-d tree with a different pure Python k-d-tree implementation which seems to be unreasonably slow (see pull request #14), I think that a better solution is to find a simple, fast, self-contained C implementation of a k-d tree, create a new repository that provides a Python wrapper around the C code, and use that Python wrapper instead. |
That seems reasonable to me. The pure python implementation is indeed unreasonably slow. |
Note that SciPy does now (mostly) work with more recent Pypy2 and Pypy3.5 beta versions. I do see goftests test failures under Pypy, but they are unrelated to cKDTree. (Some Scipy distributions are broken under Pypy due to this PyTuple_SetItem issue: http://doc.pypy.org/en/latest/cpython_differences.html#c-api-differences). |
Well, we'll need SciPy to work on PyPy in order to support testing goftests itself, so that is definitely good news Dan. I still think it would be nice to replace a rather heavyweight dependency with a slimmer one for installing goftests. |
Potential replacement for nearest neighbors: https://github.com/spotify/annoy/ (Thanks @mattbarr for the pointer.) |
Annoy seems to be about 15 times slower than diff --git a/goftests/__init__.py b/goftests/__init__.py
index b299495..115467f 100644
--- a/goftests/__init__.py
+++ b/goftests/__init__.py
@@ -37,10 +37,10 @@ except ImportError:
import random
import sys
+import annoy
import numpy
import numpy.random
from numpy import pi
-from scipy.spatial import cKDTree
from .utils import chi2sf
@@ -197,11 +197,23 @@ def volume_of_sphere(dim, radius):
return radius ** dim * pi ** (0.5 * dim) / gamma(0.5 * dim + 1)
-def get_nearest_neighbor_distances(samples):
+def get_nearest_neighbor_distances(dim, samples):
+
+ def nn_dist(i):
+ return t.get_nns_by_item(i, 2, include_distances=True)[1][1]
+
if not hasattr(samples[0], '__iter__'):
samples = numpy.array([samples]).T
- distances, indices = cKDTree(samples).query(samples, k=2)
- return distances[:, 1]
+
+ f = dim # dimension of each vector
+ t = annoy.AnnoyIndex(f, metric='euclidean')
+ for i, sample in enumerate(samples):
+ t.add_item(i, sample)
+
+ num_trees = 10 # higher is more accurate
+ t.build(num_trees)
+
+ return numpy.array(list(map(nn_dist, range(len(samples)))))
def vector_density_goodness_of_fit(
@@ -228,7 +240,7 @@ def vector_density_goodness_of_fit(
dim = get_dim(samples[0])
assert dim
assert len(samples) > 1000 * dim, 'WARNING imprecision; use more samples'
- radii = get_nearest_neighbor_distances(samples)
+ radii = get_nearest_neighbor_distances(dim, samples)
density = len(samples) * numpy.array(probs)
volume = volume_of_sphere(dim, radii)
exp_samples = density * volume I then used the following "setup" script for a # filename: speed.py
import numpy
import scipy.stats
import goftests
NUM_BASE_SAMPLES = 250
NUM_SAMPLES_SCALE = 1000
mu = numpy.ones(3)
sigma = numpy.eye(3)
dist = scipy.stats.multivariate_normal(mu, sigma)
dim = goftests.get_dim(dist.rvs(size=2)[0])
sample_count = NUM_BASE_SAMPLES + NUM_SAMPLES_SCALE * dim
xs = dist.rvs(size=sample_count)
ps = dist.pdf(xs)
# intending to call: auto_density_goodness_of_fit(xs, ps) and timed the call to $ python -m perf timeit -s "from goftests import auto_density_goodness_of_fit; from speed import xs, ps;" "auto_density_goodness_of_fit(xs, ps)"
.....................
Mean +- std dev: 8.24 ms +- 0.31 ms and with the diff: $ python -m perf timeit -s "from goftests import auto_density_goodness_of_fit; from speed import xs, ps;" "auto_density_goodness_of_fit(xs, ps)"
.....................
Mean +- std dev: 114 ms +- 1 ms Bummer! Although there's a good chance I'm not using Annoy correctly, since this is my first time using it. |
Assuming the changes suggested in pull request #10 are acceptable, we can remove the dependency on SciPy entirely just by reimplementing the
get_nearest_neighbor_distances
function. This will facilitate goftests being runnable on PyPy, which does not yet have SciPy support.Here is a short, pure Python nearest-neighbor function backed by a k-d tree: http://code.activestate.com/recipes/577497-kd-tree-for-nearest-neighbor-search-in-a-k-dimensi/. It is published under an MIT license. Since we don't need any other behavior from the k-d tree (we just need to find the distance of each point to its nearest neighbor), I think incorporating that code is the best approach.
The text was updated successfully, but these errors were encountered: