Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change the type of u to prevent illegal memory access #722

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmorlock
Copy link

I was applying implicit's AlternatingLeastSquares with CUDA on a matrix of shape (4091537, 5256984). The number of factors was set to 1011.

In doing so, AlternatingLeastSquares was failing with the following traceback:

Traceback (most recent call last):
  File "/path/to/my/project/demo.py", line 18, in <module>
    model.fit(customer_product_matrix)
  File "/path/to/my/venv/lib/python3.9/site-packages/implicit/gpu/als.py", line 162, in fit
    self.solver.least_squares(Ciu, Y, _XtX, X, self.cg_steps)
  File "_cuda.pyx", line 258, in implicit.gpu._cuda.LeastSquaresSolver.least_squares
RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/matrix.cu:246)
Aborted (core dumped)

The last line in the stack trace is a follow-up error referring to the destructor of CSRMatrix.

I tracked the original error down to the CUDA kernel. To be precise, to the lines 38 and 39 in als.cu (referring to implicit version 0.7.2):

  for (int u = blockIdx.x; u < user_count; u += gridDim.x) {
    T *x = &X[u * factors];

Both u and factors are of type int and u may go up to user_count-1. Considering the aforementioned matrix dimensions, there are values for u where this product exceeds the maximum value for integers (2147483647) in C++ leading to a negative array index.

By using size_t as type for u, this can be prevented.

Side note: it is remarkable that the second call to least_squares (line 162 in als.py) is failing even though the first call (line 159 in als.py) already produces negative array indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant