Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error - illegal memory access - ComputeVectorNorm in PBA #7

Open
shir-barzel-healthy opened this issue Jan 18, 2022 · 12 comments
Open

Comments

@shir-barzel-healthy
Copy link

Describe the bug
Crash due to: ComputeVectorNorm: an illegal memory access was encountered(700)
stack trace:
*** SIGABRT (@0x7d000004de5) received by PID 19941 (TID 0x7f13bf56a700) from PID 19941; stack trace: ***
@ 0x7f13ca8a3980 (unknown)
@ 0x7f13c7afcfb7 gsignal
@ 0x7f13c7afe921 abort
@ 0x7f13c84f1957 (unknown)
@ 0x7f13c84f7ae6 (unknown)
@ 0x7f13c84f7b21 std::terminate()
@ 0x7f13c84f7d54 __cxa_throw
@ 0x55fe5cea7bd1 pba::ProgramCU::CheckErrorCUDA()
@ 0x55fe5ceb2fc5 pba::ProgramCU::ComputeVectorNorm()
@ 0x55fe5cea28d0 pba::SparseBundleCU::SolveNormalEquationPCGB()
@ 0x55fe5cea6242 pba::SparseBundleCU::NonlinearOptimizeLM()
@ 0x55fe5cea6f3c pba::SparseBundleCU::BundleAdjustment()
@ 0x55fe5cea6fa6 pba::SparseBundleCU::RunBundleAdjustment()
@ 0x55fe5cb68256 colmap::ParallelBundleAdjuster::Solve()
@ 0x55fe5cbe8040 colmap::IncrementalMapper::AdjustParallelGlobalBundle()
@ 0x55fe5cac19cc colmap::(anonymous namespace)::AdjustGlobalBundle()
@ 0x55fe5cac1c1f colmap::(anonymous namespace)::IterativeGlobalRefinement()
@ 0x55fe5cac25d7 colmap::IncrementalMapperController::Reconstruct()
@ 0x55fe5cac47db colmap::IncrementalMapperController::Run()
@ 0x55fe5cc5dbbc colmap::Thread::RunFunc()
@ 0x7f13c85226df (unknown)
@ 0x7f13ca8986db start_thread
@ 0x7f13c7bdf71f clone

To Reproduce
Steps to reproduce the behavior:

Change kMaxNumResidualsFloat to 1000 * 1000 in bundle_adjustment.cc line 581 (increase runtime)
Run mapper using about 80 images.
The error will be reproduced in 1 of 5 executions.
Expected behavior
No crash

Desktop (please complete the following information):

OS: Linux
Version 3.6-dev.3

@cbalint13
Copy link
Owner

@shir-barzel-healthy

  • I believe at this point the vector is either zero size (empty) and/or vector.data() is NULL pointer (not even initialized)
  • Can test/insert before assert(vector.size() > 0); or/and assert(vector.data() != NULL); to see what is there ?

Probably during refinement that list become empty (points removed as bad), but if it is even uninitialized than is worse story.

Also colmap should be more careful to not throw such bad points/pairs into the PBA, on other hand PBA have it's limits.

@shir-barzel-healthy
Copy link
Author

thanks @cbalint13
I've tried adding these two lines before the call to vector_norm_kernel

  assert(vector.GetDataSize() > 0);
  assert(vector.data() != NULL);

I didn't manage to reproduce it there but now I'm getting an error in another function: (ComputeJX)

ComputeVXY:       an illegal memory access was encountered(700)
    @     0x7f4a44a37d54 __cxa_throw
    @     0x5580a21d7c55 pba::ProgramCU::CheckErrorCUDA()
    @     0x5580a21cae30 pba::SparseBundleCU::ComputeJX()
    @     0x5580a21d28d2 pba::SparseBundleCU::SolveNormalEquationPCGB()
    @     0x5580a21d6272 pba::SparseBundleCU::NonlinearOptimizeLM()
    @     0x5580a21d6f6c pba::SparseBundleCU::BundleAdjustment()
    @     0x5580a21d6fd6 pba::SparseBundleCU::RunBundleAdjustment()

So I've add these lines both in ComputeVXY and ComputeJX:

void ProgramCU::ComputeJX(int point_offset, CuTexImage& x, CuTexImage& jc,
                          CuTexImage& jp, CuTexImage& jmap, CuTexImage& result,
                          int mode) {
  assert(x.GetDataSize() > 0);
  assert(x.data() != NULL);
  assert(jc.GetDataSize() > 0);
  assert(jc.data() != NULL);
  assert(jp.GetDataSize() > 0);
  assert(jp.data() != NULL);
  assert(jc.GetDataSize() > 0);
  assert(jc.data() != NULL);
  assert(jmap.GetDataSize() > 0);
  assert(jmap.data() != NULL);
void ProgramCU::ComputeVXY(CuTexImage& texX, CuTexImage& texY,
                           CuTexImage& result, unsigned int part,
                           unsigned int skip) {
  unsigned int len = part ? part : texX.GetLength();
  unsigned int bsize = 128;
  unsigned int nblock = (len + bsize - 1) / bsize;
  assert(texX.GetDataSize() > 0);
  assert(texX.data() != NULL);
  assert(texY.GetDataSize() > 0);
  assert(texY.data() != NULL);

But I'm not getting the assert before the crash and the crash continues to happen.

@cbalint13
Copy link
Owner

@ 0x7f4a44a37d54 __cxa_throw
@ 0x5580a21d7c55 pba::ProgramCU::CheckErrorCUDA()

  • Also this CheckErrorCUDA() doesn't print anything useful about the error itself.

I would expect to print cudaGetErrorString(e) the human readable message, maybe can try there a variant like bellow:

    cudaError_t e = cudaGetLastError();
    if (e) {
      std::cout << "Error: " << cudaGetErrorString(e) << "-->";
    }

@shir-barzel-healthy
Copy link
Author

I've added this and I got:

Error: an illegal memory access was encountered(700)

@shir-barzel-healthy
Copy link
Author

I've restarted the machine and now I mostly get:

Colmap invocation returned with code -6 and the following error: MultiplyBlockConditioner<Point>:    an illegal memory access was encountered(700)
terminate called after throwing an instance of 'char const*'
*** Aborted at 1642520827 (unix time) try "date -d @1642520827" if you are using GNU date ***
PC: @     0x7ff3fbd08fb7 gsignal
*** SIGABRT (@0xccc) received by PID 3276 (TID 0x7ff3f3776700) from PID 3276; stack trace: ***
    @     0x7ff3feaaf980 (unknown)
    @     0x7ff3fbd08fb7 gsignal
    @     0x7ff3fbd0a921 abort
    @     0x7ff3fc6fd957 (unknown)
    @     0x7ff3fc703ae6 (unknown)
    @     0x7ff3fc703b21 std::terminate()
    @     0x7ff3fc703d54 __cxa_throw
    @     0x55594f5edc55 pba::ProgramCU::CheckErrorCUDA()
    @     0x55594f5f302c pba::ProgramCU::MultiplyBlockConditioner()
    @     0x55594f5e10c9 pba::SparseBundleCU::ApplyBlockPC()
    @     0x55594f5e88b6 pba::SparseBundleCU::SolveNormalEquationPCGB()
    @     0x55594f5ec272 pba::SparseBundleCU::NonlinearOptimizeLM()
    @     0x55594f5ecf6c pba::SparseBundleCU::BundleAdjustment()
    @     0x55594f5ecfd6 pba::SparseBundleCU::RunBundleAdjustment()
    @     0x55594f2ae286 colmap::ParallelBundleAdjuster::Solve()
    @     0x55594f32e070 colmap::IncrementalMapper::AdjustParallelGlobalBundle()
    @     0x55594f2079fc colmap::(anonymous namespace)::AdjustGlobalBundle()
    @     0x55594f207c4f colmap::(anonymous namespace)::IterativeGlobalRefinement()
    @     0x55594f208607 colmap::IncrementalMapperController::Reconstruct()
    @     0x55594f20a80b colmap::IncrementalMapperController::Run()
    @     0x55594f3a3bec colmap::Thread::RunFunc()
    @     0x7ff3fc72e6df (unknown)
    @     0x7ff3feaa46db start_thread
    @     0x7ff3fbdeb71f clone

@shir-barzel-healthy
Copy link
Author

I think I solved it by removing the __shared__ from these lines: (2110-2111)

__shared__ float mat[WIDTH * VSZ];

But I think now it's slower, any idea how to solve it in a different way?

@cbalint13
Copy link
Owner

cbalint13 commented Jan 18, 2022

I think I solved it by removing the __shared__ from these lines: (2110-2111)

  • Must be __shared__ memory type to leverage parallel access to it.
  • It might be that in some corner cases two thread (warps) access exact same address from shared workspace region.
  • Can try this at this line:
--- cuda.cu	2022-01-18 20:23:32.877792183 +0200
+++ new.cu	2022-01-18 20:23:57.109554719 +0200
@@ -1,2 +1,4 @@
-    for(int i= 0; i < VSZ * WIDTH; i += WIDTH)    mat[i + threadIdx.x] = blocks[i + block_read_pos + threadIdx.x]; 
-    __syncthreads();
+    for(int i= 0; i < VSZ * WIDTH; i += WIDTH) {
+      mat[i + threadIdx.x] = blocks[i + block_read_pos + threadIdx.x]; 
+      __syncthreads();
+    }

__shared__ float mat[WIDTH * VSZ];

But I think now it's slower, any idea how to solve it in a different way?

  • It is possible to get a more reliable kernel (replace it) by various ways/sources.
  • It would be interesting to capture a dump of all vars passed into multiply_block_conditioner_kernel()

I could handle this if could help me:

  • full core dump thus i can see conditions for multiply_block_conditioner_kernel() in that frame
  • values / dumps of multiply_block_conditioner_kernel() that makes it to fail.

@cbalint13
Copy link
Owner

@shir-barzel-healthy ,

A more easy poking around would be just to print: WIDTH, BBIT, VSZ along with num, rowsz along with pointer addresses of *blocks, *x, *results and see if there are anomalies in form of odd values for WIDTH/BBIT that would drive kernel vertex/threads into that memory access issue.

If not a full dump as described in earlier comment (one of two) would help to look at the problem.

@shir-barzel-healthy
Copy link
Author

shir-barzel-healthy commented Jan 18, 2022

@cbalint13 Thank you very much for your help, very appreciated!

  1. I've tried this change:
--- cuda.cu	2022-01-18 20:23:32.877792183 +0200
+++ new.cu	2022-01-18 20:23:57.109554719 +0200
@@ -1,2 +1,4 @@
-    for(int i= 0; i < VSZ * WIDTH; i += WIDTH)    mat[i + threadIdx.x] = blocks[i + block_read_pos + threadIdx.x]; 
-    __syncthreads();
+    for(int i= 0; i < VSZ * WIDTH; i += WIDTH) {
+      mat[i + threadIdx.x] = blocks[i + block_read_pos + threadIdx.x]; 
+      __syncthreads();
+    }

But still getting the crash.

  1. I've tried adding these prints inside multiply_block_conditioner_kernel:
  printf("WIDTH:%d,BBIT:%d,VSZ:%d,num:%d,rosz:%d\n", WIDTH, BBIT, VSZ, num, rowsz);
  printf("address pointed to - *blocks:%p,*x:%p,*result:%p\n", (void*)blocks, (void*)x, (void*)result);
  printf("address of - *blocks:%p,*x:%p,*result:%p\n", (void*)&blocks, (void*)&x, (void*)&result);

But it cause the program to run very slowly and I didn't manage to reproduce the crash with these prints

Tried also to add "1>output.txt" but the error didn't reproduce.

  1. What do you mean by vars passed into multiply_block_conditioner_kernel(), how would you recommend me to do it?
    Thanks

Update:

Managed to reproduce it with:
if(threadIdx.x % 100 == 0) {
printf("WIDTH:%d,BBIT:%d,VSZ:%d,num:%d,rosz:%d\n", WIDTH, BBIT, VSZ, num, rowsz);
printf("address pointed to - blocks:%p,x:%p,result:%p\n", (void)blocks, (void)x, (void)result);
printf("address of - blocks:%p,x:%p,result:%p\n", (void)&blocks, (void)&x, (void)&result);
}

This is the output file (1000 lines before the crash):
output.txt

I don't see any odd values.

Hi @cbalint13
Any update?

@shir-barzel-healthy
Copy link
Author

@cbalint13 ?

@cbalint13
Copy link
Owner

  • I tried to reproduce with random data a singe call to multiply_block_conditioner_kernel() having "WIDTH:128,BBIT:2,VSZ:3,num:47280,rosz:47360" but it doesn't crash on my SM61 (Pascal).

  • Now, how could I have the same setup as yours (colmap + pba + dataset) to reproduce ?

@shir-barzel-healthy
Copy link
Author

shir-barzel-healthy commented Jan 26, 2022

  • I tried to reproduce with random data a singe call to multiply_block_conditioner_kernel() having "WIDTH:128,BBIT:2,VSZ:3,num:47280,rosz:47360" but it doesn't crash on my SM61 (Pascal).
  • Now, how could I have the same setup as yours (colmap + pba + dataset) to reproduce ?

@cbalint13 Thanks fro trying to help!

I'm running it using GCP with two Tesla T4 GPUs
In colmap I've changed the value of kMaxNumResidualsFloat to 1000 * 1000 in this line:
https://github.com/colmap/colmap/blob/ea40ef9aecf98eec7e14bd6eeedd2a5381c6b46b/src/optim/bundle_adjustment.cc#L583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants