Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace conda with pyenv to fix incorrect libstdc++ use in jammy CI. #6966

Merged
merged 6 commits into from
Sep 30, 2024

Conversation

ssheorey
Copy link
Member

@ssheorey ssheorey commented Sep 13, 2024

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #

Motivation and Context

CUDA jammy CI (python tests) fail since pytest causes the conda libstdc++ library to be loaded instead of the system libstdc++ library. This is older and prevents Open3D cuda pybind lib from loading.

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

  • Replace python install from conda with pyenv python install. This uses the system libraries.
  • Use RPATH instead of RUNPATH to load libc++abi.so directly in Python. No need to find and load explicitly for filament.
  • Use libc++ v11 to build in Ubuntu 22.04. Warn if newer version is used. libc++ v12 and later use LLVM libunwind, which is incompatible with system libunwind and causes the Python crash.
  • TODO: Solution for Ubuntu 24.04

TODO: Solution for Ubuntu 24.04 (newer libc++ in general).

Copy link

update-docs bot commented Sep 13, 2024

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@ssheorey
Copy link
Member Author

ssheorey commented Sep 17, 2024

Seg fault in python test_tensormap.py here when built with gcc 11 (Ubuntu 20.04 jammy) in release config. No seg fault in debug or relwithdebinfo configs.

...
def test_tensormap(device):
...
    # __delitem__ operator. SEGFAULT
    with pytest.raises(RuntimeError) as excinfo:
        del tm.positions
        assert 'cannot be deleted' in str(excinfo.value)
...
   # Set primary key. SEGFAULT
    with pytest.raises(KeyError) as e:
        tm.primary_key = o3c.Tensor.ones((2, 3), dtype, device)
...
    # Get unknown attributes. SEGFAULT
    with pytest.raises(KeyError) as e:
        normals = tm.normals

Fix: switch to gcc 13.

Update nanoflannimpl to fix this warning as error:

    inlined from ‘open3d::core::nns::impl::{anonymous}::_KnnSearchCPU<float, int, open3d::core::nns::NeighborSearchAllocator<float, int>, 1>(open3d::core::nns::NanoFlannIndexHolderBase*, int64_t*, size_t, const float*, size_t, const float*, size_t, int, bool, bool, open3d::core::nns::NeighborSearchAllocator<float, int>&)::<lambda(const float*, const float*, size_t)>’ at /home/ssheorey/Documents/Open3D/Code/Open3D/cpp/open3d/core/nns/NanoFlannImpl.h:126:5,
    inlined from ‘open3d::core::nns::impl::{anonymous}::_RadiusSearchCPU<float, int, open3d::core::nns::NeighborSearchAllocator<float, int>, 1>(open3d::core::nns::NanoFlannIndexHolderBase*, int64_t*, size_t, const float*, size_t, const float*, size_t, const float*, bool, bool, bool, bool, open3d::core::nns::NeighborSearchAllocator<float, int>&)::<lambda(const tbb::detail::d1::blocked_range<long unsigned int>&)>’ at /home/ssheorey/Documents/Open3D/Code/Open3D/cpp/open3d/core/nns/NanoFlannImpl.h:258:41:
/usr/include/c++/13/bits/new_allocator.h:168:33: error: ‘void operator delete(void*, std::size_t)’ called on pointer ‘<unknown>’ with nonzero offset [4, 9223372036854775804] [-Werror=free-nonheap-object]

gcc-11 leads to a seg fault in tensormap in Release mode.
Copy link
Contributor

@benjaminum benjaminum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -147,8 +140,9 @@ void _KnnSearchCPU(NanoFlannIndexHolderBase *holder,
for (size_t valid_i = 0; valid_i < num_valid; ++valid_i) {
TIndex idx = result_indices[valid_i];
if (ignore_query_point &&
points_equal(&queries[i * dimension],
&points[idx * dimension], dimension)) {
std::equal(&queries[i * dimension],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! This should significantly reduce the number of heap allocations

bash Miniconda3-latest-Linux-x86_64.sh -b; \
rm Miniconda3-latest-Linux-x86_64.sh; \
curl https://pyenv.run | bash \
&& pyenv update \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on switching to Miniforge in #6717 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miniforge likely won't fix this issue, since it's a copy of conda and likely uses it's on libstdc++ as well.

@ssheorey
Copy link
Member Author

Seg fault with Python, CUDA 11.8, 12.1 and Ubuntu 22.04. No issue with C++, the open3d-cpu package, or if CUDA_VISIBLE_DEVICES is empty. No issue on Ubuntu 20.04 or Windows.

… No need to find and load explicitly.

Use libc++11 to build in Ubuntu 22.04. Warn if newer version is used.

TODO: Solution for Ubuntu 24.04
Copy link
Contributor

@benjaminum benjaminum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. See question comment

@@ -67,34 +67,55 @@ RUN if [ "${BUILD_SYCL_MODULE}" = "ON" ]; then \
rm -rf /etc/apt/sources.list.d/oneAPI.list; \
fi

# Dependencies: basic
# Dependencies: basic and python-build
# gcc-11 causes a seg fault in tensormap when built in Release mode. Upgrade to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we found the root cause for that or is it really a bug only due to gcc11?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. I think it was actually the libunwind issue. I've been working with gcc-11 on jammy (22.04) without issues. Reverted.

@ssheorey ssheorey merged commit e88c7b1 into main Sep 30, 2024
41 of 45 checks passed
@ssheorey ssheorey deleted the ss/cuda-ci-jammy-fix branch September 30, 2024 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants