Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-RTPS cross vendor tests are failing frequently on Windows #246

Closed
wjwwood opened this issue Dec 13, 2018 · 13 comments
Closed

Fast-RTPS cross vendor tests are failing frequently on Windows #246

wjwwood opened this issue Dec 13, 2018 · 13 comments
Assignees
Labels
bug Something isn't working ready Work is about to start (Kanban column)

Comments

@wjwwood
Copy link
Member

wjwwood commented Dec 13, 2018

Bug report

Required Info:

  • Operating System:
    • Windows (only)
  • Installation type:
    • source
  • Version or commit hash:
    • master (Fast-RTPS 1.7.0 and Connext 5.3.1)
  • DDS implementation:
    • Fast-RTPS/Connext
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

In terminal A:

> call install\setup.bat
> set RMW_IMPLEMENTATION=rmw_connext_cpp
> install\demo_nodes_cpp\lib\demo_nodes_cpp\talker.exe

In terminal B:

> call install\setup.bat
> set RMW_IMPLEMENTATION=rmw_fastrtps_cpp
> install\demo_nodes_cpp\lib\demo_nodes_cpp\listener.exe

Expected behavior

They communicate and the listener receives data from the talker.

Actual behavior

Nothing is received by the listener.

Additional information

This occurs when swapping which is using Fast-RTPS/Connext (talker vs listener), and is resolved if you use either Fast-RTPS or Connext on both sides.

This is likely the root cause of new failures in our test_communication tests which do cross-vendor testing.

Screenshots:

screen shot 2018-12-12 at 6 52 26 pm

screen shot 2018-12-12 at 6 54 38 pm

@wjwwood wjwwood added the bug Something isn't working label Dec 13, 2018
@wjwwood wjwwood changed the title Fast-RTPS and Connext to not communicate any longer Fast-RTPS and Connext do not communicate any longer Dec 13, 2018
@richiware
Copy link
Contributor

Before integrating FastRTPS v1.7.0 and the changes in rmw_fastrtps to support this version, they were tested against your CI ( link ). As far as I know your CI checks communication between Fast RTPS and Connext. Do you know what changes are made in rmw_fastrtps after merging v1.7.0?

@richiware
Copy link
Contributor

We are trying to help investigating which could be the problem. We were analyzing your CI jobs. There is something we don't understand. We don't know how you CI jobs are internally and surely there is a reason. Why did job 1024 fail but next day job 1025 work successfully? It seems they used the same configuration and there are no significant changes in involved repositories. Is there something we don't get? some configuration that change between nightly jobs? Thanks.

@nuclearsandwich
Copy link
Member

While looking at last night's jobs I saw that 1042 also has communication issues between connext and fastrtps. These failures are logged in ros2/build_farmer#153 and look to have first appeared in 1035.

Why did job 1024 fail but next day job 1025 work successfully? It seems they used the same configuration and there are no significant changes in involved repositories. Is there something we don't get? some configuration that change between nightly jobs?

It's not an explanatory reason, but our CI defaults to re-running failed tests up to 10 times to see if they'll pass. This is a mitigation against tests that may flake due to network or other variable conditions but it also muddies the waters when trying to pinpoint the exact start of issues that don't occur every time. That the tests sometimes fail and sometimes don't suggest the issue is not reliably reproduced. Although it has certainly become more reliable to reproduce on Windows and possibly in debug configurations on Linux now as well.

@MiguelCompany
Copy link
Collaborator

As I have stated here it seems connext is now too restrictive on guidPrefix values. Could you check with eProsima/Fast-DDS#353 ?

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Dec 13, 2018

Thanks for linking that @MiguelCompany. I've triggered a build of our communication tests with the retest-until-pass setting reduced from 10 to 3.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Edit: Added a run on Linux in the Debug configuration where we have somtimes seen failures as well Build Status (see ros2/build_farmer#153)

@nuclearsandwich
Copy link
Member

There is warning output during cross-communication reported in ros2/demos#293 but (Fast-RTPS <-> Connext is toward the end of the description). It doesn't appear that communication was inhibited so those warnings may or may not be related.

@MiguelCompany
Copy link
Collaborator

@nuclearsandwich @wjwwood On the Linux debug build, I see that Connext is failing to initialize the rcl node on some tests where Fast-RTPS is not involved.

For instance, this test says

[D0108|ENABLE]DDS_DomainParticipant_enableI:Automatic participant index failed to initialize. PLEASE VERIFY CONSISTENT TRANSPORT / DISCOVERY CONFIGURATION.
DDSDomainParticipant_impl::createI:ERROR: Failed to auto-enable entity
DomainParticipantFactory_impl::create_participant():!create failure creating participant
/home/rosbuild/ci_scripts/ws/src/ros2/system_tests/test_communication/test/test_messages_c.cpp:110: Failure
Expected equality of these values:
0
ret
Which is: 1
failed to create participant, at /home/rosbuild/ci_scripts/ws/src/ros2/rmw_connext/rmw_connext_shared_cpp/src/node.cpp:261, at /home/rosbuild/ci_scripts/ws/src/ros2/rcl/rcl/src/rcl/node.c:401

@MiguelCompany
Copy link
Collaborator

@nuclearsandwich @wjwwood FYI, eProsima/Fast-DDS#353 has been merged on master.

@nuclearsandwich nuclearsandwich changed the title Fast-RTPS and Connext do not communicate any longer Fast-RTPS cross vendor tests are failing frequently on Windows Dec 14, 2018
@nuclearsandwich
Copy link
Member

Cross-vendor tests between Fast-RTPS and OpenSplice are also having issues: ros2/system_tests#322

Recent example: https://ci.ros2.org/view/nightly/job/nightly_win_extra_rmw_rel/196/

@mjcarroll
Copy link
Member

During the lead-up to Crystal, I tested cross-vendor support on my Windows 10 Virtual Machine, and did not find any issues.

In one of the hangouts, @wjwwood put forward the hypothesis that FastRTPS, OpenSplice, and Connext may be choosing different network interfaces in order to do discovery or connectivity, leading the nodes to fail to discover each other. This is only a hypothesis and has not been validated or researched in any way.

@richiware
Copy link
Contributor

We fixed an issue sending multicast in a Windows machine with several interfaces and some of them disconnected. We think that issue should fix this one. Can you confirm?

@nuclearsandwich nuclearsandwich self-assigned this Feb 13, 2019
@richiware
Copy link
Contributor

v1.7.1 incorporates eProsima/Fast-DDS#394. Can you verify your nightly job works as expected? Thanks

@MiguelCompany
Copy link
Collaborator

I think this can be closed now that ros2/ros2#814 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready Work is about to start (Kanban column)
Projects
None yet
Development

No branches or pull requests

6 participants