Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

Open
peci1 opened this issue Oct 23, 2022 · 22 comments

Comments

@peci1
Copy link

peci1 commented Oct 23, 2022

Describe the bug
Running ukf_localization_node (or the nodelet) with #671 merged leads to the node stopping working a while after startup. The responsible commit was found with git bisect.

To Reproduce
Steps to reproduce the behavior:

  1. Run on a Noetic computer
  2. Download the bag husky_ukf_bug.zip
  3. Run ukf_localization_node with the config mentioned below.
  4. Observe /imu_odom topic (output of the filter)
  5. Messages will stop coming after some while.

node config (file odom_localization.yaml):

odom_frame: odom
base_link_frame: base_link
world_frame: odom

two_d_mode: false
frequency: 20
sensor_timeout: 0.03
# publish_tf: false  # set from the launch file
reset_on_time_jump: true

odom0: husky_velocity_controller/odom
odom0_differential: false
odom0_nodelay: true
odom0_queue_size: 10
odom0_config: [
  false, false, false,   # x_pos, y_pos, z_pos
  false, false, false,   # roll, pitch, yaw
  true, true, true,      # x_vel, y_vel, z_vel 
  false, false, true,    # roll_vel, pitch_vel, yaw_vel
  false, false, false    # x_accel, y_accel, z_accel
]

imu0: imu/data_filtered
imu0_differential: false
imu0_nodelay: true
imu0_queue_size: 10
imu0_remove_gravitational_acceleration: true
imu0_config: [
  false, false, false,  # x_pos, y_pos, z_pos
  true, true, false,    # roll, pitch, yaw
  false, false, false,  # x_vel, y_vel, z_vel
  true, true, true,     # roll_vel, pitch_vel, yaw_vel
  false, false, false   # x_accel, y_accel, z_accel
]

Launch the node with:

<launch>
       <node pkg="robot_localization" type="ukf_localization_node" name="ekf_odom"
                  clear_params="true" respawn="true" output="screen">
                <rosparam command="load" file="$(dirname)/odom_localization.yaml" />
                <param name="publish_tf" value="false" />
                <remap from="odometry/filtered" to="imu_odom" />
        </node>
</launch>

Attaching to the running filter with gdb, it is always stuck here:

#0  RobotLocalization::FilterUtilities::clampRotation (rotation=-10636217665855.5) at ./src/filter_utilities.cpp:138                                                                                               
#1  0x00007f0de0c2b83e in RobotLocalization::FilterBase::wrapStateAngles (this=0x7ffc078a7078) at /usr/include/eigen3/Eigen/src/Core/util/Meta.h:300                                                               
#2  0x00007f0de0e20b7e in RobotLocalization::Ukf::correct (this=0x7ffc078a7078, measurement=...) at ./src/ukf.cpp:298                                                                                              
#3  0x00007f0de0c2d2ab in RobotLocalization::FilterBase::processMeasurement (this=this@entry=0x7ffc078a7078, measurement=...) at ./src/filter_base.cpp:234                                                         
#4  0x00007f0de14f7aa7 in RobotLocalization::RosFilter<RobotLocalization::Ukf>::integrateMeasurements (this=this@entry=0x7ffc078a6990, currentTime=...) at /usr/include/c++/9/bits/atomic_base.h:549               
#5  0x00007f0de14f8efa in RobotLocalization::RosFilter<RobotLocalization::Ukf>::periodicUpdate (this=0x7ffc078a6990, event=...) at /usr/include/c++/9/ext/new_allocator.h:119                                      
#6  0x00007f0de1361608 in ros::TimerManager<ros::Time, ros::Duration, ros::TimerEvent>::TimerQueueCallback::call() () from /opt/ros/noetic/lib/libroscpp.so                                                        
#7  0x00007f0de1383172 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/noetic/lib/libroscpp.so                                                                                         
#8  0x00007f0de1384883 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/noetic/lib/libroscpp.so                                                                                            
#9  0x00007f0de13d7fcf in ros::SingleThreadedSpinner::spin(ros::CallbackQueue*) () from /opt/ros/noetic/lib/libroscpp.so                                                                                           
#10 0x00007f0de13c021f in ros::spin() () from /opt/ros/noetic/lib/libroscpp.so                                          
#11 0x000055d0fb55beb0 in main (argc=<optimized out>, argv=<optimized out>) at ./src/ukf_localization_node.cpp:54         

Notice the gigantic rotation argument of clampRotation(). Apparently, the while loop has a lot to do to get this number to -2pi +2pi by subtracting 2pi at a time. Of course, using a modulo function would prevent this stucking, but if I look into the logfile, the state is apparently exploding, so it would not fix the underlying problem.

Here is the debug log of the filter: debug_log.zip Apparently, the state explodes near the end of the log file. No more records are being added to this file once the filter gets stuck,

Version 2.6.3 does not result in this behavior, the filter runs normally.

@ayrton04
Copy link
Collaborator

Thanks, I'll take a look. I think I replaced those wrapStateAngles calls in ROS 2 with the methods in the angles library.

The old UKF logic wasn't really properly functioning as a UKF, so there are likely some parameters that need to be tuned now.

@ayrton04
Copy link
Collaborator

ayrton04 commented Oct 24, 2022

@peci1 the 671-wrap-angles branch handles the wrapping better.

Looking at the logs you posted, I don't see a single IMU measurement being received by the filter. You are attempting to produce a 3D state estimate, but your roll and pitch values have no reference. See this covariance matrix:

[0.57132     -0.077639   -0.00074252 5.7332e+05  6.3011e+05  -3.4506e+05 -1.7039e-05 -6.2144e-07 -3.3632e-06 -0.020026   0.020237    2.263e-13   -0.00053363 -2.0628e-05 -2.8947e-05 
 -0.077639   0.46173     0.00039103  -2.9696e+05 -3.2708e+05 1.7864e+05  -2.0653e-06 -1.862e-05  -9.9904e-06 -0.062799   0.061786    7.1349e-13  -6.2154e-05 -0.00057945 -8.907e-05  
 -0.00074252 0.00039103  0.50618     -1801.4     -2485.9     1103.2      1.0646e-05  1.0808e-05  -3.852e-05  0.31989     -0.32507    -3.6917e-12 0.00032444  0.00032948  -0.00033645 
 5.7332e+05  -2.9696e+05 -1801.4     -8.6652e+06 1.0946e+07  2.0866e+07  -22.448     -23.081     -112.34     -6.9081e+05 7.054e+05   7.9559e-06  -693.29     -712.81     -993.34     
 6.3011e+05  -3.2708e+05 -2485.9     1.0946e+07  -3.3151e+07 -2.137e+07  19.301      19.959      101.06      6.2201e+05  -6.2273e+05 -7.0537e-06 614.27      635.23      881.57      
 -3.4506e+05 1.7864e+05  1103.2      2.0866e+07  -2.137e+07  -3.0473e+07 20.652      21.156      103.01      6.343e+05   -6.4607e+05 -7.3079e-06 638.7       654.42      909.49      
 -1.7039e-05 -2.0653e-06 1.0646e-05  -22.448     19.301      20.652      0.0016376   3.9481e-05  0.00019072  1.174       -1.1962     -1.3544e-11 0.0011814   0.0012155   0.0016877   
 -6.2144e-07 -1.862e-05  1.0808e-05  -23.081     19.959      21.156      3.9481e-05  0.0016399   0.00019642  1.209       -1.231      -1.3941e-11 0.0012155   0.0012519   0.001738    
 -3.3632e-06 -9.9904e-06 -3.852e-05  -112.34     101.06      103.01      0.00019072  0.00019642  0.00352     5.9109      -6.0051     -6.7947e-11 0.0059219   0.0060997   0.0084748   
 -0.020026   -0.062799   0.31989     -6.9081e+05 6.2201e+05  6.343e+05   1.174       1.209       5.9109      36416       -36923      -4.1798e-07 36.436      37.527      52.119      
 0.020237    0.061786    -0.32507    7.054e+05   -6.2273e+05 -6.4607e+05 -1.1962     -1.231      -6.0051     -36923      37577       4.2508e-07  -37.058     -38.141     -52.997     
 2.263e-13   7.1349e-13  -3.6917e-12 7.9559e-06  -7.0537e-06 -7.3079e-06 -1.3544e-11 -1.3941e-11 -6.7947e-11 -4.1798e-07 4.2508e-07  0.0012794   -4.1939e-10 -4.3175e-10 -5.9975e-10 
 -0.00053363 -6.2154e-05 0.00032444  -693.29     614.27      638.7       0.0011814   0.0012155   0.0059219   36.436      -37.058     -4.1939e-10 0.03721     0.037634    0.052275    
 -2.0628e-05 -0.00057945 0.00032948  -712.81     635.23      654.42      0.0012155   0.0012519   0.0060997   37.527      -38.141     -4.3175e-10 0.037634    0.039406    0.053842    
 -2.8947e-05 -8.907e-05  -0.00033645 -993.34     881.57      909.49      0.0016877   0.001738    0.0084748   52.119      -52.997     -5.9975e-10 0.052275    0.053842    0.075756    ]```

All of your absolute orientation variables have massive covariance values, presumably because (a) you have no roll/pitch reference, and (b) the correlation between yaw, roll, and pitch mean that you are going to see yaw covariance also increase. The correlation with the rest of the state variables will eventually lead to those exploding too.

Looking at the bag you sent, the wheel encoder odometry is publishing for nearly 22 seconds before the IMU data even starts arriving. I echoed the first topic from each, then started playing the bag:

$ rostopic echo /husky_velocity_controller/odom/header -n 1
seq: 133769
stamp: 
  secs: 1666530636
  nsecs: 555635664
frame_id: "odom"
$ rostopic echo /imu/data_filtered/header -n 1
seq: 0
stamp: 
  secs: 1666530658
  nsecs: 117259502
frame_id: "imu"

So you are asking the filter to produce a 3D state estimate, but starving it of 3D orientation measurements for over 20 seconds. That's going to make the covariance explode.

@peci1
Copy link
Author

peci1 commented Oct 24, 2022

The thing with missing IMU measurements in the log is a consequence of the reported problem. If I comment out receiving odom messages, the filter processes the IMU messages. If both are used, the filter gets stuck earlier than the IMU measurements start coming.

You are right that the IMU messages are not coming for some time. Our IMU does bias estimation on startup and does not output anything until the biases are estimated. I didn't think this could be a problem because I'm fusing the absolute roll/pitch from IMU, so when messages start coming, the covariances should quickly get to reasonable numbers. And they did before #671. As the robot is forced to be stationary during the bias estimation, all odom messages should read zero velocity. Therefore, the filter should not be forced to project non-zero velocities along the exploding roll and pitch.

I see it could be better to the filter to start it only when IMU measurements are available. Is there a setting supporting this (except the explicit pause service)? I.e., can I mark a sensor as required (so that predictions are not done if the sensor is timed out)?

@ayrton04
Copy link
Collaborator

If I comment out receiving odom messages, the filter processes the IMU messages. If both are used, the filter gets stuck earlier than the IMU measurements start coming.

That's because the filter won't start until it receives its first measurement.

I didn't think this could be a problem because I'm fusing the absolute roll/pitch from IMU, so when messages start coming, the covariances should quickly get to reasonable numbers.

It might still do that if the filter was able to get to that point before things started freezing. For that reason, maybe try the branch I mentioned to see if it helps?

And they did before #671.

Yeah, the filter was effectively not properly implementing a UKF. Now it is, and I also found that it's more sensitive to this kind of thing. But I'd rather have the math be right, even if it means the filter has greater sensitivity to sensor measurement holidays. To be fair, 22 seconds is a very long time to go without any reference in one of your state variables.

Therefore, the filter should not be forced to project non-zero velocities along the exploding roll and pitch.

Prediction is done using the filter's state, not the measurements. If there's a lot of instability in the covariance values, then I can see the "almost zero" state variables (they are really on the order of 1e-15) getting corrected via massive Kalman gains into non-zero values.

That's not to say there isn't another issue lurking (and I'll poke around). It's just that I can see a legitimate cause for this behavior.

I see it could be better to the filter to start it only when IMU measurements are available. Is there a setting supporting this (except the explicit pause service)? I.e., can I mark a sensor as required (so that predictions are not done if the sensor is timed out)?

This does not exist, but it's a great idea (I think there might be a ticket for it?). I'd welcome a PR. :)

@ayrton04
Copy link
Collaborator

Yeah, one thing I am not happy about is this part of the logs:

Corrected full state is:
[1.4444e-13  5.5672e-14  1.169e-15   -7.9048e-11 -3.9406e-12 -8.5224e-13 4.9838e-16  5.1531e-16  -4.3609e-16 1.1484e-14  8.4267e-14  -5.4854e-15 6.1239e-14  -8.464e-15  -1.6683e-14 ]

Corrected full estimate error covariance is:
[0.40633     -4.7622e-26 3.7759e-17  9.1548e-24  -1.1752e-16 2.5607e-18  9.2335e-05  8.2408e-17  3.239e-16   4.2898e-27  1.3729e-15  1.0919e-19  -0.00051717 -3.6678e-16 -4.8162e-16 
 -4.7622e-26 0.40633     2.6919e-17  1.1752e-16  -1.2661e-24 8.4148e-17  -8.2408e-17 9.2335e-05  1.7534e-15  -1.3729e-15 -7.9448e-27 5.1301e-19  3.6678e-16  -0.00051717 -4.2194e-16 
 3.7759e-17  2.6919e-17  0.48739     -9.2442e-18 -3.0298e-16 1.1725e-24  -3.2867e-16 -1.6908e-15 8.2446e-05  -9.9185e-16 4.9278e-16  -2.359e-27  2.6028e-16  -3.8102e-17 -0.00079265 
 9.982e-24   1.1752e-16  -9.2442e-18 1.9925      2.6211e-21  5.6607e-13  3.4721e-25  -4.5071e-19 -1.4761e-19 0.32536     -7.5032e-24 9.6579e-16  -4.3674e-24 -7.6716e-18 -3.9759e-18 
 -1.1752e-16 -1.2665e-24 -3.0298e-16 2.6228e-21  1.9925      -1.1356e-12 4.5071e-19  -9.7783e-26 1.733e-19   8.7895e-24  0.32536     3.2638e-14  7.6716e-18  -1.0274e-24 2.7577e-18  
 2.5607e-18  8.4148e-17  1.1733e-24  5.6607e-13  -1.1356e-12 0.55163     3.6287e-20  -5.9609e-20 2.6298e-26  8.0321e-14  -2.9913e-13 0.0080254   7.4316e-19  -5.9225e-19 1.9295e-24  
 9.2335e-05  -8.2408e-17 -3.2867e-16 4.0537e-25  4.5071e-19  3.6287e-20  0.0024853   -1.7333e-30 -1.3247e-30 3.8695e-29  1.3191e-28  2.0971e-29  0.00083505  9.5695e-30  5.3446e-30  
 8.2408e-17  9.2335e-05  -1.6908e-15 -4.5071e-19 -9.7783e-26 -5.9609e-20 -1.7333e-30 0.0024853   -4.3797e-31 -1.1455e-28 2.2475e-29  -8.0514e-30 5.9518e-29  0.00083505  -5.6549e-30 
 3.239e-16   1.7534e-15  8.2446e-05  -1.4761e-19 1.733e-19   2.6296e-26  -1.4893e-30 -4.3797e-31 0.0033612   -2.9953e-29 -1.4473e-28 2.0068e-29  -1.4452e-29 5.5338e-29  0.00091879  
 4.2898e-27  -1.3729e-15 -9.9185e-16 0.32536     8.8154e-24  8.0321e-14  2.6073e-29  -1.1455e-28 -3.6264e-29 0.080958    -1.1532e-26 1.1031e-27  2.6822e-27  -2.7606e-27 -1.0205e-27 
 1.3729e-15  -7.6419e-27 4.9278e-16  -7.5032e-24 0.32536     -2.9913e-13 1.3191e-28  2.2475e-29  -1.7313e-28 -1.1532e-26 0.080958    -8.116e-28  4.8143e-27  -1.9896e-27 -4.9599e-27 
 1.0919e-19  5.1301e-19  -2.3588e-27 9.6579e-16  3.2638e-14  0.0080254   2.5259e-29  -7.0001e-31 1.1689e-29  1.141e-27   -8.116e-28  0.012872    2.22e-29    6.0675e-28  -1.2944e-28 
 -0.00051717 3.6678e-16  2.6028e-16  -4.3674e-24 7.6716e-18  7.4316e-19  0.00083505  5.794e-29   -1.9812e-30 2.6822e-27  4.6124e-27  1.0387e-29  0.016223    2.4414e-28  2.5716e-28  
 -3.6678e-16 -0.00051717 -3.8102e-17 -7.6716e-18 -1.0274e-24 -5.9225e-19 3.2586e-30  0.00083505  4.9027e-29  -2.7606e-27 -1.9391e-27 6.2076e-28  2.6938e-28  0.016223    5.9008e-28  
 -4.8162e-16 -4.2194e-16 -0.00079265 -3.9759e-18 2.7577e-18  1.9295e-24  -8.1786e-31 6.5601e-31  0.00091879  -1.0205e-27 -4.9599e-27 -1.3117e-28 2.5726e-28  5.9008e-28  0.024507    ]

But then in the next prediction, we get

Predicted state is:
[1.3392e-13  -8.4105e-15 -1.0415e-14 -3.1416     3.1416      -8.0218e-13 6.6562e-15  -2.3334e-15 -2.8866e-15 1.9008e-14  9.9602e-14  -1.0982e-15 5.297e-14   -7.7369e-16 -7.1054e-15 ]

So the roll and pitch get projected into -pi and pi, respectively. Will dig into that.

@ayrton04
Copy link
Collaborator

OK, I think I spotted an issue with how Euler angles are projected. Fix is pushed to the same 671-wrap-angles branch. Let me know if it helps.

@ayrton04
Copy link
Collaborator

No, that won't do what I thought. OK, I'll have to give the bag a go myself. Will try to find cycles for that.

@peci1
Copy link
Author

peci1 commented Oct 24, 2022

Good, thanks a lot!

@ayrton04
Copy link
Collaborator

ayrton04 commented Oct 26, 2022

@peci1 can you do me a favour and add these parameters to your config file?

alpha: 1.414213562
beta: 2.0
kappa: 0.0

And then run and let me know how it goes.

@peci1
Copy link
Author

peci1 commented Oct 26, 2022

Ok, do you mean on the real robot or just running with the bag file? The robot will not be available until next week probably...

@ayrton04
Copy link
Collaborator

I know it'll work on the bag file you sent, but if you have any others that caused failure, that would be great. But please at least verify on the bag you sent.

@peci1
Copy link
Author

peci1 commented Oct 26, 2022

Ok, with the new parameters, it does not get stuck on the bag. However, the localization is still erratic when the IMU messages start coming.

Also, nans are lurking in (although not in any fatal way) - at least rviz says so when visualizing the odometries (orientation shape_scale contains NaN: Vector3(-nan, 0.001, 1.84705)). But I haven't seen any NaNs in the rostopic echo output, so it might be invalid covariances or denormalized quaternions?

Snímek obrazovky pořízený 2022-10-26 13-18-54

I'll test on the real robot next week.


Unrelated: do you plan to backport all these news to melodic, too? I've just compiled noetic-devel branch on my Melodic PC and it seems to work ok.

@ayrton04
Copy link
Collaborator

Are you by chance filtering out the "bad" values from the bag before you run again?

@ayrton04
Copy link
Collaborator

I changed the launch file so it publishes to odometry/filtered again, and rviz seems happy for the entire bag:

image

@peci1
Copy link
Author

peci1 commented Oct 26, 2022

Are you by chance filtering out the "bad" values from the bag before you run again?

What are bad values? You mean the time when only odom is coming? No, I was now testing on the same bag that's mentioned in the original post.

What fixed frame do you use in rviz to get this result? I used 'odom'.

@peci1
Copy link
Author

peci1 commented Oct 26, 2022

Ah, I see, you mean the /imu_odom topic recorded in the bag. I forgot about it! I'll rerun my test.

@peci1
Copy link
Author

peci1 commented Oct 26, 2022

Good, when filtering out the old /imu_odom topic, I see exactly the result you show (when scaling up the covariances a lot in rviz).

Do you have an explanation for the UKF parameter values that fixed it?

@ayrton04
Copy link
Collaborator

Well, I don't think it's really a fix, as it turns out. It will solve that particular issue, but that breaks other things (like our regression tests). I know the main issue; I just need to think of a solution. In the meantime, I'd either switch to the EKF or add the feature whereby we wait for all sensor inputs before starting the filter.

@ayrton04
Copy link
Collaborator

@peci1 can you try this branch again?

#778

Thanks!

@peci1
Copy link
Author

peci1 commented Nov 14, 2022

Thanks, I'll have a look. I got busy in the meantime, but I think I can do this soon.

@daniel-zawadzki-vay
Copy link

@ayrton04 this while loop is a disaster waiting to happen. My project is affected by this bug after migration from ROS melodic to noetic. robot_localization process is stuck at RobotLocalization::FilterUtilities::clampRotation(double). I confirmed that this fix makes this issue go away. Please consider merging it.

@ayrton04
Copy link
Collaborator

ayrton04 commented Oct 9, 2023

@daniel-zawadzki-vay that change touches far more than the rotation clamping. Please feel free to submit a separate PR with that in isolation. However, that clamping logic has been around since the very early days of the package, so I'm not clear on why it's suddenly an issue.

This package is deprecated right now, so while I will definitely merge a PR to address the angle wrapping, I don't have the cycles to iterate on the UKF in general right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants