follow_path action keeps on running after having canceled the goal #4659

milidam · 2024-09-03T10:29:45Z

Bug report

Required Info:

Operating System:
- Custom Linux distribution
ROS2 Version:
- Iron 2024-07-12
Version or commit hash:
- navigation2 1.2.9
DDS implementation:
- CycloneDDS

Steps to reproduce issue

In some cases, the follow_path action keeps on running after having canceled a navigate_to_pose goal.

Expected behavior

After canceling the navigate_to_pose goal, we would expect the controller_server to properly interrupt its follow_path action.

[controller_server-8] [INFO]: Goal was canceled. Stopping the robot.
[controller_server-8] [WARN]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.

Actual behavior

The bt_navigator gets a Failed to get result for compute_path_to_pose in node halt! or a Failed to get result for follow_path in node halt! error, and the controller_server keeps on trying to follow the last received path.

[bt_navigator-12] [INFO]: Starting path following
[controller_server-8] [INFO]: Received a goal, begin computing control effort.
[planner_server-9] [WARN]: Map update loop missed its desired rate of 5.0000Hz. Current loop rate is 0.9686Hz
[bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 114 ms (timeout is 50 ms)
[bt_navigator-12] [WARN]: Time required to tick the tree is 114 ms
[bt_navigator-12] [WARN]: Timed out while waiting for action server to acknowledge goal request for follow_path
[bt_navigator-12] [INFO]: Pausing path following due to slow path computation (>500ms)
[planner_server-9] [INFO]: Attempting to find a path from (-12.62, 3.38) to (-54.58, 11.29).
[planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (-12.624103, 3.381263), in costmap (452, 103)
[planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (-54.581326, 11.286437), in costmap (32, 182)
[app-21] [INFO]: Request to cancel goals on /navigate_to_pose was accepted by server, waiting for result
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 9.2498 Hz.
[bt_navigator-12] [ERROR]: Failed to get result for compute_path_to_pose in node halt!
[bt_navigator-12] [INFO]: Goal canceled
[bt_navigator-12] [WARN]: [navigate_to_pose] [ActionServer] Client requested to cancel the goal. Cancelling.
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 9.5395 Hz.
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 10.9403 Hz.

Additional information

In the case reported above, before that new instance of the Starting path following, in a previous attempt in the same instance of the navigate_to_pose behavior tree, the bt_navigator already encountered the following error

[bt_navigator-12] [ERROR]: Failed to get result for follow_path in node halt!

what could indicate that the bt_navigator might somehow be desynchronized with the controller_server?

Also, the map is quite big, and the planner_server takes some time to compute a path. The goal canceling occurred while the compute_path_to_pose action was still running.

The text was updated successfully, but these errors were encountered:

SteveMacenski · 2024-09-03T22:39:45Z

That occurs here [1], so the server timeout doesn't cancel the action in time. If you were to set this to say - 1 second - does that remove your issue? Also, what so you have your server_timeout set to - that could be unrealistically short.

It looks like this condition was added in [2] and may need to (generally speaking) have a longer timeout than the server timeout proportional to steady-state BT node ticking. If we're halting / exiting out a BT, we may find it better to have the rate exceeded a bit in exchange for a more realistic chance of those actions being canceled successfully.

Also, the map is quite big, and the planner_server takes some time to compute a path. The goal canceling occurred while the compute_path_to_pose action was still running.

This is definitely why. If you have a server timeout of 10ms and you've just started a planning request that is 100ms in length, then it won't always cancel in time. We added a cancel checker [3] in the Planner API so that we query in the planning loop if a cancel is requested to be more responsive, but that is not backported to Iron and you'll need to upgrade to Jazzy to receive that. That actually by itself might completely fix your issue.

[1]

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Line 319 in 34d41ed

    
           if (callback_group_executor_.spin_until_future_complete(future_result, server_timeout_) !=

[2] bfb4506

[3] https://github.com/ros-navigation/navigation2/blob/main/nav2_navfn_planner/src/navfn_planner.cpp#L137

SteveMacenski · 2024-09-09T22:15:24Z

@milidam have you had a chance to look into these items? I'd love to get feedback on the precise nature of it to write up a solution for you to test

milidam · 2024-09-10T09:20:40Z

Hi @SteveMacenski,
Our server_timeout is currently set to 100ms.
I haven't had time yet to try to reproduce the issue with an increased timeout.

Our planner's max_planning_time is 30.0s.
Does that also increase the risk of not being able to actually cancel it?

Note that migration to Jazzy is planned, but not yet started.

SteveMacenski · 2024-09-10T18:14:33Z

I haven't had time yet to try to reproduce the issue with an increased timeout.

Do you know when you would be able to? If you increased that to 500ms or 1s and it went away for Follow Path, that could be something we could replace that spin with since that cancel should only happen when halting an entire tree on task exit (which keeping the 10ms tick rate is unnecessary)

Does that also increase the risk of not being able to actually cancel it?

If you haven't upgraded, I think so. This is why I want you to check for me :-) But that shouldn't impact the follow path action, just computing path to pose -- since canceling path planning that is mid-execution is only supported in Jazzy and newer.

SteveMacenski · 2024-12-11T00:11:30Z

@milidam any updates? It sounds like the issue is fixed in Jazzy and Iron is now EOL

mamariomiamo · 2024-12-20T08:37:25Z

From the terminal output shared by the OP, follow_path BT action node never cancel the current goal. This may be because of the following sequence of events:

We see the Timed out while waiting for action server to acknowledge goal request for follow_path in the console logging, which means server already timed out, future_goal_handle_ gets reset()

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Lines 405 to 413 in 34d41ed

    
           bool is_future_goal_handle_complete(std::chrono::milliseconds & elapsed) 
        
           { 
        
             auto remaining = server_timeout_ - elapsed; 
        
             // server has already timed out, no need to sleep 
        
             if (remaining <= std::chrono::milliseconds(0)) { 
        
               future_goal_handle_.reset(); 
        
               return false; 
        
             }

and goal_handle_ was not assigned value

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Lines 425 to 432 in 34d41ed

    
           if (result == rclcpp::FutureReturnCode::SUCCESS) { 
        
             goal_handle_ = future_goal_handle_->get(); 
        
             future_goal_handle_.reset(); 
        
             if (!goal_handle_) { 
        
               throw std::runtime_error("Goal was rejected by the action server"); 
        
             } 
        
             return true; 
        
           }

When halt() function is called, should_cancel_goal() returns false since goal_handle_ is still a nullptr and thus the goal was never canceled.

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Lines 340 to 350 in 34d41ed

    
           bool should_cancel_goal() 
        
           { 
        
             // Shut the node down if it is currently running 
        
             if (status() != BT::NodeStatus::RUNNING) { 
        
               return false; 
        
             } 
        
             // No need to cancel the goal if goal handle is invalid 
        
             if (!goal_handle_) { 
        
               return false; 
        
             }

Should we add a call to cancel goal before we return failure here? Since we never get a valid goal_handle_, we can use async_cancel_all_goals()

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Lines 209 to 228 in 34d41ed

    
           try { 
        
             // if new goal was sent and action server has not yet responded 
        
             // check the future goal handle 
        
             if (future_goal_handle_) { 
        
               auto elapsed = 
        
                 (node_->now() - time_goal_sent_).template to_chrono<std::chrono::milliseconds>(); 
        
               if (!is_future_goal_handle_complete(elapsed)) { 
        
                 // return RUNNING if there is still some time before timeout happens 
        
                 if (elapsed < server_timeout_) { 
        
                   return BT::NodeStatus::RUNNING; 
        
                 } 
        
                 // if server has taken more time than the specified timeout value return FAILURE 
        
                 RCLCPP_WARN( 
        
                   node_->get_logger(), 
        
                   "Timed out while waiting for action server to acknowledge goal request for %s", 
        
                   action_name_.c_str()); 
        
                 future_goal_handle_.reset(); 
        
                 return BT::NodeStatus::FAILURE; 
        
               } 
        
             }

If we do not properly cancel the follow_path action here and if we have any other recovery mechanism for example DriveOnHeading, we would have two nodes publishing to the cmd_vel at the same time.

SteveMacenski · 2024-12-20T18:01:48Z

We see the Timed out while waiting for action server to acknowledge goal request for follow_path in the console logging, which means server already timed out, future_goal_handle_ gets reset()

That's not the only false return condition from is_future_goal_handle_complete(), there's also the following which is caused when result == rclcpp::FutureReturnCode::TIMEOUT -- but yes, another form of the timeout. However future_goal_handle_ is not reset in this case.

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Line 434 in 34d41ed

return false;

and goal_handle_ was not assigned value

This is true either way, yes!

When halt() function is called, should_cancel_goal() returns false since goal_handle_ is still a nullptr and thus the goal was never canceled.

True, but we use the goal_handle_ to cancel things, so if nullptr, we couldn't use it away. We'd either have to set the goal handle (which I don't think is possible) or switch the API to use `async_cancel_all_goals or async_cancel_goals_before. Neither of which are very precise, but technically possible.

Should we add a call to cancel goal before we return failure here? Since we never get a valid goal_handle_, we can use async_cancel_all_goals()

Definitely if we do something like that using the other 2 cancel goal APIs would be discouraged because it is a sludge hammer when we want something more targeted.

In this situation is future_goal_handle_.get() != nullptr in order to retrieve a goal handle to use for cancellation? If so, that seems technically possible. However isn't this block focused on the goal sending, not the actual goal execution? This is a timeout for getting an ACK that the goal was receieved. I don't think canceling is appropriate if the goal was never received because there's no goal to handle. From the action server's side, we'd have to see how a cancel to a goal that doesn't exist is handled (and if that's cleanly done that we could possibly send excessive requests and not cause a problem).

navigation2/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp

Lines 209 to 228 in 34d41ed

    
           try { 
        
             // if new goal was sent and action server has not yet responded 
        
             // check the future goal handle 
        
             if (future_goal_handle_) { 
        
               auto elapsed = 
        
                 (node_->now() - time_goal_sent_).template to_chrono<std::chrono::milliseconds>(); 
        
               if (!is_future_goal_handle_complete(elapsed)) { 
        
                 // return RUNNING if there is still some time before timeout happens 
        
                 if (elapsed < server_timeout_) { 
        
                   return BT::NodeStatus::RUNNING; 
        
                 } 
        
                 // if server has taken more time than the specified timeout value return FAILURE 
        
                 RCLCPP_WARN( 
        
                   node_->get_logger(), 
        
                   "Timed out while waiting for action server to acknowledge goal request for %s", 
        
                   action_name_.c_str()); 
        
                 future_goal_handle_.reset(); 
        
                 return BT::NodeStatus::FAILURE; 
        
               } 
        
             }

adivardi · 2025-01-22T12:29:21Z

Do you know when you would be able to? If you increased that to 500ms or 1s and it went away for Follow Path, that could be something we could replace that spin with since that cancel should only happen when halting an entire tree on task exit (which keeping the 10ms tick rate is unnecessary)

I have a similar issue. I use the navigate_w_basic_complete_coverage behavior tree. When canceling navigate_complete_coverage, I often get [bt_navigator_navigate_complete_coverage_rclcpp_node]: Failed to get result for follow_path in node halt!

Increasing default_server_timeout from 30 to even ~50ms solves this issue.

Example of full output (this repeats for a while):

[ERROR] [bt_navigator_navigate_complete_coverage_rclcpp_node]: Failed to get result for follow_path in node halt!
[WARN]  [controller_server]: Control loop missed its desired rate of 20.0000 Hz. Current loop rate is 14.2857 Hz.
[INFO]  [controller_server]: Cancellation was successful. Stopping the robot.
[INFO]  [controller_server]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.
[INFO]  [controller_server]: Optimizer reset
[INFO]  [controller_server]: Received a goal, begin computing control effort.
[INFO]  [controller_server]: Cancellation was successful. Stopping the robot.
[INFO]  [controller_server]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.
[INFO]  [controller_server]: Optimizer reset
[INFO]  [controller_server]: Received a goal, begin computing control effort.

I also have a similar issue with IsPathValid timing out, but I think this is a similar but not exactly related issue of that service being very slow.

SteveMacenski · 2025-01-22T19:11:57Z

Did you read over the thread above and determine if that is an issue for you, or just increasing the timeout is sufficient? What compute platform / ROS 2 distribution are you on?

adivardi · 2025-01-23T07:20:40Z

I am running on Jazzy on Gazebo simulation, on a laptop with i7-12700H.

Currently using the higher timeout seems to be working, but it does mean many nodes may run slower than the BT tick frequency. I am not sure how big of an issue it is, but it does not sound correct.

I did read the thread, my intention was to support the idea of either using spin for the action cancellation, or better yet supporting a configurable timeout for cancellation different than the "normal" one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

follow_path action keeps on running after having canceled the goal #4659

follow_path action keeps on running after having canceled the goal #4659

milidam commented Sep 3, 2024

SteveMacenski commented Sep 3, 2024

SteveMacenski commented Sep 9, 2024

milidam commented Sep 10, 2024

SteveMacenski commented Sep 10, 2024 •

edited

Loading

SteveMacenski commented Dec 11, 2024 •

edited

Loading

mamariomiamo commented Dec 20, 2024

SteveMacenski commented Dec 20, 2024

adivardi commented Jan 22, 2025

SteveMacenski commented Jan 22, 2025 •

edited

Loading

adivardi commented Jan 23, 2025

follow_path action keeps on running after having canceled the goal #4659

follow_path action keeps on running after having canceled the goal #4659

Comments

milidam commented Sep 3, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

SteveMacenski commented Sep 3, 2024

SteveMacenski commented Sep 9, 2024

milidam commented Sep 10, 2024

SteveMacenski commented Sep 10, 2024 • edited Loading

SteveMacenski commented Dec 11, 2024 • edited Loading

mamariomiamo commented Dec 20, 2024

SteveMacenski commented Dec 20, 2024

adivardi commented Jan 22, 2025

SteveMacenski commented Jan 22, 2025 • edited Loading

adivardi commented Jan 23, 2025

SteveMacenski commented Sep 10, 2024 •

edited

Loading

SteveMacenski commented Dec 11, 2024 •

edited

Loading

SteveMacenski commented Jan 22, 2025 •

edited

Loading