Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow_path action keeps on running after having canceled the goal #4659

Open
milidam opened this issue Sep 3, 2024 · 10 comments
Open

follow_path action keeps on running after having canceled the goal #4659

milidam opened this issue Sep 3, 2024 · 10 comments

Comments

@milidam
Copy link
Contributor

milidam commented Sep 3, 2024

Bug report

Required Info:

  • Operating System:
    • Custom Linux distribution
  • ROS2 Version:
    • Iron 2024-07-12
  • Version or commit hash:
    • navigation2 1.2.9
  • DDS implementation:
    • CycloneDDS

Steps to reproduce issue

In some cases, the follow_path action keeps on running after having canceled a navigate_to_pose goal.

Expected behavior

After canceling the navigate_to_pose goal, we would expect the controller_server to properly interrupt its follow_path action.

[controller_server-8] [INFO]: Goal was canceled. Stopping the robot.
[controller_server-8] [WARN]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.

Actual behavior

The bt_navigator gets a Failed to get result for compute_path_to_pose in node halt! or a Failed to get result for follow_path in node halt! error, and the controller_server keeps on trying to follow the last received path.

[bt_navigator-12] [INFO]: Starting path following
[controller_server-8] [INFO]: Received a goal, begin computing control effort.
[planner_server-9] [WARN]: Map update loop missed its desired rate of 5.0000Hz. Current loop rate is 0.9686Hz
[bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 114 ms (timeout is 50 ms)
[bt_navigator-12] [WARN]: Time required to tick the tree is 114 ms
[bt_navigator-12] [WARN]: Timed out while waiting for action server to acknowledge goal request for follow_path
[bt_navigator-12] [INFO]: Pausing path following due to slow path computation (>500ms)
[planner_server-9] [INFO]: Attempting to find a path from (-12.62, 3.38) to (-54.58, 11.29).
[planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (-12.624103, 3.381263), in costmap (452, 103)
[planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (-54.581326, 11.286437), in costmap (32, 182)
[app-21] [INFO]: Request to cancel goals on /navigate_to_pose was accepted by server, waiting for result
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 9.2498 Hz.
[bt_navigator-12] [ERROR]: Failed to get result for compute_path_to_pose in node halt!
[bt_navigator-12] [INFO]: Goal canceled
[bt_navigator-12] [WARN]: [navigate_to_pose] [ActionServer] Client requested to cancel the goal. Cancelling.
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 9.5395 Hz.
[controller_server-8] [WARN]: Control loop missed its desired rate of 10.0000 Hz. Current loop rate is 10.9403 Hz.

Additional information

In the case reported above, before that new instance of the Starting path following, in a previous attempt in the same instance of the navigate_to_pose behavior tree, the bt_navigator already encountered the following error

[bt_navigator-12] [ERROR]: Failed to get result for follow_path in node halt!

what could indicate that the bt_navigator might somehow be desynchronized with the controller_server?

Also, the map is quite big, and the planner_server takes some time to compute a path. The goal canceling occurred while the compute_path_to_pose action was still running.

@SteveMacenski
Copy link
Member

That occurs here [1], so the server timeout doesn't cancel the action in time. If you were to set this to say - 1 second - does that remove your issue? Also, what so you have your server_timeout set to - that could be unrealistically short.

It looks like this condition was added in [2] and may need to (generally speaking) have a longer timeout than the server timeout proportional to steady-state BT node ticking. If we're halting / exiting out a BT, we may find it better to have the rate exceeded a bit in exchange for a more realistic chance of those actions being canceled successfully.

Also, the map is quite big, and the planner_server takes some time to compute a path. The goal canceling occurred while the compute_path_to_pose action was still running.

This is definitely why. If you have a server timeout of 10ms and you've just started a planning request that is 100ms in length, then it won't always cancel in time. We added a cancel checker [3] in the Planner API so that we query in the planning loop if a cancel is requested to be more responsive, but that is not backported to Iron and you'll need to upgrade to Jazzy to receive that. That actually by itself might completely fix your issue.

[1]

if (callback_group_executor_.spin_until_future_complete(future_result, server_timeout_) !=

[2] bfb4506

[3] https://github.com/ros-navigation/navigation2/blob/main/nav2_navfn_planner/src/navfn_planner.cpp#L137

@SteveMacenski
Copy link
Member

@milidam have you had a chance to look into these items? I'd love to get feedback on the precise nature of it to write up a solution for you to test

@milidam
Copy link
Contributor Author

milidam commented Sep 10, 2024

Hi @SteveMacenski,
Our server_timeout is currently set to 100ms.
I haven't had time yet to try to reproduce the issue with an increased timeout.

Our planner's max_planning_time is 30.0s.
Does that also increase the risk of not being able to actually cancel it?

Note that migration to Jazzy is planned, but not yet started.

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 10, 2024

I haven't had time yet to try to reproduce the issue with an increased timeout.

Do you know when you would be able to? If you increased that to 500ms or 1s and it went away for Follow Path, that could be something we could replace that spin with since that cancel should only happen when halting an entire tree on task exit (which keeping the 10ms tick rate is unnecessary)

Does that also increase the risk of not being able to actually cancel it?

If you haven't upgraded, I think so. This is why I want you to check for me :-) But that shouldn't impact the follow path action, just computing path to pose -- since canceling path planning that is mid-execution is only supported in Jazzy and newer.

@SteveMacenski
Copy link
Member

SteveMacenski commented Dec 11, 2024

@milidam any updates? It sounds like the issue is fixed in Jazzy and Iron is now EOL

@mamariomiamo
Copy link

From the terminal output shared by the OP, follow_path BT action node never cancel the current goal. This may be because of the following sequence of events:

  1. We see the Timed out while waiting for action server to acknowledge goal request for follow_path in the console logging, which means server already timed out, future_goal_handle_ gets reset()

    bool is_future_goal_handle_complete(std::chrono::milliseconds & elapsed)
    {
    auto remaining = server_timeout_ - elapsed;
    // server has already timed out, no need to sleep
    if (remaining <= std::chrono::milliseconds(0)) {
    future_goal_handle_.reset();
    return false;
    }

    and goal_handle_ was not assigned value
    if (result == rclcpp::FutureReturnCode::SUCCESS) {
    goal_handle_ = future_goal_handle_->get();
    future_goal_handle_.reset();
    if (!goal_handle_) {
    throw std::runtime_error("Goal was rejected by the action server");
    }
    return true;
    }

  2. When halt() function is called, should_cancel_goal() returns false since goal_handle_ is still a nullptr and thus the goal was never canceled.

    bool should_cancel_goal()
    {
    // Shut the node down if it is currently running
    if (status() != BT::NodeStatus::RUNNING) {
    return false;
    }
    // No need to cancel the goal if goal handle is invalid
    if (!goal_handle_) {
    return false;
    }

Should we add a call to cancel goal before we return failure here? Since we never get a valid goal_handle_, we can use async_cancel_all_goals()

try {
// if new goal was sent and action server has not yet responded
// check the future goal handle
if (future_goal_handle_) {
auto elapsed =
(node_->now() - time_goal_sent_).template to_chrono<std::chrono::milliseconds>();
if (!is_future_goal_handle_complete(elapsed)) {
// return RUNNING if there is still some time before timeout happens
if (elapsed < server_timeout_) {
return BT::NodeStatus::RUNNING;
}
// if server has taken more time than the specified timeout value return FAILURE
RCLCPP_WARN(
node_->get_logger(),
"Timed out while waiting for action server to acknowledge goal request for %s",
action_name_.c_str());
future_goal_handle_.reset();
return BT::NodeStatus::FAILURE;
}
}

If we do not properly cancel the follow_path action here and if we have any other recovery mechanism for example DriveOnHeading, we would have two nodes publishing to the cmd_vel at the same time.

@SteveMacenski
Copy link
Member

We see the Timed out while waiting for action server to acknowledge goal request for follow_path in the console logging, which means server already timed out, future_goal_handle_ gets reset()

That's not the only false return condition from is_future_goal_handle_complete(), there's also the following which is caused when result == rclcpp::FutureReturnCode::TIMEOUT -- but yes, another form of the timeout. However future_goal_handle_ is not reset in this case.

and goal_handle_ was not assigned value

This is true either way, yes!

When halt() function is called, should_cancel_goal() returns false since goal_handle_ is still a nullptr and thus the goal was never canceled.

True, but we use the goal_handle_ to cancel things, so if nullptr, we couldn't use it away. We'd either have to set the goal handle (which I don't think is possible) or switch the API to use `async_cancel_all_goals or async_cancel_goals_before. Neither of which are very precise, but technically possible.

Should we add a call to cancel goal before we return failure here? Since we never get a valid goal_handle_, we can use async_cancel_all_goals()

Definitely if we do something like that using the other 2 cancel goal APIs would be discouraged because it is a sludge hammer when we want something more targeted.

In this situation is future_goal_handle_.get() != nullptr in order to retrieve a goal handle to use for cancellation? If so, that seems technically possible. However isn't this block focused on the goal sending, not the actual goal execution? This is a timeout for getting an ACK that the goal was receieved. I don't think canceling is appropriate if the goal was never received because there's no goal to handle. From the action server's side, we'd have to see how a cancel to a goal that doesn't exist is handled (and if that's cleanly done that we could possibly send excessive requests and not cause a problem).

try {
// if new goal was sent and action server has not yet responded
// check the future goal handle
if (future_goal_handle_) {
auto elapsed =
(node_->now() - time_goal_sent_).template to_chrono<std::chrono::milliseconds>();
if (!is_future_goal_handle_complete(elapsed)) {
// return RUNNING if there is still some time before timeout happens
if (elapsed < server_timeout_) {
return BT::NodeStatus::RUNNING;
}
// if server has taken more time than the specified timeout value return FAILURE
RCLCPP_WARN(
node_->get_logger(),
"Timed out while waiting for action server to acknowledge goal request for %s",
action_name_.c_str());
future_goal_handle_.reset();
return BT::NodeStatus::FAILURE;
}
}

@adivardi
Copy link
Contributor

Do you know when you would be able to? If you increased that to 500ms or 1s and it went away for Follow Path, that could be something we could replace that spin with since that cancel should only happen when halting an entire tree on task exit (which keeping the 10ms tick rate is unnecessary)

I have a similar issue. I use the navigate_w_basic_complete_coverage behavior tree. When canceling navigate_complete_coverage, I often get [bt_navigator_navigate_complete_coverage_rclcpp_node]: Failed to get result for follow_path in node halt!

Increasing default_server_timeout from 30 to even ~50ms solves this issue.

Example of full output (this repeats for a while):

[ERROR] [bt_navigator_navigate_complete_coverage_rclcpp_node]: Failed to get result for follow_path in node halt!
[WARN]  [controller_server]: Control loop missed its desired rate of 20.0000 Hz. Current loop rate is 14.2857 Hz.
[INFO]  [controller_server]: Cancellation was successful. Stopping the robot.
[INFO]  [controller_server]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.
[INFO]  [controller_server]: Optimizer reset
[INFO]  [controller_server]: Received a goal, begin computing control effort.
[INFO]  [controller_server]: Cancellation was successful. Stopping the robot.
[INFO]  [controller_server]: [follow_path] [ActionServer] Client requested to cancel the goal. Cancelling.
[INFO]  [controller_server]: Optimizer reset
[INFO]  [controller_server]: Received a goal, begin computing control effort.

I also have a similar issue with IsPathValid timing out, but I think this is a similar but not exactly related issue of that service being very slow.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 22, 2025

Did you read over the thread above and determine if that is an issue for you, or just increasing the timeout is sufficient? What compute platform / ROS 2 distribution are you on?

@adivardi
Copy link
Contributor

I am running on Jazzy on Gazebo simulation, on a laptop with i7-12700H.

Currently using the higher timeout seems to be working, but it does mean many nodes may run slower than the BT tick frequency. I am not sure how big of an issue it is, but it does not sound correct.

I did read the thread, my intention was to support the idea of either using spin for the action cancellation, or better yet supporting a configurable timeout for cancellation different than the "normal" one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants