Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability when launching multiple nodes simultaneously in rmw_zenoh 1.0.0. #318

Closed
Tacha-S opened this issue Nov 19, 2024 · 11 comments
Closed
Assignees

Comments

@Tacha-S
Copy link

Tacha-S commented Nov 19, 2024

When launching 50 nodes simultaneously with the default configuration of rmw_zenoh 1.0.0, the following logs are output. Although errors occur, communication itself seems to be working, suggesting that issues during high-load situations, such as scouting, may be resolved through retries or similar mechanisms.

WARN net-0 ThreadId(02) zenoh::net::runtime::orchestrator: Unable to connect to any locator of scouted peer abcca54e9b9f39dcc959a291b70ad4a3: [tcp/127.0.0.1:37439]

ERROR rx-3 ThreadId(10) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to abcca54e9b9f39dcc959a291b70ad4a3. Closing transport!

ERROR rx-3 ThreadId(10) zenoh::net::routing::dispatcher::pubsub: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown subscriber 36
ERROR rx-3 ThreadId(10) zenoh::net::routing::dispatcher::queries: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown queryable 37

ERROR rx-5 ThreadId(12) zenoh::net::routing::dispatcher::pubsub: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown subscriber @/a58b796ef2f5719ab57e80fb2c728d75/peer/config/**!
ERROR rx-5 ThreadId(12) zenoh::net::routing::dispatcher::queries: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown queryable @/a58b796ef2f5719ab57e80fb2c728d75/peer/**!

ERROR rx-1 ThreadId(08) zenoh::net::routing::dispatcher::queries: Face{27, 524dd4311cd7123f970f723a5392f031} Declare queryable 40 for unknown scope 40!

[ERROR][rmw_zenoh_cpp]: topic name /clock not found in topic_map. Report this.

Reducing the number of nodes to 31 significantly reduces errors, leaving only the following two types:

ERROR ThreadId(13) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to 5512dc59d089a06f075aef2519934705. Closing transport!

ERROR rx-0 ThreadId(07) zenoh::net::routing::dispatcher::queries: Face{2, cf85ad2d74c31f9f8380d04b0f718782} Undeclare unknown queryable @/5011374e2ef9e02648f354442d5bffb/peer/**!

I have not yet investigated the parameters, but I will add comments as I find more information.

Additionally, while the following warnings are not problematic, I would like to suppress them if possible. I am also curious as to why ROS 2 reliability is not supported.

[WARN][rmw_zenoh_cpp]: `reliability` no longer supported on subscriber. Ignoring...

here is our operating environment:

  • OS: Ubuntu 24.04
  • ROS2 jazzy
  • rmw_zenoh 1.0.0
@Yadunund
Copy link
Member

Do you encounter the same issues when running with the rolling branch of rmw_zenoh?

@Tacha-S
Copy link
Author

Tacha-S commented Nov 20, 2024

In the rolling, the following error appears instead of the above error.

WARN net-0 ThreadId(02) zenoh::net::runtime::orchestrator: Unable to connect to any locator of scouted peer 19e837d73b78681c57969db53c2a58a1: [tcp/127.0.0.1:37373]
ERROR rx-0 ThreadId(07) zenoh::net::routing::dispatcher::resource: Resource 1 remapped. Remapping unsupported!
ERROR rx-1 ThreadId(08) zenoh::net::routing::dispatcher::pubsub: Undeclare subscription with unknown scope!
ERROR rx-1 ThreadId(08) zenoh::net::routing::dispatcher::queries: Declare queryable for unknown scope 29!
[ERROR][rmw_zenoh_cpp]: topic name /clock not found in topic_map. Report this.

@Yadunund
Copy link
Member

Could you clarify if these errors/warnings are from the rmw_zenohd router or from the nodes you are launching?

It sounds like communication itself is working so i'm curious why you classify this as "instability"? Is there some functionality that is not working as per expectations?

It would be great if you can provide a link to a repository that can help us reproduce this behavior.

@Tacha-S
Copy link
Author

Tacha-S commented Nov 20, 2024

These are all outputs from the RMW of our node.
I believe there were also some ERROR messages from the zenohd router, but I haven't included them here.

If these behaviors are expected, then it’s fine,
but it’s hard to believe that frequent ERROR-level logs are normal, which is why I described it as "instability."
If the system can recover from these errors, perhaps the log level should be WARN rather than ERROR.

Providing code to reproduce this issue is not possible,
but I believe a similar setup would be an environment where a robot is spawned in a Gazebo environment.

@YuanYuYuan
Copy link
Contributor

When launching 50 nodes simultaneously with the default configuration of rmw_zenoh 1.0.0, the following logs are output. Although errors occur, communication itself seems to be working, suggesting that issues during high-load situations, such as scouting, may be resolved through retries or similar mechanisms.

WARN net-0 ThreadId(02) zenoh::net::runtime::orchestrator: Unable to connect to any locator of scouted peer abcca54e9b9f39dcc959a291b70ad4a3: [tcp/127.0.0.1:37439]

ERROR rx-3 ThreadId(10) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to abcca54e9b9f39dcc959a291b70ad4a3. Closing transport!

ERROR rx-3 ThreadId(10) zenoh::net::routing::dispatcher::pubsub: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown subscriber 36
ERROR rx-3 ThreadId(10) zenoh::net::routing::dispatcher::queries: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown queryable 37

ERROR rx-5 ThreadId(12) zenoh::net::routing::dispatcher::pubsub: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown subscriber @/a58b796ef2f5719ab57e80fb2c728d75/peer/config/**!
ERROR rx-5 ThreadId(12) zenoh::net::routing::dispatcher::queries: Face{2, ba18fecf1dc6032086ef7598ee0610d3} Undeclare unknown queryable @/a58b796ef2f5719ab57e80fb2c728d75/peer/**!

ERROR rx-1 ThreadId(08) zenoh::net::routing::dispatcher::queries: Face{27, 524dd4311cd7123f970f723a5392f031} Declare queryable 40 for unknown scope 40!

[ERROR][rmw_zenoh_cpp]: topic name /clock not found in topic_map. Report this.

Reducing the number of nodes to 31 significantly reduces errors, leaving only the following two types:

ERROR ThreadId(13) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to 5512dc59d089a06f075aef2519934705. Closing transport!

ERROR rx-0 ThreadId(07) zenoh::net::routing::dispatcher::queries: Face{2, cf85ad2d74c31f9f8380d04b0f718782} Undeclare unknown queryable @/5011374e2ef9e02648f354442d5bffb/peer/**!

I have not yet investigated the parameters, but I will add comments as I find more information.

Additionally, while the following warnings are not problematic, I would like to suppress them if possible. I am also curious as to why ROS 2 reliability is not supported.

[WARN][rmw_zenoh_cpp]: `reliability` no longer supported on subscriber. Ignoring...

here is our operating environment:

* OS: Ubuntu 24.04

* ROS2 jazzy

* rmw_zenoh [1.0.0](https://github.com/ZettaScaleLabs/rmw_zenoh/tree/dev/1.0.0)

Hi @Tacha-S, the warning will be removed since it's not really useful to the users. Thanks for your feedback.

And I just tested rmw_zenoh_cpp 1.0 with 50 more demo listeners and talkers, but I didn't see any errors as you posted. Can you share how you produce the errors?

@Tacha-S
Copy link
Author

Tacha-S commented Nov 27, 2024

It has been confirmed that the issue can be reproduced with the following launch.

ros2 launch turtlebot4_gz_bringup turtlebot4_spawn.launch.py namespace:=robot2 nav2:=true localization:=true rviz:=true

@evshary
Copy link
Contributor

evshary commented Nov 29, 2024

Hi @Tacha-S, It would be great if you could provide any minimum reproducible example to reduce the complexity. On my side, the turtlebot4_gz_bringup can't even work with CycloneDDS, so it's a little difficult to see what is expected.

@Tacha-S
Copy link
Author

Tacha-S commented Nov 29, 2024

That's unfortunate.
It might be better to use:
ros2 launch turtlebot4_gz_bringup turtlebot4_gz.launch.py slam:=true nav2:=true rviz:=true
Additionally, since there is a model download involved, the first launch may take a significant amount of time.

In any case, I think the a lot of errors from Zenoh during startup should be reproducible.
Did they not show on your environment?

@Yadunund
Copy link
Member

@Tacha-S we've heard other reports of high CPU loads when launching a lot of nodes at once. #408 tracks this issue and Zettascale has a fix in upstream Zenoh and we'll update the vendored version here once ready.

@Tacha-S
Copy link
Author

Tacha-S commented Jan 17, 2025

I'm looking forward to the updated version being released.

@Yadunund
Copy link
Member

@Tacha-S we just merged #424 which should alleviate the problem. Could you rebuild and try you application again and let us know if there is an improvement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants