Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Test udn node scale + informer multiplexing #2378

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

dceara
Copy link
Contributor

@dceara dceara commented Dec 3, 2024

πŸ“‘ Description

Fixes #

Additional Information for reviewers

βœ… Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

@dceara dceara changed the title Test udn node scale + informer multiplexing [WIP] Test udn node scale + informer multiplexing Dec 3, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2024
@dceara
Copy link
Contributor Author

dceara commented Dec 3, 2024

/hold

Just for testing

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2024
@openshift-ci openshift-ci bot requested review from abhat and tssurya December 3, 2024 15:21
@dceara dceara force-pushed the test_udn_node_scale branch from c9fe0f9 to 00d6f2f Compare December 4, 2024 15:27
@jtaleric
Copy link

jtaleric commented Dec 4, 2024

/test

Copy link
Contributor

openshift-ci bot commented Dec 4, 2024

@jtaleric: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
/test 4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade
/test 4.18-upgrade-from-stable-4.17-images
/test e2e-aws-ovn
/test e2e-aws-ovn-hypershift
/test e2e-aws-ovn-local-gateway
/test e2e-aws-ovn-local-to-shared-gateway-mode-migration
/test e2e-aws-ovn-serial
/test e2e-aws-ovn-shared-to-local-gateway-mode-migration
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-local-gateway
/test e2e-aws-ovn-windows
/test e2e-azure-ovn-upgrade
/test e2e-gcp-ovn
/test e2e-gcp-ovn-techpreview
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test gofmt
/test images
/test lint
/test unit

The following commands are available to trigger optional jobs:

/test e2e-agent-compact-ipv4
/test e2e-aws-ovn-clusternetwork-cidr-expansion
/test e2e-aws-ovn-fdp-qe
/test e2e-aws-ovn-hypershift-conformance-techpreview
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-techpreview
/test e2e-aws-ovn-virt-techpreview
/test e2e-azure-ovn
/test e2e-azure-ovn-techpreview
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview
/test e2e-metal-ipi-ovn-dualstack-techpreview
/test e2e-metal-ipi-ovn-ipv4
/test e2e-metal-ipi-ovn-ipv6-techpreview
/test e2e-metal-ipi-ovn-techpreview
/test e2e-openstack-ovn
/test e2e-ovn-hybrid-step-registry
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-windows
/test okd-scos-e2e-aws-ovn
/test okd-scos-images
/test openshift-e2e-gcp-ovn-techpreview-upgrade
/test qe-perfscale-aws-ovn-medium-cluster-density
/test qe-perfscale-aws-ovn-medium-node-density-cni
/test qe-perfscale-aws-ovn-small-cluster-density
/test qe-perfscale-aws-ovn-small-node-density-cni
/test qe-perfscale-aws-ovn-small-udn-density-l2
/test qe-perfscale-aws-ovn-small-udn-density-l3
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-ovn-kubernetes-master-4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
pull-ci-openshift-ovn-kubernetes-master-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-ovn-kubernetes-master-4.18-upgrade-from-stable-4.17-images
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-hypershift
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-hypershift-conformance-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-kubevirt
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-gateway
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-serial
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-shared-to-local-gateway-mode-migration
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-single-node-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-windows
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-azure-ovn-upgrade
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-ipv6-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-techpreview
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn
pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry
pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn
pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn-techpreview
pull-ci-openshift-ovn-kubernetes-master-gofmt
pull-ci-openshift-ovn-kubernetes-master-images
pull-ci-openshift-ovn-kubernetes-master-lint
pull-ci-openshift-ovn-kubernetes-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-ovn-kubernetes-master-openshift-e2e-gcp-ovn-techpreview-upgrade
pull-ci-openshift-ovn-kubernetes-master-security
pull-ci-openshift-ovn-kubernetes-master-unit

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jtaleric
Copy link

jtaleric commented Dec 4, 2024

/test qe-perfscale-aws-ovn-small-udn-density-l3

@dceara dceara force-pushed the test_udn_node_scale branch 3 times, most recently from d60dde2 to c3d3b62 Compare December 11, 2024 10:22
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024
@dceara dceara force-pushed the test_udn_node_scale branch from c3d3b62 to 2ac90fa Compare December 11, 2024 10:29
Routes via mp0 were being deleted on every ovnkube-node restart:
[root@ovn-worker ~]# ip monitor route
Deleted 192.72.3.0/24 dev ovn-k8s-mp0 proto kernel scope link src 192.72.3.2
Deleted broadcast 192.72.3.255 dev ovn-k8s-mp0 table local proto kernel scope link src 192.72.3.2
Deleted local 192.72.3.2 dev ovn-k8s-mp0 table local proto kernel scope host src 192.72.3.2
local 192.72.3.2 dev ovn-k8s-mp0 table local proto kernel scope host src 192.72.3.2
broadcast 192.72.3.255 dev ovn-k8s-mp0 table local proto kernel scope link src 192.72.3.2

This causes traffic outage during upgrade, as well as other unwanted
side effects when pod-destined traffic is routed via default gateway
route in the host. This is especially disruptive in local gateway mode.

This patch removes the teardown, and then makes the synchronization of
addresses and routes more robust, so that we can safely handle changes
to MTU or mp0 addresses.

Signed-off-by: Tim Rozet <[email protected]>
Next need to:
 - add mgmt port mac address
 - update l2 secondary
 - update node tracker

Signed-off-by: Tim Rozet <[email protected]>
- Adds locking to protect node/UDNNode syncmap updates
- Adds UDNNode client support to node controller for updating mac
- make codegen

Signed-off-by: Tim Rozet <[email protected]>
We still need workqueue and retry for nodetracker, as well as to
abstract to a singleton. The start up time of adding all nodes is a
waste everytime we create a new UDN.

Signed-off-by: Tim Rozet <[email protected]>
Signed-off-by: Tim Rozet <[email protected]>
Signed-off-by: Tim Rozet <[email protected]>
trozet and others added 6 commits December 11, 2024 11:29
Make UDN Node informer have 15 threads
Make Network Manager and NAD controller have higher thread count

Signed-off-by: Tim Rozet <[email protected]>
From the documentation of the PollUntilContextTimeout (used by
startupWaiter):

  // ConditionWithContextFunc returns true if the condition is satisfied, or an error
  // if the loop should be aborted.

In the readyFunc callback we were incorrectly returning an error if the
flows couldn't be found in the current try, causing the waiter to bail
early, essentially never retrying.

Change this to log the error inside the callback instead of returning it.

Signed-off-by: Dumitru Ceara <[email protected]>
The netConfig map is accessed from multiple threads (e.g., for UDNs).

Signed-off-by: Dumitru Ceara <[email protected]>
Increases async performance of informer cache being able to always queue
events and not blocking while performing ADD/UPDATE/DELETE operation.

Signed-off-by: Tim Rozet <[email protected]>
@dceara dceara force-pushed the test_udn_node_scale branch from 2ac90fa to 914c896 Compare December 11, 2024 10:30
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024
@dceara dceara force-pushed the test_udn_node_scale branch from 914c896 to 92ffec2 Compare December 11, 2024 13:10
dceara and others added 7 commits December 12, 2024 13:50
Add a pool of Event handlers instead of a single (federated) event
handler per informer.  Ensure a controller always gets registers with
the same event handler.

TODO: this is used by the secondary L3 network controller and by the
cluster manager for now.

Signed-off-by: Dumitru Ceara <[email protected]>
Always use pool entry with index 0 for the default network controller.

Bump the queue sizes to 1K (except for the initial sync queue, keep that
small enough to avoid contention on handler initial processing).

Signed-off-by: Dumitru Ceara <[email protected]>
Compare annotations directly if possible.
For network specific map entries only compare raw json
entries without parsing the map in full.

Co-authored-by: Tim Rozet <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
Instead of always parsing all node/join subnets
parse the raw json map and only compute the results
for the affected network.

Signed-off-by: Patryk Diak <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
@dceara dceara force-pushed the test_udn_node_scale branch from 92ffec2 to f5e4576 Compare December 12, 2024 12:51
Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dceara
Once this PR has been reviewed and has the lgtm label, please assign abhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dceara dceara force-pushed the test_udn_node_scale branch 2 times, most recently from 8266a3d to 4d764f9 Compare December 12, 2024 14:01
dceara and others added 6 commits December 12, 2024 17:24
Today ovnkube-controller relies on ovnkube-node to annotate the mac
address so that it can properly program the OVN mgmt port. In practice
with UDN, this means cluster manager creates the node and allocates some
subnet info to the node, both the node and the ovnkube-controller side
get updates, but ovnkube-controller fails to program the first time
because it is waiting for the node side to configure the mac address.

This patch changes the behavior of ovn-kubernetes to calculate the MAC
address for the management port from the mgmt port IP address of the
first subnet for the network. For backwards compatibility, it will first
attempt to read the node annotation, and if it no mgmt MAC exists, it
will derive it from the mgmt IP of the subnet.

Signed-off-by: Tim Rozet <[email protected]>
Signed-off-by: Dumitru Ceara <[email protected]>
@dceara dceara force-pushed the test_udn_node_scale branch from 4d764f9 to ad1682f Compare December 12, 2024 16:25
Copy link
Contributor

openshift-ci bot commented Dec 12, 2024

@dceara: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l3 00d6f2f link false /test qe-perfscale-aws-ovn-small-udn-density-l3
ci/prow/e2e-metal-ipi-ovn-techpreview ad1682f link false /test e2e-metal-ipi-ovn-techpreview
ci/prow/gofmt ad1682f link true /test gofmt
ci/prow/e2e-gcp-ovn-techpreview ad1682f link true /test e2e-gcp-ovn-techpreview
ci/prow/4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade ad1682f link true /test 4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview ad1682f link false /test e2e-metal-ipi-ovn-dualstack-local-gateway-techpreview
ci/prow/e2e-aws-ovn-upgrade-local-gateway ad1682f link true /test e2e-aws-ovn-upgrade-local-gateway
ci/prow/e2e-azure-ovn-upgrade ad1682f link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6-techpreview ad1682f link false /test e2e-metal-ipi-ovn-ipv6-techpreview
ci/prow/openshift-e2e-gcp-ovn-techpreview-upgrade ad1682f link false /test openshift-e2e-gcp-ovn-techpreview-upgrade
ci/prow/okd-scos-e2e-aws-ovn ad1682f link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-vsphere-ovn-techpreview ad1682f link false /test e2e-vsphere-ovn-techpreview
ci/prow/e2e-aws-ovn-hypershift ad1682f link true /test e2e-aws-ovn-hypershift
ci/prow/e2e-metal-ipi-ovn-ipv6 ad1682f link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-metal-ipi-ovn-dualstack-techpreview ad1682f link false /test e2e-metal-ipi-ovn-dualstack-techpreview
ci/prow/e2e-aws-ovn-windows ad1682f link true /test e2e-aws-ovn-windows
ci/prow/e2e-aws-ovn-kubevirt ad1682f link false /test e2e-aws-ovn-kubevirt
ci/prow/e2e-aws-ovn-local-gateway ad1682f link true /test e2e-aws-ovn-local-gateway
ci/prow/e2e-aws-ovn ad1682f link true /test e2e-aws-ovn
ci/prow/4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade ad1682f link true /test 4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
ci/prow/security ad1682f link false /test security
ci/prow/e2e-vsphere-ovn ad1682f link false /test e2e-vsphere-ovn
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration ad1682f link true /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/e2e-metal-ipi-ovn-dualstack ad1682f link true /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-aws-ovn-serial ad1682f link true /test e2e-aws-ovn-serial
ci/prow/e2e-gcp-ovn ad1682f link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-single-node-techpreview ad1682f link false /test e2e-aws-ovn-single-node-techpreview
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview ad1682f link false /test e2e-aws-ovn-hypershift-conformance-techpreview
ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration ad1682f link true /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
ci/prow/e2e-azure-ovn ad1682f link false /test e2e-azure-ovn
ci/prow/e2e-aws-ovn-techpreview ad1682f link false /test e2e-aws-ovn-techpreview
ci/prow/e2e-openstack-ovn ad1682f link false /test e2e-openstack-ovn
ci/prow/e2e-azure-ovn-techpreview ad1682f link false /test e2e-azure-ovn-techpreview
ci/prow/e2e-aws-ovn-upgrade ad1682f link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-ovn-hybrid-step-registry ad1682f link false /test e2e-ovn-hybrid-step-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants