Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccessible LoadBalancer services when using OVN Octavia provider #2333

Closed
m-bull opened this issue Dec 16, 2024 · 7 comments
Closed

Inaccessible LoadBalancer services when using OVN Octavia provider #2333

m-bull opened this issue Dec 16, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@m-bull
Copy link

m-bull commented Dec 16, 2024

/kind bug

What steps did you take and what happened:
Since #2128, when using OVN Octavia provider, LoadBalancer services that were accessible when NodePorts were open from 0.0.0.0/0 are no longer accessible, because Security Groups are created which restrict NodePorts to traffic sourced from only the cluster network.

What did you expect to happen:
No change in behaviour from previous versions.

Anything else you would like to add:
This isn't strictly a bug, but it is a change in behaviour that I spent some time chasing down and might be worth documenting at least here in case it saves someone else the trouble!

As mentioned here, when using the Amphora provider for Octavia the origin of traffic to LB members is from within the cluster CIDR, however the same is not true when using the OVN provider, where the origin of the traffic towards LB members is from outside of the cluster CIDR (at least in my hands). This means that creating a Security Group for workers that only allows traffic destined for NodePorts from inside the cluster CIDR instead of 0.0.0.0/0, breaks existing configurations using OVN Octavia provider underneath, possibly unexpectedly.

A small config change on the OCCM side to allow it to manage Security Groups itself makes this all work fine, but does potentially restore the exposure of NodePorts to the internet:

[LoadBalancer]
manage-security-groups=true

Environment:

  • Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): v0.11.3
  • Cluster-API version: v1.8.5
  • OpenStack version: Caracal
  • Kubernetes version (use kubectl version): 1.31.2
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2024
@mdbooth
Copy link
Contributor

mdbooth commented Dec 17, 2024

This looks like an oversight in the original PR and is clearly a regression.

I suspect that the correct fix here is to add a default rule covering OVN Octavia traffic. Do you know what that rule would look like?

@m-bull
Copy link
Author

m-bull commented Dec 17, 2024

From a few quick tests, it seems that the source IP of packets coming in via an OVN loadbalancer is preserved all the way to the destination, which I think means that the remote group can only realistically be 0.0.0.0/0 in order to restore the original behaviour, which unfortunately doesn't really help from the point of view of tightening up Security Group rules.

@mkjpryor
Copy link
Contributor

mkjpryor commented Dec 17, 2024

@mdbooth

As @m-bull says, because of how OVN works the fix is basically to revert the change… Are we happy doing that? Because of how the networking is set up, I’m not really seeing what benefits the tighter security groups give in this case TBH, unless people are in the habit of putting floating IPs on their worker nodes? Maybe they are…

@MaysaMacedo
Copy link
Contributor

MaysaMacedo commented Dec 17, 2024

@mkjpryor

hello, the fix you propose to revert is beneficial to when Amphora driver is used as well, right?
Perhaps, instead of reverting we could document that if users want to open any additional rules, which is the ovn-octavia case, they need to use managed security groups. Thoughts?

@mkjpryor
Copy link
Contributor

mkjpryor commented Dec 18, 2024

Depends if you are happy to introduce such a severe regression in a point release, TBH. I probably wouldn’t be - personally I would revert this change, issue a new point release and then if we do still want the change it can go into v0.12.0.

P.S. I know SemVer doesn’t guarantee any backwards compatibility until 1.0.0, but in reality people are using this in prod and we can’t just do that.

@mkjpryor
Copy link
Contributor

mkjpryor commented Dec 18, 2024

Also, I think the benefits it brings in the Amphora case are minimal - usually the LB and worker nodes are all on the same private network and all access to the nodes is mediated via the LBs.

I would actually argue that the thing that is different and should be documented is the exact opposite case, i.e. if all your CAPI clusters are going onto one big shared network you probably want different secgroups to shut that access down.

@EmilienM
Copy link
Contributor

I think we can close the issue with #2353 being merged and we even have docs.

@github-project-automation github-project-automation bot moved this from Inbox to Done in CAPO Roadmap Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Done
Development

No branches or pull requests

6 participants