Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: properly garbage collecting orphaned network interfaces #642

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Bryce-Soghigian
Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian commented Jan 8, 2025

Fixes #92

Description
This introduces a garbage collection controller that will remove network interfaces that are orphaned.

How was this change tested?

  • make test
  • make az-all + scale up

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Properly garbage collect network interfaces.

nodeClaim.Labels = labels
nodeClaim.Annotations = annotations
nodeClaim.CreationTimestamp = metav1.Time{Time: *vm.Properties.TimeCreated}
nodeClaim.CreationTimestamp = metav1.Time{Time: lo.FromPtr(vm.Properties.TimeCreated)}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this path was executed when my List query returned an empty result.

In the case where ARG has an outage, garbage collection will break, but there is no need to panic here in that scenario. So switched to lo.FromPtr() instead of normal dereference.

@Bryce-Soghigian
Copy link
Collaborator Author

Testing this locally.

Initially after our first scaleup, we were at 810 network interfaces.

After garbage collection ran we can see that we deleted 435 network interfaces, and were left with 375 for our 375 nodes.

~/dev/focus/karpenter-provider-azure (bsoghigian/network-interface-gc*) » cat logs.txt| grep "garbage collected NIC" | wc -l sillygoose@Bryces-MacBook-Pro
435

image

Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realize this is still draft to be cleaned up, so apologies if feedback is too early. Leaving some comments, overall I think I would be tempted to have a separate "networkinterface.garbagecollection" controller - unless there is a good reason not to.

pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Jan 10, 2025

Pull Request Test Coverage Report for Build 12738183194

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 100 of 190 (52.63%) changed or added relevant lines in 8 files are covered.
  • 21 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 95.275%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/nodeclaim/garbagecollection/instance_garbagecollection.go 4 5 80.0%
pkg/providers/instance/instance.go 23 28 82.14%
pkg/providers/instance/arglist.go 42 58 72.41%
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go 0 68 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instance/armutils.go 1 65.85%
pkg/providers/instance/instance.go 20 83.87%
Totals Coverage Status
Change from base Build 12661198057: -0.1%
Covered Lines: 48191
Relevant Lines: 50581

💛 - Coveralls

@tallaxes tallaxes added area/networking Issues or PRs related to networking area/provisioning Issues or PRs related to provisioning (instance provider) area/error-handling Issues or PRs related to handling of errors labels Jan 14, 2025
Comment on lines +75 to +76
"karpenter.sh_cluster": lo.ToPtr("test-cluster"),
"karpenter.sh_nodepool": lo.ToPtr(nodepoolName),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use constants from instance provider? But then cyclic dependency?

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review January 15, 2025 23:52
Bryce-Soghigian and others added 18 commits January 16, 2025 02:40
…tor of arg related functions to their own file
…ate between them to prevent conflicts in nic deletion calls
…ontroller to avoid attempts to delete nics that are managed by a vm
@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/network-interface-gc branch from 9eb0362 to 20c6625 Compare January 16, 2025 10:40
Comment on lines 91 to +93
GetNic(context.Context, string, string) (*armnetwork.Interface, error)
DeleteNic(context.Context, string) error
ListNics(context.Context) ([]*armnetwork.Interface, error)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future refactor here would be to factor out these calls into a network interface provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/error-handling Issues or PRs related to handling of errors area/networking Issues or PRs related to networking area/provisioning Issues or PRs related to provisioning (instance provider)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NicReservedForAnotherVM Error leaves nics in MC Resource Group
3 participants