Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter to customize the number of cinderAPI processes #474

Closed
wants to merge 1 commit into from

Conversation

fmount
Copy link
Contributor

@fmount fmount commented Dec 5, 2024

No description provided.

@fmount fmount requested review from eharney and ASBishop December 5, 2024 08:59
@fmount fmount marked this pull request as draft December 5, 2024 08:59
@openshift-ci openshift-ci bot requested review from stuggi and viroel December 5, 2024 08:59
Copy link
Contributor

openshift-ci bot commented Dec 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fmount
Once this PR has been reviewed and has the lgtm label, please assign abays for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -666,6 +666,11 @@ spec:
type: object
transportURLSecret:
type: string
workers:
default: 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ASBishop do you think we need a better default? And also, I'm not sure minimum: 1 is the right thing to do, we might want to start w/ 4 as minimum value, and scale up.

// +kubebuilder:default=4
// +kubebuilder:validation:Minimum=1
// Workers - Number of processes running CinderAPI
Workers *int32 `json:"workers"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ASBishop as this is a parameter we tune for CinderAPI, my idea is to keep it under CinderAPITemplate: it results clear how to access this value from cinder_controller. Do you think we have a better place to expose this value?
FYI I'm doing the same thing for Manila, I'd like to keep the two operators in sync.

@fmount fmount marked this pull request as ready for review December 5, 2024 09:05
@openshift-ci openshift-ci bot requested a review from frenzyfriday December 5, 2024 09:05
@fmount
Copy link
Contributor Author

fmount commented Dec 5, 2024

@ASBishop one interesting thing here is that we have apiTimeout as top level parameter, while I think it's specific to CinderAPI.
I'd like to consolidate those parameters into a substruct (httpdTuning or something like that), scoped at CinderAPITemplate level, but this might be part of a follow up patch to keep the focus on the goal of this change.

@stuggi
Copy link
Contributor

stuggi commented Dec 5, 2024

form httpd tuning, check this keystone PR openstack-k8s-operators/keystone-operator#500 . we should do the same here

// +kubebuilder:default=4
// +kubebuilder:validation:Minimum=1
// ProcessNumber - Number of processes running in Cinder API
ProcessNumber *int32 `json:"processNumber"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in nova-operator we try to keep the replica as the unit of scaling of a service with a hardcoded process number. What is the case where a replica based scaling is not a good solution for cinder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were discussions on architectural aspects where you can't scale replicas (because of scheduling constraints), but you still want to tweak this parameter on the same API.
My initial understanding was pretty much the same as yours: use replicas as it represents a building block to scale the API, but to reach the same number of process with a smaller amount of replicas you might need to tune this value [1].

[1] https://issues.redhat.com/browse/OSPRH-10363

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were discussions on architectural aspects where you can't scale replicas (because of scheduling constraints), but you still want to tweak this parameter on the same API.

What are the scheduling constraints that prevents us having many replicas? I believe our anti affinity is soft, so it only tries to spread the pods across worker nodes but does not fail if more pods needs to land on the same worker. I looked at [1] but I don't see the actual scheduling problem described there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moreover if we open up the process config for the user then it is easy to make a nonsensical config with a high process count but a low cpu resource request. I still think that having a hardcoded process count helps defining a meaningful resource request for the our pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We don't have a way to keep all these parameters consistent and still scale in a way where we keep track of the resource consumption. I'm reconsidering these set of patches because they're not probably solving a problem that is already addressed at k8s level. I see we have a default number of processes: 4. Do you know if there are better defaults or did you do anything in Nova that can lead us to choose better numbers (if so, which is the metric used in this context?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We don't have a way to keep all these parameters consistent and still scale in a way where we keep track of the resource consumption. I'm reconsidering these set of patches because they're not probably solving a problem that is already addressed at k8s level. I see we have a default number of processes: 4. Do you know if there are better defaults or did you do anything in Nova that can lead us to choose better numbers (if so, which is the metric used in this context?)

In nova-operator we ended up having processes=2 [1] based on the tempest results and troubleshooting of some probe timeouts. I think 4 is equally valid if you see that as a good fit for the steps of scaling. Probably we will gain more insight from the field especially for the cpu and memory resource request. Like I'm just fixing and performance measuring a placement-api timeout / memory consumption spike upstream so I have data that an idle placement-api process needs 100MB but during an allocation candidate query it spikes up to 650MB easily while pegging 1 CPU core. Probably similar data is needed for each service to set up a nice defaults.

[1]https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates/novaapi/config/httpd.conf#L76

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing. We didn't conduct any of this exercise and this is a prerequisite to learn more about Resource consumption and tuning the default parameters. I don't see a particular problem that comes from tempest with processes=4 as a default, but worth investigating for the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think 4 is fine. I would advise against 1 or bigger 4 though. Or maybe even bigger than 4 is OK if the resource request also set accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deydra71 since you added this to keystone, was there a reason for keystone to be able to customize the processes, instead if bumping the replicas?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuggi
Basically just the Jira conversation -->
https://issues.redhat.com/browse/OSPRH-10363

@fmount fmount closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants