Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ray cluster with ssa #778

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

akram
Copy link
Contributor

@akram akram commented Dec 9, 2024

Issue link

What changes have been made

Adding apply function.

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 95.12195% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.40%. Comparing base (b5c13dc) to head (b21b177).

Files with missing lines Patch % Lines
src/codeflare_sdk/ray/cluster/cluster.py 95.12% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #778      +/-   ##
==========================================
+ Coverage   90.43%   92.40%   +1.97%     
==========================================
  Files          23       23              
  Lines        1359     1396      +37     
==========================================
+ Hits         1229     1290      +61     
+ Misses        130      106      -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@akram akram force-pushed the create-ray-cluster-with-ssa branch 13 times, most recently from 658779c to 11dd0cb Compare December 10, 2024 22:28
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
@akram akram force-pushed the create-ray-cluster-with-ssa branch 12 times, most recently from 73cbb2c to 6a01352 Compare December 16, 2024 13:03
@akram akram force-pushed the create-ray-cluster-with-ssa branch 5 times, most recently from 43fb625 to 6c8f3db Compare December 16, 2024 15:25
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akram akram force-pushed the create-ray-cluster-with-ssa branch 2 times, most recently from 352c26c to 34b0e86 Compare January 2, 2025 14:43
Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Began testing this again on both Kind and OpenShift.
It seems that Appwrapper is not working with the new changes when creating a cluster via apply().

TypeError: Discoverer.get() takes 1 positional argument but 2 were given

src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
src/codeflare_sdk/ray/cluster/cluster.py Outdated Show resolved Hide resolved
@akram akram force-pushed the create-ray-cluster-with-ssa branch from 34b0e86 to 1f606fb Compare January 6, 2025 11:00
@akram
Copy link
Contributor Author

akram commented Jan 7, 2025

Began testing this again on both Kind and OpenShift. It seems that Appwrapper is not working with the new changes when creating a cluster via apply().

TypeError: Discoverer.get() takes 1 positional argument but 2 were given

I have encapsulated the exception and was not able to reproduce the exact issue here.
Instead I was seeing an issue related to appwrapper not being enabled.

Copy link
Contributor

@Bobbins228 Bobbins228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this again and Appwrapper is working for initial apply but I ran into a couple of issues when trying to update the AppWrapper and apply my changes.

I kept running into this error with regards to the field manager even with force=True:

A conflict occurred with the RayCluster resource.
Only one RayCluster with the same name is allowed. Please delete or rename the existing RayCluster before creating a new one with the desired name.
Response: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Apply failed with 1 conflict: conflict with \\"codeflare-operator\\" using workload.codeflare.dev/v1beta2: .spec.components","reason":"Conflict","details":{"causes":[{"reason":"FieldManagerConflict","message":"conflict with \\"codeflare-operator\\" using workload.codeflare.dev/v1beta2","field":".spec.components"}]},"code":409}\n'

@akram
Copy link
Contributor Author

akram commented Jan 7, 2025

Tested this again and Appwrapper is working for initial apply but I ran into a couple of issues when trying to update the AppWrapper and apply my changes.

I kept running into this error with regards to the field manager even with force=True:

A conflict occurred with the RayCluster resource.
Only one RayCluster with the same name is allowed. Please delete or rename the existing RayCluster before creating a new one with the desired name.
Response: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Apply failed with 1 conflict: conflict with \\"codeflare-operator\\" using workload.codeflare.dev/v1beta2: .spec.components","reason":"Conflict","details":{"causes":[{"reason":"FieldManagerConflict","message":"conflict with \\"codeflare-operator\\" using workload.codeflare.dev/v1beta2","field":".spec.components"}]},"code":409}\n'

nice catch. I reproduced it. There is 2 issues actually. For appwrappers, I was not propagating the force=True .
But, on top of that, changing the spec.components[0].template is also blocked by webhooks.
I tried to update the mutating webhook the same way I did for raycluster oauth proxy field, but that's not working.
The validating webhook is currently enforcing the raycluster.template to be immutable.

Trying to edit an appwrapper manually with oc edit leads to the same error.

(codeflare-sdk-py3.9) akram@wadez-m1 codeflare-sdk % oc edit appwrappers.workload.codeflare.dev jobtest
error: appwrappers.workload.codeflare.dev "jobtest" could not be patched: admission webhook "vappwrapper.kb.io" denied the request: spec.components[0].template.raw: Forbidden: attempt to change immutable field

@dgrove-oss
Copy link
Collaborator

AppWrappers enforce that spec.components[*].template is immutable after creation, because updates may invalidate the PodSet info that they've extracted and given to Kueue.

I'm missing the bigger context of what you are trying to do, but you will not be able to mutate a RayCluster wrapped in an AppWrapper in place.

@akram
Copy link
Contributor Author

akram commented Jan 8, 2025

AppWrappers enforce that spec.components[*].template is immutable after creation, because updates may invalidate the PodSet info that they've extracted and given to Kueue.

I'm missing the bigger context of what you are trying to do, but you will not be able to mutate a RayCluster wrapped in an AppWrapper in place.

hi @dgrove-oss ,
the goal of this PR is to introduce an apply() function that would avoid cases where users have to do down+up to update rayclusers and appwrappers. That's an idempotent alternative to the up() function that can also makes resource creation more straitghtforward.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2025
@akram akram force-pushed the create-ray-cluster-with-ssa branch from 1f606fb to 074557a Compare January 15, 2025 17:46
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2025
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, just one nit

@akram akram force-pushed the create-ray-cluster-with-ssa branch from 074557a to 225db7b Compare January 16, 2025 05:52
@akram
Copy link
Contributor Author

akram commented Jan 17, 2025

@KPostOffice and @Bobbins228 do you think we can merge it?
right now, error states are already handled in _kube_api_error_handling and explicit error messages are provided.

- Adds  RayCluster.apply() implementation
- Adds e2e tests for apply
- Adds unit tests for apply
- Exclude unit tests code from coverage
- Add coverage to cluster.py
- Adding coverage for the case of an openshift cluster
- Refactoring apply
- Adding constant, adding error checking, adding documentation
- Rename _apply_resources into _apply_ray_cluster
@akram akram force-pushed the create-ray-cluster-with-ssa branch from c17095a to b21b177 Compare January 17, 2025 14:18
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025
Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants