Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Reset cloud allocation API #561

Open
josecastillolema opened this issue Jan 7, 2025 · 6 comments
Open

[RFE] Reset cloud allocation API #561

josecastillolema opened this issue Jan 7, 2025 · 6 comments

Comments

@josecastillolema
Copy link

Is your feature request related to a problem? Please describe.
The feature request is related to CI usage. Most of our tooling (jetlag, jetski) assumes a clean deployment as a pre-requisite.

Describe the solution you'd like
A reset allocation API that will:

  • Reset the nodes (foreman deploy them)
  • Allow to optionally skip the bastion (first host of the allocation)
  • Maintains the same cloud allocation number

Describe alternatives you've considered
Manually hammer host update all of the nodes of the allocation

@sadsfae
Copy link
Member

sadsfae commented Jan 7, 2025

Hi @josecastillolema I think all of this is already covered by additional processes or not in scope.

Reset the nodes (foreman deploy them)

You should just use Foreman to do this, a self-scheduled environment is the exact same functionally as a deliberately scheduled one and this isn't really in scope of the API. Foreman is a third-party complete platform where those actions are better handled there.

Allow to optionally skip the bastion (first host of the allocation)

Just re-provision via Foreman and skip this node. We would have no way of knowing what you chose to use as your bastion node anyway.

Maintains the same cloud allocation number

You will keep your environment name, cloud # and everything.

I think this RFE isn't in scope for being managed by QUADS. We do have a Forman library but we do not want to try to operate Foreman from QUADS.

@sadsfae
Copy link
Member

sadsfae commented Jan 7, 2025

@josecastillolema add here, we will be providing an RFE shortly to allow you to choose your OS so that may let you skip having to re-provision if you were just doing this to get a newer OS than the lab default, so long as that operating system is present in that QUADS Foreman. This would be an API option for self-scheduling.

#474

(WIP patchset)
https://review.gerrithub.io/c/redhat-performance/quads/+/1206450

@grafuls
Copy link
Contributor

grafuls commented Jan 7, 2025

You can achieve this via the foreman REST api like this:

  1. Get the host id:
    curl -X GET -s -k -u admin:pass https://foreman.example.com/api/hosts?search=name=host1.example.com | jq .results[0].id
  2. Put the build parameter passing the host id, from previous step, onto the endpoint:
    curl -X PUT -s -k -u admin:pass -H "content-type: application/json" -d '{"host": {"build": 1}}' https://foreman.example.com/api/hosts/36

@sadsfae
Copy link
Member

sadsfae commented Jan 10, 2025

@josecastillolema given the details above and also choosing Foreman OS got merged in development we now have the ability for tenants to choose their Foreman-deployed OS: 189cd16

We'll be adding this to the self-service API via #563

It's not exactly what you're asking for here, but I think it saves the need for re-provisioning systems with a newer OS (EL9 vs EL8 lab default or lab default) when you get systems.

We will close this RFE as QUADS won't be performing Foreman actions beyond initial provisioning workflows it already does with the QUADS Foreman library. Foreman already provides a robust API for what you're trying to do anyway and that's the best, most direct place to do it for marking something for build for example.

@sadsfae sadsfae closed this as completed Jan 10, 2025
@github-project-automation github-project-automation bot moved this from To do to Done in QUADS 2.1 Series Jan 10, 2025
@josecastillolema
Copy link
Author

josecastillolema commented Jan 11, 2025

Thanks for taking a look @sadsfae @grafuls
I trust your judgment regarding QUADS roadmap but would like to provide more context on the ask:

  • The feature requested would be used in LTA CI allocations, not in short allocations nor in self scheduling ones
  • We do not want custom SOs for this specific use case, we just need the default SO that most of our tooling expects (jetlag, etc.)
  • We need it, among other things, to be able to switch the environment from OCP deployments to:
    • RHOSO, which uses "normal" RHEL servers for compute nodes not part of the OCP clusters
    • Hybrid deployments with virtual machines in jetlag, which assumes a clean QUADS deployment as a pre-requisite
    • Microshift
    • etc.

Re-provision the host via Foreman (programmatically from the CI) has proven to be extremely challenging. Not the API call to Foreman itself, that seems to be the easy part, but rebooting the server in the proper interface to pick the PXE boot from Foreman and then reestablishing settings in a way OCP deployments don't break afterwards. Some challenges we have experienced:

  • Its 100% hardware dependent, every time we try this on a new cloud we expend a lot of time trying to figure out the proper badfish_interfaces.yml file, this is what our last iteration looks like:
    foreman_r760_interfaces: NIC.PxeDevice.1-1,RAID.SL.3-2,Optical.iDRACVirtual.1-1
    foreman_r740xd_interfaces: NIC.PxeDevice.1-1,RAID.Slot.5-1,Optical.iDRACVirtual.1-1
    

It has worked sometimes on performancelab cloud31 but never on cloud19 without @QuantumPosix manually loging on the server and doing some adjustments.
As you can see, the value for foreman_r740xd_interfaces (which it is still not working btw) differs from the one on the provided idrac_interfaces.yml, so we needed to come up with this one in a time-consuming trial and error process.

  • As we discussed in other issues, the default idrac_interfaces.yml is meant as a guide, it's not always going to be up-to-date in the repository. That leaves up to us from the CI to deal with all the hardware specifics of every allocation.
  • @grafuls tried your approach here without success: https://privatebin.corp.redhat.com/
  • This is the other approach we have been trying from the CI (that doesnt work reliably either) and the corresponding doc that @radez has put together:
    hammer host update --name $i --operatingsystem "$FOREMAN_OS" --pxe-loader "Grub2 UEFI" --build 1
    badfish -H $i -u $USER -p $PWD -i ~/badfish_interfaces.yml -t foreman
    

About (maybe unrealistic 😁) expectations:

  • I would ideally expect for the QUADS API to act as a layer between the CI and Foreman that will abstract the specifics of the hardware from the CI:
flowchart TD
    A[CI/Prow] --> B[QUADs]
    B --> C[Foreman]
Loading
  • I can't avoid to feel that we are duplicating work in the CI that's already implemented in QUADS. I may be be totally wrong here and please do correct me if that's the case but for example QUADS should already know how to Foreman deploy i.e.: performancelab's cloud19 right? What the proper order of the interfaces needs to be, etc. Can't we tap into this knowledge?

And finally about alternatives:

  • If the reset cloud allocation API is not under consideration, could we consider a QUADS API call to Foreman rebuild one specific host? Not only request the foreman deploy but rebooting the host to be foreman deployed (or whatever other way QUADS handles this during the provisioning of new allocations).

Looking forward to discuss this further.

cc @kambiz-aghaiepour @jtaleric @smalleni

@sadsfae
Copy link
Member

sadsfae commented Jan 11, 2025

Thanks for taking a look @sadsfae @grafuls I trust your judgment regarding QUADS roadmap but would like to provide more context on the ask:

@josecastillolema Thanks for such a detailed response and some great information here, I'll try to respond in-line below:

* The feature requested would be used in LTA CI allocations, not in short allocations nor in self scheduling ones

* We do not want custom SOs for this specific use case, we just need the default SO that most of our tooling expects (jetlag, etc.)

* We need it, among other things, to be able to switch the environment from OCP deployments to:
  
  * RHOSO, which uses "normal" RHEL servers for compute nodes not part of the OCP clusters
  * Hybrid deployments with virtual machines in jetlag, which assumes a clean QUADS deployment as a pre-requisite
  * Microshift
  * etc.

Re-provision the host via Foreman (programmatically from the CI) has proven to be extremely challenging. Not the API call to Foreman itself, that seems to be the easy part, but rebooting the server in the proper interface to pick the PXE boot from Foreman and then reestablishing settings in a way OCP deployments don't break afterwards. Some challenges we have experienced:

* Its 100% hardware dependent, every time we try this on a new cloud we expend a lot of time trying to figure out the proper **_badfish_interfaces.yml_** file, this is what our last iteration looks like:

This is just life with bare-metal, it's always a difficult challenge when you are at the mercy of so many vendors, firmware versions and hardware configurations. Things get especially interesting when complex application stacks also want to do varied things to the hardware too.

idrac_interfaces.yml will never be shipped right out of Badfish to account for every configuration, lab or need. The overrides go a long way but we also don't know what we don't know. The fact you're including ISO media in your boot interface strings is one example. idrac_interfaces.yml is designed as only a guide and meant for you to modify and get right according to your needs.

  ```
  foreman_r760_interfaces: NIC.PxeDevice.1-1,RAID.SL.3-2,Optical.iDRACVirtual.1-1
  foreman_r740xd_interfaces: NIC.PxeDevice.1-1,RAID.Slot.5-1,Optical.iDRACVirtual.1-1
  ```

It has worked sometimes on performancelab cloud31 but never on cloud19 without @QuantumPosix manually loging on the server and doing some adjustments. As you can see, the value for foreman_r740xd_interfaces (which it is still not working btw) differs from the one on the provided idrac_interfaces.yml, so we needed to come up with this one in a time-consuming trial and error process.

This seems like a hardware / lab configuration challenge. If it works reliably in once place and not the other it's likely not anything we can solve programmatically, there are just intrinsic differences between environments.

Let's take the RDU3 Performance lab for example, there are 6 x different R740XD models and at least 3 x different R750 models, this is by design - some of it historical but mainly because each model has a differing set of hardware design which places device-name-changing attributes because of slot placements, different mainboards, different components and so on.

The best we can hope for here is to abstract enough of these differences away so things can be installed/operated in a repeatable and reliable fashion. Customers don't have it any easier and may have even more variety in their fleet, may likely not have something useful like Jetlag or QUADS and fight a hard fight against the changing landscape of application installers to boot.

The onus to sort this out is going to still be on your own automation and what model(s) you receive, the best we can provide here is some reliable designator that maps to XY hardware or XZ hardware config. We can build on what the API provides to make this easier and more turn-key but there's already a lot provided too.

In QUADS 2.x and above hardware differences this can be filtered via the models filter. We can work with you to augment or extend however you're sorting this to get the right idrac_interfaces.yml in a more manageable fashion but it's simply going to come down to using the API and filtering in a more manageable way where intrinsic hardware differences among the same major server model don't matter.

For example, you can filter based on model in RDU3 performance lab:

https://github.com/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md#example-hardware-filter-searches

There are a number of models there, each correspond to every sub-variety of every major server model, within these sub-models they should be identical so therefore an idrac_interfaces.yml generated from this data should be correct.

R740XD-SL-N
R740XD-SL-G
R740XD-SL-U
R740XD-CL-N
R740XD-CL-G
R740XD-CL-U
R750-IL-N
R750-IL-G
R750-IL-U

If the model(s) you're using are not one of those then you need to work with the admin of that lab (Chris) to get them added and queryable. Go to the QUADS wiki for that lab and click on Available to also get a search by model.

There has to be on the part of the lab the assumption that there are no differences between the model QUADS refers to and the hardware configuration (or else there needs to be another distinct sub-model added). It's on the hardware provider/lab to ensure this because you need something that's reliable and repeatable that can be leveraged to describe differences so your automation can work.

The second big thing here is how complex application stacks like OCP or OSP want to interact with, fiddle or otherwise manipulate BIOS boot order. This is also a changing landscape we have no control over and something your automation needs to handle and something we need to provide feedback to installer teams as well and to track and address if things change.

* As we discussed in other issues, the default **_idrac_interfaces.yml_** is meant as a guide, it's not always going to be up-to-date in the repository. That leaves up to us from the CI to deal with all the hardware specifics of every allocation.

No it will not for every case, but I think utilizing things like the model filter in the API you can do this fairly easily without issue. We ship the most common boot strings used but can't account for all differences. Overrides go a long way but can't know what we don't know, either.

* @grafuls tried your approach here without success: https://privatebin.corp.redhat.com/

* This is [the other approach we have been trying from the CI](https://github.com/openshift/release/pull/57723) (that doesnt work reliably either) and the corresponding [doc](https://docs.google.com/document/d/1HyZZZih9nZT6mnXuTW_2Y4D_Qlx5gsadVCHOI4SSkc4/edit?tab=t.0) that @radez has put together:
  ```
  hammer host update --name $i --operatingsystem "$FOREMAN_OS" --pxe-loader "Grub2 UEFI" --build 1
  badfish -H $i -u $USER -p $PWD -i ~/badfish_interfaces.yml -t foreman
  ```

I'd have to look more about what you're trying to do here but you shouldn't need to modify any of the pxe loader stuff, perhaps this model isn't integrated correctly and needs more steps or changes in Foreman. I don't see r760 models listed at all on the RDU3 Performance Lab page. Are these models assigned onto to some people and not generally schedulable? Only Chris manages this lab so we'd have to take a look with him or see if this is the case. Have you tried the same thing without modfying the PXE loader stuff in Scale Lab on R660? They are also EFI by default and we can lend you some to test. This reads like more of a lab / foreman / configuration issue and nothing to do with QUADS or even Badfish. Let's take a look and chat about it internally.

About (maybe unrealistic 😁) expectations:

* I would ideally expect for the QUADS API to act as a layer between the CI and Foreman that will abstract the specifics of the hardware from the CI:

Sure. QUADS already does this, what kind of hardware details are you looking for that's not provided already?

https://github.com/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md#querying-host-information

We have a flexible, extensible and growing metadata model we can add almost anything and keep it in QUADS for each individual server so the API can query it: https://github.com/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md#how-to-import-host-metadata and adding other hardware details to filter for is a great RFE we'd love to tackle.

flowchart TD
    A[CI/Prow] --> B[QUADs]
    B --> C[Foreman]
Loading
* I can't avoid to feel that we are duplicating work in the CI that's already implemented in QUADS. I maybe be totally wrong here and please do correct me if that's the case but for example QUADS should already know how to Foreman deploy i.e.: performancelab's cloud19 right? What the proper order of the interfaces needs to be, etc. Can't we tap into this knowledge?

The QUADS Foreman library only toggles systems for build, sets OS, sets a few host-level parameters, manages Foreman RBAC for cloud user access and with a more recent RFE it will let you set a default OS per cloud: 189cd16

The technical reason why this isn't possible is QUADS has no RBAC / admin token API level ability for individual cloud users to do these things outside of self-scheduling where it is token/bearer based (because it has to be).

Deliberately scheduled assignment RBAC and access is handled within Foreman (e.g. cloud02 cloud03 cloud04 user, views, permissions and so on). Foreman is a lifecycle management platform for systems, and thus their access to them and administration (rebuilding, installing another OS, etc).

QUADS is not a provisioner, it's a scheduler first and foremost - it calls to other provisioners to do the things they do best. Right now that's Foreman for a lot of good reasons but it's not limited to that for the future (think AWS, hybrid cloud etc). From a design-principle perspective we would not want to overlap with a mature, robust RESTful API that is already provided by Forman to do this or any number of things in this category.

We do not want to handle RBAC in two places or act as some kind of API proxy / translation to Foreman either - that's just beyond the scope of what QUAD does and does not do right now. I could forsee moving cloud/environment-based RBAC to QUADS in the future as things evolve (or syncing it with Foreman) but it's handled well by Foreman right now for tenant machine operations and would be a complex undertaking to re-design it. I hope this explains our approach better here.

And finally about alternatives:

* If the reset cloud allocation API is not under consideration, could we consider a QUADS API call to Foreman rebuild one specific host? Not only request the foreman deploy but rebooting the host to be foreman deployed (or whatever other way QUADS handles this during the provisioning of new allocations).

Without a cloud user RBAC in QUADS so far as rebooting systems - you have plenty of other ways to do this directly. You can power cycle the system through the Foreman API, you can power cycle the system through curl, you can power cycle the system through badfish, ipmitool, the python redfish library even, through Ansible uri module, through sushy, through native python with urllib3, and likely others. I don't think talking to an API to talk to an API is a sustainable design in this aspect, not when RBAC is handled at the Foreman level anyway for IPMI/OOB.

So far as marking for build the same design principle applies here - talk directly to any number of API(s), hammer, curl, urllib3, uri module in Ansible to do this. There is a trivial facility in Ansible to wait_for other webservices response to depend on and enact other automation and there are half a dozen or more direct avenues for you to do this without relying on QUADS to do it beyond the duplication of RBAC we'd need to maintain to allow for that level of API POST/PUT (which would in turn have to just talk to the same service you would talk to directly).

Now, if you're doing at least one of these against the same source you should just do both to keep things simple.

Any feature that's already handled better by the Foreman API is just not something we'll likely be able to implement and maintain in QUADS to effectively proxy it for you when it still manages machine lifecycle RBAC. If in the future we move our assignment RBAC services directly to QUADS this can change and it might make sense to do it there but that's not the case today. Foreman does an excellent job of handling systems lifecycle management and provisioning and has a full-featured RESTful API and is already set up with granular RBAC for these needs so it just makes sense that's where you do it right now. It's also not uncommon or complex to expect a CI-driven workflow to talk to more than one API either.

The caveat here is with self-scheduled assignments we handle token/bearer auth in QUADS to authorize admin-level scope for only the systems inclusive of that temporary, self-scheduled assignment to enable the necessary API calls necessary to complete the assignment allocation workflow but systems management is still handled by Foreman like any other assignment. Similarly when we start talking to public cloud provider API's well also likely abstract that in a similar fashion.

The best way to think about the lines of delineation between QUADS and any other infrastructure platform in play is the following:

  • QUADS: scheduling, hardware and assignment datasource, network/VLAN management, visualizations and inventory, notifications, reporting and capacity
  • Foreman: hardware lifecycle management, OS provisioning, another place to do power actions if you want, cloud environment RBAC
  • Badfish: one of many portable tools to manage Redfish, IPMI, boot interface, BIOS settings

Looking forward to discuss this further.

cc @kambiz-aghaiepour @jtaleric @smalleni

Likewise, let me re-open this RFE and we will keep it open due to the useful information here and discussion. We can always change the title and scope and use it to further discussion or carve out a related RFE from this.

@sadsfae sadsfae reopened this Jan 11, 2025
@github-project-automation github-project-automation bot moved this from Done to To Do: High Priority and Bugs in QUADS 2.1 Series Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To Do: High Priority and Bugs
Development

No branches or pull requests

3 participants