Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the steps to reboot the computes after update. #2587

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sathlan
Copy link
Contributor

@sathlan sathlan commented Dec 5, 2024

This sequence implements reboot of the compute nodes after the
update. By default it's not run and cifmw_update_reboot_test must be
set to true to activate it.

We have one instance created. If the hypervisor being rebooted has
the instance that instance will be live-migrated to another hypervisor
before the reboot and migrated back to that original hypervisor after
the reboot.

Some basic sanity checks are performed after the reboot and before the
migration back to ensure that the necessary services are up and
running.

During the reboot we start two scripts. One monitors and log the
reboot of the hypervisors. The other log where the instance is
currently running. The log files can be found in
~/ci-framework-data/tests/update/ in monitor_servers.log and
monitor_vm_placement.log respectively.

A note about node evacuation. We are still using node evaction from
the nova cli. This command has not been ported to the openstack
cli. There's a discussion about it on launchpad.

Also, we do the evacuation only if there are more than one hypervisor
available. When only one compute is available we stop and and after
reboot, we just restart the instance.

The official documentation mention only the live-migration path, but
as we also use the live-migration in the test sequence that part is
covered. We still expect customer to use the nova cli as it's way
more user friendly and is still currently working.

Closes: https://issues.redhat.com/browse/OSPRH-8937

@github-actions github-actions bot marked this pull request as draft December 5, 2024 10:25
Copy link

github-actions bot commented Dec 5, 2024

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

@sathlan
Copy link
Contributor Author

sathlan commented Dec 5, 2024

Current tested with ping test running in the background and found not loss of connectivity.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4f12a60d8dc64dad92d4f7b5f5bec990

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 40m 14s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 06s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 28m 19s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 44s
cifmw-pod-pre-commit FAILURE in 8m 14s
✔️ build-push-container-cifmw-client SUCCESS in 21m 46s
cifmw-molecule-update FAILURE in 5m 04s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/bf06493eaeea430c898bf25520dfdd04

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 44m 32s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 15m 44s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 31m 55s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 45s
cifmw-pod-pre-commit FAILURE in 7m 28s
✔️ build-push-container-cifmw-client SUCCESS in 21m 20s
cifmw-molecule-update FAILURE in 5m 20s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/caf50b2e762a4aaeb3326c9399dffd15

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 21m 02s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 18m 47s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 28m 48s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 43s
cifmw-pod-pre-commit FAILURE in 7m 37s
✔️ build-push-container-cifmw-client SUCCESS in 36m 42s
cifmw-molecule-update FAILURE in 4m 29s

@sathlan sathlan force-pushed the update-reboot branch 11 times, most recently from c056d3c to 85e4367 Compare December 18, 2024 14:14
@sathlan sathlan added enhancement New feature or request and removed do-not-merge/work-in-progress labels Dec 19, 2024
@sathlan sathlan marked this pull request as ready for review December 19, 2024 08:45
@sathlan sathlan requested a review from a team as a code owner December 19, 2024 08:45
PATH: "{{ cifmw_path | default(ansible_env.PATH) }}"
ansible.builtin.command: >-
{{ cifmw_update_oc_cmd_prefix }}
get openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
Copy link
Contributor

@ciecierski ciecierski Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
get openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
wait openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
--for=condition=ready
--timeout={{ cifmw_update_timeout_reboot }}m

With oc wait ansible log is more readable, as there no retires logged to output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Done.

edpm_reboot_strategy: force
ansibleLimit: {{ cifmw_update_hypervisor_short_name }}

- name: Apply the OpenStackDataPlaneDeployment CR to trigger a reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Apply the OpenStackDataPlaneDeployment CR to trigger a reboot
- name: Create the OpenStackDataPlaneDeployment CR to trigger a reboot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jistr
Copy link
Contributor

jistr commented Jan 9, 2025

/lgtm

Copy link
Collaborator

@pablintino pablintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some minor suggestions.

- name: Define command for OpenStack client interactions
ansible.builtin.set_fact:
cifmw_update_openstack_cmd: >-
oc rsh -n {{ cifmw_update_namespace }} openstackclient openstack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a variable, not a default in the vars/main.yaml file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not, sure about that one. I've moved that definition to default/main.yaml as it's used a lot of time in the reboot sequence. Let me know if that's what you had in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine :)

roles/update/tasks/reboot_computes.yml Show resolved Hide resolved
roles/update/tasks/reboot_hypervisor_using_cr.yml Outdated Show resolved Hide resolved
roles/update/tasks/reboot_hypervisor_using_cr.yml Outdated Show resolved Hide resolved
- name: Create the OpenStackDataPlaneDeployment CR used for reboot
ansible.builtin.copy:
dest: "{{ cifmw_update_artifacts_basedir }}/{{ cifmw_reboot_dep_name }}.yaml"
content: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a yaml (or json) with plain string manipulation is asking for trouble. Why not using to_nice yaml like this?

vars:
  _content:
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneDeployment
      metadata:
        name: "{{ cifmw_reboot_dep_name }}"
        namespace: "{{ cifmw_update_namespace }}"
      spec:
        nodeSets: "{{ cifmw_update_node_sets.stdout | split('\n') }}"
        servicesOverride:
        - reboot-os
        ansibleExtraVars:
          edpm_reboot_strategy: force
        ansibleLimit: {{ cifmw_update_hypervisor_short_name }}
ansible.builtin.copy:
    dest: "{{ cifmw_update_artifacts_basedir }}/{{ cifmw_reboot_dep_name }}.yaml"
    content: "{{ _content | to_nice_yaml }}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oki, this one is nice as well. Processing was a little more involved, but the version I've got should be working.

Copy link
Contributor

openshift-ci bot commented Jan 13, 2025

New changes are detected. LGTM label has been removed.

@pablintino
Copy link
Collaborator

/approve

Copy link
Contributor

openshift-ci bot commented Jan 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b30a7822761f41f3890d3b541e5e5bd6

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 42m 25s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 17m 09s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 30m 30s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 16s
cifmw-pod-pre-commit FAILURE in 6m 30s
✔️ build-push-container-cifmw-client SUCCESS in 37m 11s
✔️ cifmw-molecule-update SUCCESS in 5m 14s

@sathlan sathlan force-pushed the update-reboot branch 2 times, most recently from 5b95216 to 0217325 Compare January 13, 2025 14:36
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/87194aa88e3d4b5e90652416f40b57cb

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 44m 32s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 18m 33s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 22m 16s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 00s
cifmw-pod-pre-commit FAILURE in 7m 41s
✔️ build-push-container-cifmw-client SUCCESS in 36m 36s
✔️ cifmw-molecule-update SUCCESS in 5m 23s

This sequence implements reboot of the compute nodes after the
update. By default it's not run and `cifmw_update_reboot_test` must be
set to true to activate it.

We have one instance created.  If the hypervisor being rebooted has
the instance that instance will be live-migrated to another hypervisor
before the reboot and migrated back to that original hypervisor after
the reboot.

Some basic sanity checks are performed after the reboot and before the
migration back to ensure that the necessary services are up and
running.

During the reboot we start two scripts. One monitors and log the
reboot of the hypervisors.  The other log where the instance is
currently running.  The log files can be found in
`~/ci-framework-data/tests/update/` in `monitor_servers.log` and
`monitor_vm_placement.log` respectively.

A note about node evacuation.  We are still using node evaction from
the nova cli.  This command has not been ported to the openstack
cli. There's a discussion about it [on launchpad](https://bugs.launchpad.net/python-openstackclient/+bug/2055552).

Also, we do the evacuation only if there are more than one hypervisor
available.  When only one compute is available we stop and and after
reboot, we just restart the instance.

The official documentation mention only the live-migration path, but
as we also use the live-migration in the test sequence that part is
covered.  We still expect customer to use the nova cli as it's way
more user friendly and is still currently working.

Closes: https://issues.redhat.com/browse/OSPRH-8937
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0030f67185a246beaa6641ac614816e3

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 43m 19s
podified-multinode-edpm-deployment-crc POST_FAILURE in 1h 14m 15s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 16s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 52s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 58s
✔️ build-push-container-cifmw-client SUCCESS in 22m 17s
✔️ cifmw-molecule-update SUCCESS in 5m 35s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants