Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

feat: add metrics collection using Prometheus #29

Merged
merged 5 commits into from
Jul 29, 2024

Conversation

jedel1043
Copy link
Contributor

@jedel1043 jedel1043 commented Jul 11, 2024

Description

This change allows integrating the charm with the Canonical Observability Stack. Using Rivos Inc's Prometheus exporter, we can collect metrics from the controller, which can then be sent to a subordinate grafana-agent that can forward the metrics to the COS on Kubernetes using a cross-model, cross-cloud relation.

Still WIP:

  • Tutorial on how to deploy the observability stack.
  • Tests to check that everything deploys correctly.

How was the code tested?

Tested locally (Ubuntu Noble) in a hybrid-cloud environment on Juju (Microk8s + LXD).

Checklist

  • I am the author of these changes, or I have the rights to submit them.
  • I have added the relevant changes to the README and/or documentation.
  • I have self reviewed my own code.
  • All requested changes and/or review comments have been resolved.

@jedel1043 jedel1043 changed the base branch from main to experimental July 11, 2024 22:13
@NucciTheBoss
Copy link
Member

Need to rebase experimental with main

Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedel1043 can you rebase this branch with the latest changes from the experimental branch? You might need to resolve some conflicts.

I'm going to start working on putting together some test bundles that we can use within our documentation!

@jedel1043
Copy link
Contributor Author

@jedel1043 can you rebase this branch with the latest changes from the experimental branch? You might need to resolve some conflicts.

I'll do that after the CI is in a good state (probably after #31 gets merged).

@jedel1043 jedel1043 force-pushed the cos-integration branch 2 times, most recently from c17d519 to 671912a Compare July 19, 2024 23:23
@jedel1043
Copy link
Contributor Author

Moment of truth...

@jedel1043 jedel1043 marked this pull request as ready for review July 26, 2024 21:24
@NucciTheBoss
Copy link
Member

How was the code tested?

WIP

@jedel1043 could you update this section of the PR message to indicate how you verified that COS was working in a test deployment?

@NucciTheBoss NucciTheBoss self-requested a review July 26, 2024 21:46
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm surprised how painless the COS integration itself was into the charm. Most of the fun came from just getting the exporter supported 😅

Just two small nits I'd like you to address, and after that I'm good to ship 🚢

src/cos/grafana_dashboards/Slurm-overview.json Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
@@ -112,6 +120,8 @@ def _on_install(self, event: InstallEvent) -> None:
self._legacy_manager.write_jwt_rsa(jwt_rsa)

self._slurmctld.enable()
self._slurmctld.munge.enable()
self._slurmctld.exporter.enable()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[foresight]: we'll want to make the port the prometheus exporter uses configurable in the future as we'll potentially have multiple exporters running per-node. We should likely address this after we have a better idea of what exporters we'll need to have installed on each of the nodes.

Enables integrating the charm with Grafana Agent, which can proxy the observability data to an external K8s cloud.
Implements some default alerts and adds a prototype Grafana dashboard.
Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@NucciTheBoss NucciTheBoss merged commit bc56f05 into charmed-hpc:experimental Jul 29, 2024
5 checks passed
@jedel1043 jedel1043 deleted the cos-integration branch July 29, 2024 15:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants