You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a recent incident there was some cloud infrastructure running in the background that we did not track with our Grafana dashboards (because it was on an old cluster).
We have an issue to track using Grafana for cloud provider alerting (2i2c-org/infrastructure#1288). However, this would not have caught this problem because it was outside of Grafana's scope.
Each cloud provider also tends to provide their own cloud billing monitoring and alerting infrastructure. For example, you can trigger emails or warnings at certain spend levels, and you can even automatically trigger some actions like cluster shutdown.
One of the biggest concerns that researchers have with cloud is the "hidden and ballooning costs" problem, so we need to do whatever we can to reduce this uncertainty for others.
Proposal
For each of the clusters that we deploy, we should also use the cloud provider's cost management and alerting system, in order to warn us when unexpected amounts of spending occur. We can define the specific rules in collaboration with Community Representatives, but they could be something like:
Define the expect monthly cost given the estimated user size
Set up an alert for 100% higher than this expected amount.
(optionally) define a "shut down" point if we reach a really high threshold.
Updates and actions
No response
The text was updated successfully, but these errors were encountered:
Context
In a recent incident there was some cloud infrastructure running in the background that we did not track with our Grafana dashboards (because it was on an old cluster).
We have an issue to track using Grafana for cloud provider alerting (2i2c-org/infrastructure#1288). However, this would not have caught this problem because it was outside of Grafana's scope.
Each cloud provider also tends to provide their own cloud billing monitoring and alerting infrastructure. For example, you can trigger emails or warnings at certain spend levels, and you can even automatically trigger some actions like cluster shutdown.
For example:
We may also be able to automate this process. For example:
One of the biggest concerns that researchers have with cloud is the "hidden and ballooning costs" problem, so we need to do whatever we can to reduce this uncertainty for others.
Proposal
For each of the clusters that we deploy, we should also use the cloud provider's cost management and alerting system, in order to warn us when unexpected amounts of spending occur. We can define the specific rules in collaboration with Community Representatives, but they could be something like:
Updates and actions
No response
The text was updated successfully, but these errors were encountered: