-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics and Reporting for LEAP Hub #1279
Comments
I don't fully know the extent of the current LEAP contract, but very most definitely
|
Since the goal is to avoid overrunning costs, I think a progressive approach would be:
This combines an alerting mechanism (1) with a way for us to investigate usage (2). Grafana already has useful mechanisms for tracking usage, and I think BigQuery has mechanisms for tracking cloud spend. I think 'down-to-the-cent-attribution' will always be difficult, and I want to focus more on answering the question 'how can we reduce anxiety about cloud overspend?' |
@rabernat GCP now allows us to setup budgets, and we can set a monthly budget - and GCP will alert us (via email) as soon as forecasted cost goes above that monthly number. Let's do that? https://cloud.google.com/billing/docs/how-to/budgets has more info. |
I also want to suggest that:
We instead focus on usage rather than cost attribution. As the goal is to figure out 'hey who is using things so much that it is costing us a lot of stuff?', I think usage reports (as an extension of what we have in Grafana now) is the way to go. Cost attribution is super difficult and I think will take a lot more work. If the workflow is 'some human evaluates the cost per week and looks to see if some user is using too much!' then the user reporting should be enough. |
I've just played around with the budget tool on GCP, and have it set to send alerts in the following conditions: So I've set a budget of 2k$, and it'll email whenever monthly spend:
Note that (1) need not necessarily be triggered before (2) - someone using up huge resources today that'll blow us over $2000 should still trigger that alert well before $1000 in actual spend |
I've set it to email [email protected] and you, @rabernat. What number do you want this to be set at? |
Hi Yuvi, thanks for the quick and thoughtful response on this!
👍 This sounds like a great plan.
I think the $2000 threshold is a good place to start as we are spinning up. As usage increases, we can reassess. I accept that it is only practical to track usage, not cost on a per user and per group basis. However, I would still like to make it possible to have weekly and monthly usage reports, automatically generated or available via a dashboard, rather than only looking at usage when there is an incident. Does our current Grafana configuration allow that? |
@rabernat so with grafana I think we keep data for the last 6 months, and dashboards have variable time range - so you can definitely click a link and see 'last week', 'last month' usage reports. There is no emailing or similar process available, however. So let's track what exactly we want in a 'usage report', and make a grafana dashboard for it? I think 'per-group' would definitely need some work, while per-user already is tracked. |
👍 to this. What are the next steps? |
cc @GeorgianaElena who might be interested in pushing this conversation forward since she is looking at the monitoring topic right now... |
A few notes from a conversation w/ @yuvipanda and the team today:
So the two-step process here if there's an unexpected usage increase is:
Development for this process w/ LEAP is:
After that we should:
|
I appreciate all the efforts and thought that have been happening here. Can anyone provide a rough estimate of when we might expect a "usage report" dashboard for the LEAP hub? I would like to share this timeline with the LEAP leadership because they are very, very interested in this feature. Providing a timeline will prevent me from having to give updates about this at weekly meetings. |
There has been some progress here: #1310 Let me circle back with @GeorgianaElena and @yuvipanda about this one so I can give you some good estimation. |
We already have budgets, so what we want is a usage report. I think these are the following bar charts:
This should be enough to dig in and investigate if there's an unexpectedly large bill. The work that needs to happen is:
I'll check with @GeorgianaElena and @damianavila to see how we can prioritize this! |
Small update about this one:
⬆️ these have been deployed to the central 2i2c grafana and the leap cluster one. Not sure about what the best way of sharing them would be? (cc @choldgraf and @yuvipanda) |
Is there a way that we can expose these charts to only certain community leaders based on their username in the hub? Eg, any hub administrator can also see the grafana usage plots for that hub? Otherwise, could we generate these reports as PDFs on the fly to share manually? |
No, we don't have yet the infra to do this. I've opened #1437 that could be a step in this direction
Unfortunately exporting dashboards is a grafana enterprise feature https://grafana.com/docs/grafana/latest/enterprise/export-pdf, we have these export options https://grafana.com/docs/grafana/latest/sharing/share-dashboard/ |
Depending on price, I think this could be a good investment given how important this will be to our billing/invoicing process, and only the central grafana would need to be on the enterprise plan. We could recoup the subscription cost in our overheads maybe? |
I agree, definitely worth checking out. @GeorgianaElena would you like to play around with this feature? If so, and depending on how much the enterprise version costs, we could buy a license and see if it is useful. (Via the 2i2c expensify card) If that doesn't work, I'm wondering if we could use playwright to automate PDF generation of those reports. If it is possible then this could be a nice way to have more customizability over the output (and at least basic PDF generation via playwright is not that hard) |
IMO, installing and maintaining licenses for an enterprise version on our cluster is a lot of work + brings something proprietary in our cluster (vs a service we use). We could try paying for a hosted version. Given that the goal of this is to investigate users who might be overusing the cluster, my suggestion is we setup GitHub auth for the leap grafana and call it a day. |
The leap grafana, running at https://grafana.leap.2i2c.cloud/dashboards now has GitHub auth enabled. So everyone in the I believe this was the last piece that was missing, so I will close this issue as completed. Thanks everyone! |
I was finally able to log in and see the dashboard. The "Usage Report" dashboard seems great! 🚀 This gives me an overall view of the amount of usage. Some additional information I would like to be able to see
|
@rabernat, I believe the screenshot above is from the usage dashboard and there should be another one called usage report. |
Context
In #1050 (comment) I listed several important requirements for the new LEAP hub. That issue has been closed with some of those points unresolved, so I am raising this issue to continue tracking.
I cannot overemphasize how concerned the LEAP leadership are about runaway cloud costs. 2i2c is holding all of our Google Credits, which means we are placing a huge amount of trust in 2i2c to steward this resource on behalf of our project. LEAP leadership is concerned with the following scenarios:
We need a plan to address these concerns before LEAP leadership are comfortable opening up the hub to users. So therefore, this issue has the most urgent priority for LEAP, above all other ongoing technical development.
Proposal
I propose that 2i2c build a reporting system to deliver the following information in the form of a weekly email and / or interactive dashboard.
For each of the following breakdowns:
Report the following information:
For the following periods
A LEAP person (probably me to start) will review this report on a weekly basis to ensure that no anomalous costs are occurring. If anomalous costs are found, these reports will allow us to trace them to specific users and intervene.
Additionally, we would like to request a standing monthly meeting between LEAP and a 2i2c representative to review costs.
I understand that these reports are not currently part of 2i2c services. I see three possible ways forward:
It would be great if someone could identify which of these pathways is most likely. I need to convey a response to the LEAP leadership asap.
Linked issues:
Updates and actions
No response
The text was updated successfully, but these errors were encountered: