Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics and Reporting for LEAP Hub #1279

Closed
rabernat opened this issue May 5, 2022 · 23 comments
Closed

Metrics and Reporting for LEAP Hub #1279

rabernat opened this issue May 5, 2022 · 23 comments
Assignees

Comments

@rabernat
Copy link
Contributor

rabernat commented May 5, 2022

Context

In #1050 (comment) I listed several important requirements for the new LEAP hub. That issue has been closed with some of those points unresolved, so I am raising this issue to continue tracking.

I cannot overemphasize how concerned the LEAP leadership are about runaway cloud costs. 2i2c is holding all of our Google Credits, which means we are placing a huge amount of trust in 2i2c to steward this resource on behalf of our project. LEAP leadership is concerned with the following scenarios:

  • Illegitimate use of the hub (e.g. bitcoin mining)
  • Disproportionate resource use by individuals, leading to unexpected charges
  • Too quick burn rate of credits, meaning we would run out before 5 years are up

We need a plan to address these concerns before LEAP leadership are comfortable opening up the hub to users. So therefore, this issue has the most urgent priority for LEAP, above all other ongoing technical development.

Proposal

I propose that 2i2c build a reporting system to deliver the following information in the form of a weekly email and / or interactive dashboard.

For each of the following breakdowns:

  • Project total
  • By GitHub group (group membership used to authorize LEAP users)
  • By individual users

Report the following information:

  • Total CPU compute cost consumed
  • Total GPU compute cost consumed
  • Total storage cost consumed

For the following periods

  • weekly total
  • monthly total
  • net total to date

A LEAP person (probably me to start) will review this report on a weekly basis to ensure that no anomalous costs are occurring. If anomalous costs are found, these reports will allow us to trace them to specific users and intervene.

Additionally, we would like to request a standing monthly meeting between LEAP and a 2i2c representative to review costs.


I understand that these reports are not currently part of 2i2c services. I see three possible ways forward:

  • You can start building this reporting system now without any additional overhead.
  • You could build this reporting system if we pay more money.
  • You will never build this reporting system.

It would be great if someone could identify which of these pathways is most likely. I need to convey a response to the LEAP leadership asap.


Linked issues:

Updates and actions

No response

@yuvipanda
Copy link
Member

I don't fully know the extent of the current LEAP contract, but very most definitely

You could build this reporting system if we pay more money.

@yuvipanda
Copy link
Member

Since the goal is to avoid overrunning costs, I think a progressive approach would be:

  1. Figure out a weekly report that just lists total cloud spend that week
  2. Have tools in place that let us investigate usage when we need to.

This combines an alerting mechanism (1) with a way for us to investigate usage (2).

Grafana already has useful mechanisms for tracking usage, and I think BigQuery has mechanisms for tracking cloud spend. I think 'down-to-the-cent-attribution' will always be difficult, and I want to focus more on answering the question 'how can we reduce anxiety about cloud overspend?'

@yuvipanda
Copy link
Member

@rabernat GCP now allows us to setup budgets, and we can set a monthly budget - and GCP will alert us (via email) as soon as forecasted cost goes above that monthly number. Let's do that? https://cloud.google.com/billing/docs/how-to/budgets has more info.

@yuvipanda
Copy link
Member

I also want to suggest that:

For each of the following breakdowns:

* Project total
* By GitHub group (group membership used to authorize LEAP users)
* By individual users

Report the following information:

* Total CPU compute cost consumed
* Total GPU compute cost consumed 
* Total storage cost consumed

We instead focus on usage rather than cost attribution. As the goal is to figure out 'hey who is using things so much that it is costing us a lot of stuff?', I think usage reports (as an extension of what we have in Grafana now) is the way to go. Cost attribution is super difficult and I think will take a lot more work. If the workflow is 'some human evaluates the cost per week and looks to see if some user is using too much!' then the user reporting should be enough.

@yuvipanda
Copy link
Member

I've just played around with the budget tool on GCP, and have it set to send alerts in the following conditions:

image

So I've set a budget of 2k$, and it'll email whenever monthly spend:

  1. Crosses $1000, $1800 in actual spend
  2. Is forecast to cross $2000 by end of the month

Note that (1) need not necessarily be triggered before (2) - someone using up huge resources today that'll blow us over $2000 should still trigger that alert well before $1000 in actual spend

@yuvipanda
Copy link
Member

I've set it to email [email protected] and you, @rabernat. What number do you want this to be set at?

@rabernat
Copy link
Contributor Author

rabernat commented May 6, 2022

Hi Yuvi, thanks for the quick and thoughtful response on this!

We instead focus on usage rather than cost attribution. As the goal is to figure out 'hey who is using things so much that it is costing us a lot of stuff?', I think usage reports (as an extension of what we have in Grafana now) is the way to go. Cost attribution is super difficult and I think will take a lot more work.

👍 This sounds like a great plan.

What number do you want this to be set at?

I think the $2000 threshold is a good place to start as we are spinning up. As usage increases, we can reassess.


I accept that it is only practical to track usage, not cost on a per user and per group basis. However, I would still like to make it possible to have weekly and monthly usage reports, automatically generated or available via a dashboard, rather than only looking at usage when there is an incident. Does our current Grafana configuration allow that?

@yuvipanda
Copy link
Member

@rabernat so with grafana I think we keep data for the last 6 months, and dashboards have variable time range - so you can definitely click a link and see 'last week', 'last month' usage reports. There is no emailing or similar process available, however.

So let's track what exactly we want in a 'usage report', and make a grafana dashboard for it? I think 'per-group' would definitely need some work, while per-user already is tracked.

@rabernat
Copy link
Contributor Author

rabernat commented May 9, 2022

So let's track what exactly we want in a 'usage report', and make a grafana dashboard for it?

👍 to this. What are the next steps?

@damianavila
Copy link
Contributor

cc @GeorgianaElena who might be interested in pushing this conversation forward since she is looking at the monitoring topic right now...

@damianavila damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog May 10, 2022
@choldgraf
Copy link
Member

A few notes from a conversation w/ @yuvipanda and the team today:

  • There are two things to think about: budgets and monitoring / reporting usage.
  • budgets are directly tied to funds, and tend to be cloud-specific
  • monitoring / usage can be abstracted away from the cloud provider and done with Grafana etc.

So the two-step process here if there's an unexpected usage increase is:

  • Budgeting alerts / limits will be triggered by the cloud provider, raising visibility to the community rep + the 2i2c team.
  • We then use the usage dashboards to understand what could be triggering this, and plan a next set of actions.

Development for this process w/ LEAP is:

After that we should:

  • Document how to set up budgeting alerts on each of the major cloud providers, since this will be cloud-specific for each

@rabernat
Copy link
Contributor Author

I appreciate all the efforts and thought that have been happening here.

Can anyone provide a rough estimate of when we might expect a "usage report" dashboard for the LEAP hub? I would like to share this timeline with the LEAP leadership because they are very, very interested in this feature. Providing a timeline will prevent me from having to give updates about this at weekly meetings.

@damianavila
Copy link
Contributor

There has been some progress here: #1310
And there is ongoing work in this direction as you can see from this board: https://github.com/orgs/2i2c-org/projects/32

Let me circle back with @GeorgianaElena and @yuvipanda about this one so I can give you some good estimation.

@yuvipanda
Copy link
Member

yuvipanda commented May 20, 2022

We already have budgets, so what we want is a usage report. I think these are the following bar charts:

  1. Pod memory requests, grouped by username, for notebook nodes as well as dask-gateway
  2. Pod GPU requests, grouped by username

This should be enough to dig in and investigate if there's an unexpectedly large bill.

The work that needs to happen is:

I'll check with @GeorgianaElena and @damianavila to see how we can prioritize this!

@GeorgianaElena
Copy link
Member

Small update about this one:

We already have budgets, so what we want is a usage report. I think these are the following bar charts:

  1. Pod memory requests, grouped by username, for notebook nodes as well as dask-gateway
  2. Pod GPU requests, grouped by username

⬆️ these have been deployed to the central 2i2c grafana and the leap cluster one. Not sure about what the best way of sharing them would be? (cc @choldgraf and @yuvipanda)

@choldgraf
Copy link
Member

Is there a way that we can expose these charts to only certain community leaders based on their username in the hub? Eg, any hub administrator can also see the grafana usage plots for that hub?

Otherwise, could we generate these reports as PDFs on the fly to share manually?

@GeorgianaElena
Copy link
Member

Is there a way that we can expose these charts to only certain community leaders based on their username in the hub? Eg, any hub administrator can also see the grafana usage plots for that hub?

No, we don't have yet the infra to do this. I've opened #1437 that could be a step in this direction

Otherwise, could we generate these reports as PDFs on the fly to share manually?

Unfortunately exporting dashboards is a grafana enterprise feature https://grafana.com/docs/grafana/latest/enterprise/export-pdf, we have these export options https://grafana.com/docs/grafana/latest/sharing/share-dashboard/

@sgibson91
Copy link
Member

exporting dashboards is a grafana enterprise feature

Depending on price, I think this could be a good investment given how important this will be to our billing/invoicing process, and only the central grafana would need to be on the enterprise plan. We could recoup the subscription cost in our overheads maybe?

@choldgraf
Copy link
Member

choldgraf commented Jun 21, 2022

I agree, definitely worth checking out. @GeorgianaElena would you like to play around with this feature? If so, and depending on how much the enterprise version costs, we could buy a license and see if it is useful. (Via the 2i2c expensify card)

If that doesn't work, I'm wondering if we could use playwright to automate PDF generation of those reports. If it is possible then this could be a nice way to have more customizability over the output (and at least basic PDF generation via playwright is not that hard)

@yuvipanda
Copy link
Member

IMO, installing and maintaining licenses for an enterprise version on our cluster is a lot of work + brings something proprietary in our cluster (vs a service we use). We could try paying for a hosted version.

Given that the goal of this is to investigate users who might be overusing the cluster, my suggestion is we setup GitHub auth for the leap grafana and call it a day.

@GeorgianaElena
Copy link
Member

The leap grafana, running at https://grafana.leap.2i2c.cloud/dashboards now has GitHub auth enabled. So everyone in the 2i2c-org can login and checkout the usage report (it's under Manage -> Jupyterhub Default Dashboards -> Usage Report).

I believe this was the last piece that was missing, so I will close this issue as completed. Thanks everyone!

Repository owner moved this from In Progress to Done in Cloud usage monitoring and alerting infrastructure and process Jun 24, 2022
Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Jun 24, 2022
@rabernat
Copy link
Contributor Author

I was finally able to log in and see the dashboard. The "Usage Report" dashboard seems great! 🚀

image

This gives me an overall view of the amount of usage.

Some additional information I would like to be able to see

  • Breakdown of usage by individual named user
  • NFS home directory size by user
  • Dask gateway usage by user

@GeorgianaElena
Copy link
Member

@rabernat, I believe the screenshot above is from the usage dashboard and there should be another one called usage report.

Screenshot 2022-08-25 at 22 41 53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

6 participants