Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition GCP config-connector to terraform - cleanup cloudResources #1046

Closed
4 tasks done
consideRatio opened this issue Mar 2, 2022 · 9 comments · Fixed by #3778
Closed
4 tasks done

Transition GCP config-connector to terraform - cleanup cloudResources #1046

consideRatio opened this issue Mar 2, 2022 · 9 comments · Fixed by #3778
Assignees
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Mar 2, 2022

Updated issue

The "config connector" system enabled GKE clusters to define resources of custom types representing GCP resources, where GCP then would create these for us. In practice it enabled our chart's to emit k8s resources that led to the creation of GCP resources outside the k8s cluster, such as a GCP service account and a GCP storage bucket.

We have now transitioned to create such service accounts and buckets via terraform instead, but some clusters still rely on the config connector system. If we can transition to using terraform entirely, we would avoid the complexity of retaining config connector related config that is only relevant to some clusters, and we would avoid the quite significant CPU/Memory overhead requested inside the k8s cluster by the config connector system.

Action points

From Yuvi's comment #1046 (comment)

  • Make sure the cloudResources config is removed from config for:
    • pangeo-hubs/common
    • 2i2c/dask-staging
    • 2i2c/ohw
    • 2i2c/catalyst-cooperative
  • Remove set of resources from our config, under helm-charts/
  • Remove custom code we have in our values.yaml to support this
  • Remove terraform code that supported setting up config-connector

Original issue

In #1045 I removed this section.

scratchBucket:
# Enable a 'scratch' bucket per-hub, with read-write permissions for all
# users. This will set a `SCRATCH_BUCKET` env variable (and a PANGEO_SCRATCH variable
# too, for backwards compatibility). Users can share data with each other using
# this bucket.
enabled: true

It wasn't functional and instead silently ignored. It should be located under basehub.jupyterhub.custom.cloudResources.scratchBucket.enabled to be functional and not ignored.

In #1045 I opted to not change any behavior but instead remove the misconfiguration that was silently ignored and open this issue to raise awareness about the situation: daskhub won't create a scratch bucket by default etc, and you must enable this for each hub that wants it.

I lean towards thinking that any custom resources that one may want, should be explicit configuration for a cluster's hubs rather than default configuration. If there is agreement about that, there may be no action point at all but to close this issue.

Action points

Outlined by Yuvi in #1046 (comment)

@sgibson91
Copy link
Member

sgibson91 commented Mar 2, 2022

Creating the scratch bucket is something the config connector should be able to do (on GKE), but I never got it to work #669 We should scope another project, similar to the CI/CD one, that refactors/overhauls our terraform infra to make it consistent.

@choldgraf
Copy link
Member

It feels like there are two questions to answer here:

  1. Should we automatically create things like a scratch bucket as part of a daskhub, or should this be a feature flag that we turn on when people ask for it?
  2. What's the infrastructure pathway for enabling this if somebody wants it?

For 1, I am not sure what the answer is, but my instinct is that we should document it as a feature and let people ask for it explicitly so we aren't adding unnecessary complexity that may not be used.

For 2, I agree that setting up a broader project around standardizing our terraform is a great idea. @sgibson91 do you want to create an issue to start writing down a list of things that we can shape into a broader project?

@sgibson91
Copy link
Member

@sgibson91 do you want to create an issue to start writing down a list of things that we can shape into a broader project?

I will put it on my to-do list and get to it when I can

@choldgraf
Copy link
Member

sounds good - I don't think there's a strong hurry there, just want to make sure that we don't forget about it.

For this issue, it then sounds like we can close it simply by documenting in infrastructure/ that scratch buckets must be created first before the hub can be configured to use them. Is that right? It seems like these docs are where we describe the setup process so maybe we can refine those a bit?

@damianavila
Copy link
Contributor

Creating the scratch bucket is something the config connector should be able to do (on GKE), but I never got it to work #669

I would bet this is the underlying issue!

For this issue, it then sounds like we can close it simply by documenting in infrastructure/ that scratch buckets must be created first before the hub can be configured to use them. Is that right?

IIUC, we are currently creating that bucket with terraform, ie: https://github.com/2i2c-org/infrastructure/blob/master/terraform/gcp/projects/pangeo-hubs.tfvars#L16-L18
@sgibson91 might know better if there is any other place where this is actually already documented.

@sgibson91
Copy link
Member

@sgibson91 might know better if there is any other place where this is actually already documented.

I'm not sure there is any documentation yet. I caught this in the LEAP config though. #1074 (comment)

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 18, 2022
The GKE config connector was helpful in letting us deploy
Google Cloud Service Accounts with permissions for cloud storage
directly just from helm. However, it has been difficult to debug,
and in 2i2c-org#669
we decided to move away from it and towards creating these
cloud resources via Terraform.

This commit adds:
- Terraform code that will create a Google Service Account,
  bind it to a given Kubernetes Service Account, for a list of
  hub namespaces passed in. This means that some hub initial deployments
  now *can not be done just with CD*, but need manual work with
  terraform. I think this would be any hub that wants to use
  requestor pays or scratch buckets. This would need to be
  documented.
- Move meom-ige to use this new scheme. metadata concealment
  (https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment)
  which is what we were using earlier as alternative to config-connector
  + workload identity, is no longer supported by the terraform
  google provider. In b7b42ce,
  we changed the default from 'SECURE' to 'UNSPECIFIED', but
  it looks like 'UNSPECIFIED' really means 'use workload identity'
  haha. When 2i2c-org#1124 was
  deployed to meom-ige yesterday, it seems to have enabled workload
  identity, causing cloud access to stop working, leading to
  https://2i2c.freshdesk.com/a/tickets/107. Further investigation on
  what happened here is needed, but I've currently fixed it by
  just deploying this change for meom-ige.
- All hubs are given access to all buckets we create. This is
  inadequete, and needs to be more fine grained.

Ref 2i2c-org#669
Ref 2i2c-org#1046
@yuvipanda yuvipanda self-assigned this Mar 18, 2022
@yuvipanda
Copy link
Member

#1130 is relevant

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 29, 2022
The GKE config connector was helpful in letting us deploy
Google Cloud Service Accounts with permissions for cloud storage
directly just from helm. However, it has been difficult to debug,
and in 2i2c-org#669
we decided to move away from it and towards creating these
cloud resources via Terraform.

This commit adds:
- Terraform code that will create a Google Service Account,
  bind it to a given Kubernetes Service Account, for a list of
  hub namespaces passed in. This means that some hub initial deployments
  now *can not be done just with CD*, but need manual work with
  terraform. I think this would be any hub that wants to use
  requestor pays or scratch buckets. This would need to be
  documented.
- Move meom-ige to use this new scheme. metadata concealment
  (https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment)
  which is what we were using earlier as alternative to config-connector
  + workload identity, is no longer supported by the terraform
  google provider. In b7b42ce,
  we changed the default from 'SECURE' to 'UNSPECIFIED', but
  it looks like 'UNSPECIFIED' really means 'use workload identity'
  haha. When 2i2c-org#1124 was
  deployed to meom-ige yesterday, it seems to have enabled workload
  identity, causing cloud access to stop working, leading to
  https://2i2c.freshdesk.com/a/tickets/107. Further investigation on
  what happened here is needed, but I've currently fixed it by
  just deploying this change for meom-ige.
- All hubs are given access to all buckets we create. This is
  inadequete, and needs to be more fine grained.

Ref 2i2c-org#669
Ref 2i2c-org#1046
@yuvipanda
Copy link
Member

We need to basically rip out the config-connector stuff completely, as we've moved all scratch bucket stuff to terraform. Unassigning myself as I'm not currently working on it.

@yuvipanda

This comment was marked as duplicate.

@yuvipanda yuvipanda removed their assignment Nov 9, 2022
@damianavila damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Nov 9, 2022
@damianavila damianavila moved this from Needs Shaping / Refinement to Ready to work in DEPRECATED Engineering and Product Backlog Nov 9, 2022
@consideRatio consideRatio changed the title daskhub didn't enable scratchBuckets by default, now silently ignored faulty config has been removed Cleanup cloudResources feature for GCP Feb 24, 2023
@consideRatio consideRatio changed the title Cleanup cloudResources feature for GCP Transition GCP config-connector to terraform - cleanup cloudResources Feb 24, 2023
@consideRatio consideRatio moved this to Todo 👍 in Sprint Board Feb 24, 2023
@consideRatio consideRatio added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 11, 2023
@github-project-automation github-project-automation bot moved this from Ready to work to Complete in DEPRECATED Engineering and Product Backlog Mar 6, 2024
@consideRatio consideRatio self-assigned this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants