Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eScience Course: 31/10-11/11 2022 (plus 28-30 November for remote presentations of final results). #42

Closed
annefou opened this issue Oct 19, 2022 · 28 comments

Comments

@annefou
Copy link
Collaborator

annefou commented Oct 19, 2022

Clivar is ending very soon and we have another course coming. Very similar needs e.g. 17 students + 10 mentors (see https://www.aces.su.se/research/projects/escience-tools-in-climate-science-linking-observations-with-modelling/). I will create a repo (same as for Clivar); the organisers would also be happy if someone can deliver some of the trainings (mostly focusing on Dask + kerchunk because the rest is covered in-house).

  1. My first question is: do we keep the CLIVAR Jupyterhub and reuse it for this eScience course or will we have a new (elastic) infra?
  2. Any volunteers for teaching (online) Dask and/or kerchunk? (I understood they are flexible but should probably happen 1st or 2nd November).
@sebastian-luna-valero
Copy link
Collaborator

Clivar is ending very soon and we have another course coming.

Great to know, thanks!

do we keep the CLIVAR Jupyterhub and reuse it for this eScience course?

I discussed with CESNET and after 21st Oct we need to scale down the CLIVAR JupyterHub, removing first the worker nodes with the hpc.16core-64ram-ssd-ephem flavors. Specifically, CESNET would like to have back 15 out of the 20 hpc.16core-64ram-ssd-ephem nodes that we are currently using, as they are being requested by other research groups. If @j34ni and @guillaumeeb struggle to identify which worker nodes to remove from the cluster, please contact me.

will we have a new (elastic) infra?

Tests are still ongoing between @guillaumeeb, Miguel and myself. I would say that manual scaling is still preferred.

Very similar needs e.g. 17 students + 10 mentors

According to what I see in grafana, we should be covered after removing 15 VMs of the hpc.16core-64ram-ssd-ephem flavor for this upcoming workshop. We could add more elixir.16core-64ram nodes instead, but as I see that we will have less users, we will check as we go.

Additionally, please remember to submit an application to https://c-scale.eu/call-for-use-cases/ to gain access to additional resources to what you have in EGI-ACE so we could host multiple JupyterHub/DaskHub instances at the same time. Actually, I would be greateful if you could spread the link to the C-SCALE call through your networks (i.e. the European Pangeo community) so others can also benefit from these resources.

@guillaumeeb
Copy link
Member

do we keep the CLIVAR Jupyterhub and reuse it for this eScience course?

I discussed with CESNET and after 21st Oct we need to scale down the CLIVAR JupyterHub, removing first the worker nodes with the hpc.16core-64ram-ssd-ephem flavors. Specifically, CESNET would like to have back 15 out of the 20 hpc.16core-64ram-ssd-ephem nodes that we are currently using, as they are being requested by other research groups. If @j34ni and @guillaumeeb struggle to identify which worker nodes to remove from the cluster, please contact me.

I would suggest to deploy a new pangeo-eosc or pangeo-egi fresh infrastructure, even if Elastic scaling is still not completely working. I assume we can stop clues2 service to avoid auto-scaling issue if needed? But continuing with the other infrastructure is fine too.

Any volunteers for teaching (online) Dask and/or kerchunk? (I understood they are flexible but should probably happen 1st or 2nd November).

Sorry, I won't be available for that.

@j34ni
Copy link
Collaborator

j34ni commented Oct 20, 2022

I started to look into building a new infrastructure, similar to the pangeo-clivar but using only the elixir.16core-64ram flavor

That works, but I found the deployment process really much slower than with the hpc.16core-64ram-ssd-ephem, so there are performance differences

@sebastian-luna-valero: I would have liked to try also the hpc.30core-64ram flavor but the only possible values for the Number of CPUs in the IM Dashboard are 2, 4, 8, 16, 32 and 64: is it possible to include 30 cores?

@tinaok @annefou: the possible Size of the disk to be attached is limited to 2TB, is that sufficient for you?

@annefou
Copy link
Collaborator Author

annefou commented Oct 20, 2022

Thanks for starting the creation of the new jupyterhub for the eScience course. On my side, I have created (duplicated from clivar workshop) https://github.com/pangeo-data/escience-2022

I just ask the course organisers for the storage. I see that for the clivar bootcamp, they had 1TB and it is not full. I guess 2TB is OK.

They may also be able to create minIO for reading some data from their own infrastructure (let's see if it works out).

Thanks.

@tinaok
Copy link
Collaborator

tinaok commented Oct 21, 2022

Any volunteers for teaching (online) Dask and/or kerchunk? (I understood they are flexible but should probably happen 1st or 2nd November).

Sorry, I won't be available for that neither.

do we keep the CLIVAR Jupyterhub and reuse it for this eScience course or will we have a new (elastic) infra?

Is it possible to separate the object storage with the usage of CLIVAR?
CLIVAR bootcamp finished but working groups are continuing their work. And they are saving zarr/netcdf files in pangeo storage at CESNET. If the training is one shot and the data can be deleted for this science course, may be better to use one shot MINIO disk storage? Then it is safe and the CLIVAR users who are continue working does not have risk that their files get deleted.

For computing resource, may be I can make a meeting with each workgroup to understand when they will work intensively so that we can adjust the needs manually in advance? (and thank you jean for creating separating instance!)

@annefou
Copy link
Collaborator Author

annefou commented Oct 21, 2022

You are right, we need to make the storage separate. For the eScience course, they will work until the end of November. I guess we also want to onboard them more "permanently" e.g. we would like to provide a more long-term solution but it is probably only viable once the elastic part is in place.

@sebastian-luna-valero
Copy link
Collaborator

@sebastian-luna-valero: I would have liked to try also the hpc.30core-64ram flavor but the only possible values for the Number of CPUs in the IM Dashboard are 2, 4, 8, 16, 32 and 64: is it possible to include 30 cores?

We could request the 30 vCPU option to be added but the hpc.30core-64ram flavor has ~2 GB RAM per vCPU core, instead of 4 GB per vCPU core in elixir.16core-64ram. I think we agreed previously that the 4 to 1 ratio is better than 2 to 1.

That works, but I found the deployment process really much slower than with the hpc.16core-64ram-ssd-ephem, so there are performance differences

For me the keyword here is deployment. Have you also run notebooks inside the new cluster to check performance of these nodes after deployment? this would need to be compared with performance of the same notebook in the clivar deployment.

Is it possible to separate the object storage with the usage of CLIVAR?

Object storage is detached from JupyterHub deployments.
NFS storage is attached to JupyterHub deployments.
As long as users are using object storage, we should be fine.

we would like to provide a more long-term solution but it is probably only viable once the elastic part is in place.

Remember that until we get automatic elasticiy in place, manually scaling up and down the cluster is possible.

@tinaok
Copy link
Collaborator

tinaok commented Oct 21, 2022

@sebastian-luna-valero

Is it possible to separate the object storage with the usage of CLIVAR?

Object storage is detached from JupyterHub deployments.
NFS storage is attached to JupyterHub deployments.
As long as users are using object storage, we should be fine.

Sorry I was not clear enough. I was hoping that we do not give access to the e-science course students for vopangeo.eu in aai.egi.eu for accessing s3 storage just now (untie #17 is resolved), then they can write into 'ANY OF' vo.pangeo.eu-swift disk space.
It can have some unfortunate 'delete/overwrite' from eScience course students (or vise-versa)
Thats why I thought may be better to separate the s3 access (if Anne plan to give writing access to s3 storage) like using external MinIO server or such. (which does not stay for long time)

@sebastian-luna-valero
Copy link
Collaborator

I see, thanks!

If other object storage is not available, and until #17 is solved, we could also look into deploying our own MinIO. However, there is the extra effort required to deploy and maintain this operational, and I am not sure whether I will have the time. What about others?

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero @tinaok @annefou

I did a quick test with dask_introduction.ipynb and the computation times are comparable between pangeo-eosc and pangeo-clivar, although I am not certain that the latter used hpc.16core-64ram-ssd-ephem and not the same elixir.16core-64ram (which would explain the similarity)...

Downloads were a lot slower however, but that could be related to the network?!

If there are no other dramatic losses in performances I guess that it should be fine for the eScience course?

This eosc infrastructure now has 16 WNs and the same values as clivar except for the amount of memory which is increased

@sebastian-luna-valero
Copy link
Collaborator

the computation times are comparable between pangeo-eosc and pangeo-clivar

great!

Downloads were a lot slower however, but that could be related to the network?!

Indeed, I would say so.

This eosc infrastructure now has 16 WNs and the same values as clivar except for the amount of memory which is increased

Looking at OpenStack I see:

  • 31 VMs for pangeo-clivar, out of which 20 are hpc.16core-64ram-ssd-ephem. So the worker nodes delete from pange-clivar have all been of the flavor elixir.16core-64ram. Please remember that instead we need to delete worker nodes with the flavor hpc.16core-64ram-ssd-ephem and replace them with elixir.16core-64ram.

  • 25 VMs for pangeo-eosc, all of them with the flavor elixir.16core-64ram. Great!

@j34ni please replace hpc.16core-64ram-ssd-ephem with elixir.16core-64ram. Happy to help if you need me!

@guillaumeeb
Copy link
Member

guillaumeeb commented Oct 21, 2022

I just ask the course organisers for the storage. I see that for the clivar bootcamp, they had 1TB and it is not full. I guess 2TB is OK.

As this storage is only use for home directory, I don't think we need so much. Are you planning to put huge scientific datasets there?

If other object storage is not available, and until #17 is solved, we could also look into deploying our own MinIO. However, there is the extra effort required to deploy and maintain this operational, and I am not sure whether I will have the time. What about others?

I won't have the time either. Would the solution to create yet another Openstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.

I started to look into building a new infrastructure,

@j34ni, I see that the jupyterhub is available at https://pangeo-eosc.vm.fedcloud.eu/jupyterhub/, so I guess you did not use the latest configuration provided here: https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md? This is not crutial, but the setup is a bit simplified in this documentation.

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero @tinaok @annefou

Should we fiddle with pangeo-clivar now, as it is being used, or should we wait before starting to remove the hpc.16core-64ram-ssd-ephem VMs (and then replace them by elixir.16core-64ram or not)?

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@guillaumeeb
Sorry I missed that and simply took the current clivar_values.yaml to reuse for pangeo-eosc

@sebastian-luna-valero
Copy link
Collaborator

Should we fiddle with pangeo-clivar now, as it is being used, or should we wait before starting to remove the hpc.16core-64ram-ssd-ephem VMs (and then replace them by elixir.16core-64ram or not)?

The pangeo-clivar cluster went down from 49 VMs to 31 VMs already, have you noticed any disruption?

We agreed with CESNET the hpc.16core-64ram-ssd-ephem nodes until today EOB, but we would need to give them back as soon as possible.

Again, I am here to help if needed.

I won't have the time either. Would the solution to create yet another Ppenstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.

Currently the quotas are 10TB for each project. Please confirm the new quota value and I will double check with CESNET.

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero

The pangeo-clivar cluster went down from 49 VMs to 31 VMs already, have you noticed any disruption?

I am not sure these 49 - 31 = 18 VMs were configured/used at all in the infrastructure, they never showed up in the list of VMs in the IM Dashboard anyway, so I deleted them in openstack manually

If I start to remove VMs from the instrastructure while they are in use the affected users will not be very happy

@tinaok When will be a good time to do that?

@guillaumeeb
Copy link
Member

I won't have the time either. Would the solution to create yet another Ppenstack project for the escience workshop to host data containers here would be feasible? We cloud reduce object store quotas on both projects if needed.

Currently the quotas are 10TB for each project. Please confirm the new quota value and I will double check with CESNET.

The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?

@sebastian-luna-valero
Copy link
Collaborator

If I start to remove VMs from the instrastructure while they are in use the affected users will not be very happy

According to grafana, the cluster is quite now, and it's Friday afternoon, I would say it's a good time to reconfigure the cluster.

The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?

Sure, we can create a new group in check-in dedicated for the new object store. Please note that this would imply that every time a new user requests to enroll in the VO for the eScience Course, VO managers would have to manually add them to the new group. If that's not a problem for you, we can do it.

Anyway, we would need to update OS_PROJECT_ID in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI-CLI-Swift-S3.md so we could also create a dedicated page for the eScience Course with the new project ID, and users will simply use that.

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero

According to grafana, the cluster is quite now, and it's Friday afternoon, I would say it's a good time to reconfigure the cluster.

OK, I'll give it a go

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero

Now I see a lot more VMs for pangeo-clivar on the IM Dashboard than when I refreshed a few minutes ago and these 18 "ghost" are suddenly back, what happened?

@sebastian-luna-valero
Copy link
Collaborator

I am sorry, but I can't check since don't have access to the pangeo-{clivar,eosc} clusters on my profile of IM Dashboard. As a last resort, I can send you my details so you can add me as owner to check further.

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero

Please do send me your credentials
I started by deleting the oldest HPC VMs and the IM Dashboard does not like it

download

@sebastian-luna-valero
Copy link
Collaborator

Will do. By the way:

  • is the pangeo-clivar cluster deployed with the dev instance of IM?
  • is the pange-eosc cluster deployed with the prod instance of IM?

@j34ni
Copy link
Collaborator

j34ni commented Oct 21, 2022

@sebastian-luna-valero

they are both on the dev

@sebastian-luna-valero
Copy link
Collaborator

The quota was just a proposition to see if we could create another Openstack project to have an object storage space that has different access policy. Imagine a pangeo-escience Openstack project, maybe we would need to create another user group too on check-in?

CESNET is happy to create another OpenStack project. There are two options

  • Option 1: Creating and mapping a dedicated vo.pangeo.eu in aai.egi.eu/escience group to the new OpenStack project:
  Create/destroy VMs Object Storage at OpenStack: project "vo.pangeo.eu" Object Storage at OpenStack: project "vo.pangeo.eu" Object Storage at OpenStack: project "vo.pangeo.eu-swift" Object Storage at OpenStack: project "vo.pangeo.eu-swift" Object Storage at OpenStack: project "vo.pangeo.eu-escience" Object Storage at OpenStack: project "vo.pangeo.eu-escience"
Virtual Organisation   Public bucket Private bucket Public bucket Private bucket Public bucket Private bucket
member of vo.pangeo.eu in aai.egi.eu/pangeo.admins yes read/write access read/write access read-write access read/write access read-only no access
member of vo.pangeo.eu in aai.egi.eu no read-only no access read/write access read/write access read-only no access
member of vo.pangeo.eu in aai.egi.eu/escience no read-only no access read-only no access read/write access read/write access
member of vo.pangeo.eu in aai-dev.egi.eu no read-only no access read-only no access read-only no access
None no read-only no access read-only no access read-only no access
  • Option 2: Not creating a dedicated group.
  Create/destroy VMs Object Storage at OpenStack: project "vo.pangeo.eu" Object Storage at OpenStack: project "vo.pangeo.eu" Object Storage at OpenStack: project "vo.pangeo.eu-swift" Object Storage at OpenStack: project "vo.pangeo.eu-swift" Object Storage at OpenStack: project "vo.pangeo.eu-escience" Object Storage at OpenStack: project "vo.pangeo.eu-escience"
Virtual Organisation   Public bucket Private bucket Public bucket Private bucket Public bucket Private bucket
member of vo.pangeo.eu in aai.egi.eu/pangeo.admins yes read/write access read/write access read-write access read/write access read/write access read/write access
member of vo.pangeo.eu in aai.egi.eu no read-only no access read/write access read/write access read/write access read/write access
member of vo.pangeo.eu in aai-dev.egi.eu no read-only no access read-only no access read-only no access
None no read-only no access read-only no access read-only no access

Please let me know your thoughts.

@annefou
Copy link
Collaborator Author

annefou commented Oct 26, 2022

I think option 1 is best.
Thank you!

@sebastian-luna-valero
Copy link
Collaborator

Thanks, please review #44

@sebastian-luna-valero
Copy link
Collaborator

Closing as obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants