Optimise resource #21

tinaok · 2022-09-21T10:03:53Z

We have limited VCPUs resource allocated. When we create clusters, even if we do not use it, we are consuming the resource.
As long as we are using 'non elastic' Kubernets cluster, we need to manually control this.

I just logged on openstack dashboard and we have no tutorial sessions going on but we use 240VCPUS now.

I tried to shutdown from IM Dashboard but I do not find any cluster in my interface.

How can I shut them down ?

tinaok · 2022-09-21T10:04:47Z

@sebastian-luna-valero I think it would be a good idea for IM-dashboard, that all EGI VO admins have access to clusters created automatically.

guillaumeeb · 2022-09-21T13:17:59Z

This is clearly something we should optimize! We need to defines how many resources we want now and in the following weeks/months. How many VMs do we need without event?

And then we need to clarify how to do it, and who can. We can probably do something through Openstack UI too.

Of course, ideally we should advance on elastic kubernetes set-up!

sebastian-luna-valero · 2022-09-27T07:50:11Z

Under the Change User bullet point you can find how to share access to virtual infrastructure via IM Dashboard:
https://docs.egi.eu/users/compute/orchestration/im/dashboard/#list-of-actions

I believe it is better to do this explicitly (i.e. you choose who to share with) rather than automatically for security reasons.

If you plan to use the elastic cluster, I suggest to do as many tests as possible before the upcoming CLIVAR workshop in October. The main aspect we should consider is to disconnect DaskHub from EGI Check-In, allow the native authentication mechanism, and perform stress tests with fake users. Happy to participate and contribute.

Also, as discussed via email, we are happy to offer more computational resources to the vo.pangeo.eu VO, preferably on a different cloud provider to balance the load across the EGI federation. This would also allow to have two deployments up and running for workshops with overlapping occurrence. If we want to go down this route, we would need the required amount of vCPUs, RAM and storage to negotiate access with a new cloud provider as soon as possible.

guillaumeeb · 2022-09-27T15:43:15Z

The main aspect we should consider is to disconnect DaskHub from EGI Check-In, allow the native authentication mechanism, and perform stress tests with fake users. Happy to participate and contribute.

You propose to do this in order to check that the "Elastic" functionnality work? If so, we can also verify this using Dask Gateway.

With Elastic Kubernetes, can we chose to increase the minimum number of VMs before a workshop, through IM Dashboard or by another mean?

cc @annefou for the more computational resources on a different provider. Not sure how workshops are overlapping in resources need. This might get tricky to handle if we have to Jupyterhub URLs, and we need to copy datasets in two places.

tinaok · 2022-09-28T10:16:17Z

@sebastian-luna-valero @guillaumeeb
I had a meeting with organiser of CLIVER workshop this morning. It is not yet clear the size of data set. we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

tinaok · 2022-09-28T10:21:06Z

Still I think it is a good idea to test other infrastructure,
Rough estimate I had for disk space is only 10To
From former experience with some applications, so lets say

shape of dask workers as each have as 16GB of memory with 4 threads(to 8?),
jupyterlab instance should have at least 16GB & 4 to 8 threads.

at least 2-4 dask workers for each student and 1 jupyterlab ?

sebastian-luna-valero · 2022-09-28T14:52:37Z

Hi,

You propose to do this in order to check that the "Elastic" functionnality work? If so, we can also verify this using Dask Gateway.

Great, much easier then!

With Elastic Kubernetes, ca we chose to increase the minimum number of VMs before a workshop, through IM Dashboard or by another mean?

I need to investigate this, and will report back. More importantly, we need to plan and test it before the workshop.

Still I think it is a good idea to test other infrastructure,

Great. I will start looking for an alternative provider.

we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

According to:

vCPUs: (n. of users x n. of JupyterLabs x n. of threads) + (n. of users x n. of dask workers x n. of threads)
RAM: (n. of of users x n. of JupyterLabs x 16 GB) + (n. of users x n. of dask workers x 16 GB)
Storage: 10 TB

For 30 users, 2 dask workers per user, and 4 vCPUs per user/worker we get:

480 vCPUs
1440 GB RAM

Object storage is: 10 TB

I will check what's possible and get back to you.

Best regards,
Sebastian

guillaumeeb · 2022-09-28T17:01:34Z

Getting back to the optimizing current resources part, I have several questions:

Now that we can see our deployment through IM Dashboard, I just tried to delete 2 VMS (8 and 9) from there, but get the error : ERROR Error making terminate op on VM 9: Error Removing resources: Error removing resources: ['No auth data has been specified to OpenStack.'] . Is this a known issue? Should I delete those VMs through Openstack? Will it cause any trouble (I don't thin so knowing Kubernetes a bit)?
I was planning to leaving 6 VMs to the current working deployment, has that leaves enough room on our Openstack quota to test Elastic Kubernetes, is that OK for everyone?
I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

micafer · 2022-09-30T07:10:48Z

Hi @guillaumeeb

Now that we can see our deployment through IM Dashboard, I just tried to delete 2 VMS (8 and 9) from there, but get the error : ERROR Error making terminate op on VM 9: Error Removing resources: Error removing resources: ['No auth data has been specified to OpenStack.'] . Is this a known issue? Should I delete those VMs through Openstack? Will it cause any trouble (I don't thin so knowing Kubernetes a bit)?

This error means that you do not have correctly defined the credentials to access the cloud site.
Could you check that you have the site defined in your credentials section?

I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

If you want to deploy a K8s cluster on an OpenStack site (using OpenStack credentials) you only have to add it in the credentials section of the IM-Dashboard setting the needed authentication data.

sebastian-luna-valero · 2022-09-30T09:43:56Z

Thanks @micafer !

This error means that you do not have correctly defined the credentials to access the cloud site. Could you check that you have the site defined in your credentials section?

@guillaumeeb FYI: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#cloud-credentials

I believe we have another Kubernetes cluster / Pangeo deployment without EGI check-in, from the look of Openstack UI, it uses 3 VMs. Should we keep it? Can somebody give me access to it through IM Dashboard?

I think it was created by @j34ni so he should be able to share the cluster with you.

I was planning to leaving 6 VMs to the current working deployment, has that leaves enough room on our Openstack quota to test Elastic Kubernetes, is that OK for everyone?

Great!

j34ni · 2022-09-30T09:51:32Z

@guillaumeeb
Sorry for not answering earlier, I am tied up with something else
I created this other infrastructure with the basic Jupyter sign-in for participants at the FOSS4G workshop who were not able to use the EGI Check-in.
It is now reduced in size and we thought about keeping it but if you need more resources and nobody else needs it then you can delete it entirely
Send me your token so that I can give you access

guillaumeeb · 2022-09-30T17:13:23Z

Thanks @micafer

This error means that you do not have correctly defined the credentials to access the cloud site. Could you check that you have the site defined in your credentials section?

@guillaumeeb FYI: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#cloud-credentials

So I have a Cloud cedential for CESNET:
Host: https://identity.cloud.muni.cz
VO: vo.pangeo.eu

I used it to deploy my own Kubernetes cluster.

However, I didn't have it configure when @j34ni gave me access to the other infrastructure (74ab3cc8-1e2d-11ed-8c48-0ee20d64cb6e ). This infrastructure has a status 'unknown' whereas the other I deployed is 'configured'. I should probably try to achieve a correct status before trying to manage it from IM Dashboard, but I don't know what to do. Could that comes from the fact I used a different ID for CESNET cloud provider than @j34ni used? @j34ni, do you see the foss4g infrastructure in a 'configured' status?

I created this other infrastructure with the basic Jupyter sign-in for participants at the FOSS4G workshop who were not able to use the EGI Check-in.
It is now reduced in size and we thought about keeping it but if you need more resources and nobody else needs it then you can delete it entirely

@j34ni I don't need more resources currently, it's perfectly fine for me to keep it. I'll sent you my credentials by email, but it is not mandatory that I see this cluster.

guillaumeeb · 2022-10-04T14:45:16Z

Hi everyone,

for @annefou @tinaok @j34ni.

So @j34ni deleted the instance without EGI checkin. We currently have the pangeo-foss4g instance running with a lot of resources, and a new pangeo-elastic instance that I deployed to test Elastic Kubernetes. The elastic functionality is not working right now.

What are the plan for the upcoming workshop?

Do you want to:

Continue using pangeo-foss4g as it suits your need and is stable?
Deploy a new pangeo-eosc platform based on the pangeo-elastic one but with more resources?

We could also use the pangeo-elastic directly by adding new nodes, but I intended to keep it for testing purpose.

It should be pretty easy to deploy a new instance with more resources (and users limitations on dask-gateway side: Dask cluster size limits)) if you want.

sebastian-luna-valero · 2022-10-05T20:09:25Z

I had a meeting with organiser of CLIVER workshop this morning. It is not yet clear the size of data set. we'll need about 30 Jupiter lab instances. But I do not have example of codes which will run, or computation estimate, so I can not estimate the shape of dask worker, neither the number of them we need.

Now that we are closer to the workshop, do we have a better estimate of the required capacity?

I will check what's possible and get back to you.

I am currently struggling to find a new provider on time, so in the end we may simply ask CESNET to increase the available capacity for the CLIVAR workshop, if that's ok. However, the discussion to get a new provider is still on the table and if it's not available for the CLIVAR workshop we will try to have it for the following one. Therefore, the deployment for the CLIVAR workshop can stay at full capacity for longer, even if there is an overlap with the following workshop in November.

Please let me know your thoughts.

annefou · 2022-10-06T07:40:20Z

Yes I think it is OK to stick to CESNET.

For the infrastructure, maybe we could just "rename" foss4g to pangeo-eosc or similar and add more resources. I think it takes time to add resources and the bootcamp starts next week.

On my side, I am slightly worried because we know little about the datasets and they will most likely download many during the workshop. Do we have storage like for foss4g? They also have a minio instance in Denmark but it may not be very efficient for reading large amount of data.

sebastian-luna-valero · 2022-10-06T07:58:01Z

Yes, we have 10TB object storage, but write-access is only allowed to Pangeo Admins at the moment. Trainees only have read-only access, is that ok?

Regarding adding resources, should we try to match the amount of resources requested in #21 (comment) or can you confirm more accurate numbers?

annefou · 2022-10-06T08:24:56Z

Ok. Read access should be OK. For writing results and other data, they can use their minio.

Yes I think the amount of resources requests in #21 (comment) is OK for this course. Thanks a lot.

guillaumeeb · 2022-10-06T12:33:58Z

For the infrastructure, maybe we could just "rename" foss4g to pangeo-eosc or similar and add more resources. I think it takes time to add resources and the bootcamp starts next week.

Renaming can be tricky, and won't be much faster than rebuilding a fresh infrastructure.

If needed, I can deploy a new pangeo-eosc platform tonight or tomorrow. Adding resources don't seem to be a problem, on my testing instance it was pretty fast to add a node (something like a few minutes, less than 10). However, it will need to be validated before we delete the pangeo-foss4g one.

On my side, I cannot add resources on pangeo-foss4g deployment, I don't know if @j34ni can?

tinaok · 2022-10-06T12:47:33Z

Ok. Read access should be OK. For writing results and other data, they can use their minio.

During the workshop, when users use dask, and probably we will need to work with temporally Zarr files.
Thus users need to read/write Zarr file efficiently.
So user need a private disk space in CESNET.

j34ni · 2022-10-06T13:17:39Z

I can see the pangeo-elastic infrastructure as red, is it normal ?

I have added 2 hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) to pangeo-foss4g and they show as "running" (orange), so I guess that it is in the process of working and will hopefully soon be green

If it all turns green, how many more nodes should I add?

As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

j34ni · 2022-10-06T13:53:48Z

@sebastian-luna-valero

Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)?

Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them?

guillaumeeb · 2022-10-06T14:36:07Z

Thus users need to read/write Zarr file efficiently.
So user need a private disk space in CESNET.

I'm afraid we don't know how to do that currently with CESNET Object storage. We should try to advance #17, but I'm not sure we ca easily answer this need. If I understand correctly, even with #23, users need to have an account on EGI Checkin operational service to have write access using either Swift or S3 interfaces. Is that correct @sebastian-luna-valero?

We could generate either a Swift token or S3 credentials and share them, but this won't be very secure has this means users would be able to delete every bucket we have.

I can see the pangeo-elastic infrastructure as red, is it normal ?

Not really, I have some exchanges with Miguel to try to make Elastic Kubernetes work, maybe the manipulations done had a side effect, I also see the infrastructure as Red, but it's working.

I also still see pangeo-foss4g as unknown on my side.

If it all turns green, how many more nodes should I add?
As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

Let's wait for @annefou or @tinaok, I guess if we chose to keep the pangeo-foss4g infrastructure, we'll want to add as many nodes as we can, just leaving a few resources for testing.

Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)?

I'm not sure of what we are talking about here, are these local volumes attached to VMs? How would you want to use those volumes? I don't feel that using local VM storage (if this is what we are talking about) will solve the temporary storage problem. I'm under the impression we need a storage accessible from every VMs, a shared file system or object storage.

j34ni · 2022-10-06T14:48:21Z

@guillaumeeb

I was talking about the shared file system

tinaok · 2022-10-06T16:24:54Z

If it all turns green, how many more nodes should I add?
As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc"

Let's wait for @annefou or @tinaok, I guess if we chose to keep the pangeo-foss4g infrastructure, we'll want to add as many nodes as we can, just leaving a few resources for testing.

As long as we can't make the cluster elastic, I think we better keep the pangeo-foss4g infra.
But I prefer we 'rename' it. (also by renaming it the data I have in NFS server of pangeo-foss4g will be there!! so it is good ;-)
I do not mind if it is pangeo-egi or pangeo-eosc. Please take the one which is most convenient for you.

If it all turns green, how many more nodes should I add?

Please put as much as possible but keep some resource for @guillaumeeb and so on to work on elastic version and may be binder tests.

tinaok · 2022-10-06T16:31:04Z

#21 (comment)

I have update for number of students /mentors. We'll have 22 student, 14 mentors (including anne and myself)
Now I got 20 students and 6 mentors enrolled.
They are directed to test their login and xarray with
https://pangeo-foss4g.vm.fedcloud.eu
So once @j34ni (?) rename the infrastructure, I'll ask them to change the link.

@guillaumeeb
I was talking about the shared file system

I agree for increasing NFS disk space if possible, For some users, they would try creating Zarr in local dask cluster. It is not 'optimal ' parallel computing, but until we fix the possibilities for creating Zarr store in object storage, I think it is good to have this solution.

guillaumeeb · 2022-10-06T17:05:01Z

Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them?

I was talking about the shared file system

@j34ni, what I can tell from Openstack dashboard is that every VM (13 at the moments) uses a 80GB local disk space.

On the two Kubernetes front nodes, the Tosca template also mounts another volume which according to the documentation on IM Dashboard is used to store Kubernetes Persistent Volumes. Those persistent volumes are disk space that can be requested by pods, and are used for example by Jupyterhub to get a persistent volume on the Jupyter notebook pod of each user, mounted on /home/jovyan. This is the volume of 931GiB you're talking about.

However, this space is not mounted and so not visible on Dask worker pods created by dask-gateway. I'm not sure if this is feasible. So this is a space that is shared between users, but not shared in the sense of distributed computing, more a space that is kept between Jupyter sessions. As @tinaok said, we'll only be able to use this space with Dask LocalClusters, and I'm not sure of the performances we can get if many users work on it at the same time.

But I prefer we 'rename' it. (also by renaming it the data I have in NFS server of pangeo-foss4g will be there!! so it is good ;-)
I do not mind if it is pangeo-egi or pangeo-eosc. Please take the one which is most convenient for you.

So I guess we'll use pangeo-egi that @j34ni has already booked. I'll let you do the change @j34ni, it looks you know how to do it properly!

Please put as much as possible but keep some resource for @guillaumeeb and so on to work on elastic version and may be binder tests.

For my tests, I'd say 32 cores and 128 GiB is enough! And I already use 24 cores.

tinaok · 2022-10-06T17:16:11Z

Thank you @guillaumeeb

As @tinaok said, we'll only be able to use this space with Dask LocalClusters, and I'm not sure of the performances we can get if many users work on it at the same time.

I totally agree about performance issue.

I'm trying to run some notebook from
https://github.com/pangeo-gallery/cmip6
at our future pangeo-egu ,
I would probably want to investigate how users can 'add' their own data on cloud so that it can be loaded from dask workers.
I guess I should follow the instruction from #23 ??
(https://github.com/sebastian-luna-valero/pangeo-eosc/blob/egi/EGI-CLI-Swift-S3.md)

tinaok · 2022-10-06T17:18:43Z

Students will be formed as working groups. (each have 3-4 student, i.e. about 6-7 working groups) They should be sharing same data sets to work. So I guess we can create public cloud bucket for each working group, and we need to give them full access (read/write)?

tinaok · 2022-10-06T17:19:14Z

(or may be we should create a new issue for organising s3 disk space for cliver workshop?)

j34ni · 2022-10-07T14:43:18Z

I can see no change since yesterday on the IM dashboard (everything is still red), however on openstack it seems that the project is now allocated 1736 vCPUs

Is it a bug?

It would be great if it was true and if we could actually use them though...

tinaok · 2022-10-07T14:50:46Z

@guillaumeeb

I just benched
https://pangeo-foss4g.vm.fedcloud.eu/
and
https://pangeo-elastic.vm.fedcloud.eu

with same cmip6 notebook
https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb

elastic has 4GBRAM on each dask worker and dask distribute version 2022.09.01
foss4g has 2GB RAM and dask version 2022.07

I used 4 dask worker (32GB) with elastic, 26 dask worker (52GB( with foss4g,

elastic could handle the work, but foss4g, failed.

According to @keewis dask had big update recently, and probably that plays, and I think dask worker with just 2GB RAM is small for heavy duty.

The foss4g cluster can be updated to recent Pangeo-notebook docker image but keeping all the all data on it? (I mean data on NFS)

sebastian-luna-valero · 2022-10-07T14:55:04Z

elastic could handle the work, but foss4g, failed.

Interesting, I was just discussing with CESNET about the quotas, and in #21 (comment) it was asked 2GB RAM per vCPU. Let's see if they can provide 4GB RAM per vCPU instead.

It would be great if it was true and if we could actually use them though...

CESNET may need some additional time to do the checks.

sebastian-luna-valero · 2022-10-07T15:06:09Z

@tinaok sorry but I would like to clarify what configuration worked for you:

how many dask workers per user? how many CPUs per dask worker? how much RAM per dask worker?

Could you please confirm this info?

tinaok · 2022-10-07T16:41:34Z

Thank you @sebastian-luna-valero

elastic has ~~4GB~~ 8GB RAM with 2 threads on each dask worker
foss4g has 2GB RAM with 1 thread on each dask worker

The bench shows the dask-hub configuration pb. foss4g have older dask version + small dask worker configuration, thus cluster for cliver needs to be re-created or foss4g needs update...

@guillaumeeb @j34ni is recreating or updating possible for you before the workshop starts?

tinaok · 2022-10-07T16:43:41Z

@sebastian-luna-valero

how many dask workers per user?

I benched using just one Jupyter lab.
one user using

elastic 4 dask worker (4 * 8GB, 4 * 2 threads)
foss4g 26 dask worker (26 * 2GB, 26 * 1 threads)

guillaumeeb · 2022-10-07T17:39:51Z

Please remember to use the hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) flavors for k8s nodes.

@sebastian-luna-valero I used smaller instances for my test of Elastic Kubernetes, I guess this is OK in this case?

elastic has 8GB RAM with 2 threads on each dask worker and dask distribute version 2022.09.01
foss4g has 2GB RAM with 1 thread on each dask worker and dask version 2022.07

@tinaok Okay, this was not intended. We added those lines at one point on the Helm chart values, but I though this was hard limits, and that by default, the values in the configuration options would be used. But it seems that once backend.worker section is set-up in the Yaml file, options values are ignored, even if changing them in Python code.

I will just remove these changes in pangeo-elastic so that we entirely rely on the options part. This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads). This is done as follow:

cluster = gateway.new_cluster(worker_memory=8, worker_cores=2)

We can also put a bigger default, like 4GiB RAM per worker.

According to @keewis dask had big update recently, and probably that plays, and I think dask worker with just 2GB RAM is small for heavy duty.

You can already try the above code on pangeo-foss4g instance, it should work and give you the same workers as on pangeo-elastic. This way you'll be able to know if only the memory limits played a role in the failure.

The foss4g cluster can be updated to recent Pangeo-notebook docker image but keeping all the all data on it? (I mean data on NFS)
@guillaumeeb @j34ni is recreating or updating possible for you before the workshop starts?

Yes and yes. Updating default Docker image and changing default memory/threads per worker (and limits) can be done with no disruption in a few minutes. I just need to be sure @j34ni does not try to change deployment name and host name at the same time. We also need to agree on correct values, in

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
          handler=options_handler,

What should be default, min, max for worker_cores and worker_memory?

guillaumeeb · 2022-10-07T17:56:59Z

Just took the chance, right now I changed the settings on both infrastructure.

On pangeo-elastic, I removed the backend.worker part and set:

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=4, min=2, max=12, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
          handler=options_handler,
        )

This solve the problem of not being able to specify a custom value.

On pangeo-foss4g:

        c.Backend.cluster_options = Options( 
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=4, min=2, max=16, label="Worker Memory (GiB)"),
          String("image", default="pangeo/pangeo-notebook:2022.08.24", label="Image"),
          handler=options_handler,
        )

I used a bit older pangeo-notebook image version to avoid the dask-gateway display widget bug. And slightly higher max value for per worker memory limit.

tinaok · 2022-10-07T18:30:41Z

@guillaumeeb Thanks a lot for the update, the min max value of pangeo-foss4g is exactly what I would like to have ;-)
I'll run the notebook and gets back to you

tinaok · 2022-10-07T18:50:30Z

I have problem with dask cluster output on both foss4g and elastic.
dashboard link is working for both foss4g and elastic

guillaumeeb · 2022-10-07T19:47:04Z

I guess the pangeo-notebook image I tagged is not old enough. We can try with an older one if you want.

j34ni · 2022-10-08T06:38:27Z

@annefou @tinaok @guillaumeeb

The issue with the IM dashboard have not been resolved yet, so I cannot make any change to the pangeo-foss4g infrastructure (i.e., adding resources)
However it shows as "running" (orange) and nodes 0-9 are "green" (configured)
It also seems that it still works and is accessible, and I can see that @guillaumeeb has already updated the values

Is it now OK to change name from pangeo-foss4g to pangeo-clivar?

j34ni · 2022-10-08T07:18:42Z

@tinaok @annefou @guillaumeeb

I guess the pangeo-notebook image I tagged is not old enough. We can try with an older one if you want.

I changed for (pangeo-foss4g only, since I do not have access to the other infrastructures that @guillaumeeb has set up) to pangeo/pangeo-notebook tag: 2022.08.19 (which is very likely the version we had at FOSS4G) and the error disappeared

j34ni · 2022-10-08T07:45:26Z

@tinaok @annefou @guillaumeeb

I also did a bit of manual cleaning among the pods left running, there was quite a few of them and we ought to be careful about the resources available

As a reminder, on the machine this requires to find the name of the dask-scheduler(s) and then issue a delete command:


sudo kubectl get pods -n daskhub
sudo kubectl -n daskhub delete pod dask-scheduler-553417911a284304a2df5d9789b56f2c

That will also delete the related dask-worker(s)

If the pod indefinitely remains in the terminating state add --grace-period=0 --force

guillaumeeb · 2022-10-08T07:46:29Z

Is it now OK to change name from pangeo-foss4g to pangeo-clivar?

@j34ni didn't we talk about pangeo-egi? Appart that the name, it's OK on my side.

I changed to pangeo/pangeo-notebook tag: 2022.08.19 (which is very likely the version we had at FOSS4G) and the error disappeared

👍

tinaok · 2022-10-08T08:06:09Z

Thank you guillaume and jean, I checked at
https://us-central1-b.gcp.pangeo.io/ and I have same err

I’ll come back with the full test
with same cmip6 notebook
https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb
on elastic and ex-foss4g( @j34ni @guillaumeeb i confirm , we agreed the name as pangeo-egi)
to decide if we keep the esthetic err but go for dask 2022.09 for performance or we stay as what we have in ex-foss4g

j34ni · 2022-10-08T08:08:24Z

@guillaumeeb @annefou @tinaok

Switch from pangeo-foss4g to pangeo-clivar done

j34ni · 2022-10-08T08:10:19Z

j34ni · 2022-10-08T08:13:29Z

we agreed the name as pangeo-egi

I was under the impression that pangeo-eosc would eventually become the "permanent" name and that for this CLIVAR workshop something like pangeo-clivar was more suited than pangeo-egi (and hence took the liberty to rename it)

However, we can always change the name to whatever you want

annefou · 2022-10-08T19:21:37Z

Should we give the new address https://pangeo-clivar.vm.fedcloud.eu/jupyterhub/hub/home to the workshop attendees & mentors? or do you plan to make additional changes?

j34ni · 2022-10-09T07:27:27Z

Yes, I think that you can communicate this address

The only "change" that could be made (in terms of infrastructure) will be to add nodes as soon as the blocked IP address problem has been resolved, but this is not something we can fix ourselves so it may not happen before the workshop

sebastian-luna-valero · 2022-10-10T07:50:11Z

@sebastian-luna-valero I used smaller instances for my test of Elastic Kubernetes, I guess this is OK in this case?

It's ok for tests, but as I mentioned over email we need to think carefully about the VM flavor if we want the workshop to go smoothly. Please see this spreadsheet and maybe let's discuss on a separate issue.

This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads).

I would be very much in favor of fixing the amount of vCPUS/RAM per dask worker so we have a predicable amount of capacity, to avoid capacity problems.

tinaok · 2022-10-10T10:12:35Z

@sebastian-luna-valero

This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads).
I would be very much in favor of fixing the amount of vCPUS/RAM per dask worker so we have a predicable amount of capacity, to avoid capacity problems.

Some computation requires more memory for each dask worker than number of threads. I think it is preferable to keep this kind of flexibility for optimisation of resource?

sebastian-luna-valero · 2022-10-10T10:21:11Z

Some computation requires more memory for each dask worker than number of threads. I think it is preferable to keep this kind of flexibility for optimisation of resource?

I understand, thanks!

As long as we allocate enough capacity for the maximum amount of requested resources, we should be ok.

Let's continue the discussion in #34

tinaok · 2022-10-10T15:22:33Z

@j34ni @guillaumeeb

I’ll come back with the full test with same cmip6 notebook https://github.com/pangeo-gallery/cmip6/blob/master/ECS_Gregory_method.ipynb on elastic and ex-foss4g( @j34ni @guillaumeeb i confirm , we agreed the name as pangeo-egi) to decide if we keep the esthetic err but go for dask 2022.09 for performance or we stay as what we have in ex-foss4g

we can stay with actual version of clivar (ex-foss4g) I used 4 workers each with
cluster = gateway.new_cluster(worker_memory=8, worker_cores=2)
and on the clivar configuration now the benchmark does not fail. So I conclude the pb came from memory size of worker (2G) was too small for this computation.

I'll continue running the notebooks at clivar-2022/tutorial/examples/notebooks/ in the clivar infrastructure, and if there is other anomaly I'll get back to you.

tinaok · 2022-10-12T16:54:06Z

Hi @sebastian-luna-valero @j34ni @guillaumeeb

Tutorial session @ clivar bootcamp all finished!! Thank you very much for all your work!!
We will start using the infrastructure for working group from tomorrow.

I would like to understand that if we need to add more nodes, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu and create new one?

I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??)
Here is the hackmd https://hackmd.io/@pangeo/clivar-2022
(I propose a new threads about 'configuration of clover-bookcamp working group infrastructure on EOSC cloud' )

sebastian-luna-valero · 2023-07-14T07:08:17Z

Closing as obsolete.

sebastian-luna-valero mentioned this issue Sep 27, 2022

Creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard #22

Open

guillaumeeb mentioned this issue Oct 7, 2022

Update object storage access #23

Merged

tinaok mentioned this issue Oct 12, 2022

CLIVAR bootcamp configuration for working group EOSC cloud #38

Closed

sebastian-luna-valero closed this as completed Jul 14, 2023

Optimise resource #21

Optimise resource #21

Comments

tinaok commented Sep 21, 2022

tinaok commented Sep 21, 2022

guillaumeeb commented Sep 21, 2022

sebastian-luna-valero commented Sep 27, 2022

guillaumeeb commented Sep 27, 2022 • edited Loading

tinaok commented Sep 28, 2022

tinaok commented Sep 28, 2022 • edited Loading

sebastian-luna-valero commented Sep 28, 2022

guillaumeeb commented Sep 28, 2022

micafer commented Sep 30, 2022

sebastian-luna-valero commented Sep 30, 2022

j34ni commented Sep 30, 2022

guillaumeeb commented Sep 30, 2022

guillaumeeb commented Oct 4, 2022

sebastian-luna-valero commented Oct 5, 2022

annefou commented Oct 6, 2022

sebastian-luna-valero commented Oct 6, 2022

annefou commented Oct 6, 2022

guillaumeeb commented Oct 6, 2022

tinaok commented Oct 6, 2022

j34ni commented Oct 6, 2022

j34ni commented Oct 6, 2022

guillaumeeb commented Oct 6, 2022

j34ni commented Oct 6, 2022

tinaok commented Oct 6, 2022

tinaok commented Oct 6, 2022

guillaumeeb commented Oct 6, 2022

tinaok commented Oct 6, 2022 • edited Loading

tinaok commented Oct 6, 2022

tinaok commented Oct 6, 2022

j34ni commented Oct 7, 2022

tinaok commented Oct 7, 2022

sebastian-luna-valero commented Oct 7, 2022

sebastian-luna-valero commented Oct 7, 2022

tinaok commented Oct 7, 2022

tinaok commented Oct 7, 2022 • edited Loading

guillaumeeb commented Oct 7, 2022

guillaumeeb commented Oct 7, 2022

tinaok commented Oct 7, 2022 • edited Loading

tinaok commented Oct 7, 2022

guillaumeeb commented Oct 7, 2022

j34ni commented Oct 8, 2022

j34ni commented Oct 8, 2022 • edited Loading

j34ni commented Oct 8, 2022

guillaumeeb commented Oct 8, 2022

tinaok commented Oct 8, 2022

j34ni commented Oct 8, 2022

j34ni commented Oct 8, 2022

j34ni commented Oct 8, 2022 • edited Loading

annefou commented Oct 8, 2022

j34ni commented Oct 9, 2022

sebastian-luna-valero commented Oct 10, 2022

tinaok commented Oct 10, 2022

sebastian-luna-valero commented Oct 10, 2022

tinaok commented Oct 10, 2022

tinaok commented Oct 12, 2022

sebastian-luna-valero commented Jul 14, 2023

guillaumeeb commented Sep 27, 2022 •

edited

Loading

tinaok commented Sep 28, 2022 •

edited

Loading

tinaok commented Oct 6, 2022 •

edited

Loading

tinaok commented Oct 7, 2022 •

edited

Loading

tinaok commented Oct 7, 2022 •

edited

Loading

j34ni commented Oct 8, 2022 •

edited

Loading

j34ni commented Oct 8, 2022 •

edited

Loading