-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise resource #21
Comments
@sebastian-luna-valero I think it would be a good idea for IM-dashboard, that all EGI VO admins have access to clusters created automatically. |
This is clearly something we should optimize! We need to defines how many resources we want now and in the following weeks/months. How many VMs do we need without event? And then we need to clarify how to do it, and who can. We can probably do something through Openstack UI too. Of course, ideally we should advance on elastic kubernetes set-up! |
Under the I believe it is better to do this explicitly (i.e. you choose who to share with) rather than automatically for security reasons. If you plan to use the elastic cluster, I suggest to do as many tests as possible before the upcoming CLIVAR workshop in October. The main aspect we should consider is to disconnect DaskHub from EGI Check-In, allow the native authentication mechanism, and perform stress tests with fake users. Happy to participate and contribute. Also, as discussed via email, we are happy to offer more computational resources to the |
You propose to do this in order to check that the "Elastic" functionnality work? If so, we can also verify this using Dask Gateway. With Elastic Kubernetes, can we chose to increase the minimum number of VMs before a workshop, through IM Dashboard or by another mean? cc @annefou for the more computational resources on a different provider. Not sure how workshops are overlapping in resources need. This might get tricky to handle if we have to Jupyterhub URLs, and we need to copy datasets in two places. |
@sebastian-luna-valero @guillaumeeb |
Still I think it is a good idea to test other infrastructure,
at least 2-4 dask workers for each student and 1 jupyterlab ? |
Hi,
Great, much easier then!
I need to investigate this, and will report back. More importantly, we need to plan and test it before the workshop.
Great. I will start looking for an alternative provider.
According to:
For 30 users, 2 dask workers per user, and 4 vCPUs per user/worker we get:
Object storage is: 10 TB I will check what's possible and get back to you. Best regards, |
Getting back to the optimizing current resources part, I have several questions:
|
Hi @guillaumeeb
This error means that you do not have correctly defined the credentials to access the cloud site.
If you want to deploy a K8s cluster on an OpenStack site (using OpenStack credentials) you only have to add it in the credentials section of the IM-Dashboard setting the needed authentication data. |
Thanks @micafer !
@guillaumeeb FYI: https://docs.egi.eu/users/compute/orchestration/im/dashboard/#cloud-credentials
I think it was created by @j34ni so he should be able to share the cluster with you.
Great! |
@guillaumeeb |
Thanks @micafer
So I have a Cloud cedential for CESNET: I used it to deploy my own Kubernetes cluster. However, I didn't have it configure when @j34ni gave me access to the other infrastructure (74ab3cc8-1e2d-11ed-8c48-0ee20d64cb6e ). This infrastructure has a status 'unknown' whereas the other I deployed is 'configured'. I should probably try to achieve a correct status before trying to manage it from IM Dashboard, but I don't know what to do. Could that comes from the fact I used a different ID for CESNET cloud provider than @j34ni used? @j34ni, do you see the foss4g infrastructure in a 'configured' status?
@j34ni I don't need more resources currently, it's perfectly fine for me to keep it. I'll sent you my credentials by email, but it is not mandatory that I see this cluster. |
Hi everyone, So @j34ni deleted the instance without EGI checkin. We currently have the pangeo-foss4g instance running with a lot of resources, and a new pangeo-elastic instance that I deployed to test Elastic Kubernetes. The elastic functionality is not working right now. What are the plan for the upcoming workshop? Do you want to:
We could also use the It should be pretty easy to deploy a new instance with more resources (and users limitations on dask-gateway side: Dask cluster size limits)) if you want. |
Now that we are closer to the workshop, do we have a better estimate of the required capacity?
I am currently struggling to find a new provider on time, so in the end we may simply ask CESNET to increase the available capacity for the CLIVAR workshop, if that's ok. However, the discussion to get a new provider is still on the table and if it's not available for the CLIVAR workshop we will try to have it for the following one. Therefore, the deployment for the CLIVAR workshop can stay at full capacity for longer, even if there is an overlap with the following workshop in November. Please let me know your thoughts. |
Yes I think it is OK to stick to CESNET. For the infrastructure, maybe we could just "rename" foss4g to pangeo-eosc or similar and add more resources. I think it takes time to add resources and the bootcamp starts next week. On my side, I am slightly worried because we know little about the datasets and they will most likely download many during the workshop. Do we have storage like for foss4g? They also have a minio instance in Denmark but it may not be very efficient for reading large amount of data. |
Yes, we have 10TB object storage, but write-access is only allowed to Pangeo Admins at the moment. Trainees only have read-only access, is that ok? Regarding adding resources, should we try to match the amount of resources requested in #21 (comment) or can you confirm more accurate numbers? |
Ok. Read access should be OK. For writing results and other data, they can use their minio. Yes I think the amount of resources requests in #21 (comment) is OK for this course. Thanks a lot. |
Renaming can be tricky, and won't be much faster than rebuilding a fresh infrastructure. If needed, I can deploy a new pangeo-eosc platform tonight or tomorrow. Adding resources don't seem to be a problem, on my testing instance it was pretty fast to add a node (something like a few minutes, less than 10). However, it will need to be validated before we delete the pangeo-foss4g one. On my side, I cannot add resources on pangeo-foss4g deployment, I don't know if @j34ni can? |
During the workshop, when users use dask, and probably we will need to work with temporally Zarr files. |
I can see the pangeo-elastic infrastructure as red, is it normal ? I have added 2 hpc.16core-64ram-ssd-ephem nodes (16 CPUs, 64.0 GB of RAM , 80.0 GB of HD) to pangeo-foss4g and they show as "running" (orange), so I guess that it is in the process of working and will hopefully soon be green If it all turns green, how many more nodes should I add? As for the name, we have a "pangeo-egi" ready and should in principle be able to switch easily without disturbing "pangeo-eosc" |
Is it possible to increase the size of the disk on an existing infrastructure (currently 931GiB)? Also, on Openstack I can see a lot of 80GB volumes which are apparently not in use (they must have been a left over from previous infrastructures and/or VMs!?), can/should we delete them? |
I'm afraid we don't know how to do that currently with CESNET Object storage. We should try to advance #17, but I'm not sure we ca easily answer this need. If I understand correctly, even with #23, users need to have an account on EGI Checkin operational service to have write access using either Swift or S3 interfaces. Is that correct @sebastian-luna-valero? We could generate either a Swift token or S3 credentials and share them, but this won't be very secure has this means users would be able to delete every bucket we have.
Not really, I have some exchanges with Miguel to try to make Elastic Kubernetes work, maybe the manipulations done had a side effect, I also see the infrastructure as Red, but it's working. I also still see pangeo-foss4g as unknown on my side.
Let's wait for @annefou or @tinaok, I guess if we chose to keep the pangeo-foss4g infrastructure, we'll want to add as many nodes as we can, just leaving a few resources for testing.
I'm not sure of what we are talking about here, are these local volumes attached to VMs? How would you want to use those volumes? I don't feel that using local VM storage (if this is what we are talking about) will solve the temporary storage problem. I'm under the impression we need a storage accessible from every VMs, a shared file system or object storage. |
I was talking about the shared file system |
As long as we can't make the cluster elastic, I think we better keep the pangeo-foss4g infra.
Please put as much as possible but keep some resource for @guillaumeeb and so on to work on elastic version and may be binder tests. |
I have update for number of students /mentors. We'll have 22 student, 14 mentors (including anne and myself)
I agree for increasing NFS disk space if possible, For some users, they would try creating Zarr in local dask cluster. It is not 'optimal ' parallel computing, but until we fix the possibilities for creating Zarr store in object storage, I think it is good to have this solution. |
@j34ni, what I can tell from Openstack dashboard is that every VM (13 at the moments) uses a 80GB local disk space. On the two Kubernetes front nodes, the Tosca template also mounts another volume which according to the documentation on IM Dashboard is used to store Kubernetes Persistent Volumes. Those persistent volumes are disk space that can be requested by pods, and are used for example by Jupyterhub to get a persistent volume on the Jupyter notebook pod of each user, mounted on /home/jovyan. This is the volume of 931GiB you're talking about. However, this space is not mounted and so not visible on Dask worker pods created by dask-gateway. I'm not sure if this is feasible. So this is a space that is shared between users, but not shared in the sense of distributed computing, more a space that is kept between Jupyter sessions. As @tinaok said, we'll only be able to use this space with Dask LocalClusters, and I'm not sure of the performances we can get if many users work on it at the same time.
So I guess we'll use
For my tests, I'd say 32 cores and 128 GiB is enough! And I already use 24 cores. |
Thank you @guillaumeeb
I totally agree about performance issue. I'm trying to run some notebook from |
Students will be formed as working groups. (each have 3-4 student, i.e. about 6-7 working groups) They should be sharing same data sets to work. So I guess we can create public cloud bucket for each working group, and we need to give them full access (read/write)? |
(or may be we should create a new issue for organising s3 disk space for cliver workshop?) |
I can see no change since yesterday on the IM dashboard (everything is still red), however on openstack it seems that the project is now allocated 1736 vCPUs Is it a bug? It would be great if it was true and if we could actually use them though... |
I just benched with same cmip6 notebook
I used 4 dask worker (32GB) with elastic, 26 dask worker (52GB( with foss4g, elastic could handle the work, but foss4g, failed. According to @keewis dask had big update recently, and probably that plays, and I think dask worker with just 2GB RAM is small for heavy duty. The foss4g cluster can be updated to recent Pangeo-notebook docker image but keeping all the all data on it? (I mean data on NFS) |
Interesting, I was just discussing with CESNET about the quotas, and in #21 (comment) it was asked 2GB RAM per vCPU. Let's see if they can provide 4GB RAM per vCPU instead.
CESNET may need some additional time to do the checks. |
@tinaok sorry but I would like to clarify what configuration worked for you:
Could you please confirm this info? |
Thank you @sebastian-luna-valero
The bench shows the dask-hub configuration pb. foss4g have older dask version + small dask worker configuration, thus cluster for cliver needs to be re-created or foss4g needs update... @guillaumeeb @j34ni is recreating or updating possible for you before the workshop starts? |
I benched using just one Jupyter lab.
|
@sebastian-luna-valero I used smaller instances for my test of Elastic Kubernetes, I guess this is OK in this case?
@tinaok Okay, this was not intended. We added those lines at one point on the Helm chart values, but I though this was hard limits, and that by default, the values in the configuration options would be used. But it seems that once I will just remove these changes in pangeo-elastic so that we entirely rely on the options part. This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads). This is done as follow: cluster = gateway.new_cluster(worker_memory=8, worker_cores=2) We can also put a bigger default, like 4GiB RAM per worker.
You can already try the above code on pangeo-foss4g instance, it should work and give you the same workers as on pangeo-elastic. This way you'll be able to know if only the memory limits played a role in the failure.
Yes and yes. Updating default Docker image and changing default memory/threads per worker (and limits) can be done with no disruption in a few minutes. I just need to be sure @j34ni does not try to change deployment name and host name at the same time. We also need to agree on correct values, in c.Backend.cluster_options = Options(
Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
handler=options_handler, What should be |
Just took the chance, right now I changed the settings on both infrastructure. On pangeo-elastic, I removed the c.Backend.cluster_options = Options(
Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
Float("worker_memory", default=4, min=2, max=12, label="Worker Memory (GiB)"),
String("image", default="pangeo/pangeo-notebook:2022.09.21", label="Image"),
handler=options_handler,
) This solve the problem of not being able to specify a custom value. On pangeo-foss4g: c.Backend.cluster_options = Options(
Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
Float("worker_memory", default=4, min=2, max=16, label="Worker Memory (GiB)"),
String("image", default="pangeo/pangeo-notebook:2022.08.24", label="Image"),
handler=options_handler,
) I used a bit older pangeo-notebook image version to avoid the dask-gateway display widget bug. And slightly higher max value for per worker memory limit. |
@guillaumeeb Thanks a lot for the update, the min max value of pangeo-foss4g is exactly what I would like to have ;-) |
I guess the pangeo-notebook image I tagged is not old enough. We can try with an older one if you want. |
The issue with the IM dashboard have not been resolved yet, so I cannot make any change to the pangeo-foss4g infrastructure (i.e., adding resources) Is it now OK to change name from pangeo-foss4g to pangeo-clivar? |
I changed for (pangeo-foss4g only, since I do not have access to the other infrastructures that @guillaumeeb has set up) to pangeo/pangeo-notebook tag: 2022.08.19 (which is very likely the version we had at FOSS4G) and the error disappeared |
I also did a bit of manual cleaning among the pods left running, there was quite a few of them and we ought to be careful about the resources available As a reminder, on the machine this requires to find the name of the dask-scheduler(s) and then issue a delete command:
That will also delete the related dask-worker(s) If the pod indefinitely remains in the terminating state add --grace-period=0 --force |
@j34ni didn't we talk about pangeo-egi? Appart that the name, it's OK on my side.
👍 |
Thank you guillaume and jean, I checked at I’ll come back with the full test |
Switch from pangeo-foss4g to pangeo-clivar done |
I was under the impression that pangeo-eosc would eventually become the "permanent" name and that for this CLIVAR workshop something like pangeo-clivar was more suited than pangeo-egi (and hence took the liberty to rename it) However, we can always change the name to whatever you want |
Should we give the new address https://pangeo-clivar.vm.fedcloud.eu/jupyterhub/hub/home to the workshop attendees & mentors? or do you plan to make additional changes? |
Yes, I think that you can communicate this address The only "change" that could be made (in terms of infrastructure) will be to add nodes as soon as the blocked IP address problem has been resolved, but this is not something we can fix ourselves so it may not happen before the workshop |
It's ok for tests, but as I mentioned over email we need to think carefully about the VM flavor if we want the workshop to go smoothly. Please see this spreadsheet and maybe let's discuss on a separate issue.
I would be very much in favor of fixing the amount of vCPUS/RAM per dask worker so we have a predicable amount of capacity, to avoid capacity problems. |
This way, the two deployments will have a default of 2GiB, but you can pass options and go up to 8GiB in both (and two threads). Some computation requires more memory for each dask worker than number of threads. I think it is preferable to keep this kind of flexibility for optimisation of resource? |
I understand, thanks! As long as we allocate enough capacity for the maximum amount of requested resources, we should be ok. Let's continue the discussion in #34 |
we can stay with actual version of clivar (ex-foss4g) I used 4 workers each with I'll continue running the notebooks at clivar-2022/tutorial/examples/notebooks/ in the clivar infrastructure, and if there is other anomaly I'll get back to you. |
Hi @sebastian-luna-valero @j34ni @guillaumeeb Tutorial session @ clivar bootcamp all finished!! Thank you very much for all your work!! I would like to understand that if we need to add more nodes, we need to delete actual https://pangeo-clivar.vm.fedcloud.eu and create new one? I starts to have information back from attendee about the size of data we need, and data CMIP6 data which are missing in google cloud. I was wondering if some data missing are already in some EOSC cloud (if possible in cesnet??) |
Closing as obsolete. |
We have limited VCPUs resource allocated. When we create clusters, even if we do not use it, we are consuming the resource.
As long as we are using 'non elastic' Kubernets cluster, we need to manually control this.
I just logged on openstack dashboard and we have no tutorial sessions going on but we use 240VCPUS now.
I tried to shutdown from IM Dashboard but I do not find any cluster in my interface.
How can I shut them down ?
The text was updated successfully, but these errors were encountered: