-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh instead of timing out #15
Comments
I'm thinking of not relying on Gradio's loading for the training process, don't think it's suitable for things that will last for minutes or hours. Can't monitor the progress on multiple devices, and it won't be possible to hook back into the training progress once the page is closed or disconnected - have to rely on the terminal to monitor the progress or abort it. Instead, we can put the training into a subprocess, run it in the background and let the UI poll for its status, enabling us to see and control the progress on multiple devices. Have to craft a loading UI and block other features, such as inference, during fine-tuning, though. Another thing I want to do is to add CLI support, so I can do long fine-tuning on SkyPilot's managed spot instance or terminate the machine automatically after fine-tuning ended to save cost. |
Nice, let me know how I can help! |
I just implemented it on the I'll merge it back to The current known issue is that some processes, such as loading the base model or mapping the training dataset, can't be aborted immediately by clicking the abort button on the UI - will have to wait for that process to finish to get actually aborted. |
My current training takes 35 hours, it will time out - unless we refresh or increase the timeout substantially
The text was updated successfully, but these errors were encountered: