-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
luigi.target.FileAlreadyExists: Destination exists on HTCondor #204
Comments
This is something I had seen when re-starting the b2luigi job submission script. B2luigi then doesn't resubmit jobs for which the outputs already exist, but it doesn't when a particular job is currently running. Actually this could be solved by saving a cache with a mapping of b2luigi IDs to HTCondor job IDs, I started implementing that in #167 but never came around to finish that. BUT, do I understand you correctly that you get that error during a single submission process? I.e. when not rerunning the submission script? That's would be weird. And you're sure it's from different tasks with different b2luigi parameter values, and not different tasks having the same parameters and thus same output by accident? If so, this would be definetly weird. I would believe Luigi recognizes tasks by their |
Correct, I do not rerun the submission script in this case, or if I do, it's after all htcondor jobs have finished. But in htcondor there is a mechanism where b2luigi resubmits the failed jobs no? |
There is a Luigi option for retrying tasks, via the [worker]
retry_count = 2
ping_interval = 5 # time interval for pinging workers whether they are alive in s, default 1s
count_uniques = true
keep_alive = true
[scheduler]
disable_window_seconds = 900 # 15 minutes, failures within this window are not retried You can set is also per-task with a
|
I'm not sure if others also experience this issue but for me often when running large set of jobs on HTCondor, there seems to be a glitch somewhere in the scheduler, that jobs are resubmitted when the task has already successfully finished and the output file is there.
From digging into the condor log files it seems this can happen multiple times.
Could this be somehow be an artifact of using "on_temporary" where a task (for some reason) submitted multiple times?
For example, the parent task doesn't find the expected output, even though the job might be running at that time with a temporary (output file), so it submits the child task again?
Although from digging into the logs, it seems like the redundant jobs can also resubmitted after the first job has finished, which is also odd.
In the end this is not super harmful, since it just ends with some tasks which superficially have failed, but the output is fine. However, it is a waste of resources and can make the HTCondor queues longer for no reason, and also you end up with extra "-luigi-tmp-" tiles which need to be cleaned up.
The text was updated successfully, but these errors were encountered: