Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luigi.target.FileAlreadyExists: Destination exists on HTCondor #204

Open
nrad opened this issue Oct 26, 2023 · 3 comments
Open

luigi.target.FileAlreadyExists: Destination exists on HTCondor #204

nrad opened this issue Oct 26, 2023 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed htcondor concerns the htcondor batch system

Comments

@nrad
Copy link

nrad commented Oct 26, 2023

I'm not sure if others also experience this issue but for me often when running large set of jobs on HTCondor, there seems to be a glitch somewhere in the scheduler, that jobs are resubmitted when the task has already successfully finished and the output file is there.
From digging into the condor log files it seems this can happen multiple times.

Could this be somehow be an artifact of using "on_temporary" where a task (for some reason) submitted multiple times?
For example, the parent task doesn't find the expected output, even though the job might be running at that time with a temporary (output file), so it submits the child task again?

Although from digging into the logs, it seems like the redundant jobs can also resubmitted after the first job has finished, which is also odd.

In the end this is not super harmful, since it just ends with some tasks which superficially have failed, but the output is fine. However, it is a waste of resources and can make the HTCondor queues longer for no reason, and also you end up with extra "-luigi-tmp-" tiles which need to be cleaned up.

@meliache meliache added the htcondor concerns the htcondor batch system label Oct 26, 2023
@meliache
Copy link
Collaborator

meliache commented Oct 26, 2023

This is something I had seen when re-starting the b2luigi job submission script. B2luigi then doesn't resubmit jobs for which the outputs already exist, but it doesn't when a particular job is currently running. Actually this could be solved by saving a cache with a mapping of b2luigi IDs to HTCondor job IDs, I started implementing that in #167 but never came around to finish that.

BUT, do I understand you correctly that you get that error during a single submission process? I.e. when not rerunning the submission script? That's would be weird. And you're sure it's from different tasks with different b2luigi parameter values, and not different tasks having the same parameters and thus same output by accident?

If so, this would be definetly weird. I would believe Luigi recognizes tasks by their task_id and make sure to avoid duplicates. But yeah, if the task finishes and the renaming from on_temporary is not done, maybe that could cause a race-condition or something?

@nrad
Copy link
Author

nrad commented Oct 27, 2023

Correct, I do not rerun the submission script in this case, or if I do, it's after all htcondor jobs have finished.
I can try again with some simpler tasks, to make sure it's not different tasks with the same output file (I highly doubt it, because that would be more deterministic).

But in htcondor there is a mechanism where b2luigi resubmits the failed jobs no?
I remember somewhere there was an option for setting the number of resubmissions, although I can't find it now.

@meliache
Copy link
Collaborator

But in htcondor there is a mechanism where b2luigi resubmits the failed jobs no?
I remember somewhere there was an option for setting the number of resubmissions, although I can't find it now.

There is a Luigi option for retrying tasks, via the retry_count option. It's not HTCondor or b2luigi specific, works for any kind of Luigi task. See the Luigi Configuration docs. You can set that globally as a Luigi option via a luigi.toml file in the correct location (see docs above). E.g. mine has the contents:

[worker]
retry_count = 2 
ping_interval = 5  # time interval for pinging workers whether they are alive in s, default 1s
count_uniques = true
keep_alive = true

[scheduler]
disable_window_seconds = 900  # 15 minutes, failures within this window are not retried

You can set is also per-task with a .retry_count class attribute (also see docs above).

Regarding the actual bug, sadly I currently don't have really time to work on this and I'm not payed by any Belle II institution anymore. If it's not deterministic I could imagine a race condition where the HTCondor worker finishes, but the temporary file has not been renamed yet, and while it's in the process of being renamed the scheduler notices that the worker is not alive anymore and checks for the final output and sees it's not these, so it reschedules it for a retry (if the option is enabled). Maybe because I manually increased the ping_interval (see config above) I hadn't seen the issue that often, as it might make those race conditions more unlikely.

But this is just speculation. Can you check if any retries actually happened? Did you enable retries somewhere (they are off by default)? Also you log the b2luigi output to see if any jobs had failed, because retries only happen if luigi thinks that a job failed. To log the b2luigi scheduler output you can e.g. use something like

python3 b2luigi_steering_file.py &> b2luigi_output.log &   tail -f b2luigi_output.log

Also, do you have any reason to believe this might be a recent regression? In #186 I had changed how Luigi iterates through tasks, but I don't think that might have caused this. That PR was supposed to eliminate scheduling of duplicate tasks and I unit-tested the code with simple examples, further if there was a bug it should be deterministic and not HTCondor-specific 🤔

@meliache meliache added bug Something isn't working help wanted Extra attention is needed labels Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed htcondor concerns the htcondor batch system
Projects
None yet
Development

No branches or pull requests

2 participants