Check archived jobs for last known job number before creating cluster #738
+87
−52
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4.
When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job.
We could probably consider this a bug in the Dynamo KV Provider, but it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the
lastJobCount
value is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number.We desire the following scenario
Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.
Context
Explain context and other details for this pull request.
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests