Check archived jobs for last known job number before creating cluster #738

timmartin-stripe · 2025-01-03T17:21:33Z

We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4.

When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job.

We could probably consider this a bug in the Dynamo KV Provider, but it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the lastJobCount value is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number.

We desire the following scenario

Create a job cluster "MyJob"
Create a job "MyJob-1"
Delete the job and job cluster
Create another job cluster MyJob
Create a job "MyJob-2" instead of "MyJob-1"

Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.

Context

Explain context and other details for this pull request.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4. When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job. We could probably consider this a bug in the Dynamo KV Provider, _but_ it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the `lastJobCount` value is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number. We desire the following scenario 1. Create a job cluster "MyJob" 2. Create a job "MyJob-1" 3. Delete the job and job cluster 4. Create another job cluster MyJob 5. Create a job "MyJob-2" instead of "MyJob-1" Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.

Andyz26 · 2025-01-03T19:21:21Z

...ne/mantis-control-plane-server/src/main/java/io/mantisrx/master/JobClustersManagerActor.java

+                    // e.g. If a job cluster with the name "MyJobCluster" was deleted, but had run a job, then
+                    // we'd have a Job ID of `MyJobCluster-1` in the archived jobs table. When the new job cluster
+                    // came online it may partially overwrite the old archived job if we started a new job with ID MyJobCluster-1.
+                    List<CompletedJob> completedJobs = jobStore.loadCompletedJobsForCluster(request.getJobClusterDefinition().getName(), 1, null);


my main concern here is that this loadCompletedJobsForCluster call's logic is essentially doing a getAll (the limit is applied after getAll rows) while this creation logic is on the api response path with a default 1 sec timeout. Thus the costly call to getAll rows can cause timeout and other error here.
What about moving this to use a different call to only load the single row needed here instead of the full partition?
Another alternative would be to add a flag to the create request to make "global unique job id" an optional argument so we can still disable this call if needed.

timmartin-stripe marked this pull request as ready for review January 3, 2025 18:37

timmartin-stripe requested review from calvin681, sundargates, Andyz26, hmitnflx, fdc-ntflx and dtrager02 as code owners January 3, 2025 18:37

Andyz26 requested changes Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check archived jobs for last known job number before creating cluster #738

Check archived jobs for last known job number before creating cluster #738

timmartin-stripe commented Jan 3, 2025

Andyz26 Jan 3, 2025

Check archived jobs for last known job number before creating cluster #738

Are you sure you want to change the base?

Check archived jobs for last known job number before creating cluster #738

Conversation

timmartin-stripe commented Jan 3, 2025

Context

Checklist

Andyz26 Jan 3, 2025

Choose a reason for hiding this comment