Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check archived jobs for last known job number before creating cluster #738

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timmartin-stripe
Copy link
Contributor

We ran into a bug where we deleted a job cluster and then recreated the job cluster with the same name. The old job cluster had 4 stages and the new one had two. When a job was completed, it would write to the archived tables. However, there already existed a job cluster there with the same ID. The KV provider only overwrote the rows for stages 1 and 2. It did not delete the values for stages 3 and 4.

When Mantis tried to load the archived job, it would see job metadata indicating 2 stages, but then would receive 4 stages (two from the new job and 4 from the old job). This would lead to the Mantis not loading the job.

We could probably consider this a bug in the Dynamo KV Provider, but it felt like we don't want to overwrite archived jobs in any scenario since we'd like to maintain a record of those jobs. Instead, the problem is further upstream. When we create a job, we should be reasonably confident that the Job ID is globally unique. However, when creating a job cluster, the lastJobCount value is always set to 0. We should instead check if there are any archived jobs with the same cluster name. If so, we should grab the last value and set that as the last known job number.

We desire the following scenario

  1. Create a job cluster "MyJob"
  2. Create a job "MyJob-1"
  3. Delete the job and job cluster
  4. Create another job cluster MyJob
  5. Create a job "MyJob-2" instead of "MyJob-1"

Previously, we would have an archived job "MyJob-1" and an active job "MyJob-1" that are distinct. Stopping the active one would overwrite the archived one.

Context

Explain context and other details for this pull request.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

We ran into a bug where we deleted a job cluster and then recreated the
job cluster with the same name.  The old job cluster had 4 stages and
the new one had two.  When a job was completed, it would write to the
archived tables.  However, there already existed a job cluster there
with the same ID.  The KV provider only overwrote the rows for stages 1
and 2. It did not delete the values for stages 3 and 4.

When Mantis tried to load the archived job, it would see job metadata
indicating 2 stages, but then would receive 4 stages (two from the new
job and 4 from the old job).  This would lead to the Mantis not loading
the job.

We could probably consider this a bug in the Dynamo KV Provider, _but_
it felt like we don't want to overwrite archived jobs in any scenario
since we'd like to maintain a record of those jobs.  Instead, the
problem is further upstream.  When we create a job, we should be
reasonably confident that the Job ID is globally unique.  However, when
creating a job cluster, the `lastJobCount` value is always set to 0.  We
should instead check if there are any archived jobs with the same
cluster name.  If so, we should grab the last value and set that as the
last known job number.

We desire the following scenario

1. Create a job cluster "MyJob"
2. Create a job "MyJob-1"
3. Delete the job and job cluster
4. Create another job cluster MyJob
5. Create a job "MyJob-2" instead of "MyJob-1"

Previously, we would have an archived job "MyJob-1" and an active job
"MyJob-1" that are distinct.  Stopping the active one would overwrite
the archived one.
// e.g. If a job cluster with the name "MyJobCluster" was deleted, but had run a job, then
// we'd have a Job ID of `MyJobCluster-1` in the archived jobs table. When the new job cluster
// came online it may partially overwrite the old archived job if we started a new job with ID MyJobCluster-1.
List<CompletedJob> completedJobs = jobStore.loadCompletedJobsForCluster(request.getJobClusterDefinition().getName(), 1, null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern here is that this loadCompletedJobsForCluster call's logic is essentially doing a getAll (the limit is applied after getAll rows) while this creation logic is on the api response path with a default 1 sec timeout. Thus the costly call to getAll rows can cause timeout and other error here.
What about moving this to use a different call to only load the single row needed here instead of the full partition?
Another alternative would be to add a flag to the create request to make "global unique job id" an optional argument so we can still disable this call if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants