[BATCH-2675] Rejected Job can not be restarted Created: 29/Jan/18  Updated: 19/Dec/18  Resolved: 26/Oct/18

Status: Resolved
Project: Spring Batch
Component/s: Core
Affects Version/s: 3.0.9, 4.0.1, 4.1.0.RC1
Fix Version/s: 4.1.0, 4.0.2, 3.0.10

Type: Bug Priority: Major
Reporter: Heiko Does Assignee: Mahmoud Ben Hassine
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by BATCH-2760 Failed JobExecution due to unavailabl... Resolved
Pull Request URL: https://github.com/spring-projects/spring-batch/pull/659

 Description   

We had a Problem in one of our batch applications, because our queue capacity was too small.
So some of the jobs resulted in an TaskRejectedException and the job was marked as FAILED in the database. After we figured that out we wanted to restarted the failed jobs via JobOperator.restart method. This resulted in an JobExecutionAlreadyRunningException because the job had no end time and thishttps://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/repository/support/SimpleJobRepository.java#L119 check failed.

I my opinion this behaviour is not correct and i would suggest to add a end time when a task ist rejected (https://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/launch/support/SimpleJobLauncher.java#L167) so that the job could be restarted.



 Comments   
Comment by Dimitrios Liapis [ 09/Oct/18 ]

Continuing the discussion here for the closed PR https://github.com/spring-projects/spring-batch/pull/643 

Michael Minella Mahmoud Ben Hassine thanks for the reviews.

In the duplicate ticket BATCH-2760 I've raised for this issue: I've identified the method that gives the false positive for this case ( SimpleJobExplorer#findRunningJobExecutions() ) mainly because the subsequent query is based on the Job Execution END_TIME column being null that also includes this erroneous cases.

Looking at the alternative scenario suggested by Michael Minella in the PR comment: we could amend the above query somehow to filter out these special erroneous cases. For instance we could amend the query to exclude entries that have STATUS as FAILED along with the existing 'END_TIME is null' predicate:

SELECT E.JOB_EXECUTION_ID, E.START_TIME, E.END_TIME, E.STATUS, E.EXIT_CODE, E.EXIT_MESSAGE, E.CREATE_TIME, E.LAST_UPDATED, E.VERSION, E.JOB_INSTANCE_ID, E.JOB_CONFIGURATION_LOCATION from %PREFIX%JOB_EXECUTION E, %PREFIX%JOB_INSTANCE I where E.JOB_INSTANCE_ID=I.JOB_INSTANCE_ID and I.JOB_NAME=? and E.END_TIME is NULL and E.STATUS != 'FAILED' order by E.JOB_EXECUTION_ID desc

We could also use the Job Execution EXIT_MESSAGE column where I think the 'TaskRejectedException' name is also logged on these erroneous cases though I'm not sure how performant including this predicate would be.

Or maybe an alternative solution all together?

What do you think??

Comment by Mahmoud Ben Hassine [ 10/Oct/18 ]

I would not go for filtering based on the EXIT_MESSAGE containing the text "TaskRejectedException" as this would:

  • break if we change the exit message content (the exception name for instance)
  • perform poorly if the exit message is long enough (for queries with 'like %XXX%')

Excluding failed executions based on the status would solve the issue. However, the status "FAILED" is not the appropriate status for these rejected executions. Their status should be something like "REJECTED". The status "FAILED" implies that the execution was run and failed (which is not the case), and having an execution in "FAILED" status without start time and end time makes no sense to me (because we can't answer questions like: when did it start and when did it fail? How long did it take to fail?). That's why I suggested to set the end time in BATCH-2760). On the other hand, it makes sense for a "REJECTED" execution to not have a start/end time.

So IMO, the correct fix for this issue would be to set the status of rejected execution to "REJECTED" (a new status we need to create or may be use UNKNOWN ?) and filter out rejected executions in the query.

What do you think?

Comment by Dimitrios Liapis [ 10/Oct/18 ]

Agreed on the EXIT_MESSAGE not being a good idea to go about this, just mentioned it as it's a way to narrow down on these particular failures.

Having a dedicated status (either existing "UNKOWN" or a new "REJECTED") sounds good idea to me and will keep things a bit cleaner.

Pending also on Michael Minella's opinion about this new approach, I can start putting together a new PR.

One question in case we decide to introduce a new dedicated status i.e. "REJECTED" to capture this: where exactly should we place it inside the BatchStatus enum since there is an implicit ordering? I would assume either directly after "UNKNOWN" or directly before "UNKNOWN".

Comment by Mahmoud Ben Hassine [ 16/Oct/18 ]

Hi Dimitrios Liapis,

After discussing this issue with Michael, we will not create a new status "REJECTED" as this might introduce confusions about the orchestration part of batch jobs.

Finding running executions should be done by looking at executions having a start time but no end time yet. This can be written in the query as ... and E.START_TIME is not NULL and E.END_TIME is NULL ....

Since rejected tasks do not have a start time set (and no end time neither), they should be excluded from the "findRunningExecutions" query. Note that we don't need to filter on the status anymore.

Can you please open a PR with this change? We would like to include the fix in the upcoming 4.1.0.RELEASE.

Kind regards,
Mahmoud

Comment by Dimitrios Liapis [ 16/Oct/18 ]

Hi Mahmoud Ben Hassine,

thanks for that.

I've started working on this today and while working on the tests, another thing popped up:

MapJobExecutionDao#findRunningJobExecutions() internally uses JobExecution#isRunning() where isRunning() method is defined as endTime == null.

Initially I've changed that to be startTime != null && endTime == null, to be in synch with the above fix on the query inside JdbcJobExecutionDao#findRunningJobExecutions()

This latter change opened up to some new test failures with a common pattern:

SimpleJobRepository#createJobExecution() is internally using the JobExecution#isRunning() method. Therefore, i.e. following the test MapJobRepositoryFactoryBeanTests#testCreateRepository(), two consecutive SimpleJobRepository#createJobExecution() calls will not detect a positive isRunning() anymore as the Start Time is not set.

Which got me questioning whether the JobExecution#isRunning() really is a check to find out if a JobExecution has been created rather than running (which implies that it has already started).

What are you thoughts?

Many thanks

Comment by Mahmoud Ben Hassine [ 17/Oct/18 ]

Thank you for fixing the map based dao to be in sync with the jdbc one .

Which got me questioning whether the JobExecution#isRunning() really is a check to find out if a JobExecution has been created rather than running (which implies that it has already started).

We have createTime for creation time, so isRunning is really about running executions and not created ones. It is the test that should be updated:

repository.createJobExecution(job.getName(), jobParameters);

try {
	repository.createJobExecution(job.getName(), jobParameters);
	fail("Expected JobExecutionAlreadyRunningException");
}
catch (JobExecutionAlreadyRunningException e) {
	// expected
}

In this test, we create a job execution (note that we don't run it), and then try create another one and expect to have a running execution (but the first one was never run). If you look at the JobExecution#setStartTime method, it is used in AbstractJob#doExecute method. So we need to execute the execution to have its start time set (which makes sense to me). So the test should be updated with something like:

JobExecution jobExecution = repository.createJobExecution(job.getName(), jobParameters);
job.execute(jobExecution); // this would set the start time and update the execution in the job repository

try {
	repository.createJobExecution(job.getName(), jobParameters);
	fail("Expected JobExecutionAlreadyRunningException");
}
catch (JobExecutionAlreadyRunningException e) {
	// expected
}
Comment by Dimitrios Liapis [ 18/Oct/18 ]

Thanks, sounds good.

Looking at it now, AbstractJob#execute() sets the Start Time but also sets the End Time in a finally block.

I'm guessing if there isn't another way naturally just setting the Start Time alone without the End Time I'll try circumventing the End Time setting via mock-ing/stub-ing it somehow.

Comment by Mahmoud Ben Hassine [ 18/Oct/18 ]

I see, there should be some time between the call to job.execute(jobExecution); and the second call to repository.createJobExecution(job.getName(), jobParameters); so the repository sees the execution as running (start time set but the end time not set yet). The simplest thing to do it is to simulate a running execution, something like:

@Test
public void testCreateRepository() throws Exception {
	tested.afterPropertiesSet();
	JobRepository repository = tested.getObject();
	Job job = new JobSupport("jobName");
	JobParameters jobParameters = new JobParameters();

	JobExecution jobExecution = repository.createJobExecution(job.getName(), jobParameters);

	// simulate a running execution
	jobExecution.setStartTime(new Date());
	repository.update(jobExecution);

	try {
		repository.createJobExecution(job.getName(), jobParameters);
		fail("Expected JobExecutionAlreadyRunningException");
	}
	catch (JobExecutionAlreadyRunningException e) {
		// expected
	}
}

Otherwise, we need to run a job (in a separate thread) and make it take some time, so that the job repository has a chance to see the execution as running when it tries to create a new one. For example:

@Test
public void testCreateRepository() throws Exception {
	ApplicationContext applicationContext = new AnnotationConfigApplicationContext(JobConfiguration.class);
	JobRepository jobRepository = applicationContext.getBean(JobRepository.class);
	Job job = applicationContext.getBean(Job.class);
	JobLauncher jobLauncher = applicationContext.getBean(JobLauncher.class);

	jobLauncher.run(job, new JobParameters());

	try {
		jobRepository.createJobExecution(job.getName(), new JobParameters());
		fail("Expected JobExecutionAlreadyRunningException");
	}
	catch (JobExecutionAlreadyRunningException e) {
		// expected
	}
}

@Configuration
@EnableBatchProcessing
static class JobConfiguration {

	private JobBuilderFactory jobBuilderFactory;
	private StepBuilderFactory stepBuilderFactory;

	public JobConfiguration(JobBuilderFactory jobBuilderFactory, StepBuilderFactory stepBuilderFactory) {
		this.jobBuilderFactory = jobBuilderFactory;
		this.stepBuilderFactory = stepBuilderFactory;
	}

	@Bean
	public JobLauncher jobLauncher(JobRepository jobRepository) {
		SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
		jobLauncher.setJobRepository(jobRepository);
		jobLauncher.setTaskExecutor(new SimpleAsyncTaskExecutor());
		return jobLauncher;
	}

	@Bean
	public Job job() {
		return jobBuilderFactory.get("job")
				.start(stepBuilderFactory.get("step")
						.tasklet((contribution, chunkContext) -> {
							Thread.sleep(500);
							return RepeatStatus.FINISHED;
						})
						.build())
				.build();
	}
}

Both tests pass on my side. I let you choose how to update the test (or suggest another solution) and we can discuss it on the PR.

Comment by Dimitrios Liapis [ 18/Oct/18 ]

Many thanks for the concrete examples - it made things very clear.

PR raised: https://github.com/spring-projects/spring-batch/pull/659

Could we also please consider this fix for 3.0.10.RELEASE too?

We have substantial infrastructure based on this version and we can't currently upgrade to version 4.

Comment by Michael Minella [ 18/Oct/18 ]

Dimitrios Liapis We definitely can take a look at back porting this fix (no promises), but I'd be curious as to what is holding you back from upgrading to Spring Batch 4.

Comment by Dimitrios Liapis [ 18/Oct/18 ]

Michael Minella It's really the Spring 5 dependency which we can't currently upgrade to (from almost latest 4).

That would be highly appreciated if we could back-port it and I would be more than happy to help on that (apply it independently via a PR to 3.0.x branch or any other way you prefer).

Comment by Mahmoud Ben Hassine [ 26/Oct/18 ]

I'm resolving this as the PR was merged to the master branch (for the upcoming 4.1.0 GA and) and back ported to 4.0.x (for 4.0.2).

@ Dimitrios Liapis you are welcome to help in back porting the fix to 3.0.x via a PR if you want (otherwise we will try to do it, but no promises as mentioned by Michael).

Comment by Spazbob [ 14/Dec/18 ]

Mahmoud Ben Hassine

This fix tripped us up a bit. We have built various dashboards that use SimpleJobExplorer's findRunningJobExecutions(). We have worker pool of 12 threads to run jobs, so if you submitted 16 jobs at once, prior to the fix all 16 would appear in the list, but now 4 of them will not start immediately (as they are waiting for an execution thread) and so they drop off the list.

 

It worked well for us when the "waiting" jobs appeared in the list, as it made it clear that they would be running shortly.

What is the recommended approach to get a list of waiting AND running jobs? I know we could use getJobExecutions() and then filter, but that would imply pulling back a significant amount of history which is inefficient.

Comment by Mahmoud Ben Hassine [ 19/Dec/18 ]

@Spazbob Thank you for reporting this, your use case confirms that there was an issue. In your example, the 4 executions that are not started yet should not be returned by JobExplorer#findRunningJobExecutions because they are not effectively running.

What is the recommended approach to get a list of waiting AND running jobs? I know we could use getJobExecutions() and then filter, but that would imply pulling back a significant amount of history which is inefficient.

  • For running jobs, you can use findRunningJobExecutions
  • For waiting jobs, If you don't want to use getJobExecutions and filter the results, you can do the filtering in the database by sending a query that grabs only waiting executions (having status = STARTING). Another solution if you use a ThreadPoolTaskExecutor with your JobLauncher is to get the pending tasks from the task executor's queue with taskExecutor.getThreadPoolExecutor().getQueue().size() (in your example this would return 4). I've seen people using a custom JobLauncher that uses a rabbitmq queue to submit jobs and they monitor the queue size to get pending jobs. The same idea applies with the in-memory queue of a ThreadPoolTaskExecutor.

Does this help?

Generated at Sun Nov 17 19:50:04 UTC 2019 using Jira 7.13.8#713008-sha1:1606a5c1e7006e1ab135aac81f7a9566b2dbc3a6.