[BATCH-2675] Rejected Job can not be restarted Created: 29/Jan/18 Updated: 19/Dec/18 Resolved: 26/Oct/18
|Affects Version/s:||3.0.9, 4.0.1, 4.1.0.RC1|
|Fix Version/s:||4.1.0, 4.0.2, 3.0.10|
|Reporter:||Heiko Does||Assignee:||Mahmoud Ben Hassine|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
|Pull Request URL:||https://github.com/spring-projects/spring-batch/pull/659|
We had a Problem in one of our batch applications, because our queue capacity was too small.
I my opinion this behaviour is not correct and i would suggest to add a end time when a task ist rejected (https://github.com/spring-projects/spring-batch/blob/master/spring-batch-core/src/main/java/org/springframework/batch/core/launch/support/SimpleJobLauncher.java#L167) so that the job could be restarted.
|Comment by Dimitrios Liapis [ 09/Oct/18 ]|
Continuing the discussion here for the closed PR https://github.com/spring-projects/spring-batch/pull/643
In the duplicate ticket BATCH-2760 I've raised for this issue: I've identified the method that gives the false positive for this case ( SimpleJobExplorer#findRunningJobExecutions() ) mainly because the subsequent query is based on the Job Execution END_TIME column being null that also includes this erroneous cases.
Looking at the alternative scenario suggested by Michael Minella in the PR comment: we could amend the above query somehow to filter out these special erroneous cases. For instance we could amend the query to exclude entries that have STATUS as FAILED along with the existing 'END_TIME is null' predicate:
SELECT E.JOB_EXECUTION_ID, E.START_TIME, E.END_TIME, E.STATUS, E.EXIT_CODE, E.EXIT_MESSAGE, E.CREATE_TIME, E.LAST_UPDATED, E.VERSION, E.JOB_INSTANCE_ID, E.JOB_CONFIGURATION_LOCATION from %PREFIX%JOB_EXECUTION E, %PREFIX%JOB_INSTANCE I where E.JOB_INSTANCE_ID=I.JOB_INSTANCE_ID and I.JOB_NAME=? and E.END_TIME is NULL and E.STATUS != 'FAILED' order by E.JOB_EXECUTION_ID desc
We could also use the Job Execution EXIT_MESSAGE column where I think the 'TaskRejectedException' name is also logged on these erroneous cases though I'm not sure how performant including this predicate would be.
Or maybe an alternative solution all together?
What do you think??
|Comment by Mahmoud Ben Hassine [ 10/Oct/18 ]|
I would not go for filtering based on the EXIT_MESSAGE containing the text "TaskRejectedException" as this would:
Excluding failed executions based on the status would solve the issue. However, the status "FAILED" is not the appropriate status for these rejected executions. Their status should be something like "REJECTED". The status "FAILED" implies that the execution was run and failed (which is not the case), and having an execution in "FAILED" status without start time and end time makes no sense to me (because we can't answer questions like: when did it start and when did it fail? How long did it take to fail?). That's why I suggested to set the end time in
So IMO, the correct fix for this issue would be to set the status of rejected execution to "REJECTED" (a new status we need to create or may be use UNKNOWN ?) and filter out rejected executions in the query.
What do you think?
|Comment by Dimitrios Liapis [ 10/Oct/18 ]|
Agreed on the EXIT_MESSAGE not being a good idea to go about this, just mentioned it as it's a way to narrow down on these particular failures.
Having a dedicated status (either existing "UNKOWN" or a new "REJECTED") sounds good idea to me and will keep things a bit cleaner.
Pending also on Michael Minella's opinion about this new approach, I can start putting together a new PR.
One question in case we decide to introduce a new dedicated status i.e. "REJECTED" to capture this: where exactly should we place it inside the BatchStatus enum since there is an implicit ordering? I would assume either directly after "UNKNOWN" or directly before "UNKNOWN".
|Comment by Mahmoud Ben Hassine [ 16/Oct/18 ]|
Hi Dimitrios Liapis,
After discussing this issue with Michael, we will not create a new status "REJECTED" as this might introduce confusions about the orchestration part of batch jobs.
Finding running executions should be done by looking at executions having a start time but no end time yet. This can be written in the query as ... and E.START_TIME is not NULL and E.END_TIME is NULL ....
Since rejected tasks do not have a start time set (and no end time neither), they should be excluded from the "findRunningExecutions" query. Note that we don't need to filter on the status anymore.
Can you please open a PR with this change? We would like to include the fix in the upcoming 4.1.0.RELEASE.
|Comment by Dimitrios Liapis [ 16/Oct/18 ]|
thanks for that.
I've started working on this today and while working on the tests, another thing popped up:
MapJobExecutionDao#findRunningJobExecutions() internally uses JobExecution#isRunning() where isRunning() method is defined as endTime == null.
Initially I've changed that to be startTime != null && endTime == null, to be in synch with the above fix on the query inside JdbcJobExecutionDao#findRunningJobExecutions()
This latter change opened up to some new test failures with a common pattern:
SimpleJobRepository#createJobExecution() is internally using the JobExecution#isRunning() method. Therefore, i.e. following the test MapJobRepositoryFactoryBeanTests#testCreateRepository(), two consecutive SimpleJobRepository#createJobExecution() calls will not detect a positive isRunning() anymore as the Start Time is not set.
Which got me questioning whether the JobExecution#isRunning() really is a check to find out if a JobExecution has been created rather than running (which implies that it has already started).
What are you thoughts?
|Comment by Mahmoud Ben Hassine [ 17/Oct/18 ]|
Thank you for fixing the map based dao to be in sync with the jdbc one .
We have createTime for creation time, so isRunning is really about running executions and not created ones. It is the test that should be updated:
In this test, we create a job execution (note that we don't run it), and then try create another one and expect to have a running execution (but the first one was never run). If you look at the JobExecution#setStartTime method, it is used in AbstractJob#doExecute method. So we need to execute the execution to have its start time set (which makes sense to me). So the test should be updated with something like:
|Comment by Dimitrios Liapis [ 18/Oct/18 ]|
Thanks, sounds good.
I'm guessing if there isn't another way naturally just setting the Start Time alone without the End Time I'll try circumventing the End Time setting via mock-ing/stub-ing it somehow.
|Comment by Mahmoud Ben Hassine [ 18/Oct/18 ]|
I see, there should be some time between the call to job.execute(jobExecution); and the second call to repository.createJobExecution(job.getName(), jobParameters); so the repository sees the execution as running (start time set but the end time not set yet). The simplest thing to do it is to simulate a running execution, something like:
Otherwise, we need to run a job (in a separate thread) and make it take some time, so that the job repository has a chance to see the execution as running when it tries to create a new one. For example:
Both tests pass on my side. I let you choose how to update the test (or suggest another solution) and we can discuss it on the PR.
|Comment by Dimitrios Liapis [ 18/Oct/18 ]|
Many thanks for the concrete examples - it made things very clear.
Could we also please consider this fix for 3.0.10.RELEASE too?
We have substantial infrastructure based on this version and we can't currently upgrade to version 4.
|Comment by Michael Minella [ 18/Oct/18 ]|
Dimitrios Liapis We definitely can take a look at back porting this fix (no promises), but I'd be curious as to what is holding you back from upgrading to Spring Batch 4.
|Comment by Dimitrios Liapis [ 18/Oct/18 ]|
That would be highly appreciated if we could back-port it and I would be more than happy to help on that (apply it independently via a PR to 3.0.x branch or any other way you prefer).
|Comment by Mahmoud Ben Hassine [ 26/Oct/18 ]|
I'm resolving this as the PR was merged to the master branch (for the upcoming 4.1.0 GA and) and back ported to 4.0.x (for 4.0.2).
@ Dimitrios Liapis you are welcome to help in back porting the fix to 3.0.x via a PR if you want (otherwise we will try to do it, but no promises as mentioned by Michael).
|Comment by Spazbob [ 14/Dec/18 ]|
This fix tripped us up a bit. We have built various dashboards that use SimpleJobExplorer's findRunningJobExecutions(). We have worker pool of 12 threads to run jobs, so if you submitted 16 jobs at once, prior to the fix all 16 would appear in the list, but now 4 of them will not start immediately (as they are waiting for an execution thread) and so they drop off the list.
It worked well for us when the "waiting" jobs appeared in the list, as it made it clear that they would be running shortly.
What is the recommended approach to get a list of waiting AND running jobs? I know we could use getJobExecutions() and then filter, but that would imply pulling back a significant amount of history which is inefficient.
|Comment by Mahmoud Ben Hassine [ 19/Dec/18 ]|
@Spazbob Thank you for reporting this, your use case confirms that there was an issue. In your example, the 4 executions that are not started yet should not be returned by JobExplorer#findRunningJobExecutions because they are not effectively running.
Does this help?