[BATCH-2711] Existing failed job is restarted, even if new job contains different job parameters Created: 17/Apr/18 Updated: 17/Dec/19 Resolved: 26/Oct/18
|Fix Version/s:||4.1.0, 4.0.2|
|Reporter:||Tonie||Assignee:||Mahmoud Ben Hassine|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
Windows, RedHat Linux, SQL Server
Created a Spring Batch job as described in the documentation 'Getting Started - Creating a Batch Service' (https://spring.io/guides/gs/batch-processing/) in order to show the issue. SQLServer is used as database for the meta-data tables as well as the business data table.
The job reads a .csv file with with 6 records and the ItemProcess throws an exception on the 4th record. The job then stops and the meta-data tables are updated correctly. Job parameters used for the job is: parm=0 fileName=error.csv
After this a new job is started with the same job name, but different parameters: parm=1 fileName=new.csv. What happens is that Spring Batch restart the failed job with the errors again. Even if the new job contains different parameters.
This worked perfectly as expected in version 3.0.8, but the issue occurred when upgraded to version 4.0.1.
Logs from 3.0.8 shows the following:
Logs from 4.0.1 shows the following:
Notice the difference in parameters used in SimpleJobLauncher, even if JobLauncherCommandLineRunner in both cases show the same correct parameters.
|Comment by Tonie [ 19/Apr/18 ]|
I traced this back to Spring boot.
Version 2.0.0.M5 used the getNextJobParameters method in the JobLauncherCommandLineRunner class.
Version 2.0.0.M6 changed JobLauncherCommandLineRunner to use the getNextJobParameters method in the JobParametersBuilder class.
Now, the question is: Which one of these are the correct design and how do you achieve the same result as it was in version 2.0.0.M5 in the later versions?
|Comment by Michael Minella [ 19/Apr/18 ]|
Can you please provide the command you're using to launch your job? The difference you're describing was just moving the code from one place to another. The code itself should be the same.
|Comment by Tonie [ 19/Apr/18 ]|
I am using the following commands to start the job:
Second job that restart the first job instead of starting a new job instance:
The code was moved from one place to another, but it also changed.
Code in version 2.0.0.M5:
JobLauncherCommandLineRunner execute method:
JobLauncherCommandLineRunner getNextJobParameters method:
JobLauncherCommandLineRunner merge method:
Code in version 2.0.0.M6:
JobLauncherCommandLineRunner execute method:
JobParametersBuilder getNextJobParameters method:
JobParametersBuilder addJobParameters method:
As you can see, the old code will use the new job parameters (as it is different from the previous job), while the new code will use the previous job's job parameters. I debugged this to confirm it is indeed happening like this.
|Comment by Tonie [ 21/Apr/18 ]|
I changed the addJobParameters method of the JobParametersBuilder class in the latest version of Spring Boot 2.0.1 with Spring batch core 4.0.1 as follow:
I first ran a job as follow:
This resulted in an error (as intended) and the job in the Spring meta-data tables showed a status of FAILED.
I then ran a second job as follow:
This started the second job with a new job instance and it completed successfully.
I then fixed the data file with the error for the first job and ran it again with:
This restarted the first job and it completed successfully.
With this change the behaviour is the same as in version 2.0.0.M5.
Please advise with next steps on this.
|Comment by Tonie [ 23/Apr/18 ]|
While "playing" with different scenarios in order to understand what is happening with this issue, I also found the following 2 side effects of this issue:
I have decided not to use the default Spring Boot logic, but to write my own custom job launcher and job parameters processing logic for now, but are willing to help to resolve this issue if more information is required.
|Comment by Michael Minella [ 26/Apr/18 ]|
I'm looking at this and while the original move was just to simplify the boot code and provide this for any user, I can't help but feel that this is wrong anyways. This code assumes the run after a failed JobExecution is a restart. Period. That may not be the case. Also, I'm not clear that the JobExecution would end up the way you want. The code assumes a consistent set of identifying parameters are always passed in. The CommandLineJobRunner has an explicit flag for a restart scenario. While I'm not proposing that, I do think that we should dig deeper to determine if a restart makes sense.
|Comment by Tonie [ 27/Apr/18 ]|
I totally agree with your overall analysis and statement:
If I look at the documentation for Spring batch, I see the following playing a role during the job start logic:
I feel that Spring Boot should implement the use case as follow:
If the new job contains exactly the same job parameters as a previous failed job:
If the new job contains different job parameters from a previous failed job:
If the new job contains the same job parameters as a previous failed job, but also specifies an incrementer:
If the new job contains the same job parameters as a previous failed job, but also contains an isRestartable indicator set to false:
Does this make sense?
|Comment by Minhyeok Jeong [ 31/Aug/18 ]|
The root cause of this issue is in change of merge method.
In previous version, 3.0.9.RELEASE, existing parameters and new parameters are merged like below:
In current version, 4.0.1.RELEASE:
It happens when the previous job is either failed or stopped.
But was this change intended? Isn't it misbehavior?
I think the addition of a new public method-addJobParameters(JobParameters)- to the builder was a good thing. But it needs more test case to verify if the builder still keeps previous behavior without any side effect.
I made a PR for this issue: https://github.com/spring-projects/spring-batch/pull/630
Please review the new test case in the PR:
I hope this issue will be resolved as soon as possible.
|Comment by Aritz Bastida [ 01/Oct/18 ]|
Hello! Does the PR really solve the issue? In my opinion, it just hides it a bit deeper.
Note that, when the last execution fails, the non-identifying parameters are removed from the input map (probably to relaunch the previous execution as-it-was). Now the resulting map will contain a mix of externally provided parameters (identifying ones) and previous execution parameters (non-identifying ones).
On the other hand, there might as well be a design problem in JobLauncherCommandLineRunner (spring-boot-autoconfigure), which systematically obtains the next parameters, no matter what.
As reported in this issue, these two classes (JobParameterBuilder and JobLauncherCommandLineRunner) in tandem make it impossible to launch a new instance without first completing the previous job execution. For the same reasons, a previous failed execution can't be restarted either.
Compare this to the old semantics from Spring Batch 3 + Spring Batch Admin. Back then, the externally provided job parameters were used to find a matching job instance. If it existed and it was restartable, the call to getNext() was omitted. Otherwise, the incrementer was called, but using the externally provided job parameters as input.
In Spring Batch 4, an empty JobParameters instance will be passed to the first execution, and the previous execution's parameters otherwise. The job parameters specified at launch-time are not propagated as getNextParameters() arguments.
Somehow, the semantics have changed: For better? Is it intentional?
To summarize, I'd say that the job launcher should allow the following use cases:
|Comment by Aritz Bastida [ 01/Oct/18 ]|
In order to satisfy the use cases I mentioned above, I've revised the code for both JobLauncherCommandLineRunner and JobLauncherCommandLineRunner, and implement my own "custom" versions.
Now, you can do any of the following:
The call to getNextParameters() will only be made in the latter case, which is the only situation where it makes sense. The incrementer is based on the latest job execution to calculate the next parameters (similar to a database sequence). Note that the parameters modified in this method override the provided ones (typically via command line arguments). The addJobParametersIfAbsent() method is no longer necessary, because this was only necessary for a restart.
Any attempt to launch a previously completed job instance or a non-restartable one is detected by JobLauncher (with JobInstanceAlreadyCompleteException and JobRestartException, respectively), so it need not be checked in JobLauncherCommandLineRunner or JobParametersBuilder.
All in all, I think that the separation of concerns are clearer with this design, although, if accepted, the change would impact at least three projects (spring-batch, spring-boot and spring-cloud-task). Feedback is welcome!
By the way, I'm assuming that JobLauncherCommandLineRunner is the preferred option instead of the older CommandLineJobRunner. Am I right?
|Comment by Minhyeok Jeong [ 04/Oct/18 ]|
Aritz Bastida, Thanks for your long explanation. I've been watching your lots of editing.
As you mentioned, this PR does not care about non-identifying parameters. But I agree that lost of non-identifying parameters is another problem.
Although you suggest some customized codes for JobLauncherCommandLineRunner and JobParametersBuilder in different point of view, I don't think it is such a complicated issue.
If the problem is the lost of non-identifying parameter then we can simply fix it with just removing removeNonIdentifying(this.parameterMap); line in JobParametersBuilder#getNextJobParameters.
Then you can restart a failed job with new non-identifying parameters. But I'm not sure whether this is intentional behavior, since, at least, Spring team intentionally added this method (removeNonIdentifying()).
Documentation about non-identifying parameters are not enough in official site, so I want them to add more explanations about non-identifying parameters and its behavior, especially when the batch restarts.
Another thing that I have different thought from you is that the modification of JobLauncherCommandLineRunner which is a part of Spring Boot. One of the goals of Spring Boot is to help run Spring Batch conveniently, but not get involved in its behavior. If JobLauncherCommandLineRunner determines whether to retrieve previous JobParameters or not, Spring Boot is concerned too much about the behavior(or design) of Spring Batch. Therefore I think the role of JobLauncherCommandLineRunner is enough as it stands now.
Anyway, Mahmoud Ben Hassine has read your comments and he will take a look at it. Hope the issue gets resolved soon.
|Comment by Aritz Bastida [ 04/Oct/18 ]|
Hi Minhyeok Jeong! Thanks for your feedback. But, I'm sorry, I still think that the PR tries to "hide" the design problem with getNextJobParameters(), and the fix you are suggesting is an indication why.
The incrementer should (only) be called to calculate the "next" parameters in new job instances, as stated in the Spring Batch documentation (JobParameterIncrementer). Now, let's take a job with the same incrementer (i.e. auto-generated "run.id" parameter) as in the SB example, and suppose the following two situations:
Actually, there could be quite another side effect. In the case "run.id" was provided as command line parameters, these would take precedence from the ones calculated by the incrementer, something we probably don't want.
I take the point that Spring Boot should not be to concerned too much about Spring Batch internals, but, well, a bean called JobLauncherCommandLineRunner must at least know how to launch a job, and, as it currently stands, that includes calculating next parameters. The problem is it always calls getNextJobParameters() no matter whether it's a restart or not, even if only needed for new instances (as stated in the SB ref guide). It is also quite strange that the next parameters sometimes are previous parameters (i.e. when the last execution failed).
To conclude, the code to find out whether it's a restart or not should be outside getNextJobParameters(), either in JobLauncherCommandLineRunner or, if we want to keep it internal to Spring Batch, within the JobLauncher itself. Something like:
...the third parameter being a boolean called "useIncrementer".
Internally, that would use the job repository to find out whether it's a restart or not, and call JobParametersBuilder.getNextJobParameters() only when necessary. In fact, SimpleJobLauncher already does the first thing, so the change would be simple enough.
If you'd like to, I can contribute with a PR to prove my point.
|Comment by Minhyeok Jeong [ 10/Oct/18 ]|
Aritz Bastida, It look like we have a different point of view on this issue.
You may still argue about its name-next-, but I think it is reasonble, as you stated JobParametersIncrementer, getNextJobParameters() has a similar name. And you know the purpose of getNext(), I believe you can also accept getNextJobParameters(). (But who knows? If Spring Team would change its name or design.)
Anyway, I respect your opinion and it is good to propose your solution in this place. However, since we think in a different way, it would be better to create another PR and leave the link in this jira ticket. In your explanation, you would have to PR on both spring-batch and spring-boot project.
|Comment by Aritz Bastida [ 10/Oct/18 ]|
Hi Minhyeok Jeong. That's right, it seems we have a different opinion about what "next" means in Spring Batch. Historically, when using launchers like CommandLineJobRunner or JobService, the incrementer's getNext() method was only invoked for new instances, never for a restart. And there was a clear distinction: for example, CommandLineJobRunner has a explicit command line argument "-next". It seems that in Spring Boot this meaning has somehow changed.
Anyway, we are deviating from the main subject. As I pointed out in my last email, there is an additional problem with the fix you are suggesting for getNextJobParameters(), namely: If the last execution failed and you want to run a new instance, then the incrementer will never be called. Conversely, if you want to restart a previously failed execution but the last execution completed, the incrementer will be called (even if not necessary).
|Comment by Aritz Bastida [ 10/Oct/18 ]|
BTW, I will be glad to share a PR, when I come back from holidays in 3 weeks
As discussed, this issue can be solved in many ways, but somehow the first think upon a launch request must be to distinguish between restart and new instance, and for that we should use some DAO method, such as jobRepository.getLastJobExecution().
|Comment by Mahmoud Ben Hassine [ 10/Oct/18 ]|
Thank you all for this interesting analysis and constructive discussion!
I will share more details soon about how we would approach this issue.
|Comment by Mahmoud Ben Hassine [ 11/Oct/18 ]|
There are a lot of details discussed here so I'll try to summarize things. But first, let me give some important facts that might influence the solution to this issue:
That said, there are two things that need to be addressed:
In regards to the PRs:
1. PR #630
The method addJobParametersIfAbsent is used in getNextJobParameters so it is supposed to be used in a "start next instance scenario" (equivalent to CommandLineJobRunner with "-next"). This method is not correct because it says in the Javadoc: "If the map previously contained a mapping for the key, the new value is ignored." while the new value should override the existing one (See Javadoc of CommandLineJobRunner: "If the -next option is used the parameters on the command line (if any) are appended to those retrieved from the incrementer, overriding any with the same key."
2. PR #625
|Comment by Mahmoud Ben Hassine [ 24/Oct/18 ]|
After further investigation, this issue should be fixed in both Spring Batch and Spring Boot. The JobParametersBuilder#getNextJobParameters method (initially moved from boot) was dealing with both restartability and getting the parameters of the next instance. This made sense in boot but now that it moved to batch, it became inconsistent with the behaviour of CommandLineJobRunner and JobOperator. I explained in more details why the fix should be done on both sides here and opened a PR for each project.
|Comment by Aritz Bastida [ 06/Nov/18 ]|
Hi Mahmoud Ben Hassine!! Thanks for considering all the feedback. Glad to know the issue has already been resolved.
By the way, although not mentioned in your last comment, I see that this change affected also TaskJobLauncherCommandLineRunner in spring-cloud-task project, just for the record.
|Comment by Spring Issues [ 17/Dec/19 ]|
The Spring Batch Framework has moved from Jira to GitHub Issues. This issue was migrated to spring-projects/spring-batch#882.