[BATCH-2442] When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. Created: 02/Oct/15 Updated: 30/Oct/19 Resolved: 29/Mar/18
|Fix Version/s:||4.1.0, 4.0.2, 3.0.10, 4.1.0.M1|
|Reporter:||Yoann GENDRE||Assignee:||Mahmoud Ben Hassine|
|Remaining Estimate:||Not Specified|
|Time Spent:||Not Specified|
|Original Estimate:||Not Specified|
|Attachments:||Batch2442IgnoredItems.zip FaultTolerantChunkProcessor.java test-case-infiniteLoopOnretry.zip|
|Pull Request URL:||https://github.com/spring-projects/spring-batch/pull/592|
Job get in a infinite loop when an exception is first thrown during write and a second exception is thrown during the retry on processing the first item of the chunk.
In attachment, maven project with a job test case with commit-interval = 5 and skip-limit=3
It seems inputs is never set to busy=false in method scan of FaultTolerantChunkProcessor. (output is never empty)
|Comment by Yoann GENDRE [ 05/Oct/15 ]|
Today I try to fix the problem. It seems to be OK (Unit test of spring-batch-core-2.2.7.RELEASE and mine are OK ).
I have updated class FaultTolerantChunkProcessor (you can find it in attachment) as detailed bellow.
In method transform, if we are in a scan and process return null (item is skipped), I removed cached output for this item :
In method write, If we are in a scan, I only write if item was not skipped during process :
|Comment by Yoann GENDRE [ 10/Jan/17 ]|
The proposed correction in class FaultTolerantChunkProcessor in the pull request is a little bit different.
|Comment by Mahmoud Ben Hassine [ 20/Feb/18 ]|
Thank you for reporting this case and for providing an example.
When using a fault tolerant step, the item processor should be idempotent (as explained in the reference documentation). In your example, the ItemProcessor succeeds the first time for item2 but fails the second time. This behavior is not idempotent. Since it is not always possible to write an idempotent processor (think of transient exceptions for instance), what you can do is:
1. either declare the exception as retryable (in addition to declaring it as skippable)
I tested both ways and they do make the test in your example pass (For solution 1, I adapted the assertions since there are no more skips in processing. For solution 2, just adding processor-transactional="false" to the chunk element will make the test pass).
In regards to your PR, the code changes will actually fix the case but what will happen is:
1. cache the value the first time when it succeeds (not part of your PR)
This is by design (of the transactional property of the processor) what we don't want to do. Either we cache the value and reuse it without recalling the processor, or we don't cache any value and retry calling the processor according to the retry policy you provide.
Does this make sense?
|Comment by Mahmoud Ben Hassine [ 02/Mar/18 ]|
I'm closing this as "Won't fix" since the item processor used in the example is not idempotent as the documentation suggests. If the item processor cannot be made idempotent, it should be declared as non transactional or declare the exception it might throw in subsequent calls as retryable (more details in my previous comment).
|Comment by Yoann GENDRE [ 23/Mar/18 ]|
Thanks for the return. OK for your answer and your comment on the code changes I suggest (I must admit I have not understand all code in FaultTolerantChunkProcessor - too complicated for me).
I think we still have a problem. We should not go in a infinite loop in this case !!! From what I remember some items are not processed after first write error --> run state is KO and can't be restarted.
To me (and I think it's the mots common use case ???) it's really hard to implement idempotent processor. In most case we can do it from business point of view but we can't be sure that a technical exception (SGBD or WS access) won't be thrown.
|Comment by Mahmoud Ben Hassine [ 27/Mar/18 ]|
I agree with you. From a robustness point of view, SB should not go in an infinite loop.
It should not fail, the step can proceed to completion and count the item as skipped. Failing the whole step for a single failed item is not ideal. I opened a PR to fix this issue and added a test similar to yours (I just simplified the configuration). The fix in the PR also makes your attached test to pass. Can you give it a try?
|Comment by Yoann GENDRE [ 29/Mar/18 ]|
I have just tried your fix. It seems to be OK for me.
|Comment by Mahmoud Ben Hassine [ 29/Mar/18 ]|
Great! Thank you for taking time to test the fix.
|Comment by Conosci [ 31/Aug/19 ]|
I have a scenario where I believe this fix is causing another dangerous bug, some items don't end properly their lifecycle and are silenty ignored. The scenario is the following:
At runtime, after the
The same scenario is working well before the fix version, using Spring Batch version 4.0.1.RELEASE all the listeners are correctly invoked and the ending counts are correct.
Please look into this critical issue.
|Comment by Wolfgang Klaus [ 30/Oct/19 ]|
the fix for this issue is really problematic and critical. As Conosci commented on Aug/31/2019 it is possible that some items are silently ignored.
I'm attaching an Test for this issue. When you look at the StandardOutput after running the test you can see that for item 1 the writer is never called. In addition the Processor for item 19 is called twice without calling the writer.
Here the relevant output before your changes:
And here the relevant output after your changes:
As you can see the writer for item 1 is never called although it is in the list to write and the processor for item 19 is called twice.
To fix this the condition introduced with this fix in FaultTolerantChunkProcessor.java:584 has to be extended