[BATCH-2442] When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. Created: 02/Oct/15  Updated: 30/Oct/19  Resolved: 29/Mar/18

Status: Resolved
Project: Spring Batch
Component/s: Core, Retry
Affects Version/s: 2.2.7
Fix Version/s: 4.1.0, 4.0.2, 3.0.10, 4.1.0.M1

Type: Bug Priority: Major
Reporter: Yoann GENDRE Assignee: Mahmoud Ben Hassine
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive Batch2442IgnoredItems.zip     Java Source File FaultTolerantChunkProcessor.java     Zip Archive test-case-infiniteLoopOnretry.zip    
Issue Links:
Relate
relates to BATCH-2302 exception on process while recovering... Resolved
Pull Request URL: https://github.com/spring-projects/spring-batch/pull/592

 Description   

Job get in a infinite loop when an exception is first thrown during write and a second exception is thrown during the retry on processing the first item of the chunk.

In attachment, maven project with a job test case with commit-interval = 5 and skip-limit=3

  • read 5 String from ListItemReader
  • process String - increment count item processed - throw an exception if count = 6 (first process after first exception)
  • write items - throw an exception on first chunk

It seems inputs is never set to busy=false in method scan of FaultTolerantChunkProcessor. (output is never empty)
For now I can't see where the problem come from...



 Comments   
Comment by Yoann GENDRE [ 05/Oct/15 ]

Today I try to fix the problem. It seems to be OK (Unit test of spring-batch-core-2.2.7.RELEASE and mine are OK ).

I have updated class FaultTolerantChunkProcessor (you can find it in attachment) as detailed bellow.

In method transform, if we are in a scan and process return null (item is skipped), I removed cached output for this item :

/*
			 * We only want to process the first item if there is a scan for a
			 * failed item.
			 */
			if (data.scanning()) {
				// on scanning processed item is skipped - remove matching cached output
				if (output == null) {
					if (cacheIterator != null && cacheIterator.hasNext()) {
						cacheIterator.next();
						cacheIterator.remove();
					}
				}

				while (cacheIterator != null && cacheIterator.hasNext()) {
					outputs.add(cacheIterator.next());
				}

				// Only process the first item if scanning
				break;
			}

In method write, If we are in a scan, I only write if item was not skipped during process :

				if (!data.scanning()) {
					...
				} else {
					// scan only if item was not skipped
					if (inputs.getSkips().size() == 0) {
						scan(contribution, inputs, outputs, chunkMonitor, false);
					} else {
						logger.debug("Scan skipped because item was skipped during process");
					}
				}
				return null;
Comment by Yoann GENDRE [ 10/Jan/17 ]

The proposed correction in class FaultTolerantChunkProcessor in the pull request is a little bit different.
In pull request a test case illustrating the problem :
Step with commit-interval=5
7 items (item1,item2,item3,item4,item5,item6,item7) in input
Writer fail on 1rst chunk (writer.cpt=1)
Processor fail on 1rst item during scan (processor.cpt=6) and on 2nd item during scan (processor.cpt=7)
Expected output =
Item:id=3,nbProcessed=2,nbWritten=2
Item:id=4,nbProcessed=2,nbWritten=2
Item:id=5,nbProcessed=2,nbWritten=2
Item:id=6,nbProcessed=1,nbWritten=1
Item:id=7,nbProcessed=1,nbWritten=1
In the test cas a step listener is used to stop the infinite loop and throw an exception when chunk number > 200

Comment by Mahmoud Ben Hassine [ 20/Feb/18 ]

Hi,

Thank you for reporting this case and for providing an example.

When using a fault tolerant step, the item processor should be idempotent (as explained in the reference documentation). In your example, the ItemProcessor succeeds the first time for item2 but fails the second time. This behavior is not idempotent. Since it is not always possible to write an idempotent processor (think of transient exceptions for instance), what you can do is:

1. either declare the exception as retryable (in addition to declaring it as skippable)
2. or declare the processor as non transactional (in which case item2 will be cached the first time and (re)used the second time during the scan)

I tested both ways and they do make the test in your example pass (For solution 1, I adapted the assertions since there are no more skips in processing. For solution 2, just adding processor-transactional="false" to the chunk element will make the test pass).

In regards to your PR, the code changes will actually fix the case but what will happen is:

1. cache the value the first time when it succeeds (not part of your PR)
2. remove the cached value the second time when scanning (undo step 1)
3. detect that through skip count and prevent scanning the (single skipped item) chunk

This is by design (of the transactional property of the processor) what we don't want to do. Either we cache the value and reuse it without recalling the processor, or we don't cache any value and retry calling the processor according to the retry policy you provide.

Does this make sense?

Kr
Mahmoud

Comment by Mahmoud Ben Hassine [ 02/Mar/18 ]

I'm closing this as "Won't fix" since the item processor used in the example is not idempotent as the documentation suggests. If the item processor cannot be made idempotent, it should be declared as non transactional or declare the exception it might throw in subsequent calls as retryable (more details in my previous comment).

Comment by Yoann GENDRE [ 23/Mar/18 ]

Hi Mahmoud

Thanks for the return. OK for your answer and your comment on the code changes I suggest (I must admit I have not understand all code in FaultTolerantChunkProcessor - too complicated for me).

I think we still have a problem. We should not go in a infinite loop in this case !!! From what I remember some items are not processed after first write error --> run state is KO and can't be restarted.
Shouldn't SB fail on getting an exception on an item processed a 2nd time (first time with success). This "NonIdempotentProcessingItemFailure" would prevent from getting in this infinite loop ?

To me (and I think it's the mots common use case ???) it's really hard to implement idempotent processor. In most case we can do it from business point of view but we can't be sure that a technical exception (SGBD or WS access) won't be thrown.

Kind regards
Yoann

Comment by Mahmoud Ben Hassine [ 27/Mar/18 ]

Hi Yoann,

I agree with you. From a robustness point of view, SB should not go in an infinite loop.

Shouldn't SB fail on getting an exception on an item processed a 2nd time (first time with success). This "NonIdempotentProcessingItemFailure" would prevent from getting in this infinite loop ?

It should not fail, the step can proceed to completion and count the item as skipped. Failing the whole step for a single failed item is not ideal. I opened a PR to fix this issue and added a test similar to yours (I just simplified the configuration). The fix in the PR also makes your attached test to pass. Can you give it a try?

Kind regards,
Mahmoud

Comment by Yoann GENDRE [ 29/Mar/18 ]

Hi Mahmoud,

I have just tried your fix. It seems to be OK for me.

Thanks.

Comment by Mahmoud Ben Hassine [ 29/Mar/18 ]

Hi Yoann,

Great! Thank you for taking time to test the fix.

Kr,
Mahmoud

Comment by Conosci [ 31/Aug/19 ]

Dear Mahmoud,

I have a scenario where I believe this fix is causing another dangerous bug, some items don't end properly their lifecycle and are silenty ignored. The scenario is the following:

  • Job with 1 Step With Chunk Processing (ChunkSize=10)
  • Reader, Processor, Writer, all with faultTolerant() and skipPolicy(..) (Always Skip) and skipListener(..) to listen to 'onSkipInWrite' and 'onSkipInProcess'
  • Source Data with 3 Records: the first is a good record, the second is giving Processing Error and the third is giving Write Error (duplicate primary key against the first record)

At runtime, after the BATCH-2442 fix, the result is the following:

  1. Reader reads the 3 records chunk
  2. Processor start processing and meet Processing Error in the second record
  3. The System will skip (according to the configured policy) the second record and will invoke the SkipListener (onSkipInProcess) giving to the user the possibility to manage the issue
  4. The System will take the remaining 2 records (the first and the third), will process again them and will try to write them in chunk (same transaction)
  5. During the write the database will return exception (duplicate primary key)
  6. The System will go in 'recovery' mode and will start processing the two items one by one
  7. Now during the scan the following code will execute:
    1. //BATCH-2442 : do not scan skipped items
      if (!inputs.getSkips().isEmpty()) {
        if (outputIterator.hasNext()) {    outputIterator.remove();    return;   }

      }

  8. The condition is then verified, there is a skip (the first Processor error for the second record) and one record from the remaining two (first and third) is then removed without any reason!
  9. At this point only 1 record is then available in outputIterator (the first record), and it will be written successfully.
  10. The job then ends with a non consistent state (there is a writeSkipCount element missing, the totals are wrong):
    1. readCount = 3
    2. processSkipCount = 1
    3. writeCount = 1
    4. writeSkipCount = 0
    5. rollbackCount = 2
    6. commitCount = 2
  11. Basically what happened is that removing the wrong element from the outputIterator, the system didn't catch the Write Exception and didn't execute the proper lifecycle (doOnWriteError/shouldSkip/onSkipInWrite) givinig no change to the user to know about the skipped element in the skipListener

 

The same scenario is working well before the fix version, using Spring Batch version 4.0.1.RELEASE all the listeners are correctly invoked and the ending counts are correct.

Please look into this critical issue.

Thanks

 

Comment by Wolfgang Klaus [ 30/Oct/19 ]

Hi Mahmoud,

the fix for this issue is really problematic and critical. As Conosci commented on Aug/31/2019 it is possible that some items are silently ignored.

I'm attaching an Test for this issue. When you look at the StandardOutput after running the test you can see that for item 1 the writer is never called. In addition the Processor for item 19 is called twice without calling the writer.

Here the relevant output before your changes:

......
write items [19, 18, 17, 16, 14, 13, 12, 9, 8, 7, 6, 4, 3, 2, 1]
fail Item 13
process item 19
write items [19]
.........
process item 1
write items [1]
[info 2019/10/30 20:14:06.589 MEZ <Time-limited test> tid=0x11] Step: [step] executed in 306ms

And here the relevant output after your changes:

.......
write items [19, 18, 17, 16, 14, 13, 12, 9, 8, 7, 6, 4, 3, 2, 1]
fail Item 13
process item 19
process item 19
write items [19]
----
process item 2
write items [2]
[info 2019/10/30 20:16:06.589 MEZ <Time-limited test> tid=0x11] Step: [step] executed in 306ms

As you can see the writer for item 1 is never called although it is in the list to write and the processor for item 19 is called twice.

To fix this the condition introduced with this fix in FaultTolerantChunkProcessor.java:584 has to be extended
With the following change everything is allright.

if (!inputs.getSkips().isEmpty()  && inputs.getItems().size() != outputs.getItems().size() ) {
Generated at Sun Nov 17 20:25:54 UTC 2019 using Jira 7.13.8#713008-sha1:1606a5c1e7006e1ab135aac81f7a9566b2dbc3a6.