Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.7
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:
      Windows 7 Ultimate, but I imagine this is true for all OSes; Java 1.6.0_20

      Description

      If a process hosting a job dies unexpectedly or is killed (Ctrl-C, kill -9, etc), the job cannot be restarted, because Spring Batch is not aware that the process was killed.

      Much like other software can detect whether it was shut down gracefully v. abruptly, Spring Batch should be enhanced to do so. My guess would be to have Spring Batch update the JobRepository with a flag saying that a particular JobExecution was completed gracefully only upon normal exit (via -stop or end of flow). Then, upon a restart attempt of the JobExecution, Spring Batch could detect whether or not the last JobExecution terminated ubruptly v. gracefully and restart appropriately.

        Activity

        Hide
        Dave Syer added a comment -

        Graceful completion means the StepExecution gets status=COMPLETED or FAILED or STOPPED, and
        the StepExecution and JobExecution both have a lastUpdated field which is persisted to the database frequently throughout the execution. I don't think there is much more that we can do in the general case to decide what to do with a StepExecution that has status=RUNNING. We leave it up to the user to decide based on the data that are available whether the step is really running or not, and if not running then if it is restartable or not (if it is, it can be assigned a status of FAILED or STOPPED, otherwise ABANDONED or COMPLETED is more appropriate). Some automatic restart scenarios are already supported by the JobService in Spring Batch Admin (and can be implemented generally speaking using JobExplorer, JobRepository and JobLauncher), but we can only work with FAILED or STOPPED executions (RUNNING is running as far as the framework can determine). Maybe you could suggest some other common scenarios where the framework actually could take this decision automatically?

        Show
        Dave Syer added a comment - Graceful completion means the StepExecution gets status=COMPLETED or FAILED or STOPPED, and the StepExecution and JobExecution both have a lastUpdated field which is persisted to the database frequently throughout the execution. I don't think there is much more that we can do in the general case to decide what to do with a StepExecution that has status=RUNNING. We leave it up to the user to decide based on the data that are available whether the step is really running or not, and if not running then if it is restartable or not (if it is, it can be assigned a status of FAILED or STOPPED, otherwise ABANDONED or COMPLETED is more appropriate). Some automatic restart scenarios are already supported by the JobService in Spring Batch Admin (and can be implemented generally speaking using JobExplorer, JobRepository and JobLauncher), but we can only work with FAILED or STOPPED executions (RUNNING is running as far as the framework can determine). Maybe you could suggest some other common scenarios where the framework actually could take this decision automatically?
        Hide
        Matthew T. Adams added a comment -

        It seems to me that if you could add a boolean terminatedGracefully property to an execution (StepExecution, JobExecution, etc), that is false by default and only gets set to true (and persisted to the batch database) upon graceful termination of the execution, then you would have enough information to detect whether an execution terminated gracefully.

        The issue that I'm taking with Spring Batch here is that if a JobExecution dies or is killed without warning, the user has to manually correct the database, AFAICT. We've even gone so far as to write stored procedures that completely delete JobInstance and JobExecution instances in the database so that we can start-not restart-the job afresh (new JobInstance) after manually deleting any work that was written by that job instance.

        If the framework could detect graceful versus abrupt termination, it might be possible to establish some kind of abrupt termination recovery policy, where a reasonable default might be to update the exitStatus of the execution to FAILED, UNKNOWN or some other appropriate value, and then go from there. This would allow abruptly terminated executions to be restarted without requiring manual database intervention on the part of the user. That is, after the abrupt termination recovery policy executes, which updates JobInstance, JobExecution, and StepExecution state, Spring Batch could effectively restart its bootstrap process and take whatever actions it normally does.

        WDYT?

        Show
        Matthew T. Adams added a comment - It seems to me that if you could add a boolean terminatedGracefully property to an execution (StepExecution, JobExecution, etc), that is false by default and only gets set to true (and persisted to the batch database) upon graceful termination of the execution, then you would have enough information to detect whether an execution terminated gracefully. The issue that I'm taking with Spring Batch here is that if a JobExecution dies or is killed without warning, the user has to manually correct the database, AFAICT. We've even gone so far as to write stored procedures that completely delete JobInstance and JobExecution instances in the database so that we can start- not restart -the job afresh (new JobInstance) after manually deleting any work that was written by that job instance. If the framework could detect graceful versus abrupt termination, it might be possible to establish some kind of abrupt termination recovery policy, where a reasonable default might be to update the exitStatus of the execution to FAILED, UNKNOWN or some other appropriate value, and then go from there. This would allow abruptly terminated executions to be restarted without requiring manual database intervention on the part of the user. That is, after the abrupt termination recovery policy executes, which updates JobInstance, JobExecution, and StepExecution state, Spring Batch could effectively restart its bootstrap process and take whatever actions it normally does. WDYT?
        Hide
        Dave Syer added a comment -

        I don't see the difference between a boolean property that says "finished gracefully" and an entry in the BatchStatus enum. The BatchStatus is always updated when a job finishes (gracefully or not) as long as the JVM is still alive. If it is dead, then no-one can make the change, and it doesn't matter if it is boolean or enum. So the framework can detect the situation you describe, but only if it has some knowledge of whether the Job still alive or not, and if not whether the metadata is in a consistent state: only the user can decide if that is the case or not. If you, as a user, decide that the database is in a consistent state, and the job is no longer running, then you can use the JobExplorer and JobRepository to change the Step and Job statuses from RUNNING to FAILED, and then restart. (Possibly you have a valid request here for a convenience method to do that, but I need you to be on the same page before we make that into a new feature.)

        Show
        Dave Syer added a comment - I don't see the difference between a boolean property that says "finished gracefully" and an entry in the BatchStatus enum. The BatchStatus is always updated when a job finishes (gracefully or not) as long as the JVM is still alive. If it is dead, then no-one can make the change, and it doesn't matter if it is boolean or enum. So the framework can detect the situation you describe, but only if it has some knowledge of whether the Job still alive or not, and if not whether the metadata is in a consistent state: only the user can decide if that is the case or not. If you, as a user, decide that the database is in a consistent state, and the job is no longer running, then you can use the JobExplorer and JobRepository to change the Step and Job statuses from RUNNING to FAILED, and then restart. (Possibly you have a valid request here for a convenience method to do that, but I need you to be on the same page before we make that into a new feature.)
        Hide
        Matthew T. Adams added a comment -

        Ok, I hadn't thought of detecting whether the job is still alive in another thread; I agree that it is only the user that can detect such a situation. One possible solution to this is to allow the user to tell the framework, via configuration, that a given job will only be run once at a time (<job ... max-concurrent-executions="1">, <job ... parallel-executions="false">, <job ... serialized-executions="true"> or something like that). The framework could then take the user at his word, and do the recovery you describe automatically in the form of a DefaultAbruptTerminationRecoveryPolicy.

        I have a hunch that this would be a very common scenario. WDYT?

        Show
        Matthew T. Adams added a comment - Ok, I hadn't thought of detecting whether the job is still alive in another thread; I agree that it is only the user that can detect such a situation. One possible solution to this is to allow the user to tell the framework, via configuration, that a given job will only be run once at a time (<job ... max-concurrent-executions="1">, <job ... parallel-executions="false">, <job ... serialized-executions="true"> or something like that). The framework could then take the user at his word, and do the recovery you describe automatically in the form of a DefaultAbruptTerminationRecoveryPolicy. I have a hunch that this would be a very common scenario. WDYT?
        Hide
        Dave Syer added a comment -

        There is support for "only one instance at a time" in Spring Batch Admin (probably it will move into Spring Batch in 2.2, but I don't think anyone asked for it so there's no JIRA yet). Look at JobLauncherSynchronizer (it's AOP advice to add to the JobLauncher).

        I don't understand how that relates to your last comment about recovery though. The framework already guarantees there is only one execution of a given instance active. That is enough to be sure that you aren't running the job somewhere else in most cases. The problem for non-graceful JVM exit is that it is impossible to tell from the database whether it has happened, and the data can be in an inconsistent state that should not be restarted. It's still up to the user to decide if a job is running, restartable or should be abandoned.

        Show
        Dave Syer added a comment - There is support for "only one instance at a time" in Spring Batch Admin (probably it will move into Spring Batch in 2.2, but I don't think anyone asked for it so there's no JIRA yet). Look at JobLauncherSynchronizer (it's AOP advice to add to the JobLauncher). I don't understand how that relates to your last comment about recovery though. The framework already guarantees there is only one execution of a given instance active. That is enough to be sure that you aren't running the job somewhere else in most cases. The problem for non-graceful JVM exit is that it is impossible to tell from the database whether it has happened, and the data can be in an inconsistent state that should not be restarted. It's still up to the user to decide if a job is running, restartable or should be abandoned.
        Hide
        Dean Hiller added a comment -

        I would really like an option on startup to reset state for certain jobs as if I am starting and I know I have only one server, it would be great to clear out state...I just got a ticket from QA on this issue that their job won't start because someone ctrl-c it.

        Also a ctrl-c sohuld result in calling spring batch's shutdown handler hook and that hook "should" take a best effort at modifying the database to some state as ctrl-c is used a bit. Obviously a kill -9 we are screwed basically as JVM has no time to modify database state.

        Show
        Dean Hiller added a comment - I would really like an option on startup to reset state for certain jobs as if I am starting and I know I have only one server, it would be great to clear out state...I just got a ticket from QA on this issue that their job won't start because someone ctrl-c it. Also a ctrl-c sohuld result in calling spring batch's shutdown handler hook and that hook "should" take a best effort at modifying the database to some state as ctrl-c is used a bit. Obviously a kill -9 we are screwed basically as JVM has no time to modify database state.
        Hide
        Damien Hollis added a comment -

        I also have exactly the same issue and our solution was to write a SmartLifecycle bean that looks for jobs in RUNNING state during startup and corrects their state and restarts them. This is not ideal because we are setting the internal state of batch jobs manually, which means any change to your internal state model will break out code.

        Reading this discussion, it seems like Dave is maybe thinking about an environment where batch jobs are started as different processes, so it is hard for the batch job framework to know that a job that says it is RUNNING has really died. However, in our scenario, and I suspect in Matt and Dean's, we are running batch jobs within our web application. It is the only process that will run a batch job and if it dies and is restarted, you can be completely sure that any job that says it is running (in the database), is no longer running.

        I think in our scenario it would be very feasible to introduce a SmartLifecycle bean as part of the framework that can gracefully startup and shutdown spring batch - making best attempts to recover jobs during startup and potentially marking running jobs as incomplete (or similar) during a shutdown.

        Show
        Damien Hollis added a comment - I also have exactly the same issue and our solution was to write a SmartLifecycle bean that looks for jobs in RUNNING state during startup and corrects their state and restarts them. This is not ideal because we are setting the internal state of batch jobs manually, which means any change to your internal state model will break out code. Reading this discussion, it seems like Dave is maybe thinking about an environment where batch jobs are started as different processes, so it is hard for the batch job framework to know that a job that says it is RUNNING has really died. However, in our scenario, and I suspect in Matt and Dean's, we are running batch jobs within our web application. It is the only process that will run a batch job and if it dies and is restarted, you can be completely sure that any job that says it is running (in the database), is no longer running. I think in our scenario it would be very feasible to introduce a SmartLifecycle bean as part of the framework that can gracefully startup and shutdown spring batch - making best attempts to recover jobs during startup and potentially marking running jobs as incomplete (or similar) during a shutdown.

          People

          • Assignee:
            Dave Syer
            Reporter:
            Matthew T. Adams
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: