Spring Batch 2.0 New Feature Rundown

Dave Syer

In this article we outline the main themes of Spring Batch 2.0, and highlight the changes from 1.x. Development work on the new release is well underway, with an M2 release last week, and we are getting a lot of interest, so now seems like a good time to give a few pointers.

Spring Batch 2.0 Themes

The four main themes of the new release are

  • Java 5 and Spring 3.0
  • Non-sequential execution
  • Scalability
  • Configuration: annotations and XML namespace

so we'll cover each of those areas separately and describe what they mean and the impact of the changes on Spring Batch existing users. There is more detail below for features that are already implemented, which is mostly in the first category with some enabling features in other areas.

There are no changes to the physical layout of the project in Spring Batch 2.0.0.M2 (same old downloads, same basic layout of Java packages). We have not removed any features, but we have taken the opportunity to revise a couple of APIs, and there are some minor changes for people updating projects from 1.x. Spring Batch is immature enough and we were adding some pretty big features, so we decided a major version change was a good opportunity to have a bit of a clean out. We don't expect anyone to have any difficulty upgrading, and if you are an existing user this article will help you to get the measure of the changes.

Java 5 and Spring 3.0

As you may know, Spring 3.0 is going to be the first major release of Spring to target Java 5 exclusively (I'll leave it to Juergen and Arjen to clarify that in more detail). Now that Sun has put an "End of Service Life" stamp on the JDK 1.4 it seems appropriate, and there are some great new features on Spring 3.0 that we want to take advantage of.

Type Safety

Most of the work in the 2.0.0.M1 release of Spring Batch went into converting the existing code to Java 5, taking advantage of generics and parameterised types wherever we could. This gives users of the framework a much nicer programming experience, allowing compile time checks for type safety and ultimately reducing maintenance costs for projects using Spring Batch. For instance the in the ItemReader, one of the central interfaces and user extension points in Spring Batch, we now have a typesafe read method:

public interface ItemReader<S> {
    S read();
}

A further point to note here is that the old (1.x) framework callbacks mark() and reset() have gone from this interface, making it more friendly to end users, and preventing misunderstanding about what the framework requires and when. The same concerns (or mark and reset) are now handled internally in the Step implementations that the framework provides.

Chunk-oriented Processing

Similar changes, and one slightly more radical, have also occurred in the partner ItemWriter interface, used by the framework for writing data:

public interface ItemWriter<T> {
    void write(List<? extends T> items);
}

The old framework callbacks for flush() and clear() have gone from this interface too, and to compensate for that, the write() method has a new signature. The bottom line here is that we have moved to a chunk-oriented processing paradigm internally to the framework. This is actually much more natural in a batch framework than the old item-oriented approach because for performance reasons we often need to buffer and flush, and the old interface made that awkward for users. Now you can do all the batching you need inside the write() method.

Step Factory Bean Changes

A side effect of the chunk-oriented approach to processing is a change in the step factory bean implementations. The old SkipLimitStepFactoryBean has been renamed to FaultTolerantStepFactoryBean, and it can still be used for most common use cases as a replacement for the old factory bean. By default it creates a step implementation that buffers input items across rollbacks, so that the new ItemReader and ItemWriter interfaces work properly with non-transactional input sources (like files). For input sources that re-present the items after a rollback, clients have to set a flag in the factory bean (isReaderTransactional is the name of the flag as of M2) – JmsItemReader is the only relevant reader in the framework and not many projects use it so this isn't a big change.

Business Processing

A related new feature is that there is a new kid on the block in the form of the ItemProcessor:

public interface ItemProcessor<S,T> {
    T process(S item);
}

In 1.x the transformation between input items of type S and output items of type T had to be hidden inside one of the other participants (usually the ItemWriter). Now we have genericised this concern and placed it at the same level of importance in the framework as its siblings, ItemReader and ItemWriter. Users of 1.x might recognise the traces of the old ItemTransformer interface here, which has now been removed.

A More Useful Tasklet Interface

Many people, looking at Spring Batch 1.x, have asked "what if my business logic is not reading and writing?" To answer this question more satisfactorily we have modified the Tasklet interface. In Spring Batch 1.x it is fairly bland – little more than a Callable in fact, but in 2.0 we have given it some more flexibility and slotted it more into the mainstream of the framework (e.g. the central chunk-oriented step implementation is now implemented as a Tasklet). Here is the new interface:

public interface Tasklet {
    ExitStatus execute(StepContribution contribution,
        AttributeAccessor attributes);
}

The idea is that the tasklet can now contribute more back to the enclosing step, and this makes it a much more flexible platform for implementing business logic. The StepContribution was already part of the 1.x API, but it wasn't very publicly exposed. Its role is to collect updates to the current StepExecution without the programmer having to worry about concurrent modifications in another thread. This also tells us that the Tasklet will be called repeatedly (instead of just once per step in the 1.x framework), and so it can be used to carry out a greater range of business processing tasks. The AttributeAccessor is a chunk-scoped bag of key-value pairs. The tasklet can use this to store intermediate results that will be preserved across a rollback.

Late Binding of Job and Step Attributes

So far (up to M2) we haven't been able to take advantage of any new Spring 3.0 features because there are no releases of Spring 3.0 yet. But starting with the next milestone (M3) we will be using the new Spring EL (Expression Language) features to enable late binding of JobParameters and ExecutionContext attributes to step components. This is a much requested feature, and we have some workarounds for special cases in 1.x, like the StepExecutionResourceProxy for binding to a filename as an input parameter to a job. A more generic solution available in Spring Batch 2.0 is to allow those step attributes to be bound to arbitrary components, by defining them in an appropriate Spring scope, e.g.

<bean id="step1" parent="simpleStep">
  <property name="itemReader">
    <bean class="org.springframework.batch.item.file.FlatFileItemReader" scope="step">
        <property value="#{jobParameters[inputFile]}" />
        …
    </bean>
  </property>
  …
</bean>

The item reader needs to be bound to a file name that is only available at runtime. To do this we have declared it as scope="step" and used the Spring EL binding pattern #{...} to bind in a job parameter. The same pattern will work with step and job level execution context attributes.

The StepScope is available in nightly snapshots, but we are still using Spring 2.5.5, so the EL bindings are not yet working. Step scoped beans are also a good solution to the old Spring Batch problem of how to keep your steps thread safe. If a step relies on a stateful component like a FlatFileItemReader, then it only has to define that component as scope="step" and the framework will create a lazy initializing proxy for it, and it will be created as needed once per step execution. There's still noting wrong with creating a new ApplicationContext for each job execution, which is the current best practice in 1.x for keeping threads from colliding across jobs. But now step scope gives you another option, and one that is probably easier for most Spring users to get to grips with.

Non-sequential Execution

In 1.x the model of a job was always as a linear sequence of steps, and if one step failed, then the job failed. Although many jobs still fit that pattern so it hasn't gone away, in 2.0 we are lifting that restriction by introducing some new features. These are planned for the M3 release so the implementation details might change, but the idea is to support three features:

  • Conditional execution: branching to a different step based on the ExitStatus of the last one. This includes the ability to branch on a FAILED status, which implies that a step failure is no longer fatal for a job.
  • Pause execution and wait for explicit instruction to proceed. This is useful for instance where there is a business rule that forces manual intervention to check the validity of business critical data.
  • Parallel execution of multiple steps. Where steps are independent the user can specify which branches can be executed in parallel.

Nightly snapshots as of the time of writing this article contain a ConditionalJob that supports the first use case above.

Scalability

Spring Batch 1.x was always intended as a single VM, possibly multi-threaded model, but we built a lot of features into it that support parallel execution in multiple processes. Many projects have successfully implemented a scalable solution relying on the quality of service features of Spring Batch to ensure that processing only happens in the correct sequence. In 2.0 we plan to expose those features more explicitly. There are two approaches to scalability, and we will support both: remote chunking, and partitioning.

Remote Chunking

Remote chunking is a technique for dividing up the work of a step without any explicit knowledge of the structure of the data. Any input source can be split up dynamically by reading it in a single process (as per normal in 1.x) and sending the items as a chunk to a remote worker process. The remote process implements a listener pattern, responding to the request, processing the data and sending an asynchronous reply. The transport for the request and reply has to be durable with guaranteed delivery and a single consumer, and those features are readily available with any JMS implementation. But Spring Batch is building the remote chunking feature on top of Spring Integration, so actually it is agnostic to the actual implementation of the message middleware.

Partitioning

Partitioning is an alternative approach which in contrast depends on having some knowledge of the structure of the input data, like a range of primary keys, or the name of a file to process. The advantage of this model is that the processors of each element in a partition can act as if they are a single step in a normal Spring Batch job. They don't have to implement any special or new patterns, which makes them easy to configure and test. Partitioning in principle is more scalable than remote chunking because there is no serialization bottleneck arising from reading all the input data in one place.

In Spring Batch 2.0 partitioning is supported by two interfaces: PartitionHandler and StepExecutionSplitter. The PartitionHandler is the one that knows about the execution fabric – it has to transmit requests to remote steps and collect the results using whatever grid or remoting technology is available. PartitionHandler is an SPI, and we provide one implementation out of the box for local execution through a TaskExecutor. This will be useful immediately to a number of projects we have seen where parallel processing of heavily IO bound tasks is required, since in those cases remote execution only complicates the deployment and doesn't necessarily help much with the performance. Other implementations will be specific to the execution fabric, e.g. one of the grid providers (IBM, Oracle, Terracotta, Appistry etc.), and we don't want to imply a preference for any of those over the others in Spring Batch.

SpringSource is planning an Enterprise Batch product that will provide a full runtime solution for partitioning and remote chunking, as well as admin and scheduling concerns. Watch this space for more details of how that is shaping up, or come to Spring One Americas

Configuration: annotations and XML namespace

The idea behind using annotations to implement batch logic is by analogy with Spring @MVC. The net effect is that instead of having to implement and possibly register a bunch of interfaces (reader, writer, processor, listeners, etc.) you would just annotate a POJO and plug it into a step. There are method level annotations corresponding to the various interfaces, and parameter level annotations and factory methods corresponding to job, step and chunk level attributes (a bit like @ModelParameter and @RequestParameter in Spring @MVC.

A XML namespace for Spring Batch makes configuration of common things even easier. For example, the ConditionalJob we mentioned above has an XML configuration option, so a simple conditional execution might look like this:

<job>
  <step name="gamesLoad" next="playerLoad"/>
  <step name="playerLoad">
    <next on="*" to="summarize"/>
    <next on="FAILED" to="compensate"/>
  </step>
  <step name="compensate"/>
  <step name="summarize"/>
</job>

The first step ("gamesLoad") is followed immediately by "playerLoad", but if that fails, then we can go to an alternative step ("compensate") instead of finishing the job normally with the "summarize" step. The name= attribute of the <step element in the XML is a bean reference, so the implementation of Step is defined elsewhere (possibly through annotations).

Database Schema Changes

There are a couple of tidying up tasks and extensions to the data model in the meta data schema. We will be providing update scripts for anyone moving from 1.x to 2.0, so they shouldn't cause any problems, and will definitely make it easier to navigate and interact with the meta data. For those of you who are new to Spring Batch, some of the key benefits of the framework are the quality of service features like restartability and idempotence (process business data once and only once). We implement these features through shared state in a relational database (in most use cases), and the definition of the data model in this database has changed slightly in 2.0.

The main changes are to do with the storage of ExecutionContext, which used to be centralised in one table, even though the context can be associated either with a StepExecution or a JobExecution. The new model will be more popular with DBAs because it makes the relationships more transparent in the DDL. We also started storing the context values in JSON, to make them easier to read and track for human users (the context for a single entity is all stored in one row in a table, instead of many).

We have also added some more statistics for the counting and accounting of items executed and skipped, splitting out counts for total items read, processed and written at each stage. For steps (or tasklets) that do not split their execution into read, process, write, this is more comprehensive than is needed, but for the majority use case it is more appropriate than just storing an overall item count.

Closing Remarks

Hopefully this article has whetted your appetite for Spring Batch 2.0, and you will have time to download a snapshot or milestone and try it out. We get such good feedback from the community that the product wouldn't be half as good as it is without your help and support. If I have missed something that you were interested in, or haven't provided enough detail I apologise, but I needed to squeeze this into a reasonable blog-sized format. I can write more if anyone is interested – just ask.

If you do want to know more, there is an excellent opportunity to hear about Spring Batch 2.0 and all the other SpringSource activities from the team at this year's Spring One Americas. Lucas Ward will be presenting Spring Batch 2.0 features in one session and I will be showing the scalability features in a bit more detail. Also check out the discussions on the forum, or go to the home page here.

Similar Posts

Share this Post
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • DZone
  • LinkedIn
  • Slashdot
  • Technorati
  • TwitThis
 

9 responses


  1. Thanks for the great overview.
    This will help me justify a migration now.

    I have made great progress with v1.x; but
    the new interfaces and integration with Spring Integration in v2.0 will
    greatly improve the development scalability.


  2. This all looks very good, especially the parallel execution bit, which reminds me of the old JES 3 job netting. Now all you need to do is add support to the Spring IDE for visually realizing the structure of your parallel job net. If this were done it should be a relatively easy step to put in some visual monitoring of a batch stream (so you can see what is running, what has run, how long a task has taken etc), all in a GUI.


  3. "Spring Batch is immature enough " – I guess this was supposed to sound "Spring Batch is mature enough"? :-)


  4. The framework really seems to be picking fast.Presently in 2.0Rc1 it seems the partitioning is happening at Step level.Few questions if you can throw a light on it?
    A. Does the framework support Job Level partitioning. A Job has 10 parallel steps, can it be configured to make it run 10 steps on different VM's?
    B. What is the suggested approach, Breaking in different jobs and then scaling up or breaking in steps?

    Thanks


  5. Partitioning is all about running the same step in parallel in multiple processes (or threads) with the input data partitioned. See the use guide for more information (http://static.springframework.org/spring-batch/trunk/reference/).

    Please ask more questions on the forum (better chance of wide readership): http://forum.springframework.org/forumdisplay.php?f=41.


  6. Hi
    I use RC2.0 spring batch and Chunk-oriented Processing is exactly what I need. I use JmsItemReader to read IBM MQ message (commitInterval = 100) and found I can only GET 400 messages in 25 sec after I added taskExecutor in my SimpleStepFactoryBean. Have you guys test the preformance of JmsItemReader? How many message can it GET in one sec?
    thx


  7. JmsItemReader is nothing but a thin wrapper around a JmsTemplate, so I don't think Spring Batch is going to prevent you from improving your throughput. But as with all use of JMS and Spring, if performance is important you need to be careful with how your JMS provider supplies Connections and MessageConsumers: the latter in particular probably need to be cached. If you are using Spring 2.5.6 then the CachingConnectionFactory will do it, but I'd be surprised if MQ doesn't provide the same features natively.


  8. thank you for the quick reply. i found the spring jms forum is pretty abandon sorry i post here again.
    I found two issues, I use JmsItemReader and setup a pool for taskExecutor but i dont see 10 readers running.
    in 2.0 API, JmsTransactionManager, it seems like I only get one connection and one session.
    This transaction strategy will typically be used in combination with SingleConnectionFactory, which uses a single JMS Connection for all JMS access in order to avoid the overhead of repeated Connection creation, typically in a standalone application. Each transaction will then share the same JMS Connection, while still using its own individual JMS Session.

    transactionManager is required for this simple step bean. what is the best transaction manager for jms reader?

    really really appreicated.


  9. Well, there are some Batch-related questions in there, so you can try the Batch forum (http://forum.springsource.org/forumdisplay.php?f=41) for those. Try to provide some more detail about what you want / need to do. You can also try the Spring Integration forum (most of the message-heads hang out there now).

Leave a Reply