Spring Batch

Introduction
I've been working hard with a couple of clients on a new product called Spring Batch. The aim is to provide tools and applications to support bulk processing in an enterprise environment. Spring Batch is part of the Spring Portfolio with an initial release in the Spring 2.1 release train.
The original impetus to build some prototype code actually came independently from a number of Interface21 clients. This provides some useful additional detail and some constraints on the implementation so that it can be applied to the real-world problems posed by the clients. I hope that this article will stimulate some more interest and provide feedback on the general approach.
Rod Johnson will be presenting at JavaOne on Spring Batch, along with our partners from Accenture. If you are lucky enough to be there you will see some of the details and the thinking behind the product. Here I will show some of the details of the infrastructure layer of Spring Batch that won't be covered in the presentation.
The source code will be published in subversion as soon as I figure out how to get all the artifacts together (web site, JIRA, continuous integration etc.). I also plan to blog a couple more times with more details of the way the product is being designed. There is a mailing list for people who are interested in following the release process as we move towards a 1.0 release. To sign up to the list go to the list information page.
Spring Batch Infrastructure
The initial release provides some low level tools to support the other parts of the product. We call these the Spring Batch Infrastructure.
One of the goals of the infrastructure is to provide a declarative or semi-declarative approach to optimisation of bulk processing generally. This includes the ability to batch operations together, and to retry an piece of work if there is an exception. Both requirements have a transactional flavour, and similar concepts may be relevant (propagation, synchronisation). They also both lend themselves to the template programming model common in Spring, c.f. TransactionTemplate, JdbcTemplate, JmsTemplate.
The core interfaces in the framework are BatchOperations and BatchCallback, and the main implementation of BatchOperations is BatchTemplate. Example usage:
template.setCompletionPolicy(new FixedChunkSizeCompletionPolicy(20));
template.iterate(new BatchCallback() {
public boolean doInIteration(BatchContext context) {
// Do stuff in batch…
return true; // Return false signals that data are exhausted
}
});
The callback is executed repeatedly, until the completion policy determines that the batch should end, in this case after 20 operations. In real operations this would be wrapped in a transaction using the normal Spring transaction management framework.
The Spring Batch infrastructure also provides an API for automatic retry of a business operation. This is independent of the batching support, but will often be used in conjunction with it. The central interfaces in this case are RetryOperations and RetryCallback, and the main implementation of RetryOperations is RetryTemplate.
Example usage:
Optimising Transactional Pipeline Processing
Once the product is released, the infrastructure can be used immediately to simplify batch optimisations and automatic retries. The framework is oriented around application developers not needing to know any details of the framework - there are a few application developer interfaces that can be used for convenient construction of data processing pipelines, but apart from that we aim to support as close to a POJO programming model as is practical. This is similar to the approach taken in Spring Core in the area of transaction management and DAO implementation.
Spring Batch Container Layer
The design vision for Spring Batch is that the infrastructure can be used to implement a class of process-oriented batch applications in what we call the Spring Batch Container Layer. The first container to be published is a bulk-processing application, using the infrastructure in its implementation. This so-called "simple batch execution container" will provide robust features for traceability and management of a batch lifecycle. A key goal is that the management of the batch process (locating a job and its input and results, starting, scheduling, restarting) should be as easy as possible for a non-developer, like an application support team with some business back up. People tell me this is sacrilege, but I like to think of this as an "ETL" tool (Extract Transform and Load). At least ETL is what the container is literally doing, even if it doesn't fit with some people's notion of traditional ETL. The Spring programming model is perfect for this kind of problem - write your POJO that knows how to locate and process a single item of data, and let the framework do the plumbing.
Watch this space for more blogs on the Container Layer, on the domain concepts, ubiquitous language and design details.
Future Directions
In addition to the simple container, we also want to provide an extension which can take an input source, partition it into sub-ranges, and process those concurrently. A common application of this will be to put the concurrent processing behind a remote proxy, such as an EJB or web-service. All the concurrent sub-processes are able to identify themselves individually, show statistics and restart from the last known good record after an error. They are also able to aggregate their reportable details up to the parent process to give an operator a single view of a parallel job. The same business logic implemented as a logical unit of work can be used as in the simple container. The difference is only in the configuration - the Spring programming model again at its best.
Matt Welsh's work shows that SEDA has enormous benefits over more rigid processing architectures, and messaging containers give us a lot of resilience out of the box. So we want to provide a more SEDA flavoured container, or container support, as well as supporting the more traditional approach. There might be a tie in with Mule and/or other ESB tools here, giving the benefit of a very scalable architecture, where the choice of transport and distribution strategy can be made as late as possible. The same application code could be used in principle for a standalone tool processing a small amount of data, and a massive enterprise-scale bulk-processing engine.
Michael Mayr says:
Added on May 7th, 2007 at 3:57 pmAll I can say is: WOW! Batch Processing is really a market niche with no major open source software. I expect Spring Batch to be a huge success if it has the same quality level the Spring Framework has - plus the "Spring" name on it.
Anthony Smith says:
Added on May 8th, 2007 at 12:27 amHello Dave,
Am VERY interested in your Spring Batch Framework, as we have developed something similar for a client's project. The core features of our Spring based Batch Framework are:
- scheduling using quartz or JMX (start, stop)
- long running batch processes
- execution order of batch processes
- transaction management
- transparent tracing and logging
- monitoring and batch management using JMX
- environment checks
Davide Baroncelli says:
Added on May 8th, 2007 at 2:44 amBatch processing is definitely an area where Spring can help: we used it extensively in the past exactly for ETL activities, and had to develop some sort of infrastructure based on Spring. Another area where your project could help is something we've done in the last few feeks: an integrated jasperreports solution which massively generates a number of reports/forms/checklists meant for print and manual filling in our stores. For this we refactored the Spring jasperreports support (which is web-only) making it more independent and usable in a batch context.
Robert Fischer says:
Added on May 8th, 2007 at 8:21 amI've been working on a library that provides low-level multithreaded pipelining, and it was designed with an eye towards Spring. I'm currently publishing it under the name "JConch" over at GoogleCode — I'd like to see more about Spring Batch and how I could integrate towards it.
The start of the code is in a browsable SVN repository at http://jconch.googlecode.com/svn/trunk/src/jconch/pipeline/
The meager project home page is at http://code.google.com/p/jconch/
Mike Monette says:
Added on May 8th, 2007 at 9:40 amDave,
This is exciting news! I'm glad to see someone taking the lead to produce a consistent framework for batch Java processing.
I see talk about support for job restart, but I haven't yet seen any mention of an integrated process checkpoint scheme. Do you have plans to offer infrastructure-level support for checkpoints?
Also, I'm interested in what kind of support you plan to offer for complex job streams, i.e., sequences of batch jobs with conditional execution.
Dave Syer (blog author) says:
Added on May 8th, 2007 at 9:48 amThanks for the feedback everyone. Feel free to use the forum as well at http://forum.springframework.org/forumdisplay.php?f=41.
We are providing quite a lot of infrastructure to support "checkpoints" or "savepoints" in file-based streams. Database input / output can handle the requirements more naturally, and with more guarantees about transactional semantics. The framework strategises with the concepts of Restartable and RestartData, so I guess that adds up to "an integrated…scheme".
Conditional execution is limited in the first container implementations. A job is just a sequence of steps (each of which can decide whether to execute or not). In the future there will undoubtedly be more support for conditional execution if the demand is there.
Sabarish Sasidharan says:
Added on May 9th, 2007 at 2:00 pmBatch infrastructure is one of those pieces for which developers end up writing one framework after another for each and every application they work on. I don't know why, even in big enterprises a common framework doesn't usually get adopted. So its great to see a common framework that would save all that precious time (and the time spent on designing it the right way). When i heard the Spring Batch announcement, i guessed it would have facilities to apply processing logic using a call back. I also expected it to allow us to split files and then feed the chunks to different threads that would run our batch processing logic. Iam glad to hear those features are all in there !
Sabarish Sasidharan says:
Added on May 9th, 2007 at 2:09 pmAnother feature that would be a nice to have:
Support for data binding ie transforming a row of data from a file into a java object, something similar to what Castor does with XML (but using field separators or char positions instead). Unless there's already lib doing that out there.
Dave Syer (blog author) says:
Added on May 12th, 2007 at 7:03 amBinding lines (or sections of input streams) to Java objects is one of our infrastructure features. The first release will support line-oriented flat file record structures (delimited or fixed length), and XML object mapping using XStream. That probably covers 90% or more of the cases we have seen in live systems over the years. More options will undoubtedly be supported in the future.
Thanks for all the feedback, everyone: it is really important to us to make the product fit the needs of clients, so keep it coming. We have several early adopters using the tool in a pre-release stage and they have all been able to make valuable contributions. When the product is publicly available and you can all get hold of the source code and start using it we expect that high-quality feedback such as we are seeing here will be very important to the direction of the development effort.
Dan says:
Added on May 15th, 2007 at 10:38 amIs any additional info available regarding the line-oriented flat file record structures? I'm currently working on something similar, and would rather not re-invent the wheel.
John Walsh says:
Added on May 16th, 2007 at 9:35 amWould be great to have support for writing/reading Cobol copyboook formats. We use Websphere Message Broker for marshalling between formats today and it is great but fairly complex and of course, expensive ;o)
Dave Syer (blog author) says:
Added on May 18th, 2007 at 11:56 amThe line-oriented stuff is undergoing quite a bit of revision at the moment, so the javadocs on the website will most likely not be up to date. We don't want to go overboard with features or design complexity at the moment - prefer to get feedback from users before committing to anything very heavyweight. So it won't be rocket science at least to start with.
Does anyone know a good Apache / LGPL / similar licensed tool for dealing with Cobol copybook specifically?
Keith says:
Added on May 29th, 2007 at 10:00 amWe have used the following copybook to xml package that may be of interest to you:
http://sourceforge.net/projects/cb2xml/
Dave Syer (blog author) says:
Added on May 29th, 2007 at 12:58 pmAwesome, thanks. I will take a closer look.
Mark Nuttall says:
Added on May 31st, 2007 at 2:07 pmI had glossed over this before. One of my coworkers mentioned he was watching the TSS videos so I came back to take a look.
I think this is exactly what I have been looking for for quite a while. While there are Java based ETL tools, there is nothing that allows me to map to POJO's (not that I could find). I've asked at least one developer of one of those tools how that might occur and he didn't understand why or what I was trying to get at.
I guess my only question is how is performance? ETL tools use low level things like CLI to gain performance.
Dave Syer (blog author) says:
Added on June 1st, 2007 at 4:57 amGlad to hear that a POJO approach is appealing.
As far as performance goes we may be limited by JDBC, but we have collective experience of many projects that use Java batch processing and find ways to meet their target performance levels. Needless to say we will be thinking hard about performance, and of course trying to measure it in realistic test cases. I guess once we have a benchmark it would be interesting to compare lower level approaches.
By CLI you mean the DB2 "Command Level Interface" (is that what it's called)?
Mark Nuttall says:
Added on June 1st, 2007 at 9:55 amBy CLI, I was thinking DB2. That might not be exactly what they are using. But they are doing low level stuff to get performance.
For example, a few years ago I modifying a process that used DTS. I tried using ADO to perform the new step but it was horrendously slow. I ended up creating a dynamically created DTS step and letting the DTS precess do that work. It was light years faster. So obviously it was doing something low level.
Frank Vilhelmsen says:
Added on June 6th, 2007 at 4:37 amI'm very glad to see this kind of framework. And I need it now!
At the moment I working in a bank and we have to start some of the batch jobs with legacy systems like fx. OPC.
Is there a solution for external schedules?
When can we download Alfa versions?
Vasudev Ram says:
Added on June 13th, 2007 at 2:49 pmThis sounds like a very interesting and useful project.
Good luck with it! I blogged about it here:
http://jugad.livejournal.com/2007/06/09/
(2nd post on that page).
Great!
Vasudev Ram
http://www.dancingbison.com
Jose Hernandez says:
Added on June 14th, 2007 at 6:03 amHi,
This sounds very good. Are you planning Grid Computing support for the project ???.
We're experts in grid computing, and i think it would be very useful to abstract from grid computing in the development following spring batch phylosophy
Aj says:
Added on June 14th, 2007 at 9:41 amDo you have an expected time frame when it will be released. ?
Jason says:
Added on June 14th, 2007 at 9:14 pmSpring Batch sounds perfect for a financial transaction processing thing I have to build. I'd like to know if there is intended to be support for a delay, or custom trigger handler as part of the retry policy? I'm not clear how that bit is intended to work, but in my case if a transaction fails I want to retry in 10 mins, then 20, then an hour for a total of 24 hours before finally giving up and accepting the problem as an error. Will this scenario be possible with spring batch?
Dave Syer (blog author) says:
Added on June 15th, 2007 at 1:55 amThis retry scenario is pretty common, and one of the intended uses of the RetryTemplate. Most of the API is still subject to change, but this bit is pretty stable (http://static.springframework.org/spring-batch/spring-batch-infrastructure/apidocs/index.html). You could even implement your use case declaratively with an AOP interceptor. You only have to be careful to put the retry around the transaction boundary (not inside it).
Mats Henricson says:
Added on June 23rd, 2007 at 4:43 pmGreat! However, I see this as a failure for Quartz. I've used it for about 5 months now, and I had to twist its arms to get it to do pretty basic things. Apparently they're hiding Quartz behind this new Batch framework, and that sounds like a great idea.
Ares says:
Added on June 28th, 2007 at 8:07 pmAwesome!!!.. I've been working with Batch Processing for a year in financial industries. I've implemented it using Microsoft SQL Server Agent, maybe it sound stupid.. but it works :p
I want to ask you Dave, is there any diffrences between batch processing that run on application level and on database level?? which one are the best practices?
Now I'm working with .NET technology, as my previous project using Spring Framework for .NET version, I hope Spring Batch has it's .NET version too
Guenni says:
Added on August 12th, 2007 at 4:42 amIt is awesome, but as of 08-12-07 it is very hard to tell the current status of spring batch.
Is there some sort of release available, how active is the project?
I sure would like to get my hands on it ASAP.
Dave Syer (blog author) says:
Added on August 12th, 2007 at 6:24 amThe project is very active, with quite a few early adopters, just not quite public. We will be publishing the code in SVN on sourceforge very soon.
Guenni says:
Added on August 12th, 2007 at 8:05 amDear Dave,
how soon is soon? The reason why I'm asking is we are just about to try to develop our own, and since that task is left with poor me, I'd prefer I didn't have to, but could tell my boss to wait just a little with that.
Could I become an early adopter in case it will be some more weeks / months instead?
Dave Syer (blog author) says:
Added on August 12th, 2007 at 8:11 amDefintely not weeks or months until the code is public - I've been saying "next week" for several weeks now though. You can follow the news with all the other frustrated potential users on the forum (http://forum.springframework.org/forumdisplay.php?f=41).
VJ says:
Added on October 10th, 2007 at 5:53 pmDave: Any good news for us. I am working on batch process. We are very much interested to use this. I have seen demo in Javaone. So I know this will solve my problem so waiting to hear good words…
Dave Syer (blog author) says:
Added on October 11th, 2007 at 1:29 amThe code has been out for quite some time, and a milestone release a couple of weeks ago - there were announcements on the forum and on the mailing list. Sorry I forgot to pot back here. See the website for more details and updates (http://static.springframework.org/spring-batch/).
Lalit says:
Added on October 25th, 2007 at 4:11 amDave,
I have seen some inhouse batch frameworks in my organization and really appreciate the way Spring Batch framework is thought of (Core, Environment, Framework and Application Layers).
I am anylysing the feasiblity of Spring Batch to be used in our project which is very load intensive(should handle 3 million records). I would be intersted to know that when( or which version of Spring Batch) you are planning to release "Partitioned Batch Execution Environment" which looks to be more suitable for our requirement. It would be great to have that in version 1.1
I am still looking into the documentation(btw, m2 documentation needs some rework to improve the readability) and will post more messages once go through the Interfaces and Sample application.
Thnaks,
Lalit
Dave Syer (blog author) says:
Added on October 25th, 2007 at 5:35 amThanks for the feedback. Partitioning is still planned for 1.1.
If you want to post more messages, it might be better to use the forum (http://forum.springframework.org/forumdisplay.php?f=41) because more people will see that.
If you have specific comments on the documentation, please post those as JIRA issues (http://opensource.atlassian.com/projects/spring/browse/BATCH).
Jean-Rémi says:
Added on November 5th, 2007 at 9:35 amHi Dave,
I'm tech leader at IBM France and want to know if there is a roadmap for Spring Batch ? May we hope a production version for large amount of data (>10 millions transactions) in a few month (3-4 months max).
Jean-Rémi
Dave Syer (blog author) says:
Added on November 5th, 2007 at 12:42 pmThe roadmap is pretty much what you get from JIRA (http://opensource.atlassian.com/projects/spring/browse/BATCH). We are being cautious with dates, but 1.0 should be out in that sort of timescale. We already have some pretty high volume projects using existing (milestone) releases actually. If you are willing to do a bit of customisation and some work to upgrade when new releases come out you could probably get something scaling almost arbitrarily well with the right middleware (we will add this kind of support natively later). I would certainly like to help out if you would contact me directly (dsyer at interface21 dot com - details on the website as well).
Snehal says:
Added on November 13th, 2007 at 9:13 amHey Dave,
There is an interesting discussion about Spring Batch and the landscape of batch computing generally taking place on a theserverside.com thread:
http://www.theserverside.com/news/thread.tss?thread_id=47506
There seems to be some confusion about the role of various batch technologies like Flux, Quartz, Spring Batch, WebSphere XD Compute Grid, Tivoli Workload Scheduler, Control-M, Zeke, and so on. I'd like to know your thoughts, and the thoughts of this community, on my following post (copied below):
http://www.theserverside.com/news/thread.tss?thread_id=47506#242521
"To have a meaningful discussion about the technologies within the batch domain, I think that we need to clearly lay out the layers of batch processing. There are four of them: Schedulers, Batch Execution Environments, Batch Application Containers, and the actual batch applications.
1. Schedulers; schedulers manage job dependencies, resource dependencies, scheduled submissions, and some form of job lifecycle and execution management. Quartz, Flux, and other such open source schedulers provide time-based scheduling and some form of dependency management. Tivoli Workload Scheduler, Control-M, Zeke, and other schedulers however provide more scheduling features and are typical products found at bigger customer shops. These shops have built complete batch infrastructures around the scheduler including security models, auditing mechanisms, archiving, and so on.
2. Batch Execution Environments (BEE); they host batch application containers, and provide features like: transaction management, checkpointing, recoverability, security management, connection management, scalability, high availability, output processing, and so on; the inherent qualities of service and integration with existing schedulers are provided by the BEE. XD Compute Grid Delivers a BEE.
3. Batch Application Containers provide a well-formed invocation model for the business logic. The container manages the lifecycle of the application and gives control to the underlying transaction manager, security manager, etc as needed. XD Compute delivers a batch application container. I would argue that Spring Batch is a Batch application container too. Spring Batch doesn’t provide a transaction manager, security manager, explicit high availability, and so on, but it does allow them to be injected into the container and therefore available to the application.
4. Batch Applications that implement the actual business logic and run within a batch application container. Nothing special to discuss right now, perhaps portability among containers in the future"
Thanks,
Snehal Antani
Parag Teredesai says:
Added on December 16th, 2007 at 6:29 pmWhen I started work on a batch application one of the things I debated was whether we "NEED" to run in a J2EE container.
What is your view on this?
Some points to consider:
1)Most J2EE is geared towards online apps where the attempt is to use 'N' resource connections to handle demand which is many times 'N'…so we generally write/use code to getConnection to resource use it and as quickly as possible return it to the pool
In batch perhaps its more efficient to grab the N connections or so and use them for the life of the application ?
2) For J2EE We let container handle threading and scaling.
In batch we control scaling by changing configuration….before the run starts….if the batch size is larger…we start with more threads, nore connections upfront for example…
Obviously this is far from exhaustive list…but what are the thoughts on this and other strategies?