Monday, 15 February 2010

Managing Data and System Integrity in an SOA environment. Are You Prepared?

One thing that worries me a bit, is that you hardly find any discussion on data and system integrity in an SOA environment. At least, I don't see many. To me, data and system integrity is the most challenging issue we face today and the days to come. We will see more and more combinations of Services, SaaS, legacy apps, which will make a 'standard solution' to this problem even more relevant.

Lots of vendors I talk to tend to minimize the problems of integrity and robustness. They point at their infrastructure and say: well, our infrastructure is WS-Transaction compliant, you will not lose any messages. OK, that may be true, but will it fix failures? You can guess the next comment: with our high availability strategem, we can ensure a 99,99% availability, so you need not to worry.

Despite their reassurances, I still tend to worry: 99,99% availability is not 100%. Things can still (and will!) go wrong.

Let me try to explain the issue as I see it. A very simplistic example:

Service A calls Service B to execute a process. During execution of Service B it calls upon Service C to handle financial details.

Suppose Service B needs to be restored to a certain point in time because of an internal failure (does it really matter what caused the need for in-time recovery? I think it does ...). What does this mean for Service A, B and C? What kind of functionality do we need to have in place to make sure the entire system will not lose its integrity? How do I make sure that Service B 'catches up' with Service A and C? One might argue that due to the statelessness of a service, this shouldn't be a problem, but it is (besides the fact that there's loads of statefull services out there).

This used to be no problem in our legacy application environment. You just rolled-back the whole system and started all over again. However, our boundaries have become much smaller and larger at the same time. It is still a valid approach within a service boundary, but not in an SOA environment (which has no clear boundaries to begin with). Especially as you use services that might not be under your control (SaaS vendors, chainpartners, etc).

My worries come from the fact that most companies I visit, do not have a strategy to maintain this integrity. Mostly, they do not even acknowledge this problem, until they are confronted with it in real life. Suddenly it's become a major problem, because it is very hard to determine what to do, but there's a lot of pressure to fix it right this very minute!

What we need to keep in mind here, is that it is not just a technological problem. It is functional as well. How does the business wants to respond to failing (internal or external) components?

Luckily, having a good middleware infrastructure mitigates the problem somewhat, so it is possible to reduce the problem a lot, but especially in high-volume environments you really need to have a well thought and tried-out strategy in place. There's no falling back to manually fix things when you're processing thousands of transactions a minute.


  1. Are you familiar with "eventual consistency"? Here's some info how amazon handles this:

    Is this usable for smaller soa implementations?

  2. Mike

    I think, you are right that there's no falling back to manual f-i-x. It would be too late for any company having chosen the SOA-way.

    But what about falling back to manual o-r-g-a-n-i-z-a-t-i-o-n of collaboration, instead of handing it over to IT-systems, like SOA?

    I think, the boring IT-business gap is indicating that 'more manual' organization strategies are still an option, to be taken seriously.

    Thanks, Peter

  3. Mike

    I think you are right that there is no falling back to manual f-i-x. It would be too late for companies having chosen the SOA way.

    But what about falling back to 'more manual' o-r-g-a-n-i-z-a-t-i-o-n of collaboration - instead of handing it over to SOA-systems? I think, in view of the boring IT-business gap, this is a still an option, to be taken seriously.

    Thanks, Peter

  4. Hello,

    Thank you for the providing insight on the potential issue, which is very relevant and something we need to be aware of. My question is if it does occur, what and how should it be addressed (in respect to Oracle service bus)?



  5. @Shariq: what you do need to address (beforehand) is your recovery strategy. How are you going to make sure no messages are lost? Some measures can be implemented easily, like setting up a message warehouse (on a different server), so you have at least all (or most) messages to work with. Oracle Service Bus does facilitate that, btw.

    @Peter: well, there's no going back the old way. Just make sure you use the new ways to do things better, not to introduce new problems.

    @ako: very interesting. Seems like enough information for a sequel! Thanks.