Mastering Architecture: February 2010

Monday, 15 February 2010

Managing Data and System Integrity in an SOA environment. Are You Prepared?

One thing that worries me a bit, is that you hardly find any discussion on data and system integrity in an SOA environment. At least, I don't see many. To me, data and system integrity is the most challenging issue we face today and the days to come. We will see more and more combinations of Services, SaaS, legacy apps, which will make a 'standard solution' to this problem even more relevant.

Lots of vendors I talk to tend to minimize the problems of integrity and robustness. They point at their infrastructure and say: well, our infrastructure is WS-Transaction compliant, you will not lose any messages. OK, that may be true, but will it fix failures? You can guess the next comment: with our high availability strategem, we can ensure a 99,99% availability, so you need not to worry.

Despite their reassurances, I still tend to worry: 99,99% availability is not 100%. Things can still (and will!) go wrong.

Let me try to explain the issue as I see it. A very simplistic example:

Service A calls Service B to execute a process. During execution of Service B it calls upon Service C to handle financial details.

Suppose Service B needs to be restored to a certain point in time because of an internal failure (does it really matter what caused the need for in-time recovery? I think it does ...). What does this mean for Service A, B and C? What kind of functionality do we need to have in place to make sure the entire system will not lose its integrity? How do I make sure that Service B 'catches up' with Service A and C? One might argue that due to the statelessness of a service, this shouldn't be a problem, but it is (besides the fact that there's loads of statefull services out there).

This used to be no problem in our legacy application environment. You just rolled-back the whole system and started all over again. However, our boundaries have become much smaller and larger at the same time. It is still a valid approach within a service boundary, but not in an SOA environment (which has no clear boundaries to begin with). Especially as you use services that might not be under your control (SaaS vendors, chainpartners, etc).

My worries come from the fact that most companies I visit, do not have a strategy to maintain this integrity. Mostly, they do not even acknowledge this problem, until they are confronted with it in real life. Suddenly it's become a major problem, because it is very hard to determine what to do, but there's a lot of pressure to fix it right this very minute!

What we need to keep in mind here, is that it is not just a technological problem. It is functional as well. How does the business wants to respond to failing (internal or external) components?

Luckily, having a good middleware infrastructure mitigates the problem somewhat, so it is possible to reduce the problem a lot, but especially in high-volume environments you really need to have a well thought and tried-out strategy in place. There's no falling back to manually fix things when you're processing thousands of transactions a minute.