Thursday, April 30, 2009

SOA/BPM performance best practices (Process Engine configuration)

Different process engines have different tuning options. Here some relevant options for WPS will be discussed.

Thread pool sizes
While long-running business processes spend most of their lifetime in the default thread pool (JMS based navigation) or WorkManager thread pool (WorkManager based navigation), short running processes don’t have a specific thread pool assigned to it. Dependent on from where the request to run a microflow comes from, a microflow runs within:
  • the ORB thread pool (e.g. the microflow is started from a different JVM with remote EJB invocation)
  • the Web container thread pool (e.g. the microflow is started using a http request)
  • the default thread pool (e.g. the microflow is started using a JMS message)
If microflow parallelism is not sufficient, examine your application and increase the respective thread pool. The key is to maximize the concurrency of the end-to-end application flow until the available processor or other resources are maximized.

Navigations mechanisms for long running flows
The business process engine in WPS processes long-running flows using a number of chained transactions. There are two types of process navigation techniques in WPS (since V6.1): JMS based navigation and WorkManager based navigation. Both types of navigation provide the same quality of service. The default is JMS based navigation. In Lab tests, WorkManager based navigation has shown throughput improvements of up to 100%.

However the behaviour of the system changes a bit. If JMS based navigation is used, there is no scheme (for example age-based or priority-based) to process older or more highly prioritized business process instances first. This makes it hard to predict the actual duration of single instances, especially on an heavily loaded system. If using WorkManager based navigation, currently processed instances are being further processed as long as there is outstanding work for them. While this is quite efficient, it prefers running process instances.

Resource dependencies
Start adjusting the parameters for the SCA and MDB activation specs and the WorkManager threads (if used) and then continue down the dependency chain as depicted below:

All those engine threads on the left side of the picture require JDBC connections to databases and it is very helpful for the throughput of the system if they don't have to wait for database connections from the related connection pools.

Other process engine tuning knobs
Business process engine environments have means to record what is happening inside, either for monitoring purposes or for doing problem determination. Any such recordings require a certain amount of processing capacity, so one should try to minimize any monitoring recordings and definitively disable traces for problem determination for normal operations.
Some of the data stores on out of the box configurations may default to simple file-based single user databases like Derby. This eases simple setups, since some database administrative tasks can be avoided. When performance and production level transactional integrity is more important, then it is advisable to place these data stores on production level database systems. Throughput characteristics could improve by factors of 2 to 5.
When using WPS' common event infrastructure (CEI) for recording business relevant events it might help to disable the CEI data store within WPS since CEI consumes these events and stores them in it's own database. Also validation of CEI events could be turned off, once it has been verified, that the emitted events are valid.

Monday, April 6, 2009

SOA/BPM performance best practices (Deployment and application packaging)

Current SOA runtime environments are usually implemented on J2EE based application servers. Business applications are deployed to these environments as one or more SCA modules.

Using common object libraries
Often a set of such business applications shares common definitions for data and interfaces and classical application design considerations suggest to put such common objects into a common objects library module.
Some current SOA runtime environments however treat such packaging schemas in a maybe unexpected manner. The module specific class loader for module one finds a reference to another module (the common objects library) and loads that module. When then the module specific class loader for module two loads its module's code and find a reference to another module (again the common objects library) it loads that module. This continues for all the application modules. The fact, that the platform's application class loaders have no knowledge about what has been loaded by other application level class loaders already leads to the effect, that a shared object library module is actually shared “by copying”. And if the memory footprint of a single copy of that shared object library is several hundred megabytes, the available heap space can be exhausted pretty fast.
Such excessive memory usage does not necessarily have a performance impact, but when the JVM's garbage collection takes place, the responsiveness of the affected applications can suffer dramatically.
A less memory consuming approach in this case would be to define one library module per application module. While this approach definitively requires more development and packaging effort, it can considerable reduce the overall memory requirement. Up to 57% reduction have been observed.

Modularization effects
One of the internally used benchmark workloads is organized as three SCA modules because that seemed the most natural model a production application implementation – one module creates the business events and one consumes them, while the module in between contains the business logic responsible for synchronization. For ease of code maintenance an SCA developer may be tempted to separate an application into more modules. In order to demonstrate the performance costs incurred with modularization, the benchmark's three modules were organized into two and one SCA module.

Measurements indicate a throughput improvement of 14% in the two module version and of 32% in the one module version, both relative to the original workload implementation. As might be expected, data sharing among SCA components is more expensive across modules than it is for components within the same module (where additional optimizations are available).