WebSphere Process Server Performance: 2009

Friday, June 19, 2009

Performance boost through disabling file system caching

In a recent engagement to troubleshoot lousy performance, colleagues of mine were able to "fix it".

Usually it is a good thing, when an operating system (here: kernel and file system code) caches data in buffers at the file system level and flushes it out to disk when appropriate (and in the right chunks) to minimize the amount of physical I/O operations.

When placing database files on such filesystems however, the filesystem's caching algorithms can be extremely counterproductive.

In the case here, Solaris, the Veritas filesystem, and Oracle was involved. Re-mounting the filesystem with the parameter

mincache=direct

reduced the time for SQL inserts into the SIB tables from up to 11 seconds down to 0.03 seconds.

When dealing with the combination AIX, JFS2, and DB2, the DB2 command

db2 alter tablespace no file system caching

has more or less the same effect. Data caching on the filesystem level is disabled and the data is persisted as fast as possible. According to the AIX documentation the same effect can be reached on the file system level by using "mount ... -o dio". Read performance for non-DB files might suffer, because caching is reduced. DB-data will still be cached in the DB bufferpools.

Friday, May 8, 2009

SOA/BPM performance best practices (System and subsystem configuration)

This section handles some scalability related tuning knobs and some relevant tuning parameters for the involved subsystems like the JVMs and the databases.

Clustering topologies:
In order to take care of growth and workload distribution, modern business process engines can run in a clustered setup, spreading the workload across various physical nodes (horizontal scaling) or for better utilizing spare resources within existing nodes (vertical scaling).
For WPS three different cluster topology patterns have been identified and described e.g in http://www-01.ibm.com/support/docview.wss?uid=swg27010320&aid=1 or http://www.ibm.com/developerworks/websphere/library/techarticles/0703_redlin/0703_redlin.html .

The first pattern (shown on the left) is also known as the “bronze topology”. It consists of a single application server cluster, where the WPS business applications, the support applications like CEI and BPC Explorer, and the messaging infrastructure hosting the messaging engines (MEs) that form the system integration buses (SIBus) all reside within each of the application servers, that form the cluster.
This bronze topology is suitable for a solution, that comprises of only synchronous web services and synchronous SCA invocations, preferably with short running flows only.

The second pattern (shown in the middle) is also known as the “silver topology”. It has two clusters, the first one containing the WPS business applications and the support applications as before, but the messaging infrastructure is located in the second cluster.
This silver topology is suitable for a solution that uses long running processes, but does neither need CEI, nor message sequencing, nor asynchronous deferred response, nor JMS or MQ bindings, nor message sequencing mechanisms.

The third pattern (shown on the right) is called the “golden topology”. Compared to the previous patterns, the support applications are separated into a third cluster.
This golden topology is suited for all the remaining cases, where asynchronous processing plays a nontrivial role in the solution. It also provides the most “JVM space” for the business process applications that should run in this environment. If the available hardware resources allow for setting up this golden topology, then it is advisable to start with this topology pattern from the very beginning as it is the most versatile one.
What is not shown in the above figure is the management infrastructure, that controls the cluster(s). These consist of node agents and a deployment manager node as the central point of administration of the entire cell, these clusters belong to. A tuning tip for this management infrastructure is to turn off automatic synchronization of the node configurations. Depending on the complexity of the setup, this synchronization processing is better kicked off manually during defined maintenance windows in off-peak times.

JVM Garbage Collection
Verbose garbage collection is not as verbose as the name suggests. Those few lines of information that are produced, when verboseGC is turned on don't really hurt the system's performance. On the other side they can be a very helpful source of information when troubleshooting performance problems.
The JVM used by WPS V6.1 supports several garbage collection strategies: the Throughput Garbage Collector (TGC), the Low Pause Garbage Collector (LPGC), and the Generational Garbage Collector (GGC).
The TGC provides the best out-of-box throughput for applications running on a JVM by minimizing costs of the garbage collector. However, it has “stop-the-world” phases that can take between 100ms and multiple seconds during garbage collection.
The LPGC provides garbage collection in parallel to the JVM’s work. Due to increased synchronization costs, throughput decreases. If response time is more important than highest possible throughput, this garbage collector could be a good choice.
The GGC is new in the IBM 1.5 JVM. It is well suited for applications that produce a lot of short-lived small Java objects. As it reduces pause times it should be tried in such cases instead of the TGC or LPGC. When properly tuned, it provides the best garbage collection performance for SOA/BPM workloads. [http://www.redbooks.ibm.com/abstracts/redp4431.html]

JVM memory considerations
Increasing the heap size of the JVM of the application server can improve the throughput of business processes. However it should ensured, that there is enough real memory available to avoid that the operating system would start swapping. Detailed information on JVM parameter tuning can be found in [http://www.ibm.com/developerworks/java/jdk/diagnosis/].

Database subsystem tuning
To a large degree the performance of long running flows and/or human tasks in a SOA/BPM solution depends on a properly tuned, enterprise class database management system besides the afore mentioned application server tuning. This paper provides some tuning guidelines for IBM's DB2 database system as an example. Most of the rules should also be applicable to other production database management systems.
It is not advisable to use simple file based databases like Cloudscape or Derby as a database management system for WPS other than for the purpose of unit testing.

Configuration advisor
DB2 comes with a built-in configuration advisor. After creating the database, the advisor can be used to configure the database for the usage scenario expected. The input for the Configuration Advisor depends on the actual system environment, load assumptions, etc. Details on how to use this advisor can be found in [http://www-01.ibm.com/support/docview.wss?uid=swg27012639&aid=1]. Some parameter settings in the output of the advisor should be checked and adjusted afterwards.

MINCOMMIT A value of ‘1’ is strongly recommended. The advisor sometimes suggests other values.
NUM_IOSERVERS The value of NUM_IOSERVERS should match the number of physical disks (+2) the database resides on.
NUM_IOCLEANERS Especially on multi-processor machines, enough IO cleaners should be available to make sure that dirty pages in the bufferpool are written to disk. Provide at least one IO cleaner per processor.

Database statistics
Optimal database performance requires the database optimizer to do its job well. The optimizer acts based on statistical data about the number of rows in a table, the use of space by a table or index, and other information. When the system is set up, these statistics are empty. As a consequence the optimizer usually takes sub-optimal decisions, leading to poor performance.
Therefore after initially putting load on your system, or whenever the data volume in the database changes significantly, you should update the statistics by running the RUNSTATS utility (DB2). Make sure there is sufficient data (> 2000 process instances) in the database before you run RUNSTATS. Avoid running RUNSTATS on an empty database as this will lead to bad performance.

Enable Re-Optimization
If BPC API queries (as used by the BPC Explorer e.g.) are used regularly on your system, it is recommended to allow the database to re-optimize SQL queries once, as described at [http://www-01.ibm.com/support/docview.wss?rs=2307&uid= swg21299450]. This tuning step greatly improves the response times of BPC API queries. In Lab tests the response time for one and the same query has been reduced from over 20 seconds down to 300 milliseconds. With improvements in such orders of magnitude the additional overhead for re-optimizing SQL queries should be affordable.

Database indexes
In most cases the BPM product's datastore has not been defined such, that all the database indexes that might potentially be used have been defined. In order to avoid unnecessary processing out of the box, it is much more likely, that only those indexes have been defined, that are necessary to run the most basic queries with an acceptable response time.
As a tuning step one can do some analysis on the SQL statements resulting from end user queries to see, how the query filters used by the end user (or in the related API call) relate to the WHERE clauses in the resulting SQL statements and define additional indexes on the related tables to improve the performance of these queries. After defining new indexes, the above mentioned RUNSTATS action needs to be run to enable the use of the newly created indexes.
Sometimes customers are uncertain about whether they are turning their environment into an unsupported state when defining additional indexes. This is definitively not the case. Customers are even encouraged to apply such tuning steps and check whether they help. If not, they can be undone easily e.g. by removing the index.

Further database tuning
Any decent database management system can keep its data in memory buffers called bufferpools to avoid physical I/O. Data, that is in these bufferpools needs not be read from disk when referred to, it can be taken from these memory buffers directly. Hence it makes a lot of sense to make these buffers large enough to hold as much data as possible.
The key tuning parameter to look at is called bufferpool hit ratio and describes the ratio between the physical data and index reads and the logical reads. As a rule of thumb you can increase the size of the buffer pools as long as you get a corresponding increase of the bufferpool hit ratio. A well tuned system can easily have a hit ratio well above 90%.
WPS accesses it's databases in multiple concurrent threads and uses row level locking to ensure data consistency during it's transactions. As a result, there can be a lot of row locks being active at times of heavy processing. The related database parameters for the space, where the database maintains the lock information might have to be adjusted.
For DB2 the affected database configuration parameters are LOCKLIST and MAXLOCKS. Shortages in this lock maintenance space can lead to so called lock escalations, where row locks are escalated to undesirable table locks, which even can lead to deadlock situations. Data integrity is still maintained in such situations, but the associated wait times can severely impact throughput and response times.

Monday, May 4, 2009

Poor Man's Flight Recorder

Problem statement:
Sometimes when watching WPS you're faced with questions similar to "what is it doing rightnow?". In this situation you'd wish the product had some production level, low intrusive tracing capability, that shows you, what is going on on the programming model level. Which would e.g. be what process instances, BPEL activities, invokes, API calls, etc. are being executed "as we speak".
After some poking around in WPS' BPEDB I constructed an SQLs statement, that at least can serve as a simple surrogate for such a not-only-nice-to-have trace.

Solution: Poor Man's Flight Recorder:
Due to some very helpful timestamps in some of the tables, it made it easy to construct an SQL statement, that can show you what has been recorded in the BPEDB for the last 10 seconds (or just the last 2 seconds - change it as you like).

SELECT
ai.last_state_change as AI_last_state_change,
substr(atp.name,1,30) as AI_templatename,
case ai.state
when 0 then 'null(0)'
when 1 then 'inactive(1)'
when 2 then 'ready(2)'
when 3 then 'running(3)'
when 4 then 'skipped(4)'
when 5 then 'finished(5)'
when 6 then 'failed(6)'
when 7 then 'terminated(7)'
when 8 then 'claimed(8)'
when 9 then 'terminating(9)'
when 10 then 'failing(10)'
when 11 then 'waiting(11)'
when 12 then 'expired(12)'
when 13 then 'stopped(13)'
when 14 then 'processing_undo'
end as AI_state,
substr(pt.name,1,30) as PI_templatename,
case pi.state
when 0 then 'deleted(0)'
when 1 then 'ready(1)'
when 2 then 'running(2)'
when 3 then 'finished(3)'
when 4 then 'compensating(4)'
when 5 then 'failed(5)'
when 6 then 'terminated(6)'
when 7 then 'compensated(7)'
when 8 then 'terminating(8)'
when 9 then 'failing(9)'
when 10 then 'indoubt(10)'
when 11 then 'suspended(11)'
when 12 then 'compensation_fail'
end as PI_state,
ai.piid as Process_Instance_ID
FROM
activity_instance_b_t ai,
activity_template_b_t atp,
process_instance_b_t pi,
process_template_b_t pt
where
atp.atid = ai.atid and
ai.piid = pi.piid and
pi.ptid = pt.ptid and
ai.last_state_change > ((select max(last_state_change) from activity_instance_b_t) - 10 seconds)
order by
ai.last_state_change desc
fetch first 100 rows only
with ur ;

This statement is for IBM's DB2 database - Oracle users might want to use "WHERE ROWNUM < 100" instead of "fetch first 100 rows only".

Sample output:
(if it looks distorted, you may want to enlarge your browser window horizontally to fit it in)

AI_LAST_STATE_CHANGE AI_TEMPLATENAME AI_STATE PI_TEMPLATENAME PI_STATE PROCESS_INSTANCE_ID
-------------------------- ------------------- ---------- --------------- ----------- ----------------------------------
2009-04-03-10.22.37.545000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.540000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000 Callsub finished(5) tptmain running(2) x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000 Callsub finished(5) tptmain running(2) x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.507000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.477000 Callsub finished(5) tptmain running(2) x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.461000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B805F5CFEFFFF808BEE17AB'
2009-04-03-10.22.37.451000 - finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.451000 Reply finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.441000 Snippet2 finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.439000 - finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.439000 Reply finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.432000 Snippet2 finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.400000 Callsub finished(5) tptmain running(2) x'900301206B805F5CFEFFFF808BEE17AB'
......

Explanation:
The generated list has the following columns:

AI_LAST_STATE_CHANGE = activity instance last state change
The last point in time, the activitiy's state got changed - the table is sorted on this column, descending - i.e. newest entry first.
AI_TEMPLATENAME = the template name of the activity
AI_STATE = the state of the activity instance
PI_TEMPLATENAME = the name of the process template of the processinstance, this activity is part of
PI_STATE = the state of the process instance and finally
PROCESS_INSTANCE_ID = the ID of the process instance.

Possible performance impact:
The possible impact greatly depends on the size of the related tables. Most of the operation is being done on the ACTIVITY_INSTANCE_B_T table and there on the column LAST_STATE_CHANGE. By default, this column doesn't have an index. So if you experience very long execution times for the above SQL, you're most likely doing a table scan on the entire ACTIVITY_INSTANCE_B_T table. If so, you may want to define a suitable index for that column and perform a "runstats .... with distribution and detailed indexes all" on this table. This should reduce the table scan to an index scan, which should result in a noticeable reduction of the execution time of this SQL.

Disclaimer:
This SQL is lightyears away from being a fully fledged flight recorder, that shows all relevant events. On the other side it can at least provide some initial insight for problem determination purposes and it is for sure far less intrusive (in terms of resource consumption) than a full blown BPE trace.

(finally also published (slightly modified) at http://www-01.ibm.com/support/docview.wss?uid=swg21384848 )

Thursday, April 30, 2009

SOA/BPM performance best practices (Process Engine configuration)

Different process engines have different tuning options. Here some relevant options for WPS will be discussed.

Thread pool sizes
While long-running business processes spend most of their lifetime in the default thread pool (JMS based navigation) or WorkManager thread pool (WorkManager based navigation), short running processes don’t have a specific thread pool assigned to it. Dependent on from where the request to run a microflow comes from, a microflow runs within:

the ORB thread pool (e.g. the microflow is started from a different JVM with remote EJB invocation)
the Web container thread pool (e.g. the microflow is started using a http request)
the default thread pool (e.g. the microflow is started using a JMS message)

If microflow parallelism is not sufficient, examine your application and increase the respective thread pool. The key is to maximize the concurrency of the end-to-end application flow until the available processor or other resources are maximized.

Navigations mechanisms for long running flows
The business process engine in WPS processes long-running flows using a number of chained transactions. There are two types of process navigation techniques in WPS (since V6.1): JMS based navigation and WorkManager based navigation. Both types of navigation provide the same quality of service. The default is JMS based navigation. In Lab tests, WorkManager based navigation has shown throughput improvements of up to 100%.

However the behaviour of the system changes a bit. If JMS based navigation is used, there is no scheme (for example age-based or priority-based) to process older or more highly prioritized business process instances first. This makes it hard to predict the actual duration of single instances, especially on an heavily loaded system. If using WorkManager based navigation, currently processed instances are being further processed as long as there is outstanding work for them. While this is quite efficient, it prefers running process instances.

Resource dependencies
Start adjusting the parameters for the SCA and MDB activation specs and the WorkManager threads (if used) and then continue down the dependency chain as depicted below:

All those engine threads on the left side of the picture require JDBC connections to databases and it is very helpful for the throughput of the system if they don't have to wait for database connections from the related connection pools.

Other process engine tuning knobs
Business process engine environments have means to record what is happening inside, either for monitoring purposes or for doing problem determination. Any such recordings require a certain amount of processing capacity, so one should try to minimize any monitoring recordings and definitively disable traces for problem determination for normal operations.
Some of the data stores on out of the box configurations may default to simple file-based single user databases like Derby. This eases simple setups, since some database administrative tasks can be avoided. When performance and production level transactional integrity is more important, then it is advisable to place these data stores on production level database systems. Throughput characteristics could improve by factors of 2 to 5.
When using WPS' common event infrastructure (CEI) for recording business relevant events it might help to disable the CEI data store within WPS since CEI consumes these events and stores them in it's own database. Also validation of CEI events could be turned off, once it has been verified, that the emitted events are valid.

Monday, April 6, 2009

SOA/BPM performance best practices (Deployment and application packaging)

Current SOA runtime environments are usually implemented on J2EE based application servers. Business applications are deployed to these environments as one or more SCA modules.

Using common object libraries
Often a set of such business applications shares common definitions for data and interfaces and classical application design considerations suggest to put such common objects into a common objects library module.

Some current SOA runtime environments however treat such packaging schemas in a maybe unexpected manner. The module specific class loader for module one finds a reference to another module (the common objects library) and loads that module. When then the module specific class loader for module two loads its module's code and find a reference to another module (again the common objects library) it loads that module. This continues for all the application modules. The fact, that the platform's application class loaders have no knowledge about what has been loaded by other application level class loaders already leads to the effect, that a shared object library module is actually shared “by copying”. And if the memory footprint of a single copy of that shared object library is several hundred megabytes, the available heap space can be exhausted pretty fast.
Such excessive memory usage does not necessarily have a performance impact, but when the JVM's garbage collection takes place, the responsiveness of the affected applications can suffer dramatically.

A less memory consuming approach in this case would be to define one library module per application module. While this approach definitively requires more development and packaging effort, it can considerable reduce the overall memory requirement. Up to 57% reduction have been observed.

Modularization effects
One of the internally used benchmark workloads is organized as three SCA modules because that seemed the most natural model a production application implementation – one module creates the business events and one consumes them, while the module in between contains the business logic responsible for synchronization. For ease of code maintenance an SCA developer may be tempted to separate an application into more modules. In order to demonstrate the performance costs incurred with modularization, the benchmark's three modules were organized into two and one SCA module.

Measurements indicate a throughput improvement of 14% in the two module version and of 32% in the one module version, both relative to the original workload implementation. As might be expected, data sharing among SCA components is more expensive across modules than it is for components within the same module (where additional optimizations are available).

Tuesday, March 31, 2009

Slow BPC processing - a PD story using DB analysis

One of the educational principles and an essential base of what is called “experience” is to learn from errors made. Second best is to learn from errors made by others – which is still better than not learning at all. Maybe this story helps a bit.

A customer's WPS seemed to become a bit sluggish and all the usual tuning efforts hadn't helped a lot. One of the things that still could be done was to look into the BPEDB itself to see, whether there are unexpected amount of data and/or relations.

Finally, the investigations revealed an unhandled condition in a process model, that lead to a undesirable loop.

We started with a

select count(*) from process_instance_b_t with ur

to get the number of rows in the process instance table - which represents the number of process instances WPS knows about.

We include the "with UR" clause in all these selects to avoid, that the database sets any locks that would interfere even more with other users (applications) of the database. Or - to phrase it differently - to be as least intrusive as possible.

Maybe you're interested in a few more details, then you can ask for the amount of instances in a certain state.

select count(*) from process_instance_b_t where state = 2 with ur

would yield all running instances. These states include DELETED=0, READY=1, RUNNING=2, FINISHED=3, COMPENSATING=4, FAILED=5, TERMINATED=6, COMPENSATED=7, TERMINATING=8, FAILING=9, INDOUBT=10, SUSPENDED=11, COMPENSATION_FAILED=12

Similarly you can collect some statistics on activities:

select count(*) from activity_instance_b_t with ur
select count(*) from activity_instance_b_t where state = 5 with ur

Valid states are: INACTIVE=1, READY=2, RUNNING=3, SKIPPED=4, FINISHED=5, FAILED=6, TERMINATED=7, CLAIMED=8, TERMINATING=9, FAILING=10, WAITING=11, EXPIRED=12, STOPPED=13, PROCESSING_UNDO=14.

When your typical (or average) process model executes e.g. 30 activities, then the amount of rows in the activity_instance_b_t table should roughly be in the order of magnitude of up to 30 times the amount of rows in the process_instance_b_t table.

In this case we found over 39M activity instances for about 50k process instances, where we only expected up to 1M activity instances.
That justified some deeper investigation - we wanted to find out, which instances had the most activities.

SELECT
PI.PIID, PT.NAME, PI.STATE,
COUNT(AI.AIID) AS NUMBER_OF_ACTIVITIES
FROM
ACTIVITY_INSTANCE_B_T AS AI,
PROCESS_INSTANCE_B_T AS PI,
PROCESS_TEMPLATE_B_T AS PT
WHERE
PI.PTID = PT.PTID AND
AI.PIID = PI.PIID
GROUP BY
PI.PIID, PT.NAME, PI.STATE
ORDER BY
NUMBER_OF_ACTIVITIES DESC
FETCH FIRST 20 ROWS ONLY
WITH UR

Some explanations:
PI.PIID is the process instance ID from the process_instance_b_t table
To see a bit more than just a hex ID, the name of the related process template (PT.NAME) is included.
And as we're interested in the amount of activities for these process instances, we'd need to include something from the activity_instance_b_t table and do the necessary grouping.

In our specific case this resulted in

PIID	NAME	STATE	NUMBER_OF_ACTIVITIES
-----------------------------------	------------	-------	--------------------
x'9003011CE5DED75B3EFDEB538C02DAE4'	LGClaimEH	6	147047
x'9003011E841DE9AF3EFDEB53045C4103'	LGClaimEH	6	96609
x'9003011E841DDEF13EFDEB53045C3DD9'	LGClaimEH	6	96462
. . . .

LGClaimEH was expected to have 20-30 activities at most. Position 100 in that ordered list had about 57k and position 1000 had 7k activities. Furthermore state=6 indicated that these instances were no longer running (6=terminated). From their name, the customer could tell, that they're no longer relevant and should have been cleaned up automatically.

A lookup of

select auto_delete from process_template_b_t where name = 'LGClaimEH'

proved, that when modelling the process the “delete instance when finished” flag was not set (auto_delete=0), which was the reason, why all these instances still existed.

However, why there were so many activities in these instances needed further investigation. We tried to get a list of the most recently executed activities in some of the topmost cases in the above list.
(N.B. only activities with the “business relevance” flag set are persisted in the activity instance table)

SELECT
AI.LAST_STATE_CHANGE, ATP.NAME, AI.STATE
FROM
ACTIVITY_INSTANCE_B_T AI,
ACTIVITY_TEMPLATE_B_T ATP
WHERE
AI.ATID = ATP.ATID and
AI.PIID = x'9003011CE5DED75B3EFDEB538C02DAE4'
ORDER BY
AI.LAST_STATE_CHANGE DESC
FETCH FIRST 40 ROWS ONLY
WITH UR

resulted in something like

LAST_STATE_CHANGE NAME STATE

----------------------------------- ------------ -------

2009-03-22-16.24.17.964333 Activity_17 7

2009-03-22-16.23.55.925757 Activity_14 5

2009-03-22-16.23.32.528576 Activity_14 5

2009-03-22-16.23.11.976875 Activity_14 5

2009-03-22-16.22.49.582347 Activity_14 5

2009-03-22-16.22.24.257894 Activity_14 5

2009-03-22-16.22.01.723894 Activity_14 5

. . . .

which continued on and on with “Activity_14” in State finished(=5) in roughly 20 seconds difference in the timestamps for thousands of rows.

After the application developer saw this, it didn't take long to find out, what to correct.
In this case the “delete when finished” flag was activated the (not shown here) business rule, that determined the 20 seconds interval got adapted and most important the business logic, that lead to the undesired loop, was corrected.

At least this case could be used as a demonstration of robustness as the oldest of these instances dated over 9 moths ago. The server performed this “nonsense” in the application code for quite a long time before it's slowness in processing (most obvious in deleting instances) was noticed. And even then – it didn't break.

The final cleanup in the database was done in small chunks using the deleteCompletedProcessInstances.py script in the {install_root}/runtimes/bi_v6/ProcessChoreographer/admin directory.

Tuesday, March 3, 2009

Performance PD techniques - sniffing into the database

Sometimes performance problem determination can be assisted by tracing what is going on in the process engine database.
Usually database managers offer very helpful facilities to assist here in a least intrusive way.

In case you don't have your DBA available (or don't have one - or have enough access rights yourself) you can e.g. start monitoring SQL statements or deadlocks.

An SQL statement event monitor can be started (here with DB2 on AIX) with the following sequence of commands.

db2 connect to your-bpedb-name-here
db2 "create event monitor statmnts for statements write to file '/tmp/DB2_smon'"
---- (note: this path (directory) should exist)
db2 set event monitor statmnts state 1
db2 flush event monitor statmnts
---- now perform whatever scenario you want to do the tracing for
db2 set event monitor statmnts state 0
db2 flush event monitor statmnts
db2evmon -path '/tmp/DB2_smon' > '/tmp/DB2_smon/sqltrace.txt'
db2 connect reset

Similarly an event monitor for deadlocks can be set up:
db2 connect to your-bpedb-name-here
db2 "create event monitor dedlk for deadlocks with details history values write to file '/tmp/DB2_dlock'"
---- (note: this path (directory) should exist)
db2 set event monitor dedlk state 1
db2 flush event monitor dedlk
db2 set event monitor dedlk state 0
db2 flush event monitor dedlk
db2evmon -path '/tmp/DB2_dlock' > '/tmp/DB2_dlock/sqldeadlocks.txt'
db2 connect reset

Monday, March 2, 2009

SOA/BPM performance best practices (End user interaction considerations)

When designing the end user's interface and interaction with the BPM system, one mainly has to deal with getting a list of tasks to be worked at, claiming tasks, and completing tasks.

Querying to-do tasks
Experience has shown that when end users have large degrees of freedom in filtering and sorting the list of tasks they have access to, that they might develop a kind of cherry-picking behaviour. This might result in much more task queries being performed than actual work being completed. Cases of a 20 to 1 ratio have been observed. When the nature of the business requires the flexibility of arbitrary filtering and sorting, then there is not a lot that can be done, but if this freedom is not necessary, it is advisable to design the end user's interaction with the system in a way that doesn't offer filtering and sorting capabilities that aren't required.

Concurrent access to tasks
When two or more persons try to claim or check out the same task from their group's task list, only the first one will succeed. The other persons will get some kind of an access error, when trying to claim a task, that has been claimed by someone else before.
Such claim collisions are counterproductive, since they usually trigger additional queries. A possible remedy would be to design the client side code in a way that whenever a claim attempt reports a collision that the code randomly tries to claim one of the next tasks in the list and present that one to the end user.
The probability of claim collisions becomes larger with the number of concurrent users accessing a common group task list, the frequency of task list refreshes, and the inappropriateness of the mechanism to select a task from a common list. Claim collisions not only cause additional queries to be processed, they also can frustrate the end users.

Multiple task assignments
The previous sub-section suggests to keep the number of persons accessing a common task list “small”. If instead of putting it on a common task list, a single task is assigned to multiple people, similar collision effects can occur. Keeping the number of persons a single task is assigned to as small as possible is again a good strategy.

Monday, February 9, 2009

SOA/BPM performance best practices (BPEL process definitions)

WPS can handle two different types of BPEL flows: long running flows (also known as macro flows or interruptible flows) and short running flows (also known as microflows of non-interruptible flows).

Long running flows
A business transaction, that is represented through a long running flow has a lifetime that can span minutes, hours, days or even months and is typically divided into several technical transactions (embraced by begin and commit). The state of such a process instance is persisted in a database (the BPE DB) between two transactions, so that operating system resources are only occupied during an in-flight transaction. WPS allows for tuning of technical transaction boundaries, so that at process definition time the developer can e.g. extend the scope of a transaction by combining several transactions into a single one, thus saving the transaction handling overhead to a certain degree.

Short running flows
The other type of of flow, the short running flow, is used when the corresponding business transaction is fully automated, completes within a short time frame, and has no asynchronous request/response operations. Here the entire set of of flow activities run within one single technical transaction, navigation is all done in memory, and intermediate state is not saved to to a database. Such short running flows can run between 5 and 50 times faster than comparable long running flows and should be preferred if possible.

Programming at the business level
BPEL can be considered as a programming language. This aspect however should not lead to the assumption, that it would be appropriate to use it as a suitable base to develop applications that usually written in languages like C++ or Java. BPEL should be considered as an interpreted language, although there have been some investigations on the possible advantages of compiled BPEL. Comparing the execution characteristics of a BPEL flow internally with the flexibility of interaction and invocation of the orchestrated services provided through SOA, it might become obvious, that a considerable share of the overall execution path length can be accounted to SOA's invocation mechanisms for example via SOAP or messaging. So having a BPEL compiler would only optimize a smaller part of the overall execution path length. And with compiled BPEL one might loose some flexibility and interoperability, that the current implementations are offering.
In the 1980s, one of the trends in the IT industry was called Business Process Re-engineering. The solutions that were developed for that usually were more or less large monolithic programs containing the business level logic hard coded in it's modules in most of the cases. In BPEL based business applications, this business level logic is transferred to the BPEL layer. Some considerations should be made, where to place the dividing line between the BPEL layer and the orchestrated lower level business logic services. The more flow logic details are put into BPEL, the larger the BPEL related share of the overall processing gets and one might end up with doing low level or fine granular programming in BPEL. This is not, what BPEL is meant to be used for. After all, the “B” in BPEL stands for “business”. So BPEL should be used for programming on the business logic level only.

Business process data
Every business process deals with some amount of variable data, that might be used for decisions within the flow or as input or output parameters of some flow activities. The amount or size of that data can have a considerable impact on the amount of processing that needs to take place. For large business objects the amount of memory needed may quickly exhaust the available JVM heap and in case of long running processes the size of business objects directly relates to the amount of data, that needs to be saved and retrieved from a data store at each transactional boundary. And CPU capacity is affected as well for doing object serialization and de-serialization. The advice is to use as little data as possible within a business process instance. Instead of e.g. passing large images through a flow, a pointer to the image in form of a file name or image id causes much less overhead in the business flow engine.

Invocation types
SCA environments offer different kinds of invocation mechanisms. Some invocations can be done synchronously, some asynchronously. Synchronous invocations typically imply less internal processing compared to asynchronous invocations. Asynchronous invocations typically require also a currently open transaction to be committed to allow the outgoing request message to become visible to the consumer it is targeted to. Even when the used binding is synchronous, unnecessary serialization and de-serialization could be avoided, is the target service can reside within the same JVM, so that internally just object pointers can be passed.
If the target service could reside within the same module, then one could also save some internal name lookup processing.

Audit logging
Most BPM engines allow to keep a record of whatever is happening on the business logic level. Producing such audit logs doesn't come for free either since at least some I/O is associated with it. The recommendation here is to restrict audit logging to only those events, that are really relevant to the business and omit all others to keep the amount of logging overhead as small as possible.

Wednesday, January 28, 2009

SOA/BPM performance best practices (Addressing complex environments)

Except for maybe simple proof-of-concept proof-of-technology, or demo setups, to

day's real life business process automation environments are complex.
Complexity often goes along with a variety of different ways to look at it. Some of these views may provide overlapping contents, some others might be disjunct and perhaps also orthogonal to other views.
Having a SOA environment suggests to look at the SOA reference architecture[1], when considering performance related questions in architecture reviews.

Having this architecture chart in mind can help in assessing a solution's performance

behaviour or when doing performance problem determination, but usually diagrams on operational topology and application architecture provide much better assistance. Such charts show, how all the involved components are connected and allow to depict, how requests are flowing through the system.

So - whenever you have to troubleshoot performance problems - especially in an unknown environment - make sure you get the charts/pictures that help you understand, how things look like and how they relate to each other. This material also helps a lot in communicating with e.g. system support personnel.

Monday, January 26, 2009

SOA/BPM performance best practices (Intro)

Today's application solutions, that include SOA and Business Process Management are usually spread across a nontrivial operational topology like the one in the picture on the right.

The quality of service that a business process provides to the business is directly dependent on the integrity, availability, and performance characteristics of the involved IT systems. The more components are involved, the more complex the solution will become in every aspect throughout it's life cycle.

The following posts will try to provide an overview (from my limited point of view) on the most important performance aspects to take care of in such a complex environment by

identifying performance relevant areas,
providing the most important items to take care of in each of these areas,
discussing some governance principles to address the integrity issues that might arise in case of performance problems and with planned or unplanned outages.

Tuesday, January 20, 2009

SOA/BPM performance best practices (Abstract)

In the following few posts, I'd like to provide a wide, high-level overview on lessons learned, best practices, and performance engineering for business process management (BPM) and choreography in the context of SOA. The considerations concentrate on WebSphere Process Server (WPS) and it's BPM component Business Process Choreographer (BPC). However, most of the principles should apply to other BPM automation products as well.

BPM/BPC is one of the core services of IBM's SOA stack and is actually older than SOA itself. The areas of best practices encompass more than just the classical IT disciplines.

It starts with business process analysis and business process modeling. Analyzing and modeling the ways an enterprise works (or should work) in too much details can lead to processing low level logic within the context of a business process model running in the business process engine. As a consequence more cycles will be used and business process versioning could become more difficult to manage.

Careful planning and designing the end user interactions with new business process applications can avoid that selecting and getting a unit of work takes more cycles than getting the work done.

Defining the operational topology to meet scalability and availability requirements can have similar performance implications as the large set of configuration parameters in a single WAS/WPS node.

Since the above product stack runs on classical operating systems and their infrastructure all typical performance considerations need to be applied there as well. The scope spans from influencing dispatching priorities and memory related tuning knobs in the participating subsystems to I/O subsystem configurations used for persisting business process related state and data.

As SOA with BPM/BPC integrates and aggregates business services to new complexities, so does its use impose challenges in managing product and services dependencies in the transition from development of the business process applications into day to day operation with the classical IT management disciplines including (but not limited to) monitoring and problem and change management.

Performance Engineering can be extended to include business services and business processes within the scope of the various performance engineering disciplines.

WebSphere Process Server Performance