Friday, May 8, 2009

SOA/BPM performance best practices (System and subsystem configuration)

This section handles some scalability related tuning knobs and some relevant tuning parameters for the involved subsystems like the JVMs and the databases.

Clustering topologies:
In order to take care of growth and workload distribution, modern business process engines can run in a clustered setup, spreading the workload across various physical nodes (horizontal scaling) or for better utilizing spare resources within existing nodes (vertical scaling).
For WPS three different cluster topology patterns have been identified and described e.g in http://www-01.ibm.com/support/docview.wss?uid=swg27010320&aid=1 or http://www.ibm.com/developerworks/websphere/library/techarticles/0703_redlin/0703_redlin.html .
The first pattern (shown on the left) is also known as the “bronze topology”. It consists of a single application server cluster, where the WPS business applications, the support applications like CEI and BPC Explorer, and the messaging infrastructure hosting the messaging engines (MEs) that form the system integration buses (SIBus) all reside within each of the application servers, that form the cluster.
This bronze topology is suitable for a solution, that comprises of only synchronous web services and synchronous SCA invocations, preferably with short running flows only.

The second pattern (shown in the middle) is also known as the “silver topology”. It has two clusters, the first one containing the WPS business applications and the support applications as before, but the messaging infrastructure is located in the second cluster.
This silver topology is suitable for a solution that uses long running processes, but does neither need CEI, nor message sequencing, nor asynchronous deferred response, nor JMS or MQ bindings, nor message sequencing mechanisms.

The third pattern (shown on the right) is called the “golden topology”. Compared to the previous patterns, the support applications are separated into a third cluster.
This golden topology is suited for all the remaining cases, where asynchronous processing plays a nontrivial role in the solution. It also provides the most “JVM space” for the business process applications that should run in this environment. If the available hardware resources allow for setting up this golden topology, then it is advisable to start with this topology pattern from the very beginning as it is the most versatile one.
What is not shown in the above figure is the management infrastructure, that controls the cluster(s). These consist of node agents and a deployment manager node as the central point of administration of the entire cell, these clusters belong to. A tuning tip for this management infrastructure is to turn off automatic synchronization of the node configurations. Depending on the complexity of the setup, this synchronization processing is better kicked off manually during defined maintenance windows in off-peak times.

JVM Garbage Collection
Verbose garbage collection is not as verbose as the name suggests. Those few lines of information that are produced, when verboseGC is turned on don't really hurt the system's performance. On the other side they can be a very helpful source of information when troubleshooting performance problems.
The JVM used by WPS V6.1 supports several garbage collection strategies: the Throughput Garbage Collector (TGC), the Low Pause Garbage Collector (LPGC), and the Generational Garbage Collector (GGC).
The TGC provides the best out-of-box throughput for applications running on a JVM by minimizing costs of the garbage collector. However, it has “stop-the-world” phases that can take between 100ms and multiple seconds during garbage collection.
The LPGC provides garbage collection in parallel to the JVM’s work. Due to increased synchronization costs, throughput decreases. If response time is more important than highest possible throughput, this garbage collector could be a good choice.
The GGC is new in the IBM 1.5 JVM. It is well suited for applications that produce a lot of short-lived small Java objects. As it reduces pause times it should be tried in such cases instead of the TGC or LPGC. When properly tuned, it provides the best garbage collection performance for SOA/BPM workloads. [http://www.redbooks.ibm.com/abstracts/redp4431.html]

JVM memory considerations
Increasing the heap size of the JVM of the application server can improve the throughput of business processes. However it should ensured, that there is enough real memory available to avoid that the operating system would start swapping. Detailed information on JVM parameter tuning can be found in [http://www.ibm.com/developerworks/java/jdk/diagnosis/].

Database subsystem tuning
To a large degree the performance of long running flows and/or human tasks in a SOA/BPM solution depends on a properly tuned, enterprise class database management system besides the afore mentioned application server tuning. This paper provides some tuning guidelines for IBM's DB2 database system as an example. Most of the rules should also be applicable to other production database management systems.
It is not advisable to use simple file based databases like Cloudscape or Derby as a database management system for WPS other than for the purpose of unit testing.

Configuration advisor
DB2 comes with a built-in configuration advisor. After creating the database, the advisor can be used to configure the database for the usage scenario expected. The input for the Configuration Advisor depends on the actual system environment, load assumptions, etc. Details on how to use this advisor can be found in [http://www-01.ibm.com/support/docview.wss?uid=swg27012639&aid=1]. Some parameter settings in the output of the advisor should be checked and adjusted afterwards.

MINCOMMIT A value of ‘1’ is strongly recommended. The advisor sometimes suggests other values.
NUM_IOSERVERS The value of NUM_IOSERVERS should match the number of physical disks (+2) the database resides on.
NUM_IOCLEANERS Especially on multi-processor machines, enough IO cleaners should be available to make sure that dirty pages in the bufferpool are written to disk. Provide at least one IO cleaner per processor.

Database statistics
Optimal database performance requires the database optimizer to do its job well. The optimizer acts based on statistical data about the number of rows in a table, the use of space by a table or index, and other information. When the system is set up, these statistics are empty. As a consequence the optimizer usually takes sub-optimal decisions, leading to poor performance.
Therefore after initially putting load on your system, or whenever the data volume in the database changes significantly, you should update the statistics by running the RUNSTATS utility (DB2). Make sure there is sufficient data (> 2000 process instances) in the database before you run RUNSTATS. Avoid running RUNSTATS on an empty database as this will lead to bad performance.

Enable Re-Optimization
If BPC API queries (as used by the BPC Explorer e.g.) are used regularly on your system, it is recommended to allow the database to re-optimize SQL queries once, as described at [http://www-01.ibm.com/support/docview.wss?rs=2307&uid= swg21299450]. This tuning step greatly improves the response times of BPC API queries. In Lab tests the response time for one and the same query has been reduced from over 20 seconds down to 300 milliseconds. With improvements in such orders of magnitude the additional overhead for re-optimizing SQL queries should be affordable.

Database indexes
In most cases the BPM product's datastore has not been defined such, that all the database indexes that might potentially be used have been defined. In order to avoid unnecessary processing out of the box, it is much more likely, that only those indexes have been defined, that are necessary to run the most basic queries with an acceptable response time.
As a tuning step one can do some analysis on the SQL statements resulting from end user queries to see, how the query filters used by the end user (or in the related API call) relate to the WHERE clauses in the resulting SQL statements and define additional indexes on the related tables to improve the performance of these queries. After defining new indexes, the above mentioned RUNSTATS action needs to be run to enable the use of the newly created indexes.
Sometimes customers are uncertain about whether they are turning their environment into an unsupported state when defining additional indexes. This is definitively not the case. Customers are even encouraged to apply such tuning steps and check whether they help. If not, they can be undone easily e.g. by removing the index.

Further database tuning
Any decent database management system can keep its data in memory buffers called bufferpools to avoid physical I/O. Data, that is in these bufferpools needs not be read from disk when referred to, it can be taken from these memory buffers directly. Hence it makes a lot of sense to make these buffers large enough to hold as much data as possible.
The key tuning parameter to look at is called bufferpool hit ratio and describes the ratio between the physical data and index reads and the logical reads. As a rule of thumb you can increase the size of the buffer pools as long as you get a corresponding increase of the bufferpool hit ratio. A well tuned system can easily have a hit ratio well above 90%.
WPS accesses it's databases in multiple concurrent threads and uses row level locking to ensure data consistency during it's transactions. As a result, there can be a lot of row locks being active at times of heavy processing. The related database parameters for the space, where the database maintains the lock information might have to be adjusted.
For DB2 the affected database configuration parameters are LOCKLIST and MAXLOCKS. Shortages in this lock maintenance space can lead to so called lock escalations, where row locks are escalated to undesirable table locks, which even can lead to deadlock situations. Data integrity is still maintained in such situations, but the associated wait times can severely impact throughput and response times.

Monday, May 4, 2009

Poor Man's Flight Recorder

Problem statement:
Sometimes when watching WPS you're faced with questions similar to "what is it doing rightnow?". In this situation you'd wish the product had some production level, low intrusive tracing capability, that shows you, what is going on on the programming model level. Which would e.g. be what process instances, BPEL activities, invokes, API calls, etc. are being executed "as we speak".
After some poking around in WPS' BPEDB I constructed an SQLs statement, that at least can serve as a simple surrogate for such a not-only-nice-to-have trace.

Solution: Poor Man's Flight Recorder:
Due to some very helpful timestamps in some of the tables, it made it easy to construct an SQL statement, that can show you what has been recorded in the BPEDB for the last 10 seconds (or just the last 2 seconds - change it as you like).
SELECT
ai.last_state_change as AI_last_state_change,
substr(atp.name,1,30) as AI_templatename,
case ai.state
when 0 then 'null(0)'
when 1 then 'inactive(1)'
when 2 then 'ready(2)'
when 3 then 'running(3)'
when 4 then 'skipped(4)'
when 5 then 'finished(5)'
when 6 then 'failed(6)'
when 7 then 'terminated(7)'
when 8 then 'claimed(8)'
when 9 then 'terminating(9)'
when 10 then 'failing(10)'
when 11 then 'waiting(11)'
when 12 then 'expired(12)'
when 13 then 'stopped(13)'
when 14 then 'processing_undo'
end as AI_state,
substr(pt.name,1,30) as PI_templatename,
case pi.state
when 0 then 'deleted(0)'
when 1 then 'ready(1)'
when 2 then 'running(2)'
when 3 then 'finished(3)'
when 4 then 'compensating(4)'
when 5 then 'failed(5)'
when 6 then 'terminated(6)'
when 7 then 'compensated(7)'
when 8 then 'terminating(8)'
when 9 then 'failing(9)'
when 10 then 'indoubt(10)'
when 11 then 'suspended(11)'
when 12 then 'compensation_fail'
end as PI_state,
ai.piid as Process_Instance_ID
FROM
activity_instance_b_t ai,
activity_template_b_t atp,
process_instance_b_t pi,
process_template_b_t pt
where
atp.atid = ai.atid and
ai.piid = pi.piid and
pi.ptid = pt.ptid and
ai.last_state_change > ((select max(last_state_change) from activity_instance_b_t) - 10 seconds)
order by
ai.last_state_change desc
fetch first 100 rows only
with ur ;
This statement is for IBM's DB2 database - Oracle users might want to use "WHERE ROWNUM < 100" instead of "fetch first 100 rows only".

Sample output:
(if it looks distorted, you may want to enlarge your browser window horizontally to fit it in)
AI_LAST_STATE_CHANGEAI_TEMPLATENAMEAI_STATEPI_TEMPLATENAMEPI_STATEPROCESS_INSTANCE_ID
-------------------------------------------------------------------------------------------------------------------
2009-04-03-10.22.37.545000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.540000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000Callsubfinished(5)tptmainrunning(2)x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000Callsubfinished(5)tptmainrunning(2)x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.507000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.477000Callsubfinished(5)tptmainrunning(2)x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.461000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B805F5CFEFFFF808BEE17AB'
2009-04-03-10.22.37.451000-finished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.451000Replyfinished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.441000Snippet2finished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.439000-finished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.439000Replyfinished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.432000Snippet2finished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.400000Callsubfinished(5)tptmainrunning(2)x'900301206B805F5CFEFFFF808BEE17AB'
......
Explanation:
The generated list has the following columns:
  • AI_LAST_STATE_CHANGE = activity instance last state change
    The last point in time, the activitiy's state got changed - the table is sorted on this column, descending - i.e. newest entry first.
  • AI_TEMPLATENAME = the template name of the activity
  • AI_STATE = the state of the activity instance
  • PI_TEMPLATENAME = the name of the process template of the processinstance, this activity is part of
  • PI_STATE = the state of the process instance and finally
  • PROCESS_INSTANCE_ID = the ID of the process instance.
Possible performance impact:
The possible impact greatly depends on the size of the related tables. Most of the operation is being done on the ACTIVITY_INSTANCE_B_T table and there on the column LAST_STATE_CHANGE. By default, this column doesn't have an index. So if you experience very long execution times for the above SQL, you're most likely doing a table scan on the entire ACTIVITY_INSTANCE_B_T table. If so, you may want to define a suitable index for that column and perform a "runstats .... with distribution and detailed indexes all" on this table. This should reduce the table scan to an index scan, which should result in a noticeable reduction of the execution time of this SQL.

Disclaimer:
This SQL is lightyears away from being a fully fledged flight recorder, that shows all relevant events. On the other side it can at least provide some initial insight for problem determination purposes and it is for sure far less intrusive (in terms of resource consumption) than a full blown BPE trace.

(finally also published (slightly modified) at http://www-01.ibm.com/support/docview.wss?uid=swg21384848 )